At SAP TechEd in September we announced the release of SAP Data Hub. And today we deliver the SAP Data Hub, developer edition.

SAP Data Hub is a data sharing, pipelining, and orchestration solution that helps companies accelerate and expand the flow of data across their modern, diverse data landscapes (for more details take a look at Marc’s excellent FAQ blog post). Simply spoken, it includes features for:

  • Data governance (metadata management, discovery, profiling…)
  • Data pipelines (flow-based applications)
  • Workflows (orchestration of processes across the data landscape)

The architecture of SAP Data Hub leverages modern container technology and, again simply spoken, looks like this:

The main (technical) components of SAP Data Hub are:

  • Application based on SAP HANA, XS Advanced Model
  • Distributed Runtime leveraging Kubernetes
    • SAP Vora (to run distributed queries on “Big Data”)
    • SAP Data Hub Pipelines (to run flow-based applications)
  • SAP Data Hub Adapter (central communication endpoint for operations performed from the application on Kubernetes and Hadoop)
  • SAP Vora Spark Extensions (extensions for the Spark execution framework to access data in SAP Vora and SAP HANA)

SAP Data Hub, developer edition

For the developer edition, we have been looking for a way to run SAP Data Hub on your local computer. While there are possibilities involving SAP HANA, express edition and Kubernetes (Minikube) on your local computer, we have decided for a different approach.

We took the parts of SAP Data Hub, which are in our opinion of most interest for developers (SAP Vora, SAP Data Hub Pipelines) and packaged them together with Hadoop and Zeppelin into a single Docker image / container. This approach is similar to what we have done for SAP Vora in the past.

Now, what are the advantages of this approach?

  • You can easily run SAP Data Hub, developer edition on your local computer (be it Windows, Linux or MacOS).
  • Building the Docker image locally takes between 30 and 60 minutes. During this time, you need a stable internet connection. Once the Docker image is built, you can start a container based on the image within a few minutes and without network connectivity.
  • You can build powerful data pipelines (and they can interact with all kind of other technologies, e.g. SAP HANA, SAP API Business Hub, Kafka, any web service).

Of course, there are also some drawbacks:

  • The SAP Data Hub, developer edition currently does not allow you to use data governance and workflow features of SAP Data Hub.
  • Unfortunately, you cannot observe how the SAP Data Hub usually containerizes and deploys data-driven applications onto Kubernetes.
  • Some of the data pipeline operators (i.e., the re-useable and configurable components which you can combine to build data pipelines) will not work inside the container. Most notably, the operators related to machine learning (leveraging TensorFlow) and image processing (leveraging OpenCV) currently cannot be used, at least not “out-of-the-box”.
  • When stopping and restarting the Docker container, currently the tables which you have created in SAP Vora get lost. You need to recreate them.

How to get started?

To give the SAP Data Hub, developer edition a try, please download it from SAP Store. Afterwards visit our Tutorial Navigator. Currently the following tutorials are available:

The tutorials give you a first idea how to build data-driven applications with SAP Data Hub. You will learn how to create your first pipeline. You will use a message broker, HDFS as well as SAP Vora.

Next steps

We know, this is just a starting point for enablling developers to work with SAP Data Hub. There are several things we have planned for the future:

  • We continue to work to remove some of the aforementioned drawbacks (in particular, the operators the developer edition currently cannot run).
  • We plan to publish additional tutorials (e.g. for machine learning, how to create your own operators).
  • We have started working on an openSAP course for SAP Data Hub.
  • We have begun to work on a cloud-based trial environment for SAP Data Hub. This will run on Kubernetes and offer additional possibilities, including access to more of the SAP Data Hub functionality.

Stay tuned! If you have questions, problems or proposals in the meantime, feel free to post them as comments to this blog, or to the SAP Community. We will try to answer them in a timely manner and collect frequently asked questions here.