X1: Co-designing Big Data Analytics Systems with Modern Networks

Data is currently counted among the most important assets in academia and industry. It has been estimated that companies that adopt Big Data Analytics can increase productivity by 10% more than companies that do not, and that Big Data practices in Europe could add 1.9% to GDP between 2014 and 2020. In the recent years many different Big Data Analytics Systems such as MapReduce, Spark, or more recently TensorFlow are used to process large amounts of data. A basic need of all these systems is that data centre networks provide high throughput for large parallel dataflows such as the massive shuffle traffic in a MapReduce application or to support workloads that result from using a central parameter server that stores the trained model parameters in distributed machine learning systems such as TensorFlow. Furthermore, these systems are nowadays also deployed more and more over multiple data centres including the edges of the network to support scenarios where data needs to be pre-processed close to the data sources and might include even mobile end-devices.

While communication systems play an important role for ensuring the scalability and efficiency of data-intensive systems, there has not been much work of co-designing data-intensive systems with capabilities provided by modern communication networks. The advent of recent flexible networking hardware and expressive data plane programming languages through software-defined networks (SDNs), however, have opened up new opportunities to better align these two worlds. A main challenge hereby is that systems such as Spark and TensorFlow enable a multitude of distributed workloads ranging from classical SQL-based analytics, over graph algorithms, to stream processing and more recent deep learning algorithms with different communication characteristics. In order to support all these different workloads in an ideal manner the network has to be able to automatically adjust to those workloads by applying the optimal communication mechanisms and transitions between those mechanisms as proposed in MAKI.

Goal and Challenges

The main goal of this project is to analyze opportunities and challenges of co-designing data-intensive systems such as Spark or TensorFlow with the network layer. The challenges that should be addressed in particular are:

(1) In-network Processing for Big Data Analytics

In this part we plan to develop transitions of the data-flow graph to better support optimal data-parallel execution.

(2) Workload-aware Path Selection

The higher-level goal of this part is to use cost models and statistics of Big Data engines to apply optimal transitions for path selection.

(3) Hardware-Support

We investigate in using FPGAs to implement specialized processing elements for these functions.

Subproject Leader X1:

  Name Contact
Prof. Dr. rer. nat. Carsten Binnig
Data Management (DM)
+49 6151 16-25601
S2|02 D106