A distributed pipeline for large-scale OTDS processing

Abstract

Processing increasing amounts of data requires the processing implementation to be scalable. There are two scalability approaches: Vertical scalability describes the ability to scale with the resources of a single node executing the data processing workload. This approach is limited by the most powerful hardware available at the market. Horizontal scalability on the other hand is the ability to scale with the number of nodes. In the best case, horizontal scalability allows practically infinite scaling by adding more nodes.

Horizontally scaling data processing pipelines, which are a linear sequence of tasks, is difficult to implement. To obtain a horizontally scalable pipeline two challenges have to be solved: Splitting the work into smaller packets and coordinating the distributed processing of these smaller packets. Existing data processing frameworks, such as Apache Hadoop or Apache Spark, can help with the coordination of the work but require the implementation to use programming languages and technology stacks given by the framework. Consequently, reusing existing pipeline implementations in different technology stacks is not possible.

We present a concept for executing distributed workflows composed of containerized tasks that allows using different technology stacks per task. We further discuss how certain pipelines can be transformed into distributed workflows. The presented concepts are evaluated in an industry scenario: We first transformed an existing pipeline for processing OTDS travel offer data into a distributed workflow. In order to achieve this, we implemented a mechanism to split OTDS data into smaller work packets. Secondly, we built a prototype based on the Kubernetes container orchestration framework that can execute the distributed workflow.

In experiments the distributed workflow took up to 50% less time to complete than the original pipeline while only using around 20% more resources.

Project information

Status:

Finished

Thesis for degree:

Bachelor

Student:

Felix Friedberger

Supervisor:
Id:

2019-022