Scalable Distributed Data Processing with Microservices

Processing increasingly large amounts of data requires the underlying system to be scalable. A system can be designed to scale with the resources of a single computer using shared-memory parallelization. Another approach is to scale with the number of computers by splitting up a large work package into smaller ones, distributing these smaller work packages, processing them, and then collecting and merging the result. Established frameworks for scalable distributed computing like Apache Hadoop and Apache Spark assist in realizing this. However, in order to fully profit from these frameworks, one has to stay within their ecosystem of supported programming languages and technology stacks. A recent trend in software architectures is to build and compose microservices rather than one monolithic system. This architecture approach yields the benefit that developers can choose the best-fitting programming languages as well as technology stacks individually for every microservice.

This project aims for researching approaches to utilize microservices in distributed data processing scenarios. Topics in this field of research are:

  • Managing workflows composed of tasks that are realized by microservices
  • Modelling and executing dynamic workflows that may alter their flow depending on the situation at hand
  • Scheduling the tasks with awareness for data transfer costs
  • Robustness with respect to failure and modification of the underlying compute resources
  • Making larger amounts of simultaneous data processing and potential failures understandable for humans (e.g., by means of proper visualization and tool support)

This research is based on a joint project with Amadeus Leisure IT and is motivated by Amadeus’ experience in processing data for the leisure travel market.

Project information

Project start & end:
2017 – 2018