A framework for monitoring workload schedules in compute clusters

Abstract

Distributed systems are the foundation of many successfull enterprises that have to store and process large amounts of data and handle large amounts of traffic. Efficient scheduling of workload in such a system is key part of their success. The monitoring and evaluation of performance efficiency of cluster systems therefore is becoming important, in an attempt to reduce the operational cost and maximizing profits. In this work, we propose a monitoring framework built using a microservice architecture that is capable of recording workload schedules and changes in topologie of a variety of cluster platforms.
The framework aims to provide cluster operators, system administrators and researchers a concise interface to build monitoring solutions for one or multiple cluster systems. The conceptual design focuses on maintainability and extensibility: each individual service in the framework serves a single purpose, can be replaced and rewritten in multiple programming languages. Thanks to the Protobuf message format, new components matching the cluster environment can be developped quickly. The proposed framework aims to be a starting point for developers to implement their own visualizations, alerting systems or other clients that make use of the near real-time abilities. We will introduce a sample client to this framework: a visualization that can be used to spot performance inefficiencies. This work starts by defining requirements for such a framework and close by evaluating whether the framework concept meets all these requirements, through a series of tests and benchmarks.

Contact

Christian Plewnia

External PhD Candidate

plewnia@swc.rwth-aachen.de

Project information

Status:

Finished

Thesis for degree:

Bachelor

Student:

Julius Hinze

Supervisor:

Christian Plewnia

Id:

2021-010