An Approach for Supervised Reconstruction of Infrastructure as Code

Problem Introduction

The infrastructure that is necessary to operate legacy applications is often not described using formal methods but infrastructure plans or deployment diagrams. Additionally for different environments (e.g. testing, staging, production) there may be different descriptions and other configurations for the components.

This has a lot of drawbacks, for the software vendor as well as for the operator of the software. From vendor’s perspective the main issues are:

● Documentation does not necessarily being updated when software changes (Evolution Gap). ● The execution of the documentation is prone to errors which leads to troubleshooting effort ● It takes a lot of time to build the environment

For the operator of the software this leads to those effects: ● Documentation could be outdated and thus not working any more ● Documentation may work in certain circumstances but not in all cases. ● Installation of the software might be costly due to manual activities.

With the current Infrastructure as Code [2] approach infrastructure can be described using domain specific languages for tools which provide automatic creation of infrastructure. There are different automation tools in this area on different layers of automation requirements. For example puppet and chef among others for provisioning on operating system layer and tools like Docker for the operating system, network and orchestration layer.

Reconstruction of existing systems is not trivial and literature does not provide existing approaches for it. A first basic step was done with the tool “blueprint” [1] which detects deviations of existing systems with default operating system installations. However the generated code is not modularized and therefore not maintainable.

Purpose of the Bachelor Thesis

As we want to evolve existing systems towards infrastructure as code our starting point is an existing software system which consists of executables, documentation, running installations and other related infrastructure (monitoring, backup, configuration).

For existing systems there are essentially two ways to reconstruct the existing infrastructure:

● Extracting infrastructure information from documentation and building IaC from those

(“offline"reconstruction). ● Analysing a running system and extracting infrastructure information to generate IaC

(“online"reconstruction)

A common target is to reuse existing infrastructure as code modules. Therefore the infrastructure information has to be abstracted and generalized to create a toolbox of reusable infrastructure code modules. To create a system these modules have to be parameterized, afterwards a running system can be generated by common provisioning tools.

For the offlineapproach Infrastructure code is developed with the help of reusable modules and based on the manual analysis of the existing static documentation like installation manuals, deployment diagrams, etc..

An online approach would create infrastructure code based only on the automated extraction of information of a running system. As with blueprint the resulting code would be unstructured and not suitable for further development. Therefor this thesis propose a combined approach by using the IaC modules from the offline approach as a template for the generation step in the online approach.

This usage of prior acquired knowledge is called supervised reconstruction.

To evaluate the feasibility of this supervised reconstruction the system derived from the IaC is compared to the system under analysis. In a perfect scenario this new system should exactly match the functionality of the system under analysis and pass the same test cases.

Process

We will use a software system from the cooperation partner IVU AG as case study for IaC reconstruction. Additionally we will construct a synthetic system for development, testing and evaluation purposes. In the first step we will try to understand and classify infrastructure that are described in the documentation. Then we will design our infrastructure architecture and finally describe it as infrastructure as code.

Based on these results an appropriate algorithm for the onlinereconstruction approach will be designed and developed which uses the IaC modules to describe the analysed system.

In the next step we will automatically augment these description with the concrete parameters for those modules that would result in the running configuration found on the system. We expect this to be viable only for static system attributes and not for dynamic ones.

Finally we will define viable metrics that allow to evaluate our reconstruction approach in terms of completeness, defects and effort spent for building the infrastructure code.