Jupyter Notebooks are widely used in the industry, especially in machine learning projects. The reason for that is that they provide an interactive developing environment and can be used for quick prototyping. Further, machine learning projects in Jupyter Notebooks follow the structure of a machine learning workflow. Additionally, Jupyter Notebooks also allow to label cells which can help developers to provide a clear overview of the Notebook structure. However, adding labels manually can be time-consuming and therefore, we should use an automatic approach. This thesis will provide two different solutions to infer a label for a Notebook cell by evaluating its code and output. These approaches allow for general and independent solutions that are not strictly limited by the choice of used frameworks. First, we will provide a solution that works on general keywords and uses several heuristics to deduce the activity of a machine learning workflow this cell belongs to. Second, we will provide an approach that uses NLP techniques and makes use of Tf-idf and XGBoost. More importantly, we present statistics regarding the correctness of both solutions and benchmarks that provide insights into real-world performance. These results show that especially NLP techniques perform exceptionally well, even with small training sets and minor pre-processing.
Project information
Finished
Bachelor
Miguel Perez
2023-012