Experiment History Tracking for Jupyter Notebooks

Jupyter Notebooks are widely used in the field of data science for solving problems by doing quick and immediate code experimentation while allowing the code and results to be easily documented and shared. However, Jupyter Notebooks suffer from a multitude of issues, most notably relating to a lack of proper reproducibility, cluttered and unstructured Notebooks that exhibit many bad coding practices, as well as a lack of version control that takes the exploratory nature and other characteristics of Notebooks into account. This thesis provides a solution to these problems by the means of a JupyterLab extension that builds upon versioning and history tracking principles established by tools like Git and Verdant. It automatically tracks the history of a Jupyter Notebook as a tree graph, with each node representing a relevant change in the Notebook like a cell being added or executed. The history can be freely explored by jumping between different versions or looking at a summary of particular changes. As the history is tree-based, data scientists can organise their experiments in different branches, which declutters the Notebook by allowing them to set past experiments aside which can still be easily accessed through the history tree if so desired. This improves the organisation of the Notebook and makes informal versioning practices like commenting out code for later reuse mostly obsolete. As the exact history is apparent from the tree – including which parts of the Notebook were executed when –, it also becomes easier to comprehend how the results of a particular Notebook were formed, improving its reproducibility. Two studies were conducted to evaluate the concept of tree-based history tracking in Jupyter Notebooks as well as its specific implementation as a JupyterLab extension. It was determined that the general concept is highly promising and that the proposed extension is both useful and usable, helping to mitigate a lot of the problems that Jupyter Notebooks suffer from.

Project information

Status:

Finished

Thesis for degree:

Master

Student:

Laurens Studtmann

Supervisor:
Part of research project:

SE4ML - Processes, People and Tools

Id:

2023-014