An LLM-based Assistant for Reviewing ML Prototypes

Machine-learning (ML) prototypes are commonly developed and shared as Jupyter Note- books to explore feasibility and communicate results in multi-stakeholder settings. However, Notebook-specific issues (e.g., hidden state and out-of-order execution) and ML-specific complexity make continuous, criteria-based quality assurance difficult and time-consuming. This thesis derives a review framework for Notebook-based ML prototypes using a hermeneu- tic literature review. The framework combines (i) a domain-agnostic quality model captur- ing key prototyping qualities, (ii) domain-agnostic information needs to assess documen- tation completeness, and (iii) domain-specific stakeholder personas to incorporate context- dependent requirements. Building on this framework, we implement Proto-Check, a JupyterLab extension that pre-processes Notebook content, combines structural analyses (e.g., execution-order indica- tors) with semantic assessments via an external large language model (LLM), and presents results in an interactive review dashboard. Proto-Check is evaluated through a task-based user study and performance experi- ments. The results indicate strong perceived usability and usefulness, generally positive trust in the generated feedback, and practical runtime for interactive review workflows, while also highlighting stability considerations inherent to LLM-based assessments.