Introduction

Hello everyone and thank you for being here.

I am Arumoy. I am a PhD Candidate at the Software Engineering Research Group at TU Delft. I have the privilege of working with excellent researchers such as Luis and Arie (who are somewhere in the audience). And after 2 years, I have found my calling.

In this talk, I am going to present the current vision I have for my PhD. I hope to generate some interesting discussions and get some feedback along the way.

Implicit expectations to explicit tests

Hopefully, I don’t need to convience you that testing is important. We are software engineers, attending a software engineering conference afterall.

There is a wonderful paper by Zhang et al. (2020) that summarises the existing literature on ML testing. However, what has been ignored—until now—is the role of visualisations and how we use visual tests in the earlier, more “data-centric” stages of the ML lifecycle.

Zhang, Jie M, Mark Harman, Lei Ma, and Yang Liu. 2020. “Machine Learning Testing: Survey, Landscapes and Horizons.” IEEE Transactions on Software Engineering.

Visualisations enable a rapid, exploratory form of analysis. Practitioners use “visual tests” to check for data properties. These visualisations tell a story. They are there for a reason. The expertise and domain knowledge is embedded within the visualisation.

This works really well when we are working on our laptop, on say an assignment. But visualisations do not scale well across organisation changes or when we want to move towards a large-scale production system. Visual tests tend to be left behind as latent expectations, rather than explicit failing tests. This research gap between moving from visual to analytical tests is where we wish to contribute.

Visualisations become latent expectations rather than explicit tests. And this gap between going from latent visualisations to more analytical tests is exactly the research gap where we wish to contribute.

The hunt for data properties

The good news is that we have a rich source of data—jupyter notebooks. We are using a two-pronged approach. The first step—which I have been working on for the past month—is to collect these visualisations or data properties manually. We are exploring two sources of data, namely github and kaggle.

Once we have found a sufficient quantity of data properties, we will scale it to a larger subset. We are aware of two such datasets proposed by Quaranta, Calefato, and Lanubile (2021) and Pimentel et al. (2019) which contains notebooks from Kaggle and Github respectively.

Quaranta, Luigi, Fabio Calefato, and Filippo Lanubile. 2021. “KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle.” In 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), 550–54. IEEE.

Pimentel, João Felipe, Leonardo Murta, Vanessa Braganholo, and Juliana Freire. 2019. “A Large-Scale Study about Quality and Reproducibility of Jupyter Notebooks.” In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), 507–17. IEEE.

Next steps and beyond

Our immediate objective is to provide a format definition of “visual tests” along with examples of alternative analytical tests that can be used by the practitioners.

Our ultimate research goal is to recommend such analytical tests automatically to the practitioner. Here it becomes a mining challenge which jupyter notebooks contain three sources of information: text, code and images.

We see several opportunities to collaborate with researchers working in other areas. Besides ML testing, I see implications in reproducibity and code quality of jupyter notebooks, explainable AI and HCI.