Introduction
Hello everyone and thank you for being here.
I am Arumoy. I am a PhD Candidate at the Software Engineering Research Group at TU Delft. I have the privilege of working with excellent researchers such as Luis and Arie (who are somewhere in the audience). And after 2 years, I have found my calling.
In this talk, I am going to present the current vision I have for my PhD. I hope to generate some interesting discussions and get some feedback along the way.
Implicit expectations to explicit tests
Hopefully, I don’t need to convience you that testing is important. We are software engineers, attending a software engineering conference afterall.
There is a wonderful paper by Zhang et al. (2020) that summarises the existing literature on ML testing. However, what has been ignored—until now—is the role of visualisations and how we use visual tests in the earlier, more “data-centric” stages of the ML lifecycle.
Visualisations enable a rapid, exploratory form of analysis. Practitioners use “visual tests” to check for data properties. These visualisations tell a story. They are there for a reason. The expertise and domain knowledge is embedded within the visualisation.
This works really well when we are working on our laptop, on say an assignment. But visualisations do not scale well across organisation changes or when we want to move towards a large-scale production system. Visual tests tend to be left behind as latent expectations, rather than explicit failing tests. This research gap between moving from visual to analytical tests is where we wish to contribute.
Visualisations become latent expectations rather than explicit tests. And this gap between going from latent visualisations to more analytical tests is exactly the research gap where we wish to contribute.
The hunt for data properties
The good news is that we have a rich source of data—jupyter notebooks. We are using a two-pronged approach. The first step—which I have been working on for the past month—is to collect these visualisations or data properties manually. We are exploring two sources of data, namely github and kaggle.
Once we have found a sufficient quantity of data properties, we will scale it to a larger subset. We are aware of two such datasets proposed by Quaranta, Calefato, and Lanubile (2021) and Pimentel et al. (2019) which contains notebooks from Kaggle and Github respectively.
Next steps and beyond
Our immediate objective is to provide a format definition of “visual tests” along with examples of alternative analytical tests that can be used by the practitioners.
Our ultimate research goal is to recommend such analytical tests automatically to the practitioner. Here it becomes a mining challenge which jupyter notebooks contain three sources of information: text, code and images.
We see several opportunities to collaborate with researchers working in other areas. Besides ML testing, I see implications in reproducibity and code quality of jupyter notebooks, explainable AI and HCI.