Visualisation Zoo


Arumoy Shome


March 17, 2024

Collection of data visualisations I have created using Python.

This is a collection of data visualisations I have created in the past from prior research publications. The title of this post is inspired by the Heer, Bostock, and Ogievetsky (2010) paper.

Heer, Jeffrey, Michael Bostock, and Vadim Ogievetsky. 2010. “A Tour Through the Visualization Zoo.” Communications of the ACM 53 (6): 59–67.

Joint Distribution of Categorical Variables

The following visualisation in Figure 1 comes from the very first Conference paper I wrote (Shome, Cruz, and Deursen 2022). The paper explored the presence of anti-patterns in popular ML datasets that lead to accumulation of technical debt in the downstream stages of the pipeline.

Shome, Arumoy, Luís Cruz, and Arie van Deursen. 2022. “Data Smells in Public Datasets.” In Proceedings of the 1st International Conference on AI Engineering: Software Engineering for AI. CAIN ’22. ACM.

We curated a catalogue of 14 such anti-patterns or “data-smells” and manually analysed their presence in 25 popular ML datasets from Kaggle.

I created the visualisation using a JointGrid from the Seaborn library. The visualisation in the main subplot shows the distribution of the data-smells across all datasets that were analysed using a two-dimensional histogram. The visualisations in the marginal subplots shows a histogram of the corresponding categorical variables.

Figure 1: Joint distribution of two categorical variables.

Heatmap of Correlation Between Numerical Variables

The next visualisation comes from our Shome, Cruz, and Deursen (2024) paper. Here we analysed the relationship between data dependent and model dependent fairness metrics. Figure 2 shows the results obtained from the empirical study conducted using 8 datasets and 4 ML models.

Shome, Arumoy, Luı̀s Cruz, and Arie van Deursen. 2024. “Data Vs. Model Machine Learning Fairness Testing: An Empirical Study.” In International Workshop on Deep Learning for Testing and Testing for Deep Learning. DeepTest ’24. IEEE/ACM.

Each heatmap represents results obtained from a fairness metric (we used Disparate Impact and Statistical Parity Difference). The ML models are represented along the Y axis, while the datasets are along the X axis. Each block shows the correlation between the data and model variants of the correponding fairness metric. The statistically significant cases are marked with an asterisk. The strength of the correlation is denoted using color–bright hues of red indicate positive correlation while cooler hues of blue represent negative correlation.

Figure 2: Heatmap of correlation between numerical variables
Back to top