Arumoy Shome

Professional Summary

Data Scientist and Software Engineer with 5+ years of experience at Delft University of Technology, Netherlands, and prior web development experience at Shopify, Canada. Led four end-to-end projects and pioneered three frameworks to improve the MLOps workflow. Hands-on in Python, large-scale data mining, and machine learning.

Skills

  • Data Science and Machine Learning: Python, Scikit-Learn, PyTorch, Numpy
  • Data Engineering and Analysis: Pandas, SQL, PySpark
  • Visualization and Analytics: Seaborn, Matplotlib, D3.js, Three.js
  • DevOps and MLOps: Git, Docker, Linux, Unix, Nix
  • Programming Languages: Python, Ruby, JavaScript, Bash, HTML, CSS, LaTeX
  • Statistical Analysis: Hypothesis Testing, Correlation Analysis, Regression
  • Web Development: REST APIs, Ruby on Rails, UI/UX Development, Object-Relational Mapping, D3.js
  • Research: Data Quality Analysis, Fairness Testing, Static Analysis, Quantitative Research

Experience

Data Scientist and Applied Researcher (PhD Candidate), Delft University of Technology, Delft, The Netherlands, June 2021 - Present

  • Developed a large-scale data mining pipeline to process 297,800 Jupyter notebooks (283 GB) collected from GitHub and Kaggle using REST APIs. Extracted 3 million lines of code using Python, Pandas, and Bash. Released the results as an open-source dataset to enable future research. Project website: https://github.com/arumoy-shome/shome2023notebook.
  • Applied embedding based clustering using CodeBERT, UMAP and HDBSCAN to identify a representative sample using stratified sampling. Used NLP techniques (text representation, feature extraction and vectorization) to perform exploratory data analysis (EDA) and generated 26 insights to inform ML pipeline reliability.
  • Developed an end-to-end automated ML pipeline (data ingestion, feature engineering, model training and evaluation) using Python, Scikit-Learn and Bash. Evaluated 4 ML classification algorithms (Logistic Regression, Decision Trees, Random Forest and Ada Boost) against 2 fairness metrics (Disparate Impact and Statistical Parity Difference) across 5 tabular datasets. Project website: https://github.com/arumoy-shome/shome2022qualitative.
  • Proposed a novel data-centric ML fairness testing methodology to reduce development time and computational costs of ML pipelines. Designed experiments using AIOps data from 1,600+ pipeline executions and used statistical hypothesis testing (student t-test), correlation and regression analysis to validate findings.
  • Pioneered the Data Smells framework and published the findings as an open-source catalog to aid students, academics and data practitioners with ML dataset quality assessment. Project website: https://arumoy.me/data-smells.
  • Performed EDA and feature engineering on top 25 ML tabular datasets from Kaggle using Python and Pandas. Identified 14 data quality antipatterns and proposed cost-effective fixes to reduce technical debt of ML-enabled systems.
  • Collaborated with 13 experts from industry and academia across technology, healthcare and financial service sectors. Identified 23 engineering best practices to reduce technical debt and deploy ML prototypes to production.
  • Deployed the data and ML pipelines on Linux infrastructure and distributed the workloads across 20 CPU cores using Bash and Unix commands. Used Git and Docker to provide reproducible and open source software artifacts released under the Creative Commons Attribution (CC-BY) license.
  • Presented results through 6 publications at international conferences and scientific journals. Delivered technical talks, poster presentations and guest lectures to academic and industry audience, communicating technical concepts into actionable insights for diverse stakeholders.
  • Managed 2 M.Sc. research projects, and an edX MOOC course on Unix tools with 1000+ active students.

Deep Learning Research Engineer (Internship), Netherlands eScience Centre, Amsterdam, The Netherlands, July 2019 - March 2020

  • Built an end-to-end deep learning pipeline for real-time neutrino detection in the KM3NeT Neutrino Telescope using Python, Pandas and PyTorch. Project website: https://github.com/arumoy-shome/km3net.
  • Developed a novel signal processing pipeline using Multi Layer Perceptrons achieving 92% accuracy in data filtration quality and a 12% improvement over the state-of-the-art GPU-based solution.
  • Designed and implemented a Graph Convolutional Neural Network for event node classification in particle physics data, achieving 67% accuracy on event detection.
  • Collaborated with an interdisciplinary team of particle physicists, GPU engineers, and computer scientists to translate complex AI requirements into practical implementations.

Software Engineer (Internship), Shopify, Ottawa, Canada, September 2015 - September 2016

  • Collaborated with developers, designers and product managers to implement 25+ UI/UX features such as web components, animations and styling on a mature Ruby on Rails project using Ruby, JavaScript, HTML, & CSS.
  • Applied Object-Oriented Programming (OOP) principles and Test-Driven Development (TDD) to refactor code and improve test coverage by 7%.
  • Used Object-Relational Mapping (ORM) to optimize database queries and maintain server response times to under 100ms.

Projects

3D Kadaster, September 2018 - December 2018

  • Developed 3D model of all buildings in the Netherlands using AHN2 point cloud dataset (1.6 TB) and BAG building polygon dataset (177 GB).
  • Processed geospatial datasets using PySpark distributed computing framework for scalable data processing.
  • Executed algorithms on SurfSara supercomputer infrastructure for high-performance geospatial analysis.
  • Created interactive 3D visualizations using Three.js for web-based exploration of national building infrastructure.

ACE: Art, Color and Emotions, https://youtu.be/B1ZM6EQgEvU, January 2019 - June 2019

  • Built ACE, a visual sentiment analysis platform by developing custom ML models trained on the large-scale OmniArt dataset (512 GB) to enable data-driven analysis of artistic emotions.
  • Designed and implemented full-stack solution featuring intuitive D3.js interface with optimized interaction patterns and scalable web architecture capable of handling high-volume image processing and real-time sentiment analysis.

Elevate, https://arumoy.me/elevate, September 2016 - May 2017,

  • Developed an improved and cost-effective alternative to state-of-the-art cognitive assessment tools for Down Syndrome using web technologies, adaptive learning, and human-centric design.
  • Validated and refined the prototype through 3 usability testing sessions with human participants with Down syndrome.
  • Established 2 international research partnership with Waterloo Regional Down Syndrome Society (Canada) and Fundacion Paraiso Down (El Salvador) to explore specialized educational resource needs through comprehensive user surveys and interviews.
  • Developed comprehensive business plan and presented product at multiple startup incubators and pitch competitions to obtain funding.

Education

  • PhD Computer Science, Software Engineering for Artificial Intelligence. Delft University of Technology. Delft, The Netherlands. Present.
  • M.Sc. Computer Science, Big Data Engineering. VU University Amsterdam. Amsterdam, The Netherlands. 2018-2020.
  • B.Sc. Applied Science, Systems Design Engineering, Minor in Entrepreneurship. University of Waterloo. Waterloo, Canada. 2012-2018.

Language

English (Native), Dutch (Basic - A2), Hindi (Native), Bengali (Native)

Back to top