Data Smells

Title Description
Binary missing values Presence of high quantities of missing data primarily within a column (as opposed to being distributed across rows and columns) can be a smell that the missing data might carry implicit meaning of a negative binary response.
Binning categorical features One-hot encoding a feature with high cardinality results in a large feature space and incurs higher memory, disk space and computation costs.
Correlated Features Correlated features present an opportunity to perform feature selection and drop dedundant features which do not affect the model’s performance.
Duplicate examples Duplicate examples make the dataset “bloated” and can lead to overfitting.
Hierarchy from label encoding Label encoding sensitive categorical features can introduce unwanted hierarchy amongst the values and lead to biased predictions.
Imbalanced examples Presence of unbalanced examples for the classes in a dataset can lead to biased predictions.
Nulltype Missing Values Missing values are ignored by data analysis tools which performing statistical computations, leading to inaccurate and biased conclusions.
Numerical feature as string String features with names that indicate a numerical data type (e.g., “current_ver”, “android_ver”) is a smell that the data type of the column was identified incorrectly by the data analysis tool.
Presence of sensitive features Presence of sensitive features such as sex, gender, race or income can lead to biased and unfair model predictions.
Special missing values Using special characters (“?”), keywords (“null”, “nil”) and numbers (-9999, -6666) to represent missing values are a smell for problems in downstream stages.
Strings in human-friendly formats Numerical information being representd in a human-friendly format (“90 min”, “2 seasons”) is a smell for potential problems during the data analysis stage.
Strings with special characters The presence of leading and trailing whitespaces and special characters such as punctuation marks is a smell for potential problems in the data analysis stage.
Unique Identifiers Columns containing unique identifiers are redundant when training machine learning models and may lead to problems in downstream stages.
Unknown unit of measure Lack of a common unit of measure (distance, area, size, etc.) for all numerical features is an early indicator of potential problems that can arise during model training.
No matching items