Data Smells

Title	Description
Binary missing values	Presence of high quantities of missing data primarily within a column (as opposed to being distributed across rows and columns) can be a smell that the missing data might carry implicit meaning of a negative binary response.
Binning categorical features	One-hot encoding a feature with high cardinality results in a large feature space and incurs higher memory, disk space and computation costs.
Correlated Features	Correlated features present an opportunity to perform feature selection and drop dedundant features which do not affect the model’s performance.
Duplicate examples	Duplicate examples make the dataset “bloated” and can lead to overfitting.
Hierarchy from label encoding	Label encoding sensitive categorical features can introduce unwanted hierarchy amongst the values and lead to biased predictions.
Imbalanced examples	Presence of unbalanced examples for the classes in a dataset can lead to biased predictions.
Nulltype Missing Values	Missing values are ignored by data analysis tools which performing statistical computations, leading to inaccurate and biased conclusions.
Numerical feature as string	String features with names that indicate a numerical data type (e.g., “current_ver”, “android_ver”) is a smell that the data type of the column was identified incorrectly by the data analysis tool.
Presence of sensitive features	Presence of sensitive features such as sex, gender, race or income can lead to biased and unfair model predictions.
Special missing values	Using special characters (“?”), keywords (“null”, “nil”) and numbers (-9999, -6666) to represent missing values are a smell for problems in downstream stages.
Strings in human-friendly formats	Numerical information being representd in a human-friendly format (“90 min”, “2 seasons”) is a smell for potential problems during the data analysis stage.
Strings with special characters	The presence of leading and trailing whitespaces and special characters such as punctuation marks is a smell for potential problems in the data analysis stage.
Unique Identifiers	Columns containing unique identifiers are redundant when training machine learning models and may lead to problems in downstream stages.
Unknown unit of measure	Lack of a common unit of measure (distance, area, size, etc.) for all numerical features is an early indicator of potential problems that can arise during model training.

Categories