Data Smells
Title | Description |
---|---|
Binary missing values | Presence of high quantities of missing data primarily within a column (as opposed to being distributed across rows and columns) can be a smell that the missing data might carry implicit meaning of a negative binary response. |
Binning categorical features | One-hot encoding a feature with high cardinality results in a large feature space and incurs higher memory, disk space and computation costs. |
Correlated Features | Correlated features present an opportunity to perform feature selection and drop dedundant features which do not affect the model’s performance. |
Duplicate examples | Duplicate examples make the dataset “bloated” and can lead to overfitting. |
Hierarchy from label encoding | Label encoding sensitive categorical features can introduce unwanted hierarchy amongst the values and lead to biased predictions. |
Imbalanced examples | Presence of unbalanced examples for the classes in a dataset can lead to biased predictions. |
Nulltype Missing Values | Missing values are ignored by data analysis tools which performing statistical computations, leading to inaccurate and biased conclusions. |
Numerical feature as string | String features with names that indicate a numerical data type (e.g., “current_ver”, “android_ver”) is a smell that the data type of the column was identified incorrectly by the data analysis tool. |
Presence of sensitive features | Presence of sensitive features such as sex, gender, race or income can lead to biased and unfair model predictions. |
Special missing values | Using special characters (“?”), keywords (“null”, “nil”) and numbers (-9999, -6666) to represent missing values are a smell for problems in downstream stages. |
Strings in human-friendly formats | Numerical information being representd in a human-friendly format (“90 min”, “2 seasons”) is a smell for potential problems during the data analysis stage. |
Strings with special characters | The presence of leading and trailing whitespaces and special characters such as punctuation marks is a smell for potential problems in the data analysis stage. |
Unique Identifiers | Columns containing unique identifiers are redundant when training machine learning models and may lead to problems in downstream stages. |
Unknown unit of measure | Lack of a common unit of measure (distance, area, size, etc.) for all numerical features is an early indicator of potential problems that can arise during model training. |
No matching items