Having tidy data in python is everyone analyst’s (who uses python) dream. More often than not, in real-world scenarios, there is no doubt that data is untidy and needs a lot of work before it can be put to use in an analysis or a Machine Learning model. There can be tiny errors in the way your data is recorded or stored or just about merged together than can make your entire analysis go haywire. It is here where tidy data in python comes into the picture.
Also read, Python Data Structures
In this article, we’ll go over a simple list of the most common errors one can have in their dataset to check in order to ensure they can convert the raw data to tidy data in python.
Note: This list is NOT exhaustive.
Tidy Data in Python: The Basic Checklist
We’ll go over the things to look into to convert raw data to tidy data in python using a checklist as follows:
- Dropping unnecessary columns
- Changing indexes
- Renaming columns
- Checking correlations for similarity
- Checking Datatypes
For a quick pandas tutorial, check out this link: Getting Started with Community Tutorials on Pandas
It is important to import pandas and check your data first with some summary codes so that you don’t dive into data cleaning blindly. Check out the following code to see some summary statistics of your data.
import pandas as pd df.info() df.head() df.describe()
Dropping Unnecessary Columns:
Consider this, if your data is about the Housing data in the state of California and you’ve got details like the location of the house i.e., the address, the number of rooms, the number of bathrooms, if it has a backyard, the number of floors, if it has a balcony or if it is sea-facing or not, longitude and latitude details, etc. and your target variable to predict is the price of the house with the details of almost 5000 houses for reference.
In this case, if you fit a simple regression model to your data (linear or multilinear) the data will look at the columns in your data that are numeric, and take them to be continuous variables. The discrete variables can be marked as discrete to consider in the model but what about a model like Longitude and Latitude? It’s obvious that the model will perceive it to be a continuous variable when in fact it is not.
This is not an example of tidy data in python when you push it into a model to work.
Therefore, it is imperative to know how to drop columns that are not needed in your analysis using python. Use the following code to drop a column in Pandas.
df.drop('column_name', inplace=True, axis=1)
Changing Indexes:
If your data has a column like an ID number or a unique identifier that you want to use as the index for the columns instead of a serial number for any reason, you should change the index using Pandas which is fairly easy. It will be easy to locate rows when your index is the same as the unique identifier in your dataset.
It is generally done in an HR dataset with employee data or more so in a library or any place with multiple elements with very unique individual identifiers.
To change your index using Pandas, use the following code:
df = df.set_index('column_name')
Renaming Columns:
Sometimes, columns are named in their short forms to assist the data storage and this can be very difficult for a dataset in python because your analysis can be perfect but your data is not clearly presented in the analysis where people don’t know what the column name means. Thus for having tidy data in python it is imperative that your column names make sense and are easily readable. For example, if columns like First name, middle name, and last name are entered as 1,2,3; no sense could come out of it.
To rename columns, use the following code:
df.rename(columns={'old_name' : 'new_name'}, inplace=True)
Checking correlations for similarity:
In rather very few cases, what happens is two columns that are almost identical are given a different name and put into a model-like regression where they are not really impacting the model in any way. Tidy data in python should ensure that there are no two to three columns that are very similar in nature. For example, the no. of bedrooms and bathrooms are fairly the same in almost all houses with not a few anomalies so it would make sense if they have a similar correlation with other columns, one of them can be dropped.
To check correlations, use the following code:
import seaborn as sns df.corr() sns.heatmap(df.corr(), annot = True)
Checking Datatypes
If you look at how data is recorded in a survey, with say just two questions, the date and their height on that date. There are so many different possibilities of errors.
The date can be entered in any of the following formats:
DD/MM
DD/MM/YYYY
DD/Month/YY
DD/Month/Year
Date/Month/Year
Date/Month
Day/DD/MM/YYYY
and Height can be entered in formats like Centimeters, Feet, Inches, Meters, etc.
Even if you specify the formats it should be entered in, an open-ended question always has a possibility of error. Tidy data in python cannot have different data types.
To turn this raw data into tidy data in python and check if multiple data types are in place, use the following code:
df.get_dtype_counts()
In the case of multiple data types, you should be able to see them with a df.head() view.
Conclusion
Tidy data is imperative in any analysis that you do on and off python. Tidy data in python will give you accurate results and take your model’s score to the next level. Data cleaning or tidying is often seen as a very painful task and one of the longest in the entire data science lifecycle but what is not understood by people is that this is the same phase that can make or break your entire analysis.
Kaggle grandmasters have admitted to having found small data leakages when cleaning and exploring their data which takes their ranking to the top in a competition.
Data cleaning should not be skipped for any reason and the more the time is put into cleaning the data, the lesser the problems you’ll find in your model.
Try out the methods above and let us know if they improve your analysis.
For more such content, check out our website -> Buggy Programmer
An eternal learner, I believe Data is the panacea to the world's problems. I enjoy Data Science and all things related to data. Let's unravel this mystery about what Data Science really is, together. With over 33 certifications, I enjoy writing about Data Science to make it simpler for everyone to understand. Happy reading and do connect with me on my LinkedIn to know more!