Pandas Crosstab Function in Python: How to utilize it in a better way?

Introduction

When I am using group by(),pivot_table() to analyze the data and do data manipulation methods like a heat map, pandas introduce me to a brand-new function called crosstab(). The crosstab function is used to calculate the cross-tabulation, which helps me to understand the relationship between variables easily. In this article, we discuss how to use crosstab() and why you should use it instead of other similar functions. They are widely used in polling, business intelligence, engineering, and scientific research. It provides a basic picture of the relationship between two variables and can aid in the discovery of inter-relationship between them.

In Data Analysis, Pandas has many functions to analyze the data, group the data, and Summarize it. A crosstab (also called a contingency table or cross-tabulation) is a table that displays the frequency distribution of one variable in rows and another in columns.

OverView

  1. Pandas Crosstab Parameters
  2. Pandas Crosstab Basics
  3. Pandas Crosstab with aggregation functions
  4. Pandas Crosstab Vs Pivot table Vs Groupby functions
  5. Pandas Crosstab using plot
  6. Conclusion

Pandas Crosstab Parameters

Pandas Crosstab syntax is,

pandas.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name='All', dropna=True, normalize=False)

The table represents the parameters in the crosstab function:

NameDescription Type
IndexValues to group by in rowsArray/List/Series
ColumnsValues to group by in columnsArray/Series
ValuesThe array of Values to aggregate based on factors.Array
rownamesIf passed, the number of row arrays must match.Sequential
colnamesIf passed, the number of columns arrays must match.Sequential
aggfuncIf specified, values must also be specified.Function
marginsInclude row/column margins (subtotals).Bool
margins_nameWhen margins are True, the name of the row/column that will contain the totals is used.Str
dropnaColumns with all NaN values should not be included.Boolean
normalizeNormalize by dividing all values by their sum.bool, {‘all’, ‘index’, ‘columns’}, or {0,1}
Pandas Crosstab Parameters and their Description

Pandas CrossTab Basics

Learning the Pandas Crosstab method by implementing it on Credit Card Customer Dataset

import pandas as pd
df = pd.read_csv(r"./BUGGY_PROGRAMMER/DATASET/BankChurners.csv")
df.head()
image 3 crosstab function,crosstab,pandas,python
Credit_Card Data Description

crosstab() always returns a data frame, as shown below. The data frame is a cross-tabulation of two variables: Education_level and Gender. Cross tabulation simply means displaying one variable’s groups as indexes and the other variable’s groups as columns.

Implementation of Crosstab function using Python
crosstab_table = pd.crosstab(df['Education_Level'],df['Gender'])
print(crostab_table)
The output of Crosstab function and their relationship
Relationship between Education_level and Gender

Here, We are implementing crosstab function to get relationship pattern between Income_Category and Gender Variables

"""Implementing Crosstab function using Python"""
crosstab_table2 =pd.crosstab(df['Income_Category'],df['Gender'])
print(crosstab_table2)
Cross tab relationship between Income_category and Gender
Relationship between Income Category Vs Gender
table3 = pd.crosstab(df['Gender'],df['Income_Category'])
print(table3)
Gender Vs Income_Category
Gender Vs Income_Category

Pandas Crosstab function Sum

Here sum function is used to add values of crosstab

sum_of_crosstab = pd.crosstab(df['Gender'],df['Income_Category']).sum()
print(sum_of_crosstab)
Cross tab sum of values
Cross tab sum of values

Pandas Cross tab Normalize

  1. If ‘all’ or True is specified, the overall data will be normalized.
  2. If ‘index’ is given, it will normalize each row.
  3. If ‘columns’ are supplied, it will normalize each column.
  4. If margins are set to True, the margin values will be normalized as well.
Implementing Normalize using Python
"""Normalizing the overall data values"""
normalize_all = pd.crosstab(df['Gender'],df['Income_Category'],normalize='all')
print(normalize_all)

Here, We are normalizing the overall data values(rows and columns) of two variables Income_Category and Gender with percentages

Crosstab Normalize the overall data
Crosstab Normalize the overall data
"""Normalizing the row / index values"""
normalize_row = pd.crosstab(df['Gender'],df['Income_Category'],normalize='index')
print(normalize_row)
image 9 crosstab function,crosstab,pandas,python
Crosstab normalization of row/index values
"""Normalizing the columns values"""
normalize_columns = pd.crosstab(df['Gender'],df['Income_Category'],normalize='columns')
print(normalize_columns)
Crosstab Normalization for columns
Crosstab Normalization for columns
normalize_true = pd.crosstab(df['Gender'],df['Income_Category'],normalize=True)
print(normalize_true)
Crosstab Normalization by parsing true
Crosstab Normalization by parsing true
custom_pd.crosstab(index=df['Card_Category'],
            columns=df['Attrition_Flag'],
            margins=True,
            margins_name='Total Percentage',
            normalize=True)
Crosstab Customized normalize percentage
Crosstab Customized normalize percentage

Pandas Crosstab Count

Pandas Crosstab function we can perform count operation as shown below, where the count is the total number of times a specific thing. To perform count we have the set margin to be True and margin_name is to be all, which default one, or else we can change the column name to be like Total_count.

crosstab_count = pd.crosstab(index = 
                                               df['Gender'],columns=df['Income_Category'],margins=True,margins_name="Total_count")
print(crosstab_Count)
Crosstab counting  operation
Crosstab counting operation

Pandas Crosstab Multiple

Multiple_indexes = pd.crosstab(index =[df['Gender'],df['Card_Category']],columns =df['Income_Category'])
print(Multiple_indexes)
Crosstab with MultiLevel Indexes
Crosstab with MultiLevel Indexes

Pandas Crosstab Multiple Columns

multi_col = pd.crosstab(index =df['Gender'],columns =[df['Income_Category'],df['Card_Category']])
print(multi_col)
Pandas Crosstab with Multiple Columns
Pandas Crosstab with Multiple Columns

Pandas Crosstab with aggregation function

import numpy as np
pd.crosstab(
    df.Card_Category, 
    df.Attrition_Flag, 
    values=df.Months_on_book, 
    aggfunc=np.mean
)
Average Number of Customers for Card_Category
Average Number of Customers for Card_Category
pd.crosstab(
    df.Card_Category, 
    df.Attrition_Flag, 
    values=df.Months_on_book, 
    aggfunc=np.median,
    margins_name ="Median",margins = True
)
Crosstab with Aggfunction 'Median'
Crosstab with Aggfunction ‘Median’

Pandas Crosstab Vs Pivot table Vs Groupby functions

We compare the differences between the three functions of crosstab(),pivot_table(), and groupby() in syntax and return of their results.

Group by Function in Pandas

df.groupby(['Card_Category', 'Attrition_Flag'])['Months_on_book']\
                                           .agg([np.mean]).reset_index()
Group by Card_category with the average number of customers
Group by Card_Category with the average number of customers

Pivot_table Function in Pandas

df.pivot_table(values='Months_on_book',
                     index='Card_Category',
                     columns='Attrition_Flag',
                     aggfunc=np.mean)
The Output of Pivot Table
The Output of Pivot Table

The pivot table expects your input data to already be a DataFrame, while you pass a data frame to the pivot table and specify the index, columns, and values bypassing the column names as strings. You don’t need to pass a dataFrame into crosstab because you just pass array-like objects for index, columns, and values. Crosstab has the normalized parameter to perform percentage by dividing all values by their sum where pivot table doesn’t have.

groupby()pivot_table()crosstab()
Input data should be in DataframeInput data should be in DataframeYou can pass array-like objects
Passing Columns name in String formatPassing Columns in string formatNo need to pass in string format
groupby() Vs pivot_table() Vs crosstab()

Pandas Crosstab using plot

How to prepare plots for crosstab function in pandas as shown below

Implementing the line, bar, pie charts, and heat map using python
pd.crosstab(df['Income_Category'],df['Gender']).plot(kind='line',figsize=(12,5))
Line chart of  Income_Category and Gender
Line chart of Income_Category and Gender
pd.crosstab(df['Income_Category'],df['Gender']).plot(kind='bar',figsize=(12,5))

Bar_Chart of Income_Category and Gender using Crosstab function
Bar_Chart of Income_Category and Gender using Crosstab function
pd.crosstab(df['Income_Category'],df['Gender']).plot(kind='pie',subplots =True,figsize=(12,12))
Pie_Chart for Income_category and Gender using Crosstab
Pie_Chart for Income_category and Gender using Crosstab
import seaborn as sns
cross_tab = pd.crosstab(
    df.Card_Category, 
    df.Attrition_Flag, 
    values=df.Months_on_book, 
    aggfunc=np.mean
)
sns.heatmap(cross_tab, cmap='rocket_r', annot=True, fmt='g')
Heat map of Crosstab function
Heat map of Crosstab function

Also, read more about Data Visualization – Bokeh Vs Plotly

Conclusion

In this article, We understand the crosstab function utilization by implementing with credit card customer dataset, and with help of parameters in the function, we can count the values, normalize them with percentages and we perform different types of aggregate functions like mean, median e.t.c., how to plot the different types of the plots for crosstab to have better analysis compared to pivot table and group by functions in pandas. Feel free to ask questions or comment your thoughts on the topic, if any need math and details further.

Thanks for reading the article.

cropped Bug logo 1 crosstab function,crosstab,pandas,python

Data Scientist with 3+ years of experience in building data-intensive applications in diverse industries. Proficient in predictive modeling, computer vision, natural language processing, data visualization etc. Aside from being a data scientist, I am also a blogger and photographer.

Share this post

Read next...

Subscribe
Notify of
guest

0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments