GROUPBY IN PANDAS

In this tutorial, you will learn about the groupby function in Pandas and how to create subsets of your existing dataframe by classifying the information that you need.

What is Groupby in Pandas?

Pandas is an awesome tool for classifying data into groups through the groupby() method. We can distribute the objects in pandas on any of their axis. In short, groupby means to analyze a pandas Series by some category.

In short, if you have repeated categories in your dataset, then you can create groups in order to classify your data into sub groups. Remember, it won’t be wise to perform groupby method on unique values. Let’s look at the syntax of groupby to understand it in more depth:

DataFrame.groupby(self, by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, observed=False, **kwargs)

We will import a csv file by using the read_csv method. You can find the csv file here

import pandas as pd
countries = pd.read_csv('countries.csv')
countries.head()

Output:





Let’s say we want to group the dataframe by the region, so we can simply use the groupby() method:

countries.groupby('Region')

Output:

When we apply the groupby function, a pandas object is returned. So in order to work around that, we need to store the grouped dataframe in a variable:

region_groupby = countries.groupby('Region')
region_groupby

Output:

It still is returned as an object, but now our pandas is stored inside a variable and we can call that variable with different methods as a grouped entity, so let’s look at the size of the grouped region dataframe:

region_groupby.size()

Output:

Region
ASIA (EX. NEAR EAST)                   28
BALTICS                                 3
C.W. OF IND. STATES                    12
EASTERN EUROPE                         12
LATIN AMER. & CARIB                    45
NEAR EAST                              16
NORTHERN AFRICA                         6
NORTHERN AMERICA                        5
OCEANIA                                21
SUB-SAHARAN AFRICA                     51
WESTERN EUROPE                         28
dtype: int64

Let’s take out the population sum of distributed region area:

region_groupby.Population.sum()

Output:

Region
ASIA (EX. NEAR EAST)                   3687982236
BALTICS                                   7184974
C.W. OF IND. STATES                     280081548
EASTERN EUROPE                          119914717
LATIN AMER. & CARIB                     561824599
NEAR EAST                               195068377
NORTHERN AFRICA                         161407133
NORTHERN AMERICA                        331672307
OCEANIA                                  33131662
SUB-SAHARAN AFRICA                      749437000
WESTERN EUROPE                          396339998
Name: Population, dtype: int64

You can apply the aggregation function on the population over the region category:

region_groupby.Population.agg(['count','sum','min','max'])

Output:

Groupby in Pandas

Groupby in Pandas: Plotting with Matplotlib

You can create a visual display as well to make your analysis look more meaningful by importing matplotlib library. For example, you want to know the number of Countries present in each Region.

import matplotlib.pyplot as plt
df.groupby('Region')['Country'].count()

Output:

Region
ASIA (EX. NEAR EAST)                   28
BALTICS                                 3
C.W. OF IND. STATES                    12
EASTERN EUROPE                         12
LATIN AMER. & CARIB                    45
NEAR EAST                              16
NORTHERN AFRICA                         6
NORTHERN AMERICA                        5
OCEANIA                                21
SUB-SAHARAN AFRICA                     51
WESTERN EUROPE                         28
Name: Country, dtype: int64

Let’s plot the result now:

df.groupby('Region')['Country'].count().plot(kind="barh")
plt.show()

Output:

Groupby in Pandas