AGGREGATION IN PANDAS
In this tutorial, we will learn about the aggregation in pandas by discovering about different aggregation functions like min, max sum and mean.
Understanding Aggregation in Pandas
So as we know that pandas is a great package for performing data analysis because of its flexible nature of integration with other libraries. The aggregation function is used for one or more rows or columns to aggregate the given type of data. The syntax of the aggregation function is:
df.aggregate(func, axis=0, *args, **kwargs)
Note: asix 0 refers to the index values whereas axis 1 refers to the rows.
Let’s create a dataframe that holds some numeric values as aggregation is applicable of numeric rows or columns
import pandas as pd # intialise data of lists. data = {'Name':['Hira', 'Sanjeev', 'Rahul', 'Ali'], 'Occupation':['Entrepreneur', 'Doctor', 'Actor', 'Chef'], 'Salary':[30000, 40000, 25000, 32000], 'Age':[25,24,27,29]} # Create DataFrame df = pd.DataFrame(data, index=['Second','Fourth','Fifth','First']) # Print the output. print(df)
Let’s perform the aggregation function on our dataframe. Let’s find out the min and max value of Salary and Age from our dataframe on our columns.
df.agg(['min','max'])
Output:
Name Occupation Salary Age min Ali Actor 25000 24 max Sanjeev Entrepreneur 40000 29
Now you can see that the data returned seems pretty confusing as it did calculated min and max salary but we can see a mix up of information in Occupation column as it doesn’t corresponds to the Name column, hence there is a confusion in using these together, so what alternate do we have?
We can use the aggregation functions separately as well on the desired labels as we want. Let’s use sum of the aggregate functions on a certain label:
Aggregation in Pandas: Max Function
#using the max function on salary df['Salary'].max()
Output
40000
Aggregation in Pandas: Mean Function
#using the mean function on salary df['Salary'].mean()
Output
31750.0
Aggregation in Pandas: Median Function
#using the median function on salary df['Salary'].median()
Output:
31000.0
Sum Function
#using the sum function on salary df['Salary'].sum()
Output:
127000
Standard Deviation:
#using the std (standard deviation) function on salary df['Salary'].std()
Output:
6238.322424070967
Describe Function:
#using the describe function on salary df.describe()
Output:
Salary | Age | |
---|---|---|
count | 4.000000 | 4.000000 |
mean | 31750.000000 | 26.250000 |
std | 6238.322424 | 2.217356 |
min | 25000.000000 | 24.000000 |
25% | 28750.000000 | 24.750000 |
50% | 31000.000000 | 26.000000 |
75% | 34000.000000 | 27.500000 |
max | 40000.000000 | 29.000000 |
This is the most important tutorial of this series as it covers all the basic aggregation functions like sum, max, min, describe, count etc to work around with data. Another important aspect of performing or squeezing a dataframe into a selected dataframe is groupby where you can classify your own columns and perform aggregation functions through grouping.