Book 4 — Data Analysis with Python

Python for All

Chapter Six — Groupby and Aggregation

Thanasis Troboukis  ·  All Books

Book Four · Chapter Six

Groupby and Aggregation

The single most powerful pattern in data analysis: split your data into groups, compute a statistic for each group, and compare. pandas makes this a one-liner with .groupby().

Split–Apply–Combine

In Chapter 5 you computed average prices by filtering for each category individually — one filter call per category. That works for five categories. It does not scale to fifty, or five hundred. .groupby() solves this elegantly.

The idea behind groupby is called split–apply–combine:

  1. Split — divide the DataFrame into groups based on the values of one column.
  2. Apply — compute a statistic (mean, sum, count…) for each group independently.
  3. Combine — gather the results into a new table, one row per group.
Python · Try it — groupby mean

      

In one line you get the mean price for every category at once. The result is a Series with the category names as the index. From this alone you can already report: meat and fish are by far the most expensive categories; grains and vegetables are the most affordable.

Different Aggregations

You can call any of the statistical methods from Chapter 5 on a grouped Series: .mean(), .median(), .sum(), .min(), .max(), .count(), .std().

Python · Try it

      

.agg() — Multiple Statistics at Once

When you want several statistics per group simultaneously, .agg() (short for aggregate) lets you pass a list of function names. The result is a clean table — perfect for a published data summary.

Python · Try it — .agg() with a list

      

This is publication-ready output. You can see at a glance that the Fish category has the widest range (€1.59 to €8.99) and the Meat category has the highest mean price per item. This type of table is the backbone of most data-driven food price stories.

Naming your columns: You can give the aggregated columns custom names using a dictionary inside .agg():
df.groupby("category")["price"].agg(items="count", avg="mean", max_price="max")
Python · Try it — named aggregations

      

Groupby Multiple Columns

You can group by more than one column at a time by passing a list to .groupby(). The result has a multi-level index — one level per grouping column. This lets you cross-tabulate, for example: average price by category and organic status.

Python · Try it — two-column groupby

      

The output shows a MultiIndex — categories on the outer level, True/False on the inner level. For categories where only organic or only non-organic items exist (like Bakery), only one row appears. This kind of cross-tabulation is extremely useful for investigating whether the organic premium varies by food category.

Your Turn — Category Price Report

Use .groupby() and .agg() to build a summary table that a journalist could use in a food price article. Include: number of items per category, average price, and the range (max − min) per category.

Python · Your turn

      
What you learned in this chapter: the split–apply–combine concept; .groupby() with a single column and with multiple columns; computing statistics per group; using .agg() to get multiple statistics in one table; and naming aggregated columns. Next chapter: sorting results and adding computed columns to your DataFrame.

Chapter Navigation

Move between chapters.

Loading Python environment — this may take a moment…