Book 4 · Chapter Five — Summary Statistics

Part One

The Core Statistical Methods

Every numeric Series in pandas has a full set of statistical methods built in. You call them directly on a column. Here are the ones you will reach for most often.

Python · Try it — all the basics

Notice that the mean (€2.94) is noticeably higher than the median (€1.89). This happens because a few very expensive items — Salmon at €8.99, Beef at €7.49 — pull the average up, while the majority of items are inexpensive staples. The median, being the middle value, is more robust to these extremes.

Mean vs median: When reporting a "typical" price, journalists often prefer the median because it is not skewed by outliers. The mean is useful when you want to calculate totals (e.g., the average spend per item times the number of items). Always think about which measure tells the more honest story.

Part Two

Percentiles and the Spread of Prices

Percentiles divide your data into hundredths. The 25th percentile (Q1) is the value below which 25% of the data falls. The 50th percentile is the median. The 75th percentile (Q3) is the value below which 75% of the data falls.

The gap between Q1 and Q3 is called the interquartile range (IQR) — a robust measure of spread that is not influenced by extreme values.

Python · Try it — percentiles

Items above the 75th percentile cost more than €4.14. That threshold tells you which items are in the top quarter of prices for this market — a useful data point for a story about food affordability.

Part Three

Statistics on Filtered Data

You can chain filtering and statistics together. This is where the power of pandas really begins to show. A question like "what is the average price of organic items?" becomes a single readable line.

Python · Try it

import pandas as pd

data = {
    "item": ["Bread","Milk","Eggs","Cheese","Butter","Olive Oil","Pasta","Rice","Tomatoes","Onions","Potatoes","Chicken","Beef","Salmon","Tuna"],
    "category": ["Bakery","Dairy","Dairy","Dairy","Dairy","Oils","Grains","Grains","Vegetables","Vegetables","Vegetables","Meat","Meat","Fish","Fish"],
    "price": [1.29,0.99,2.49,3.79,1.89,5.49,1.19,0.89,1.49,0.69,0.79,4.99,7.49,8.99,1.59],
    "organic": [False,True,True,False,True,True,False,False,True,True,False,True,False,True,False]
}
df = pd.DataFrame(data)

# Average price: organic vs non-organic
avg_organic     = df[df["organic"] == True]["price"].mean()
avg_nonorganic  = df[df["organic"] == False]["price"].mean()

print(f"Average price (organic):     €{avg_organic:.2f}")
print(f"Average price (non-organic): €{avg_nonorganic:.2f}")
print(f"Organic premium:             €{(avg_organic - avg_nonorganic):.2f}")
print()

# Most expensive and cheapest categories (by average)
for cat in df["category"].unique():
    avg = df[df["category"] == cat]["price"].mean()
    print(f"{cat:<12}  avg €{avg:.2f}")

You can see that organic items are on average more expensive — but that the gap is partly explained by which categories are organic. A rigorous comparison would control for category (see Chapter 6).

Part Four

Finding the Extreme Items — .idxmin() and .idxmax()

Knowing the minimum or maximum value is useful. Knowing which row has the minimum or maximum is even more useful. .idxmin() and .idxmax() return the index label of the extreme value, which you can then use to look up the full row.

Python · Try it

Part Five

Your Turn — A Cost-of-Living Report

Imagine you are writing a short data-driven report on food affordability. Use the statistics tools to answer these questions:

What percentage of items cost more than the mean price? (Hint: filter, count, divide by total rows.)
What is the price range (max − min) for vegetables?
Is the median price of dairy higher or lower than the overall median?

Python · Your turn

import pandas as pd

mean_price = df["price"].mean()
above_mean = df[df["price"] > mean_price].shape[0]
pct_above  = above_mean / df.shape[0] * 100
print(f"Mean price: €{mean_price:.2f}")
print(f"Items above mean: {above_mean} ({pct_above:.0f}%)")
print()

veg = df[df["category"] == "Vegetables"]["price"]
print(f"Vegetable price range: €{veg.min():.2f} – €{veg.max():.2f}")
print()

dairy_median   = df[df["category"] == "Dairy"]["price"].median()
overall_median = df["price"].median()
print(f"Dairy median:   €{dairy_median:.2f}")
print(f"Overall median: €{overall_median:.2f}")

What you learned in this chapter: .count(), .sum(), .mean(), .median(), .std(), .min(), .max(); computing percentiles with .quantile(); combining filtering and statistics to compare subgroups; and finding extreme rows with .idxmin() and .idxmax(). Next chapter: groupby — computing statistics for every category at once.

Chapter Navigation

Move between chapters.

Previous: Chapter 4 — Filtering Rows Next: Chapter 6 — Groupby and Aggregation