Book 4 — Data Analysis with Python

Python for All

Chapter Five — Summary Statistics

Thanasis Troboukis  ·  All Books

Book Four · Chapter Five

Summary Statistics

Numbers in a column are raw material. Statistics — mean, median, standard deviation, percentiles — turn that raw material into findings you can report. This chapter covers every essential statistical method in pandas.

The Core Statistical Methods

Every numeric Series in pandas has a full set of statistical methods built in. You call them directly on a column. Here are the ones you will reach for most often.

Python · Try it — all the basics

      

Notice that the mean (€2.94) is noticeably higher than the median (€1.89). This happens because a few very expensive items — Salmon at €8.99, Beef at €7.49 — pull the average up, while the majority of items are inexpensive staples. The median, being the middle value, is more robust to these extremes.

Mean vs median: When reporting a "typical" price, journalists often prefer the median because it is not skewed by outliers. The mean is useful when you want to calculate totals (e.g., the average spend per item times the number of items). Always think about which measure tells the more honest story.

Percentiles and the Spread of Prices

Percentiles divide your data into hundredths. The 25th percentile (Q1) is the value below which 25% of the data falls. The 50th percentile is the median. The 75th percentile (Q3) is the value below which 75% of the data falls.

The gap between Q1 and Q3 is called the interquartile range (IQR) — a robust measure of spread that is not influenced by extreme values.

Python · Try it — percentiles

      

Items above the 75th percentile cost more than €4.14. That threshold tells you which items are in the top quarter of prices for this market — a useful data point for a story about food affordability.

Statistics on Filtered Data

You can chain filtering and statistics together. This is where the power of pandas really begins to show. A question like "what is the average price of organic items?" becomes a single readable line.

Python · Try it

      

You can see that organic items are on average more expensive — but that the gap is partly explained by which categories are organic. A rigorous comparison would control for category (see Chapter 6).

Finding the Extreme Items — .idxmin() and .idxmax()

Knowing the minimum or maximum value is useful. Knowing which row has the minimum or maximum is even more useful. .idxmin() and .idxmax() return the index label of the extreme value, which you can then use to look up the full row.

Python · Try it

      

Your Turn — A Cost-of-Living Report

Imagine you are writing a short data-driven report on food affordability. Use the statistics tools to answer these questions:

Python · Your turn

      
What you learned in this chapter: .count(), .sum(), .mean(), .median(), .std(), .min(), .max(); computing percentiles with .quantile(); combining filtering and statistics to compare subgroups; and finding extreme rows with .idxmin() and .idxmax(). Next chapter: groupby — computing statistics for every category at once.

Chapter Navigation

Move between chapters.

Loading Python environment — this may take a moment…