Part One
The Dataset We Will Work With
Throughout this chapter — and the rest of Book 4 — we will work with a grocery price survey. Imagine a market reporter has recorded the price of 15 common food items, their category, unit of measurement, and whether they are organically certified. This kind of data is the raw material of food price journalism and consumer research.
Run the cell below to create our dataset. You will use this same table for all the exercises in this chapter.
That is 15 rows and 5 columns — a modest but realistic slice of a supermarket price list. Now let us learn how to interrogate it.
Part Two
Shape, Columns, and Data Types
The first questions you ask about any dataset are: how big is it? and what kind of data does it contain? pandas answers both in one line each.
.shape — how many rows and columns?
df.shape returns a tuple: (rows, columns). On a real dataset with hundreds of thousands of rows, this is the very first thing you check. Real-world data files often contain far more rows than you expect.
.columns and .dtypes — what are the fields?
The dtypes output tells you how pandas has interpreted each column. object is the pandas term for text (strings). float64 means decimal numbers. bool means True/False. Knowing the data type of each column matters because many pandas operations only work on numeric columns.
int64 — whole numbers · float64 — decimals · object — text · bool — True/False · datetime64 — dates and times
Part Three
.head() and .tail() — Peeking at the Data
When a dataset has thousands of rows, printing the whole thing is unhelpful. pandas provides two methods for a quick look at the top or bottom of the table.
By default, .head() shows the first 5 rows and .tail() shows the last 5. Pass a number to get a different count: .head(10), .tail(3). Professional data analysts almost always start an exploration session by calling .head() immediately after loading a file.
Part Four
.describe() — Instant Summary Statistics
One of the most powerful quick-look tools in pandas is .describe(). Call it on a DataFrame and you get a full statistical summary of every numeric column in a single table.
Read the output row by row:
- count — how many non-missing values exist (useful for spotting missing data)
- mean — the average value
- std — standard deviation, a measure of how spread out the values are
- min / max — the smallest and largest values
- 25% / 50% / 75% — the quartiles; 50% is the median
From this one table you can already see that the median food price (50th percentile) is €1.49, but the mean is dragged up to €2.94 by the expensive fish and meat items. That gap between mean and median is a classic sign of a right-skewed distribution.
.describe() only summarises numeric columns. To include text columns too, use df.describe(include="all"). This adds count, unique, top (most common value), and freq (how often it appears) for text columns.
The top and freq rows for the category column tell you that Dairy is the most common category, appearing 4 times — something you might not have noticed by eye.
Part Five
Your Turn — First Look at a New Dataset
Below is a second dataset — weekly prices at a different market. Use the exploration tools you have learned to answer these questions without reading the raw data by eye:
- How many rows and columns does the dataset have?
- What data types are the columns?
- What is the most expensive item?
- What is the average price?
.shape, .columns, .dtypes, .head(), .tail(), and .describe(). These six tools are the first things every data analyst runs on any new file. In the next chapter you will learn to select specific rows and columns.
Chapter Navigation
Move between chapters.