Book 4 · Chapter Two — Exploring Your Dataset

Part One

The Dataset We Will Work With

Throughout this chapter — and the rest of Book 4 — we will work with a grocery price survey. Imagine a market reporter has recorded the price of 15 common food items, their category, unit of measurement, and whether they are organically certified. This kind of data is the raw material of food price journalism and consumer research.

Run the cell below to create our dataset. You will use this same table for all the exercises in this chapter.

Python · Run this first

That is 15 rows and 5 columns — a modest but realistic slice of a supermarket price list. Now let us learn how to interrogate it.

Part Two

Shape, Columns, and Data Types

The first questions you ask about any dataset are: how big is it? and what kind of data does it contain? pandas answers both in one line each.

.shape — how many rows and columns?

Python · Try it

df.shape returns a tuple: (rows, columns). On a real dataset with hundreds of thousands of rows, this is the very first thing you check. Real-world data files often contain far more rows than you expect.

.columns and .dtypes — what are the fields?

Python · Try it

The dtypes output tells you how pandas has interpreted each column. object is the pandas term for text (strings). float64 means decimal numbers. bool means True/False. Knowing the data type of each column matters because many pandas operations only work on numeric columns.

Common data types: int64 — whole numbers · float64 — decimals · object — text · bool — True/False · datetime64 — dates and times

Part Three

.head() and .tail() — Peeking at the Data

When a dataset has thousands of rows, printing the whole thing is unhelpful. pandas provides two methods for a quick look at the top or bottom of the table.

Python · Try it

By default, .head() shows the first 5 rows and .tail() shows the last 5. Pass a number to get a different count: .head(10), .tail(3). Professional data analysts almost always start an exploration session by calling .head() immediately after loading a file.

Part Four

.describe() — Instant Summary Statistics

One of the most powerful quick-look tools in pandas is .describe(). Call it on a DataFrame and you get a full statistical summary of every numeric column in a single table.

Python · Try it

Read the output row by row:

count — how many non-missing values exist (useful for spotting missing data)
mean — the average value
std — standard deviation, a measure of how spread out the values are
min / max — the smallest and largest values
25% / 50% / 75% — the quartiles; 50% is the median

From this one table you can already see that the median food price (50th percentile) is €1.49, but the mean is dragged up to €2.94 by the expensive fish and meat items. That gap between mean and median is a classic sign of a right-skewed distribution.

Include text columns: By default, .describe() only summarises numeric columns. To include text columns too, use df.describe(include="all"). This adds count, unique, top (most common value), and freq (how often it appears) for text columns.

Python · Try it — describe all columns

The top and freq rows for the category column tell you that Dairy is the most common category, appearing 4 times — something you might not have noticed by eye.

Part Five

Your Turn — First Look at a New Dataset

Below is a second dataset — weekly prices at a different market. Use the exploration tools you have learned to answer these questions without reading the raw data by eye:

How many rows and columns does the dataset have?
What data types are the columns?
What is the most expensive item?
What is the average price?

Python · Your turn

What you learned in this chapter: how to inspect a new dataset with .shape, .columns, .dtypes, .head(), .tail(), and .describe(). These six tools are the first things every data analyst runs on any new file. In the next chapter you will learn to select specific rows and columns.

Chapter Navigation

Move between chapters.

Previous: Chapter 1 — What is pandas? Next: Chapter 3 — Selecting Data