Book 4 — Data Analysis with Python

Python for All

Chapter Two — Exploring Your Dataset

Thanasis Troboukis  ·  All Books

Book Four · Chapter Two

Exploring Your Dataset

When you receive a new dataset, the first thing you do is look around. pandas gives you a set of quick-inspection tools that tell you the size, shape, and contents of your data in seconds.

The Dataset We Will Work With

Throughout this chapter — and the rest of Book 4 — we will work with a grocery price survey. Imagine a market reporter has recorded the price of 15 common food items, their category, unit of measurement, and whether they are organically certified. This kind of data is the raw material of food price journalism and consumer research.

Run the cell below to create our dataset. You will use this same table for all the exercises in this chapter.

Python · Run this first

      

That is 15 rows and 5 columns — a modest but realistic slice of a supermarket price list. Now let us learn how to interrogate it.

Shape, Columns, and Data Types

The first questions you ask about any dataset are: how big is it? and what kind of data does it contain? pandas answers both in one line each.

.shape — how many rows and columns?

Python · Try it

      

df.shape returns a tuple: (rows, columns). On a real dataset with hundreds of thousands of rows, this is the very first thing you check. Real-world data files often contain far more rows than you expect.

.columns and .dtypes — what are the fields?

Python · Try it

      

The dtypes output tells you how pandas has interpreted each column. object is the pandas term for text (strings). float64 means decimal numbers. bool means True/False. Knowing the data type of each column matters because many pandas operations only work on numeric columns.

Common data types: int64 — whole numbers  ·  float64 — decimals  ·  object — text  ·  bool — True/False  ·  datetime64 — dates and times

.head() and .tail() — Peeking at the Data

When a dataset has thousands of rows, printing the whole thing is unhelpful. pandas provides two methods for a quick look at the top or bottom of the table.

Python · Try it

      

By default, .head() shows the first 5 rows and .tail() shows the last 5. Pass a number to get a different count: .head(10), .tail(3). Professional data analysts almost always start an exploration session by calling .head() immediately after loading a file.

.describe() — Instant Summary Statistics

One of the most powerful quick-look tools in pandas is .describe(). Call it on a DataFrame and you get a full statistical summary of every numeric column in a single table.

Python · Try it

      

Read the output row by row:

From this one table you can already see that the median food price (50th percentile) is €1.49, but the mean is dragged up to €2.94 by the expensive fish and meat items. That gap between mean and median is a classic sign of a right-skewed distribution.

Include text columns: By default, .describe() only summarises numeric columns. To include text columns too, use df.describe(include="all"). This adds count, unique, top (most common value), and freq (how often it appears) for text columns.
Python · Try it — describe all columns

      

The top and freq rows for the category column tell you that Dairy is the most common category, appearing 4 times — something you might not have noticed by eye.

Your Turn — First Look at a New Dataset

Below is a second dataset — weekly prices at a different market. Use the exploration tools you have learned to answer these questions without reading the raw data by eye:

Python · Your turn

      
What you learned in this chapter: how to inspect a new dataset with .shape, .columns, .dtypes, .head(), .tail(), and .describe(). These six tools are the first things every data analyst runs on any new file. In the next chapter you will learn to select specific rows and columns.

Chapter Navigation

Move between chapters.

Loading Python environment — this may take a moment…