Book 4 — Data Analysis with Python

Python for All

Chapter Eight — Cleaning and Final Analysis

Thanasis Troboukis  ·  All Books

Book Four · Chapter Eight

Cleaning and Final Analysis

Real data is messy. Prices go missing. Categories are mislabelled. Entries are duplicated. This chapter teaches you how to detect and handle these problems — and then brings everything together in a complete food price analysis.

Detecting Missing Values

In the real world, datasets are never perfect. A price might be missing because a product was out of stock. A category might be blank because it was not recorded. pandas represents missing values as NaN (Not a Number), and it provides methods to find and handle them.

Python · Try it — a messy dataset

      

.isnull() returns a boolean DataFrame where True marks every missing value. Calling .sum() on it counts the True values per column — giving you a quick inventory of how much data is missing. This is always one of the first checks in any real-world analysis.

Python · Try it — find rows with any missing value

      

.dropna() and .fillna() — Handling Missing Data

Once you know where the missing values are, you have two main options: drop the affected rows, or fill in the missing values with a substitute.

.dropna() — remove rows with missing values

Python · Try it — drop rows with any NaN

      

.fillna() — replace missing values

Sometimes dropping rows loses too much data. .fillna() replaces NaN with a value you specify — the mean, a fixed default, or a placeholder string.

Python · Try it — fill missing values

      
Choosing a strategy: Dropping rows is safe when missing data is rare and random. Filling with the mean is common for prices, but it can distort analysis if many values are missing. Always document which strategy you chose and why — it is part of your methodology.

Renaming Columns and Removing Duplicates

.rename() — fix column names

Python · Try it

      

.drop_duplicates() — remove repeated rows

Python · Try it

      

Putting It All Together — A Complete Food Price Analysis

You now have all the tools needed to conduct a full data analysis. Below is a complete, commented script that takes a raw (and slightly messy) price survey dataset, cleans it, and produces a set of findings you could publish.

Python · Complete analysis

      

Read through the script section by section. Notice how each step builds on the previous one: load → clean → enrich → analyse. This is the standard structure of any data analysis pipeline, whether you are working with 15 rows or 15 million.

Final Project — Your Own Price Survey

Build a complete analysis of your own food price dataset. Walk to a local supermarket or browse an online grocery store and record the prices of 15–20 items. Then, using the full toolkit from this book, answer:

Python · Your final project — replace the data with your own

      
Congratulations — you have completed Book 4. You can now load, inspect, clean, filter, aggregate, sort, and enrich tabular data using pandas. These are the core skills of data journalism and data science. The next book covers web scraping: how to collect data automatically from websites using Python — the first step in building your own datasets from scratch.

Chapter Navigation

Move between chapters.

Loading Python environment — this may take a moment…