Book 4 · Chapter Eight — Cleaning and Final Analysis

Part One

Detecting Missing Values

In the real world, datasets are never perfect. A price might be missing because a product was out of stock. A category might be blank because it was not recorded. pandas represents missing values as NaN (Not a Number), and it provides methods to find and handle them.

Python · Try it — a messy dataset

.isnull() returns a boolean DataFrame where True marks every missing value. Calling .sum() on it counts the True values per column — giving you a quick inventory of how much data is missing. This is always one of the first checks in any real-world analysis.

Python · Try it — find rows with any missing value

Part Two

.dropna() and .fillna() — Handling Missing Data

Once you know where the missing values are, you have two main options: drop the affected rows, or fill in the missing values with a substitute.

.dropna() — remove rows with missing values

Python · Try it — drop rows with any NaN

.fillna() — replace missing values

Sometimes dropping rows loses too much data. .fillna() replaces NaN with a value you specify — the mean, a fixed default, or a placeholder string.

Python · Try it — fill missing values

Choosing a strategy: Dropping rows is safe when missing data is rare and random. Filling with the mean is common for prices, but it can distort analysis if many values are missing. Always document which strategy you chose and why — it is part of your methodology.

Part Three

Renaming Columns and Removing Duplicates

.rename() — fix column names

Python · Try it

.drop_duplicates() — remove repeated rows

Python · Try it

Part Four

Putting It All Together — A Complete Food Price Analysis

You now have all the tools needed to conduct a full data analysis. Below is a complete, commented script that takes a raw (and slightly messy) price survey dataset, cleans it, and produces a set of findings you could publish.

Python · Complete analysis

import pandas as pd
import numpy as np

# ── 1. Load the raw data ──────────────────────────────────────────
raw = {
    "item": [
        "Bread", "Milk", "Eggs", "Cheese", "Butter",
        "Olive Oil", "Pasta", "Rice", "Tomatoes", "Onions",
        "Potatoes", "Chicken", "Beef", "Salmon", "Tuna",
        "Bread"  # duplicate entry
    ],
    "category": [
        "Bakery", "Dairy", "Dairy", "Dairy", "Dairy",
        "Oils", "Grains", "Grains", "Vegetables", "Vegetables",
        "Vegetables", "Meat", "Meat", "Fish", "Fish",
        "Bakery"
    ],
    "price": [
        1.29, 0.99, np.nan, 3.79, 1.89,
        5.49, 1.19, 0.89, 1.49, 0.69,
        0.79, 4.99, 7.49, 8.99, 1.59,
        1.29
    ],
    "organic": [
        False, True, True, False, True,
        True, False, False, True, True,
        False, True, False, True, False,
        False
    ]
}
df = pd.DataFrame(raw)

print(f"Raw dataset: {df.shape[0]} rows, {df.shape[1]} columns")
print(f"Missing values: {df.isnull().sum().sum()}")
print(f"Duplicate rows: {df.duplicated().sum()}")

# ── 2. Clean ─────────────────────────────────────────────────────
# Fill missing prices with category median (better than overall mean)
df["price"] = df.groupby("category")["price"].transform(
    lambda x: x.fillna(x.median())
)
# Remove duplicates
df = df.drop_duplicates()

print(f"\nClean dataset: {df.shape[0]} rows")

# ── 3. Enrich ────────────────────────────────────────────────────
# Add a price band column
def band(p):
    if p < 1.5:   return "Budget"
    elif p < 4.0: return "Mid-range"
    else:         return "Premium"

df["band"] = df["price"].apply(band)

# ── 4. Analyse ───────────────────────────────────────────────────
print("\n=== FINDINGS ===\n")

# Average price by category (sorted)
cat_avg = df.groupby("category")["price"].mean().sort_values()
print("Average price by category:")
for cat, avg in cat_avg.items():
    print(f"  {cat:<12} €{avg:.2f}")

print()

# Organic premium
organic_avg    = df[df["organic"] == True]["price"].mean()
nonorganic_avg = df[df["organic"] == False]["price"].mean()
premium = organic_avg - nonorganic_avg
print(f"Organic average:     €{organic_avg:.2f}")
print(f"Non-organic average: €{nonorganic_avg:.2f}")
print(f"Organic premium:     €{premium:.2f} ({premium/nonorganic_avg*100:.0f}%)")

print()

# Budget items
budget_items = df[df["band"] == "Budget"].sort_values("price")
print(f"Budget items (under €1.50): {len(budget_items)}")
print(budget_items[["item", "price"]].to_string(index=False))

Read through the script section by section. Notice how each step builds on the previous one: load → clean → enrich → analyse. This is the standard structure of any data analysis pipeline, whether you are working with 15 rows or 15 million.

Part Five

Final Project — Your Own Price Survey

Build a complete analysis of your own food price dataset. Walk to a local supermarket or browse an online grocery store and record the prices of 15–20 items. Then, using the full toolkit from this book, answer:

What are the three most expensive categories?
What is the organic premium in your market?
Which items represent the best value (lowest price-to-median ratio)?
Are there any suspicious outliers in your data?

Python · Your final project — replace the data with your own

import pandas as pd

# Replace this with your own survey data
my_data = {
    "item":     ["Bread", "Milk", "Eggs", "Cheese", "Tomatoes",
                 "Chicken", "Pasta", "Rice", "Olive Oil", "Yoghurt"],
    "category": ["Bakery", "Dairy", "Dairy", "Dairy", "Vegetables",
                 "Meat", "Grains", "Grains", "Oils", "Dairy"],
    "price":    [1.49, 1.09, 2.79, 4.29, 1.69,
                 5.49, 1.39, 0.99, 6.29, 0.89],
    "organic":  [False, True, False, False, True,
                 False, False, False, True, True]
}
df = pd.DataFrame(my_data)

# --- Your analysis below ---
print("=== MY MARKET PRICE SURVEY ===\n")

print("Dataset overview:")
print(f"  {df.shape[0]} items across {df['category'].nunique()} categories\n")

print("Price summary:")
print(f"  Cheapest:  {df.loc[df['price'].idxmin(), 'item']} (€{df['price'].min():.2f})")
print(f"  Priciest:  {df.loc[df['price'].idxmax(), 'item']} (€{df['price'].max():.2f})")
print(f"  Mean:      €{df['price'].mean():.2f}")
print(f"  Median:    €{df['price'].median():.2f}\n")

print("Average price by category:")
cat_summary = df.groupby("category")["price"].mean().sort_values(ascending=False).round(2)
print(cat_summary.to_string())
print()

organic_avg    = df[df["organic"]]["price"].mean()
nonorganic_avg = df[~df["organic"]]["price"].mean()
print(f"Organic premium: €{(organic_avg - nonorganic_avg):.2f}")

Congratulations — you have completed Book 4. You can now load, inspect, clean, filter, aggregate, sort, and enrich tabular data using pandas. These are the core skills of data journalism and data science. The next book covers web scraping: how to collect data automatically from websites using Python — the first step in building your own datasets from scratch.

Chapter Navigation

Move between chapters.

Previous: Chapter 7 — Sorting and New Columns Back to Course Home