Book 2 — The Magic World of the Terminal

Python for All

Chapter Four — Reading and Searching Files

Thanasis Troboukis  ·  All Books

Book Two · Chapter Four

Reading and Searching Files

The terminal can read, count, search, and sort the contents of any text file — no spreadsheet needed. These tools are the foundation of command-line data journalism.

cat — Reading a File

The most direct way to see the contents of a file is cat. Short for concatenate, it prints the entire contents of a file to the screen in one go.

cat notes.txt
cat cities.csv
cat press_release.txt

more is an older alternative that pages through a long file screenful by screenful. In this sandbox it behaves the same as cat, but on a real terminal it lets you scroll through large files without the output flooding your screen.

Use cat when the file is short enough to read in full. For long files — a dataset with thousands of rows, for example — use head or tail instead.

Terminal · cat Lab

          

$

Try: ls · cat cities.txt · cat press_release.txt · cat earthquakes.csv

head and tail — Reading Parts of a File

When a dataset has thousands of rows, printing the entire file is not useful. head and tail let you peek at just the beginning or end.

head -n 5 filename prints the first 5 lines. Replace 5 with any number. Without a number, head defaults to 10 lines.

tail -n 5 filename prints the last 5 lines. Again, the default is 10.

head -n 3 earthquakes.csv
tail -n 3 earthquakes.csv
head earthquakes.csv

Why this matters: before analysing a CSV dataset, always run head to see the column names and check the structure. A common mistake is running analysis without realising the first row is a header, not data.

Terminal · head and tail Lab

          

$

Try: head earthquakes.csv · head -n 3 earthquakes.csv · tail -n 3 earthquakes.csv · tail -n 2 press_release.txt

wc — Counting Lines, Words, Characters

wc stands for word count, but it counts much more than words. Run on a file, it returns three numbers: lines, words, and characters.

wc press_release.txt
    9  58  398 press_release.txt

Reading left to right: 9 lines, 58 words, 398 characters.

The most useful flag for data journalism is -l, which returns only the line count:

wc -l earthquakes.csv
    11 earthquakes.csv

A CSV with 11 lines has a header row plus 10 data rows. This quick sanity check tells you how many records are in your dataset — before you open it in Python or a spreadsheet.

Terminal · wc Lab

          

$

Try: wc earthquakes.csv · wc -l earthquakes.csv · wc press_release.txt · wc -l cities.txt

grep — Searching Inside Files

grep searches a file and returns every line that contains a given word or pattern. It is arguably the single most useful tool in a journalist's terminal toolkit.

grep Athens earthquakes.csv
grep Turkey earthquakes.csv
grep ministry press_release.txt

The syntax is always: grep [search term] [filename].

The search is case-sensitive by default: grep athens and grep Athens return different results. Note that in the sandbox you cannot use the -i flag for case-insensitive search, but on a real terminal this is very useful.

Journalism use cases for grep:

Terminal · grep Lab

          

$

Try: grep Greece earthquakes.csv · grep Turkey earthquakes.csv · grep Athens cities.txt · grep minister sources.txt

sort and uniq — Order and Deduplicate

sort sorts the lines of a file alphabetically (or numerically with extra flags). uniq removes consecutive duplicate lines — meaning identical lines that appear one after another.

On their own, these two commands are simple:

sort cities.txt
uniq cities.txt

The key insight is that uniq only removes consecutive duplicates. If duplicates are scattered throughout the file, uniq alone will miss them. The standard workflow is: sort first, then run uniq.

In the sandbox, try uniq cities.txt — notice it may still show duplicates because they were not adjacent. Then try sort cities.txt — now all the Athens entries are grouped together. If you could save that sorted output to a file, running uniq on it would give you the unique city list. (Piping commands together — sort cities.txt | uniq — is the full real-terminal technique. The sandbox handles each command separately.)

Journalism use case: you have a list of cities from a dataset and want to know how many unique cities appear. sort then uniq gives you the deduplicated list. wc -l on that list gives you the count.

Terminal · sort and uniq Lab

          

$

Try: cat cities.txt · sort cities.txt · uniq cities.txt · sort sources.txt · uniq sources.txt

Full Lab and What Comes Next

Bring everything from this chapter together. The sandbox below has a realistic set of files. Work through the following tasks:

  1. How many rows of data are in earthquakes.csv? (Use wc -l, then subtract 1 for the header.)
  2. Which earthquakes were recorded in Greece? (Use grep.)
  3. What are the unique countries in the dataset? (Use sort, then uniq on the result — or just look at the sorted list.)
  4. What are the first 3 lines of press_release.txt?
  5. How many unique source names are in sources.txt?
Terminal · Full Lab

          

$

Use all commands: cat · head · tail · wc · grep · sort · uniq

What you have learned in Chapter 4: cat reads a full file; head -n and tail -n read the first or last N lines; wc -l counts lines; grep finds lines containing a search term; sort orders lines alphabetically; uniq removes consecutive duplicates.

Chapter Navigation

Move between chapters.