Book 2 · Chapter Four — Reading and Searching Files

Part One

cat — Reading a File

The most direct way to see the contents of a file is cat. Short for concatenate, it prints the entire contents of a file to the screen in one go.

        cat notes.txt

        cat cities.csv

        cat press_release.txt

more is an older alternative that pages through a long file screenful by screenful. In this sandbox it behaves the same as cat, but on a real terminal it lets you scroll through large files without the output flooding your screen.

Use cat when the file is short enough to read in full. For long files — a dataset with thousands of rows, for example — use head or tail instead.

Terminal · cat Lab

Try: ls · cat cities.txt · cat press_release.txt · cat earthquakes.csv

Part Two

head and tail — Reading Parts of a File

When a dataset has thousands of rows, printing the entire file is not useful. head and tail let you peek at just the beginning or end.

head -n 5 filename prints the first 5 lines. Replace 5 with any number. Without a number, head defaults to 10 lines.

tail -n 5 filename prints the last 5 lines. Again, the default is 10.

        head -n 3 earthquakes.csv

        tail -n 3 earthquakes.csv

        head earthquakes.csv

Why this matters: before analysing a CSV dataset, always run head to see the column names and check the structure. A common mistake is running analysis without realising the first row is a header, not data.

Terminal · head and tail Lab

Try: head earthquakes.csv · head -n 3 earthquakes.csv · tail -n 3 earthquakes.csv · tail -n 2 press_release.txt

Part Three

wc — Counting Lines, Words, Characters

wc stands for word count, but it counts much more than words. Run on a file, it returns three numbers: lines, words, and characters.

        wc press_release.txt

            9  58  398 press_release.txt

Reading left to right: 9 lines, 58 words, 398 characters.

The most useful flag for data journalism is -l, which returns only the line count:

        wc -l earthquakes.csv

            11 earthquakes.csv

A CSV with 11 lines has a header row plus 10 data rows. This quick sanity check tells you how many records are in your dataset — before you open it in Python or a spreadsheet.

Terminal · wc Lab

Try: wc earthquakes.csv · wc -l earthquakes.csv · wc press_release.txt · wc -l cities.txt

Part Four

grep — Searching Inside Files

grep searches a file and returns every line that contains a given word or pattern. It is arguably the single most useful tool in a journalist's terminal toolkit.

        grep Athens earthquakes.csv

        grep Turkey earthquakes.csv

        grep ministry press_release.txt

The syntax is always: grep [search term] [filename].

The search is case-sensitive by default: grep athens and grep Athens return different results. Note that in the sandbox you cannot use the -i flag for case-insensitive search, but on a real terminal this is very useful.

Journalism use cases for grep:

Find every row in a large CSV that mentions a specific city, politician, or company
Search interview notes for every time a source said a specific phrase
Find all entries in a financial dataset for a specific institution
Quickly verify whether a name appears in a leaked document

Terminal · grep Lab

Try: grep Greece earthquakes.csv · grep Turkey earthquakes.csv · grep Athens cities.txt · grep minister sources.txt

Part Five

sort and uniq — Order and Deduplicate

sort sorts the lines of a file alphabetically (or numerically with extra flags). uniq removes consecutive duplicate lines — meaning identical lines that appear one after another.

On their own, these two commands are simple:

        sort cities.txt

        uniq cities.txt

The key insight is that uniq only removes consecutive duplicates. If duplicates are scattered throughout the file, uniq alone will miss them. The standard workflow is: sort first, then run uniq.

In the sandbox, try uniq cities.txt — notice it may still show duplicates because they were not adjacent. Then try sort cities.txt — now all the Athens entries are grouped together. If you could save that sorted output to a file, running uniq on it would give you the unique city list. (Piping commands together — sort cities.txt | uniq — is the full real-terminal technique. The sandbox handles each command separately.)

Journalism use case: you have a list of cities from a dataset and want to know how many unique cities appear. sort then uniq gives you the deduplicated list. wc -l on that list gives you the count.

Terminal · sort and uniq Lab

Try: cat cities.txt · sort cities.txt · uniq cities.txt · sort sources.txt · uniq sources.txt

Part Six

Full Lab and What Comes Next

Bring everything from this chapter together. The sandbox below has a realistic set of files. Work through the following tasks:

How many rows of data are in earthquakes.csv? (Use wc -l, then subtract 1 for the header.)
Which earthquakes were recorded in Greece? (Use grep.)
What are the unique countries in the dataset? (Use sort, then uniq on the result — or just look at the sorted list.)
What are the first 3 lines of press_release.txt?
How many unique source names are in sources.txt?

Terminal · Full Lab

Use all commands: cat · head · tail · wc · grep · sort · uniq

What you have learned in Chapter 4: cat reads a full file; head -n and tail -n read the first or last N lines; wc -l counts lines; grep finds lines containing a search term; sort orders lines alphabetically; uniq removes consecutive duplicates.

Chapter Navigation

Move between chapters.

Previous: Chapter 3 — Managing Files and Folders Back to Course Home