Part One
cat — Reading a File
The most direct way to see the contents of a file is cat. Short for concatenate, it prints the entire contents of a file to the screen in one go.
cat cities.csv
cat press_release.txt
more is an older alternative that pages through a long file screenful by screenful. In this sandbox it behaves the same as cat, but on a real terminal it lets you scroll through large files without the output flooding your screen.
Use cat when the file is short enough to read in full. For long files — a dataset with thousands of rows, for example — use head or tail instead.
$
Try: ls · cat cities.txt · cat press_release.txt · cat earthquakes.csv
Part Two
head and tail — Reading Parts of a File
When a dataset has thousands of rows, printing the entire file is not useful. head and tail let you peek at just the beginning or end.
head -n 5 filename prints the first 5 lines. Replace 5 with any number. Without a number, head defaults to 10 lines.
tail -n 5 filename prints the last 5 lines. Again, the default is 10.
tail -n 3 earthquakes.csv
head earthquakes.csv
Why this matters: before analysing a CSV dataset, always run head to see the column names and check the structure. A common mistake is running analysis without realising the first row is a header, not data.
$
Try: head earthquakes.csv · head -n 3 earthquakes.csv · tail -n 3 earthquakes.csv · tail -n 2 press_release.txt
Part Three
wc — Counting Lines, Words, Characters
wc stands for word count, but it counts much more than words. Run on a file, it returns three numbers: lines, words, and characters.
9 58 398 press_release.txt
Reading left to right: 9 lines, 58 words, 398 characters.
The most useful flag for data journalism is -l, which returns only the line count:
11 earthquakes.csv
A CSV with 11 lines has a header row plus 10 data rows. This quick sanity check tells you how many records are in your dataset — before you open it in Python or a spreadsheet.
$
Try: wc earthquakes.csv · wc -l earthquakes.csv · wc press_release.txt · wc -l cities.txt
Part Four
grep — Searching Inside Files
grep searches a file and returns every line that contains a given word or pattern. It is arguably the single most useful tool in a journalist's terminal toolkit.
grep Turkey earthquakes.csv
grep ministry press_release.txt
The syntax is always: grep [search term] [filename].
The search is case-sensitive by default: grep athens and grep Athens return different results. Note that in the sandbox you cannot use the -i flag for case-insensitive search, but on a real terminal this is very useful.
Journalism use cases for grep:
- Find every row in a large CSV that mentions a specific city, politician, or company
- Search interview notes for every time a source said a specific phrase
- Find all entries in a financial dataset for a specific institution
- Quickly verify whether a name appears in a leaked document
$
Try: grep Greece earthquakes.csv · grep Turkey earthquakes.csv · grep Athens cities.txt · grep minister sources.txt
Part Five
sort and uniq — Order and Deduplicate
sort sorts the lines of a file alphabetically (or numerically with extra flags). uniq removes consecutive duplicate lines — meaning identical lines that appear one after another.
On their own, these two commands are simple:
uniq cities.txt
The key insight is that uniq only removes consecutive duplicates. If duplicates are scattered throughout the file, uniq alone will miss them. The standard workflow is: sort first, then run uniq.
In the sandbox, try uniq cities.txt — notice it may still show duplicates because they were not adjacent. Then try sort cities.txt — now all the Athens entries are grouped together. If you could save that sorted output to a file, running uniq on it would give you the unique city list. (Piping commands together — sort cities.txt | uniq — is the full real-terminal technique. The sandbox handles each command separately.)
Journalism use case: you have a list of cities from a dataset and want to know how many unique cities appear. sort then uniq gives you the deduplicated list. wc -l on that list gives you the count.
$
Try: cat cities.txt · sort cities.txt · uniq cities.txt · sort sources.txt · uniq sources.txt
Part Six
Full Lab and What Comes Next
Bring everything from this chapter together. The sandbox below has a realistic set of files. Work through the following tasks:
- How many rows of data are in
earthquakes.csv? (Usewc -l, then subtract 1 for the header.) - Which earthquakes were recorded in Greece? (Use
grep.) - What are the unique countries in the dataset? (Use
sort, thenuniqon the result — or just look at the sorted list.) - What are the first 3 lines of
press_release.txt? - How many unique source names are in
sources.txt?
$
Use all commands: cat · head · tail · wc · grep · sort · uniq
cat reads a full file; head -n and tail -n read the first or last N lines; wc -l counts lines; grep finds lines containing a search term; sort orders lines alphabetically; uniq removes consecutive duplicates.
Chapter Navigation
Move between chapters.