4 Pipes and Filters
Now that we know a few basic commands, we can finally look at the shell’s most powerful feature: the ease with which it lets us combine existing programs in new ways. We’ll start with the directory shell-lesson-data/exercise-data/alkanes that contains six files describing some simple organic molecules. The .pdb extension indicates that these files are in Protein Data Bank format, a simple text format that specifies the type and position of each atom in the molecule.
First copy the shell-lesson-data folder into your working directory if you have not done this before.
cp -r /mnt/s-ws/everyone/shell-lesson-data .
cd shell-lesson-data/exercise-data/alkanes/
ls4.1 wc ‘word count’ command
wc is the ‘word count’ command: it counts the number of lines, words, and characters in files (returning the values in that order from left to right).
Let’s run an example command:
wc cubane.pdbIf we run the command wc *.pdb, the * in *.pdb matches zero or more characters, so the shell turns *.pdb into a list of all .pdb files in the current directory:
wc *.pdbNote that wc *.pdb also shows the total number of all lines in the last line of the output.
If we run wc -l instead of just wc, the output shows only the number of lines per file:
wc -l *.pdbThe -m and -w options can also be used with the wc command to show only the number of characters or the number of words, respectively.
4.2 Capturing output from commands: the > (redirect) command
Which of these files contains the fewest lines? It’s an easy question to answer when there are only six files, but what if there were 6000? Our first step toward a solution is to run the command:
$ wc -l *.pdb > lengths.txtThe greater than symbol, >, tells the shell to redirect the command’s output to a file instead of printing it to the screen. This command prints no screen output, because everything that wc would have printed has gone into the file lengths.txt instead. If the file doesn’t exist prior to issuing the command, the shell will create the file. If the file exists already, it will be silently overwritten, which may lead to data loss. Thus, redirect commands require caution.
ls lengths.txt confirms that the file exists:
$ ls lengths.txtWe can now send the content of lengths.txt to the screen using cat lengths.txt. The cat command gets its name from ‘concatenate’ i.e. join together, and it prints the contents of files one after another. There’s only one file in this case, so cat just shows us what it contains:
$ cat lengths.txt4.3 Filtering output
Next we’ll use the sort command to sort the contents of the lengths.txt file. But first we’ll do an exercise to learn a little about the sort command:
4.3.1 What Does sort -n Do?
The file shell-lesson-data/exercise-data/numbers.txt contains some lines with numbers:
sort ../numbers.txt
sort -n ../numbers.txtThe -n option specifies a numerical rather than an alphanumerical sort.
We will also use the -n option to specify that the sort is numerical instead of alphanumerical. This does not change the file; instead, it sends the sorted result to the screen:
sort -n lengths.txtWe can put the sorted list of lines in another temporary file called sorted-lengths.txt by putting > sorted-lengths.txt after the command, just as we used > lengths.txt to put the output of wc into lengths.txt. Once we’ve done that, we can run another command called head to get the first few lines in sorted-lengths.txt:
sort -n lengths.txt > sorted-lengths.txt
head -n 1 sorted-lengths.txtUsing -n 1 with head tells it that we only want the first line of the file; -n 20 would get the first 20, and so on. Since sorted-lengths.txt contains the lengths of our files ordered from least to greatest, the output of head must be the file with the fewest lines
4.4 The >> operator
4.4.1 Using >> in Bash Commands
In Bash, the >> operator is used to append the output of a command to a file. If the file does not already exist, it will be created. This operator is particularly useful when you want to add content to the end of an existing file without overwriting its current content.
4.4.2 Syntax
command >> filename- command: The command whose output you want to append.
- filename: The file to which the output will be appended.
4.4.3 Example
Consider you have a file named logfile.txt and you want to append the current date and time to it each time a certain script runs.
4.4.3.1 Step 1: Create or check the initial content of logfile.txt
echo "Initial log entry" > logfile.txt
cat logfile.txt4.4.3.2 Step 2: Append the date and time to logfile.txt
date >> logfile.txt4.4.3.3 Step 3: Check the updated content of logfile.txt
cat logfile.txt4.4.4 Multiple Appends
You can use the >> operator multiple times to append different outputs to the same file. For example:
echo "First append" >> logfile.txt
echo "Second append" >> logfile.txt4.5 Passing output to another command
In our example of finding the file with the fewest lines, we are using two intermediate files lengths.txt and sorted-lengths.txt to store output. This is a confusing way to work because even once you understand what wc, sort, and head do, those intermediate files make it hard to follow what’s going on. We can make it easier to understand by running sort and head together:
$ sort -n lengths.txt | head -n 1 9 methane.pdb
The vertical bar, |, between the two commands is called a pipe. It tells the shell that we want to use the output of the command on the left as the input to the command on the right.
This has removed the need for the sorted-lengths.txt file.
4.6 Combining multiple commands
Nothing prevents us from chaining pipes consecutively. We can for example send the output of wc directly to sort, and then send the resulting output to head. This removes the need for any intermediate files.
We’ll start by using a pipe to send the output of wc to sort:
$ wc -l *.pdb | sort -n 9 methane.pdb
12 ethane.pdb
15 propane.pdb
20 cubane.pdb
21 pentane.pdb
30 octane.pdb
107 total
We can then send that output through another pipe, to head, so that the full pipeline becomes:
$ wc -l *.pdb | sort -n | head -n 1 9 methane.pdb
This is exactly like a mathematician nesting functions like log(3x) and saying ‘the log of three times x’. In our case, the algorithm is ‘head of sort of line count of *.pdb’.
4.7 grep a powerful tool for pattern search
The grep command is a powerful tool used in Unix-like operating systems to search for patterns in text. It can be used to find specific lines in files that match a given pattern. The basic syntax of the grep command is as follows: grep [options] pattern [file...] - [options]: Optional flags that modify the behavior of the grep command. - pattern: The text pattern you want to search for. - [file...]: Optional file names or paths where you want to search for the pattern. If no files are specified, grep will read from standard input (e.g., data piped into it).
Let’s create a file using nano named example.txt with the following content:
This is a sample file. It contains some lines. Let’s search for a word in this file. The word we’ll search for is “search.”
grep "search" example.txtThe output would be:
Let's search for a word in this file.
The word we'll search for is "search."
4.7.1 Common Options
-i: Ignore case distinctions in the pattern and input files.-v: Invert the match, displaying lines that do not match the pattern.-ror-R: Recursively search directories for the pattern.-l: Print only the names of files with matching lines.-n: Prefix each line of output with the line number within its file.-c: Print only a count of matching lines per file.-H: Print the filename for each match.
4.7.2 Examples
- Simple Search:
grep 'hello' file.txt- Searches for lines containing the string “hello” in
file.txt.
- Searches for lines containing the string “hello” in
- Case-Insensitive Search:
grep -i 'hello' file.txt- Searches for lines containing “hello”, “Hello”, “HELLO”, etc., in
file.txt.
- Searches for lines containing “hello”, “Hello”, “HELLO”, etc., in
- Recursive Search:
grep -r 'function' /path/to/directory- Searches for the string “function” in all files within the specified directory and its subdirectories.
- Count Matches:
grep -c 'error' logfile.txt- Counts the number of lines containing the string “error” in
logfile.txt.
- Counts the number of lines containing the string “error” in
- Exclude Matches:
grep -v 'test' file.txt- Displays all lines that do not contain the string “test” in
file.txt.
- Displays all lines that do not contain the string “test” in
- Display Line Numbers:
grep -n 'main' program.c- Displays matching lines containing the string “main” in
program.c, along with their line numbers.
- Displays matching lines containing the string “main” in
4.7.3 Use in Bioinformatics
grep is particularly useful in bioinformatics for: - Searching for specific sequences or patterns in large text files, such as FASTA or FASTQ files. - Filtering lines in output files from various bioinformatics tools. - Quickly identifying and extracting relevant information from log files, configuration files, and other textual data.
4.7.4 Example in Bioinformatics
Suppose you have a FASTA file (sequences.fasta) and you want to find all sequences containing the motif “ATGCGA”:
grep -B 1 'ATGCGA' sequences.fastaThis command searches for the motif “ATGCGA” and displays the matching lines along with the preceding line (which typically contains the sequence identifier in FASTA format).
4.8 Summary
wccounts lines, words, and characters in its inputs.catdisplays the contents of its inputs.sortsorts its inputs.headdisplays the first 10 lines of its input.taildisplays the last 10 lines of its input.command > [file]redirects a command’s output to a file (overwriting any existing content).command >> [file]appends a command’s output to a file.[first] | [second]is a pipeline: the output of the first command is used as the input to the second.- The best way to use the shell is to use pipes to combine simple single-purpose programs (filters).
grepis a powerful and versatile tool for text searching and processing.