4 Pipes and Filters

Author

Hawlader Al-Mamun

Now that we know a few basic commands, we can finally look at the shell’s most powerful feature: the ease with which it lets us combine existing programs in new ways. We’ll start with the directory shell-lesson-data/exercise-data/alkanes that contains six files describing some simple organic molecules. The .pdb extension indicates that these files are in Protein Data Bank format, a simple text format that specifies the type and position of each atom in the molecule.

First copy the shell-lesson-data folder into your working directory if you have not done this before.


cp -r /mnt/s-ws/everyone/shell-lesson-data .

cd shell-lesson-data/exercise-data/alkanes/

ls

4.1 `wc` ‘word count’ command

wc is the ‘word count’ command: it counts the number of lines, words, and characters in files (returning the values in that order from left to right).

Let’s run an example command:

wc cubane.pdb

If we run the command wc *.pdb, the * in *.pdb matches zero or more characters, so the shell turns *.pdb into a list of all .pdb files in the current directory:

wc *.pdb

Note that wc *.pdb also shows the total number of all lines in the last line of the output.

If we run wc -l instead of just wc, the output shows only the number of lines per file:

wc -l *.pdb

The -m and -w options can also be used with the wc command to show only the number of characters or the number of words, respectively.

Why Isn’t It Doing Anything?

What happens if a command is supposed to process a file, but we don’t give it a filename? For example, what if we type:

$ wc -l

but don’t type *.pdb (or anything else) after the command? Since it doesn’t have any filenames, wc assumes it is supposed to process input given at the command prompt, so it just sits there and waits for us to give it some data interactively. From the outside, though, all we see is it sitting there, and the command doesn’t appear to do anything.

If you make this kind of mistake, you can escape out of this state by holding down the control key (Ctrl) and pressing the letter C once: Ctrl+C. Then release both keys.

4.2 Capturing output from commands: the `>` (redirect) command

Which of these files contains the fewest lines? It’s an easy question to answer when there are only six files, but what if there were 6000? Our first step toward a solution is to run the command:

$ wc -l *.pdb > lengths.txt

The greater than symbol, >, tells the shell to redirect the command’s output to a file instead of printing it to the screen. This command prints no screen output, because everything that wc would have printed has gone into the file lengths.txt instead. If the file doesn’t exist prior to issuing the command, the shell will create the file. If the file exists already, it will be silently overwritten, which may lead to data loss. Thus, redirect commands require caution.

ls lengths.txt confirms that the file exists:

$ ls lengths.txt

We can now send the content of lengths.txt to the screen using cat lengths.txt. The cat command gets its name from ‘concatenate’ i.e. join together, and it prints the contents of files one after another. There’s only one file in this case, so cat just shows us what it contains:

$ cat lengths.txt

4.3 Filtering output

Next we’ll use the sort command to sort the contents of the lengths.txt file. But first we’ll do an exercise to learn a little about the sort command:

4.3.1 What Does `sort -n` Do?

The file shell-lesson-data/exercise-data/numbers.txt contains some lines with numbers:

sort ../numbers.txt

sort -n ../numbers.txt

The -n option specifies a numerical rather than an alphanumerical sort.

We will also use the -n option to specify that the sort is numerical instead of alphanumerical. This does not change the file; instead, it sends the sorted result to the screen:

sort -n lengths.txt

We can put the sorted list of lines in another temporary file called sorted-lengths.txt by putting > sorted-lengths.txt after the command, just as we used > lengths.txt to put the output of wc into lengths.txt. Once we’ve done that, we can run another command called head to get the first few lines in sorted-lengths.txt:

sort -n lengths.txt > sorted-lengths.txt
head -n 1 sorted-lengths.txt

Using -n 1 with head tells it that we only want the first line of the file; -n 20 would get the first 20, and so on. Since sorted-lengths.txt contains the lengths of our files ordered from least to greatest, the output of head must be the file with the fewest lines

4.4 The `>>` operator

4.4.1 Using `>>` in Bash Commands

In Bash, the >> operator is used to append the output of a command to a file. If the file does not already exist, it will be created. This operator is particularly useful when you want to add content to the end of an existing file without overwriting its current content.

4.4.2 Syntax

command >> filename

command: The command whose output you want to append.
filename: The file to which the output will be appended.

4.4.3 Example

Consider you have a file named logfile.txt and you want to append the current date and time to it each time a certain script runs.

4.4.3.1 Step 1: Create or check the initial content of `logfile.txt`

echo "Initial log entry" > logfile.txt
cat logfile.txt

4.4.3.2 Step 2: Append the date and time to `logfile.txt`

date >> logfile.txt

4.4.3.3 Step 3: Check the updated content of `logfile.txt`

cat logfile.txt

4.4.4 Multiple Appends

You can use the >> operator multiple times to append different outputs to the same file. For example:

echo "First append" >> logfile.txt
echo "Second append" >> logfile.txt

4.5 Passing output to another command

In our example of finding the file with the fewest lines, we are using two intermediate files lengths.txt and sorted-lengths.txt to store output. This is a confusing way to work because even once you understand what wc, sort, and head do, those intermediate files make it hard to follow what’s going on. We can make it easier to understand by running sort and head together:

$ sort -n lengths.txt | head -n 1

  9  methane.pdb

The vertical bar, |, between the two commands is called a pipe. It tells the shell that we want to use the output of the command on the left as the input to the command on the right.

This has removed the need for the sorted-lengths.txt file.

4.6 Combining multiple commands

Nothing prevents us from chaining pipes consecutively. We can for example send the output of wc directly to sort, and then send the resulting output to head. This removes the need for any intermediate files.

We’ll start by using a pipe to send the output of wc to sort:

$ wc -l *.pdb | sort -n

   9 methane.pdb
  12 ethane.pdb
  15 propane.pdb
  20 cubane.pdb
  21 pentane.pdb
  30 octane.pdb
 107 total

We can then send that output through another pipe, to head, so that the full pipeline becomes:

$ wc -l *.pdb | sort -n | head -n 1

   9  methane.pdb

This is exactly like a mathematician nesting functions like log(3x) and saying ‘the log of three times x’. In our case, the algorithm is ‘head of sort of line count of *.pdb’.

4.7 `grep` a powerful tool for pattern search

The grep command is a powerful tool used in Unix-like operating systems to search for patterns in text. It can be used to find specific lines in files that match a given pattern. The basic syntax of the grep command is as follows: grep [options] pattern [file...] - [options]: Optional flags that modify the behavior of the grep command. - pattern: The text pattern you want to search for. - [file...]: Optional file names or paths where you want to search for the pattern. If no files are specified, grep will read from standard input (e.g., data piped into it).

Let’s create a file using nano named example.txt with the following content:

This is a sample file. It contains some lines. Let’s search for a word in this file. The word we’ll search for is “search.”


grep "search" example.txt

The output would be:

Let's search for a word in this file.
The word we'll search for is "search."

4.7.1 Common Options

-i: Ignore case distinctions in the pattern and input files.
-v: Invert the match, displaying lines that do not match the pattern.
-r or -R: Recursively search directories for the pattern.
-l: Print only the names of files with matching lines.
-n: Prefix each line of output with the line number within its file.
-c: Print only a count of matching lines per file.
-H: Print the filename for each match.

4.7.2 Examples

Simple Search:
- grep 'hello' file.txt
  - Searches for lines containing the string “hello” in file.txt.
Case-Insensitive Search:
- grep -i 'hello' file.txt
  - Searches for lines containing “hello”, “Hello”, “HELLO”, etc., in file.txt.
Recursive Search:
- grep -r 'function' /path/to/directory
  - Searches for the string “function” in all files within the specified directory and its subdirectories.
Count Matches:
- grep -c 'error' logfile.txt
  - Counts the number of lines containing the string “error” in logfile.txt.
Exclude Matches:
- grep -v 'test' file.txt
  - Displays all lines that do not contain the string “test” in file.txt.
Display Line Numbers:
- grep -n 'main' program.c
  - Displays matching lines containing the string “main” in program.c, along with their line numbers.

4.7.3 Use in Bioinformatics

grep is particularly useful in bioinformatics for: - Searching for specific sequences or patterns in large text files, such as FASTA or FASTQ files. - Filtering lines in output files from various bioinformatics tools. - Quickly identifying and extracting relevant information from log files, configuration files, and other textual data.

4.7.4 Example in Bioinformatics

Suppose you have a FASTA file (sequences.fasta) and you want to find all sequences containing the motif “ATGCGA”:

grep -B 1 'ATGCGA' sequences.fasta

This command searches for the motif “ATGCGA” and displays the matching lines along with the preceding line (which typically contains the sequence identifier in FASTA format).

4.8 Summary

wc counts lines, words, and characters in its inputs.
cat displays the contents of its inputs.
sort sorts its inputs.
head displays the first 10 lines of its input.
tail displays the last 10 lines of its input.
command > [file] redirects a command’s output to a file (overwriting any existing content).
command >> [file] appends a command’s output to a file.
[first] | [second] is a pipeline: the output of the first command is used as the input to the second.
The best way to use the shell is to use pipes to combine simple single-purpose programs (filters).
grep is a powerful and versatile tool for text searching and processing.

4.1 wc ‘word count’ command

4.2 Capturing output from commands: the > (redirect) command