This lesson is in the early stages of development (Alpha version)

Introduction to the UNIX shell

Overview

Teaching: 90 min
Exercises: 25 min
Questions
  • What is a command shell and why would I use one?

  • How can I see what files and directories I have?

  • How can I use loops and scripts to my advantage?

Objectives
  • Use the shell to navigate directories

  • Perform operations on files in directories outside your working directory

  • Interconvert between relative and absolute paths

  • View, search within, copy, move, and rename files. Create new directories.

  • Construct command pipelines with two or more stages.

  • Use for loops to run the same command for several input files.

Introduction to Unix commandline (BASH)

Learning objectives:

These introductory notes are adapted from The Carpentries Introduction to the Command Line for Genomics lessons


Introducing the Shell

A shell is a computer program that presents a command line interface which allows you to control your computer using commands entered with a keyboard instead of controlling graphical user interfaces (GUIs) with a mouse/keyboard combination.

Many bioinformatics tools can only be used through a command line interface, or have extra capabilities in the command line version that are not available in the GUI. This is true, for example, of BLAST, which offers many advanced functions only accessible to users who know how to use a shell.

In this lesson you will learn how to use the command line interface to move around in your file system.

Open up a terminal in Juypter.

[matt.bixley@wbg004 ~/.jupyter/jobs/22861436] $

Note you will automatically start off in a directory related to the slurm job running your Jupyter session e.g. ~/.jupyter/jobs/22861436

The dollar sign is a prompt, which shows us that the shell is waiting for input; your shell may use a different character as a prompt and may add information before the prompt. When typing commands, either from these lessons or from other sources, do not type the prompt, only the commands that follow it.

Let’s find out where we are by running a command called pwd (which stands for “print working directory”). At any moment, our current working directory is our current default directory, i.e., the directory that the computer assumes we want to run commands in, unless we explicitly specify something else. Here, the computer’s response is /home/yourname/.jupyter/jobs/yourjobid, name and the number you see will be different and relates to your username and the specific Jupyter session running on NeSI.

$ pwd
/home/matt.bixley/.jupyter/jobs/22861436

Let’s look at how our file system is organized. We can see what files and subdirectories are in this directory by running ls, which stands for “listing”:

$ ls
home  nesi02659  nobackup_nesi02659

ls prints the names of the files and directories in the current directory in alphabetical order, arranged neatly into columns.

We can also add flags to our commands which will modify the behaviour of the command.

$ ls -l
total 3
lrwxrwxrwx 1 matt.bixley matt.bixley 19 Oct 18 21:02 home -> /home/matt.bixley
lrwxrwxrwx 1 matt.bixley matt.bixley 23 Oct 18 21:02 nesi02659 -> /nesi/project/nesi02659
lrwxrwxrwx 1 matt.bixley matt.bixley 24 Oct 18 21:02 nobackup_nesi02659 -> /nesi/nobackup/nesi02659

The -> in this output tells us that we have some shortcut links in this directory.


Full vs. Relative Paths

Directories can be specified using either a relative path or a full absolute path. The directories on the computer are arranged into a hierarchy. The full path tells you where a directory is in that hierarchy.

Here is an example of how the workshop directory hierarchy is set up:

/
|-- home/
  |-- matt.bixley/
    |-- obss_2021/
      |-- enda/
      |-- gbs/
      |-- genomic_dna/
      |-- intro_bash/
      |-- intro_r/
      `-- rnaseq/

Home

Your home directory is where your user account lives on the filesystem and is found at /home/yourname. Because your home directory special to you, it can also be referred to using ~/ which bash will automatically convert to the full path of /home/yourname.

The command to change locations in our file system is cd, followed by a directory name to change our working directory. cd stands for “change directory”.

Lets use cd to navigate to our home directory using ~/ to represent our home.

$ cd ~/
$ pwd
/home/matt.bixley

Using cd without specifying a directory will by default send you to your home directory

This is the full name of your home directory. This tells you that you are in a directory called matt.bixley, which sits inside a directory called home which sits inside the very top directory in the hierarchy. The very top of the hierarchy is a directory called / which is usually referred to as the root directory. So, to summarize: matt.bixley is a directory in home which is a directory in / (root).

Let’s look at what is in your home directory:

ls
obss_2021

obss_2021 is where we will be working out of for the entirity of the workshop. Lets change into that directory now using a relative path.

cd obss_2021

We can make the ls output from above, more comprehensible by using the flag -F, which tells ls to add a trailing / to the names of directories:

$ ls -F
edna/  gbs/  genomic_dna/  intro_bash/  intro_r/  rnaseq/

Anything with a “/” after it is a directory. Things with a “*” after them are programs. If there are no decorations, it’s a file.

/home/matt.bixley/obss_2021

We have a special command to tell the computer to move us back or up one directory level.

$ cd ..
$ pwd

You will see:

/home/yourname

Now enter the following command:

$ cd /home/yourname/obss_2021/intro_bash/shell_data/

This jumps forward multiple levels to the shell_data directory. Now go back to the home directory with cd or cd ~

$ cd

You can also navigate to the shell_data directory using:

$ cd obss_2021/intro_bash/shell_data

These two commands have the same effect, they both take us to the shell_data directory. The first uses the absolute path, giving the full address from the home directory. The second uses a relative path, giving only the address from the working directory. A full path always starts with a /. A relative path does not.

A relative path is like getting directions from someone on the street. They tell you to “go right at the stop sign, and then turn left on Main Street”. That works great if you’re standing there together, but not so well if you’re trying to tell someone how to get there from another country. A full path is like GPS coordinates. It tells you exactly where something is no matter where you are right now.


Examining Files

We now know how to switch directories, run programs, and look at the contents of directories, but how do we look at the contents of files?

One way to examine a file is to print out all of the contents using the program cat.

Enter the following command from within the untrimmed_fastq directory:

$ cat SRR098026.fastq

This will print out all of the contents of the SRR098026.fastq to the screen.

Enter the following command:

$ less SRR097977.fastq

Some navigation commands in less:

key action
Space to go forward
b to go backward
g to go to the beginning
G to go to the end
q to quit

less also gives you a way of searching through files. Use the “/” key to begin a search. Enter the word you would like to search for and press enter. The screen will jump to the next location where that word is found.

There’s another way that we can look at files, and in this case, just look at part of them. This can be particularly useful if we just want to see the beginning or end of the file, or see how it’s formatted.

The commands are head and tail and they let you look at the beginning and end of a file, respectively.

$ head SRR098026.fastq
@SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
NNNNNNNNNNNNNNNNCNNNNNNNNNNNNNNNNNN
+SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!!
@SRR098026.2 HWUSI-EAS1599_1:2:1:0:312 length=35
NNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNN
+SRR098026.2 HWUSI-EAS1599_1:2:1:0:312 length=35
!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!!
@SRR098026.3 HWUSI-EAS1599_1:2:1:0:570 length=35
NNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNN
$ tail SRR098026.fastq
+SRR098026.247 HWUSI-EAS1599_1:2:1:2:1311 length=35
#!##!#################!!!!!!!######
@SRR098026.248 HWUSI-EAS1599_1:2:1:2:118 length=35
GNTGNGGTCATCATACGCGCCCNNNNNNNGGCATG
+SRR098026.248 HWUSI-EAS1599_1:2:1:2:118 length=35
B!;?!A=5922:##########!!!!!!!######
@SRR098026.249 HWUSI-EAS1599_1:2:1:2:1057 length=35
CNCTNTATGCGTACGGCAGTGANNNNNNNGGAGAT
+SRR098026.249 HWUSI-EAS1599_1:2:1:2:1057 length=35
A!@B!BBB@ABAB#########!!!!!!!######

The -n option to either of these commands can be used to print the first or last n lines of a file.

$ head -n 1 SRR098026.fastq
@SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
$ tail -n 1 SRR098026.fastq
A!@B!BBB@ABAB#########!!!!!!!######

Shortcut: Tab Completion

Typing out file or directory names can waste a lot of time and it’s easy to make typing mistakes. Instead we can use tab complete as a shortcut. When you start typing out the name of a directory or file, then hit the Tab key, the shell will try to fill in the rest of the directory or file name.

Return to the intro_bash directory:

$ cd ~/obss_2021/intro_bash

then enter:

$ cd she<tab>

The shell will fill in the rest of the directory name for shell_data.

Now change directories to untrimmed_fastq in shell_data

$ cd shell_data
$ cd untrimmed_fastq

Using tab complete can be very helpful. However, it will only autocomplete a file or directory name if you’ve typed enough characters to provide a unique identifier for the file or directory you are trying to access.

For example, if we now try to list the files which names start with SR by using tab complete:

$ ls SR<tab>

The shell auto-completes your command to SRR09, because all file names in the directory begin with this prefix. When you hit Tab again, the shell will list the possible choices.

$ ls SRR09<tab><tab>
SRR097977.fastq  SRR098026.fastq

Tab completion can also fill in the names of programs, which can be useful if you remember the beginning of a program name.

Copying Files

When working with computational data, it’s important to keep a safe copy of that data that can’t be accidentally overwritten or deleted. For this lesson, our raw data is our FASTQ files. We don’t want to accidentally change the original files, so we’ll make a copy of them and change the file permissions so that we can read from, but not write to, the files.

First, let’s make a copy of one of our FASTQ files using the cp command.

Navigate to the shell_data/untrimmed_fastq directory and enter:

$ cp SRR098026.fastq SRR098026-copy.fastq
$ ls -F
SRR097977.fastq  SRR098026-copy.fastq  SRR098026.fastq

Creating Directories

The mkdir command is used to make a directory. Enter mkdir followed by a space, then the directory name you want to create:

$ mkdir backup

Moving

We can now move our backup file to this directory. We can move files around using the command mv:

$ mv SRR098026-copy.fastq backup/
$ ls backup
SRR098026-copy.fastq

The mv command is also how you rename files. Let’s rename this file to make it clear that this is a backup:

$ cd backup
$ mv SRR098026-copy.fastq SRR098026-backup.fastq
$ ls
SRR098026-backup.fastq

Redirection/Pipes

We discussed in a previous section how to look at a file using less and head. We can also search within files without even opening them, using grep. We can then send what we find to somewhere else.

$ cd ~/obss_2021/intro_bash/shell_data/untrimmed_fastq

Suppose we want to see how many reads in our file have really bad segments containing 10 consecutive unknown nucleotides (Ns). Let’s search for the string NNNNNNNNNN in the SRR098026 file:

$ grep NNNNNNNNNN SRR098026.fastq

One of the sets of lines returned by this command is: CNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN.

Because a read in a Fastq file involves 4 lines per read, we want a way to return the metadata and the quality associated with that sequence.

FastQ files

We will cover the FastQ format in more depth as part of the Genomic DNA variant calling lesson tomorrow.

@SRR098026.177 HWUSI-EAS1599_1:2:1:1:2025 length=35
CNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+SRR098026.177 HWUSI-EAS1599_1:2:1:1:2025 length=35
#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

We can use the -B argument for grep to return a specific number of lines before each match. The -A argument returns a specific number of lines after each matching line. Here we want the line before and the two lines after each matching line, so we add -B1 -A2 to our grep command:

$ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq

Instead of the found text spilling onto the screen, it would be useful to be able to send it to a file so that we could browse through it in a controlled manner e.g. using less. The command for redirecting output to a file is >.

Let’s try out this command and copy all the records (including all four lines of each record) in our FASTQ files that contain ‘NNNNNNNNNN’ to another file called bad_reads.txt.

$ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq > bad_reads.txt

The UNIX shell comes with many useful programs, once such one is wc which does a word count on a file.

We can then see how many bad reads we have.

$ wc -l bad_reads.txt
537 bad_reads.txt

Help and Manuals

You can find out what flags are available to most UNIX programs by using the --help flag. Or you can use the in-built manual by using man <program>, e.g. man wc will let you find out what the -l flag does.

We created the files to store the reads and then counted the lines in the file to see how many reads matched our criteria. There’s a way to do this, however, that doesn’t require us to create these intermediate files - the pipe command (|).

This is probably not a key on your keyboard you use very much, so let’s all take a minute to find that key. For the standard QWERTY keyboard layout, the | character can be found using the key combination

What | does is take the output that is scrolling by on the terminal and uses that output as input to another command. When our output was scrolling by, we might have wished we could slow it down and look at it, like we can with less. Well it turns out that we can! We can redirect our output from our grep call through the less command.

$ grep -B1 -A2 NNNNNNNNNN SRR098026.fastq | wc -l 

Variables

A variable is a method to store information eg a list, and use it again (or several times) without having to write the list out.

$ foo=abc
$ echo foo is $foo
foo is abc

We can add to a variable with curly brackets {}

$ echo foo is ${foo}EFG
foo is abcEFG

Writing for loops

Loops are key to productivity improvements through automation as they allow us to execute commands repeatedly. Similar to wildcards and tab completion, using loops also reduces the amount of typing (and typing mistakes). Loops are helpful when performing operations on groups of sequencing files, such as unzipping or trimming multiple files. We will use loops for these purposes in subsequent analyses, but will cover the basics of them for now.

When the shell sees the keyword for, it knows to repeat a command (or group of commands) once for each item in a list. Each time the loop runs (called an iteration), an item in the list is assigned in sequence to the variable, and the commands inside the loop are executed, before moving on to the next item in the list. Inside the loop, we call for the variable’s value by putting $ in front of it. The $ tells the shell interpreter to treat the variable as a variable name and substitute its value in its place, rather than treat it as text or an external command.

Let’s write a for loop to show us the first two lines of the fastq files we downloaded earlier. You will notice the shell prompt changes from $ to > and back again as we were typing in our loop. The second prompt, >, is different to remind us that we haven’t finished typing a complete command yet. A semicolon, ;, can be used to separate two commands written on a single line.

$ cd ../untrimmed_fastq/
$ for filename in *.fastq
> do
> head -n 2 ${filename}
> done

The for loop begins with the formula for <variable> in <group to iterate over>. In this case, the word filename is designated as the variable to be used over each iteration. In our case SRR097977.fastq and SRR098026.fastq will be substituted for filename because they fit the pattern of ending with .fastq in the directory we’ve specified. The next line of the for loop is do. The next line is the code that we want to execute. We are telling the loop to print the first two lines of each variable we iterate over. Finally, the word done ends the loop.

After executing the loop, you should see the first two lines of both fastq files printed to the terminal. Let’s create a loop that will save this information to a file.

$ for filename in *.fastq
> do
> head -n 2 ${filename} >> seq_info.txt
> done

When writing a loop, you will not be able to return to previous lines once you have pressed Enter. Remember that we can cancel the current command using

If you notice a mistake that is going to prevent your loop for executing correctly.

Note that we are using >> to append the text to our seq_info.txt file. If we used >, the seq_info.txt file would be rewritten every time the loop iterates, so it would only have text from the last variable used. Instead, >> adds to the end of the file.

Extended for loops

basename is a function in UNIX that is helpful for removing a uniform part of a name from a list of files. In this case, we will use basename to remove the .fastq extension from the files that we’ve been working with.

$ basename SRR097977.fastq .fastq

We see that this returns just the SRR accession, and no longer has the .fastq file extension on it.

SRR097977

If we try the same thing but use .fasta as the file extension instead, nothing happens. This is because basename only works when it exactly matches a string in the file.

$ basename SRR097977.fastq .fasta
SRR097977.fastq

Basename is really powerful when used in a for loop. It allows to access just the file prefix, which you can use to name things. Let’s try this.

Inside our for loop, we create a new name variable. We call the basename function inside the parenthesis, then give our variable name from the for loop, in this case ${filename}, and finally state that .fastq should be removed from the file name. It’s important to note that we’re not changing the actual files, we’re creating a new variable called name. The line > echo $name will print to the terminal the variable name each time the for loop runs. Because we are iterating over two files, we expect to see two lines of output.

$ for filename in *.fastq
> do
> name=$(basename ${filename} .fastq)
> echo ${name}
> done

One way this is really useful is to move files. Let’s rename all of our .txt files using mv so that they have the years on them, which will document when we created them.

$ for filename in *.txt
> do
> name=$(basename ${filename} .txt)
> mv ${filename}  ${name}_2019.txt
> done

Extra exercises

Writing files

We’ve been able to do a lot of work with files that already exist, but what if we want to write our own files? We’re not going to type in a FASTA file, but we’ll see as we go through other tutorials, there are a lot of reasons we’ll want to write a file, or edit an existing file.

To add text to files, we’re going to use a text editor called Nano. We’re going to create a file to take notes about what we’ve been doing with the data files in ~/obss_2021/intro_bash/shell_data/untrimmed_fastq.

This is good practice when working in bioinformatics. We can create a file called README.txt that describes the data files in the directory or documents how the files in that directory were generated. As the name suggests, it’s a file that we or others should read to understand the information in that directory.

Let’s change our working directory to ~/obss_2021/intro_bash/shell_data/untrimmed_fastq using cd, then run nano to create a file called README.txt:

$ cd ~/shell_data/untrimmed_fastq
$ nano README.txt

You should see something like this:

nano201711.png

The text at the bottom of the screen shows the keyboard shortcuts for performing various tasks in nano. We will talk more about how to interpret this information soon.

Which Editor?

When we say, “nano is a text editor,” we really do mean “text”: it can only work with plain character data, not tables, images, or any other human-friendly media. We use it in examples because it is one of the least complex text editors. However, because of this trait, it may not be powerful enough or flexible enough for the work you need to do after this workshop. On Unix systems (such as Linux and Mac OS X), many programmers use Emacs or Vim (both of which require more time to learn), or a graphical editor such as Gedit. On Windows, you may wish to use Notepad++. Windows also has a built-in editor called notepad that can be run from the command line in the same way as nano for the purposes of this lesson.

No matter what editor you use, you will need to know where it searches for and saves files. If you start it from the shell, it will (probably) use your current working directory as its default location. If you use your computer’s start menu, it may want to save files in your desktop or documents directory instead. You can change this by navigating to another directory the first time you “Save As…”

Let’s type in a few lines of text. Describe what the files in this directory are or what you’ve been doing with them. Once we’re happy with our text, we can press Ctrl-O (press the Ctrl or Control key and, while holding it down, press the O key) to write our data to disk. You’ll be asked what file we want to save this to: press Return to accept the suggested default of README.txt.

Once our file is saved, we can use Ctrl-X to quit the editor and return to the shell.

Control, Ctrl, or ^ Key

The Control key is also called the “Ctrl” key. There are various ways in which using the Control key may be described. For example, you may see an instruction to press the Ctrl key and, while holding it down, press the X key, described as any of:

  • Control-X
  • Control+X
  • Ctrl-X
  • Ctrl+X
  • ^X
  • C-x

In nano, along the bottom of the screen you’ll see ^G Get Help ^O WriteOut. This means that you can use Ctrl-G to get help and Ctrl-O to save your file.

Now you’ve written a file. You can take a look at it with less or cat, or open it up again and edit it with nano.

Exercise

Open README.txt and add the date to the top of the file and save the file.

Solution

Use nano README.txt to open the file.
Add today’s date and then use Ctrl-X followed by y and Enter to save.

Writing scripts

A really powerful thing about the command line is that you can write scripts. Scripts let you save commands to run them and also lets you put multiple commands together. Though writing scripts may require an additional time investment initially, this can save you time as you run them repeatedly. Scripts can also address the challenge of reproducibility: if you need to repeat an analysis, you retain a record of your command history within the script.

One thing we will commonly want to do with sequencing results is pull out bad reads and write them to a file to see if we can figure out what’s going on with them. We’re going to look for reads with long sequences of N’s like we did before, but now we’re going to write a script, so we can run it each time we get new sequences, rather than type the code in by hand each time.

We’re going to create a new file to put this command in. We’ll call it bad-reads-script.sh. The sh isn’t required, but using that extension tells us that it’s a shell script.

$ nano bad-reads-script.sh

Bad reads have a lot of N’s, so we’re going to look for NNNNNNNNNN with grep. We want the whole FASTQ record, so we’re also going to get the one line above the sequence and the two lines below. We also want to look in all the files that end with .fastq, so we’re going to use the * wildcard.

grep -B1 -A2 -h NNNNNNNNNN *.fastq | grep -v '^--' > scripted_bad_reads.txt

Custom grep control

We introduced the -v option in the previous episode, now we are using -h to “Suppress the prefixing of file names on output” according to the documentation shown by man grep.

Type your grep command into the file and save it as before. Be careful that you did not add the $ at the beginning of the line.

Now comes the neat part. We can run this script. Type:

$ bash bad-reads-script.sh

It will look like nothing happened, but now if you look at scripted_bad_reads.txt, you can see that there are now reads in the file.

Exercise

We want the script to tell us when it’s done.

  1. Open bad-reads-script.sh and add the line echo "Script finished!" after the grep command and save the file.
  2. Run the updated script.

Solution

  $ bash bad-reads-script.sh
  Script finished!

Making the script into a program

We had to type bash because we needed to tell the computer what program to use to run this script. Instead, we can turn this script into its own program. We need to tell it that it’s a program by making it executable. We can do this by changing the file permissions. We talked about permissions in an earlier episode.

First, let’s look at the current permissions.

$ ls -l bad-reads-script.sh
-rw-rw-r-- 1 dcuser dcuser 0 Oct 25 21:46 bad-reads-script.sh

We see that it says -rw-r--r--. This shows that the file can be read by any user and written to by the file owner (you). We want to change these permissions so that the file can be executed as a program. We use the command chmod like we did earlier when we removed write permissions. Here we are adding (+) executable permissions (+x).

$ chmod +x bad-reads-script.sh

Now let’s look at the permissions again.

$ ls -l bad-reads-script.sh
-rwxrwxr-x 1 dcuser dcuser 0 Oct 25 21:46 bad-reads-script.sh

Now we see that it says -rwxr-xr-x. The x’s that are there now tell us we can run it as a program. So, let’s try it! We’ll need to put ./ at the beginning so the computer knows to look here in this directory for the program.

$ ./bad-reads-script.sh

The script should run the same way as before, but now we’ve created our very own computer program!

You will learn more about writing scripts in a later lesson.

Key Points

  • The shell gives you the ability to work more efficiently by using keyboard commands rather than a GUI.

  • Useful commands for navigating your file system include: ls, pwd, and cd.

  • Tab completion can reduce errors from mistyping and make work more efficient in the shell.