Data101: From raw data to individual samples files

Overview

Teaching: 0 min
Exercises: 0 min

Questions

How do we process data from raw reads to individual fastq files?

Objectives

Check the quality of the data

Assign reads to individual samples

Developed by: Ludovic Dutoit, Alana Alexander

Adapted from: Julian Catchen, Nicolas Rochette

Background

In this tutorial, we will be working with data from Ingram et al. (2020). They sampled common bullies (Gobiomorphus cotidianus; AKA: Lake Fish) from Lake Wanaka and Lake Wakatipu in Aotearoa New Zealand. These fishes look different at different depths, and we found ourselves asking whether those morphological differences were supported by genetic structure. Today we are using a reduced dataset from this paper of 12 fishes, a reasonable size to be able to run all our commands quickly: 6 fishes sampled each from each lake, 3 from each lake at shallow depth, and three coming from deeper depths..

Common blly *Gobiomorphus cotidianus*. Copyright: Zureks

Working with single-end reads

The first step in the analysis of all short-read sequencing data, including RAD-seq data, is separating out reads from different samples that were individually barcoded. This ‘de-multiplexing’ associates reads with the different individuals or population samples from which they were derived.

Getting set up

Let’s organise our space, get comfortable moving around, and copy our data :
- Log into Jupyter at https://jupyter.nesi.org.nz/hub/login. Remember, the second factor is on your phone.
- You will set up a directory structure on the remote server that will hold your data and the different steps of your analysis. We will start by moving to the gbs/ directory in your working space (~/obss_2021/), so let’s cd (change directory) to your working location:

$ cd ~/obss_2021/gbs

$ pwd

/home/yourname/obss_2021/gbs

NOTE: If you get lost at any time today, you can always cd to your directory with the above command.

The exercise from now on is hands-on: the instructions are here to guide you through the process but you will have to come up with the specific commands yourself. Fear not, the instructors are there to help you and the room is full of friendly faces to help you get through it.

We will create more subdirectories to hold our analyses. Be careful that you are reading and writing files to the appropriate directories within your hierarchy. You’ll be making many directories, so stay organized! We will name the directories in a way that corresponds to each stage of our analysis and that allows us to remember where all our files are! A structured workspace makes analyses easier and prevents data from being accidentally overwritten.

Exercise

In gbs/, we will create two directories: lane1 and samples. lane1 will contain the raw data. By the end of the exercise, samples will contain one file per sample with all the reads for that individual.
Solution
$ mkdir lane1 samples

As a check that we’ve set up our workspace correctly use the ls -R (the ls command with the recursive flag). It should show you the following (./ is a placeholder for ‘current directory’):

./lane1:

./samples:

Exercise

Copy the dataset containing the raw reads to your lane1 directory. The dataset is the file /nesi/project/nesi02659/obss_2021/resources/gbs/lane1.fastq.gz (hint: cp /path/to/what/you/want/to/copy /destination/you/want/it/to/go)
Solution
$ cp /nesi/project/nesi02659/obss_2021/resources/gbs/lane1.fastq.gz lane1/

Examining the data

Let’s have a closer look at this data. Over the last couple of days, you learnt to run FastQC to evaluate the quality of the data. Run it on these files. Load the module first and then run FastQC over the gzipped file. We will help you out with these commands, but bonus question as you work through these: What do the commands module spider, module load, module purge do?

$ module purge
$ module spider fastqc 
$ module load FastQC
$ cd lane1
$ fastqc lane1.fastq.gz

Explanations of this code: module purge get rids of any pre-existing potentially conflicting modules. module spider searches for modules e.g. module spider fastqc looks for a module called fastqc (or something similar!). Once we know what this module is actually called (note: almost everything we do on terminal is case-sensitive) we can use module load to load the module. Finally, we ran fastqc.

You just generated a FastQC report. Use the Jupyter hub navigator tool to follow the path to your current folder (hint: If you’re not quite sure where you are, use pwd in your terminal window. Also, if obss_2021 doesn’t show up in the menu on the left, you might need to also click the littler folder icon just above Name). Once you’ve navigated to the correct folder, you can then double click on a fastqc .html report.

Exercise

What is this odd thing in the per base sequence content from base 7 to 12-13?

Solution

It is the enzyme cutsite. It is identical on all the fragments that were correctly digested by the enzyme before being ligated to a restriction enzyme.

You probably noticed that not all of the data is high quality (especially if you check the ‘Per base sequence quality’ tab!). In general with next generation sequencing, you want to remove the lowest quality sequences and remaining adapters from your data set before you proceed to do anything. However, the stringency of the filtering will depend on the final application. In general, higher stringency is needed for de novo assemblies as compared to alignments to a reference genome. However, low quality data can affect downstream analysis for both de novo and reference-based approaches, producing false positives, such as errant SNP predictions.

We will use the Stacks’s program process_radtags to remove low quality sequences (also known as cleaning data) and to demultiplex our samples. Here is the Stacks manual as well as the specific manual page for process_radtags on the Stacks website to find information and examples.

Exercise

Get back into your gbs folder by running cd ../ in the terminal (hint: if you are lost use pwd to check where you are).

It is time to load the Stacks module to be able to access the process_radtags command. Find the module and load it (hint: Do for Stacks what we did above for FastQC).

You will need to specify the set of barcodes used in the construction of the RAD library. Each P1 adaptor in RAD read starts with a particular DNA barcode that gets sequenced first, allowing data to be associated with individual samples. To save you some time, the barcode file is at: /nesi/project/nesi02659/obss_2021/resources/gbs/lane1_barcodes.txt Copy it to the gbs/ where you currently are. Have a look at the barcodes.

Based on the barcode file, can you check how many samples were multiplexed together in this RAD library

Have a look at the help online to prepare your process_radtags command. You will need to specify:

the restriction enzyme used to construct the library (pstI)

the directory of input files (the lane1 directory)

the list of barcodes (lane1_barcodes.txt)

the output directory (samples)

finally, specify that process_radtags needs clean, discard, and rescue reads as options of process_radtags

You should now be able to run the process_radtags command from the gbs/ directory using these options. It will take a couple of minutes to run. Take a breath or think about what commands we’ve run through so far.
Solution
### Load the module
$ module load Stacks
### copy the barcodes
$ cp /nesi/project/nesi02659/obss_2021/resources/gbs/lane1_barcodes.txt .
### count barcodes
wc -l lane1_barcodes.txt
### run process_radtags
$ process_radtags  -p lane1/ -o ./samples/ -b lane1_barcodes.txt -e pstI -r -c -q

The process_radtags program will write a log file into the output directory. Have a look in there. Examine the log and answer the following questions:

How many reads were retained?
Of those discarded, what were the reasons?
Use ls -lh samples to have a quick look at the size of the samples files, and make sure all files have data. Well done! Take a break, sit back or help your neighbour, we will be back shortly!

Key Points

Raw data should always be checked with FastQC.

Assigning reads to specific samples is called demultiplexing

process_radtags is the built in demultiplexing tools of Stacks and it includes some basic quality control

lesson home

Genotype by Sequencing

next episode

Data101: From raw data to individual samples files

Overview

Background

Working with single-end reads

Getting set up

Exercise

Solution

Exercise

Solution

Examining the data

Exercise

Solution

Exercise

Solution

Key Points

lesson home

next episode