Data101: From raw data to individual samples files
Overview
Teaching: 0 min
Exercises: 0 minQuestions
How do we process data from raw reads to individual fastq files?
Objectives
Check the quality of the data
Assign reads to individual samples
Developed by: Ludovic Dutoit, Alana Alexander
Adapted from: Julian Catchen, Nicolas Rochette
Background
In this tutorial, we will be working with data from Ingram et al. (2020). They sampled common bullies (Gobiomorphus cotidianus; AKA: Lake Fish) from Lake Wanaka and Lake Wakatipu in Aotearoa New Zealand. These fishes look different at different depths, and we found ourselves asking whether those morphological differences were supported by genetic structure. Today we are using a reduced dataset from this paper of 12 fishes, a reasonable size to be able to run all our commands quickly: 6 fishes sampled each from each lake, 3 from each lake at shallow depth, and three coming from deeper depths..
Working with single-end reads
The first step in the analysis of all short-read sequencing data, including RAD-seq data, is separating out reads from different samples that were individually barcoded. This ‘de-multiplexing’ associates reads with the different individuals or population samples from which they were derived.
Getting set up
-
Let’s organise our space, get comfortable moving around, and copy our data :
- Log into Jupyter at https://jupyter.nesi.org.nz/hub/login. Remember, the second factor is on your phone.
- You will set up a directory structure on the remote server that will hold your data and the different steps of your analysis. We will start by moving to the
gbs/
directory in your working space (~/obss_2021/
), so let’scd
(change directory) to your working location:
$ cd ~/obss_2021/gbs
$ pwd
/home/yourname/obss_2021/gbs
NOTE: If you get lost at any time today, you can always
cd
to your directory with the above command.
The exercise from now on is hands-on: the instructions are here to guide you through the process but you will have to come up with the specific commands yourself. Fear not, the instructors are there to help you and the room is full of friendly faces to help you get through it.
- We will create more subdirectories to hold our analyses. Be careful that you are reading and writing files to the appropriate directories within your hierarchy. You’ll be making many directories, so stay organized! We will name the directories in a way that corresponds to each stage of our analysis and that allows us to remember where all our files are! A structured workspace makes analyses easier and prevents data from being accidentally overwritten.
Exercise
In
gbs/
, we will create two directories:lane1
andsamples
.lane1
will contain the raw data. By the end of the exercise,samples
will contain one file per sample with all the reads for that individual.Solution
$ mkdir lane1 samples
- As a check that we’ve set up our workspace correctly use the
ls -R
(thels
command with the recursive flag). It should show you the following (./
is a placeholder for ‘current directory’):
./lane1:
./samples:
Exercise
Copy the dataset containing the raw reads to your
lane1
directory. The dataset is the file/nesi/project/nesi02659/obss_2021/resources/gbs/lane1.fastq.gz
(hint:cp /path/to/what/you/want/to/copy /destination/you/want/it/to/go
)Solution
$ cp /nesi/project/nesi02659/obss_2021/resources/gbs/lane1.fastq.gz lane1/
Examining the data
Let’s have a closer look at this data. Over the last couple of days, you learnt to run FastQC to evaluate the quality of the data. Run it on these files. Load the module first and then run FastQC over the gzipped file. We will help you out with these commands, but bonus question as you work through these: What do the commands module spider
, module load
, module purge
do?
$ module purge
$ module spider fastqc
$ module load FastQC
$ cd lane1
$ fastqc lane1.fastq.gz
Explanations of this code:
module purge
get rids of any pre-existing potentially conflicting modules.module spider
searches for modules e.g.module spider fastqc
looks for a module called fastqc (or something similar!). Once we know what this module is actually called (note: almost everything we do on terminal is case-sensitive) we can usemodule load
to load the module. Finally, we ranfastqc
.
You just generated a FastQC report. Use the Jupyter hub navigator tool to follow the path to your current folder (hint: If you’re not quite sure where you are, use pwd
in your terminal window. Also, if obss_2021
doesn’t show up in the menu on the left, you might need to also click the littler folder icon just above Name
). Once you’ve navigated to the correct folder, you can then double click on a fastqc .html
report.
Exercise
What is this odd thing in the per base sequence content from base 7 to 12-13?
Solution
It is the enzyme cutsite. It is identical on all the fragments that were correctly digested by the enzyme before being ligated to a restriction enzyme.
You probably noticed that not all of the data is high quality (especially if you check the ‘Per base sequence quality’ tab!). In general with next generation sequencing, you want to remove the lowest quality sequences and remaining adapters from your data set before you proceed to do anything. However, the stringency of the filtering will depend on the final application. In general, higher stringency is needed for de novo assemblies as compared to alignments to a reference genome. However, low quality data can affect downstream analysis for both de novo and reference-based approaches, producing false positives, such as errant SNP predictions.
We will use the Stacks’s program process_radtags to remove low quality sequences (also known as cleaning data) and to demultiplex our samples. Here is the Stacks manual as well as the specific manual page for process_radtags on the Stacks website to find information and examples.
Exercise
Get back into your
gbs
folder by runningcd ../
in the terminal (hint: if you are lost usepwd
to check where you are).It is time to load the
Stacks
module to be able to access theprocess_radtags
command. Find the module and load it (hint: Do for Stacks what we did above for FastQC).You will need to specify the set of barcodes used in the construction of the RAD library. Each P1 adaptor in RAD read starts with a particular DNA barcode that gets sequenced first, allowing data to be associated with individual samples. To save you some time, the barcode file is at:
/nesi/project/nesi02659/obss_2021/resources/gbs/lane1_barcodes.txt
Copy it to thegbs/
where you currently are. Have a look at the barcodes.Based on the barcode file, can you check how many samples were multiplexed together in this RAD library
- Have a look at the help online to prepare your
process_radtags
command. You will need to specify:
- the restriction enzyme used to construct the library (pstI)
- the directory of input files (the
lane1
directory)- the list of barcodes (
lane1_barcodes.txt
)- the output directory (
samples
)- finally, specify that process_radtags needs
clean, discard, and rescue reads
as options ofprocess_radtags
- You should now be able to run the
process_radtags
command from thegbs/
directory using these options. It will take a couple of minutes to run. Take a breath or think about what commands we’ve run through so far.Solution
### Load the module $ module load Stacks ### copy the barcodes $ cp /nesi/project/nesi02659/obss_2021/resources/gbs/lane1_barcodes.txt . ### count barcodes wc -l lane1_barcodes.txt ### run process_radtags $ process_radtags -p lane1/ -o ./samples/ -b lane1_barcodes.txt -e pstI -r -c -q
The process_radtags program will write a log file into the output directory. Have a look in there. Examine the log and answer the following questions:
- How many reads were retained?
- Of those discarded, what were the reasons?
- Use
ls -lh samples
to have a quick look at the size of the samples files, and make sure all files have data. Well done! Take a break, sit back or help your neighbour, we will be back shortly!
Key Points
Raw data should always be checked with FastQC.
Assigning reads to specific samples is called demultiplexing
process_radtags is the built in demultiplexing tools of Stacks and it includes some basic quality control