Assembly with a reference genome
Overview
Teaching: 0 min
Exercises: 0 minQuestions
How do we identify genetic variants if we have a reference genome?
Objectives
Learn to call SNPs with a reference genome
Practice running SLURM scripts
Practice mapping reads to a reference genome
Developed by: Ludovic Dutoit, Alana Alexander
Adapted from: Julian Catchen, Nicolas Rochette
Introduction
Obtaining an assembly without a reference genome is easy and possible. However, having a reference genome allows us to avoid several issues. We do not have to make assumptions about the “best” value for the -M
parameter, and we reduce the risk of collapsing different loci together (“lumping”) or separating one “real” locus into several “erroneous loci” (“splitting”). Studies have demonstrated that having some kind of reference genome is the single best way you can improve GBS SNP calling (see for example Shafer et al. 2016). Using a related species genome might be good enough in several cases too (but it sometimes depends on your application).
The stacks pipeline for samples with a reference genome is ref_map.pl. It skips the creation of loci and the catalog steps as reads belong to the same stack/locus if they map to the same location of the reference genome. This means we don’t have to rely on assumptions derived from genetic distances (-M
and -n
of denovo_map.pl
) about whether reads belong to the same locus or not.
We don’t have a genome for our common bullies yet. Instead, we will map the reads to the catalog of loci that stacks as created for one of our denovo runs. You would never do that in the real world, but that gives us a chance to run the ref_map.pl pipeline.
Map your samples to the reference
Organise yourself
Get back to
gbs/
In the folder
gbs/
create an output folder for this analysis,refmap_output/
as well as the foldersamples_mapped
Copy the reference catalog below that we will use as a pretend reference genome to inside the
gbs/
folder you should currently be in.
/nesi/project/nesi02659/obss_2021/resources/gbs/reference_catalog.fa
Have a quick look inside the first few lines of reference_catalog.fa using
head -n 10 reference_catalog.fa
. It contains each loci identified in the denovo approach as a complete sequence.Solution
$ mkdir samples_mapped $ mkdir refmap_output $ cp /nesi/project/nesi02659/obss_2021/resources/gbs/reference_catalog.fa . $ head -n 10 reference_catalog.fa
It is about time to remind ourselves about a couple of mapping softwares. BWA is a piece of software that is commonly used for mapping. bwa mem
is the ideal algorithm to align our short reads to the pretend reference genome. We will also use Samtools, a suite of tools designed to interact with the sam/bam alignment format (i,e, sorting, merging, splitting, subsetting, it is all there).
Load the software
Identify the 2 modules you need for bwa and Samtools and load them. You need to index your reference genome. This is basically creating a map of your genome so that BWA can navigate it extra fast instead of reading it entirely for each read. Use bwa index YOURGENOME.fa for that.
Solution
$ module spider BWA $ module load BWA $ module spider samtools $ module load SAMtools $ bwa index reference_catalog.fa
Well done, we are now ready to do the mapping!
Mapping command
For each sample, we will now take the raw reads and map them to the reference genome, outputting a sorted .bam
file inside the folder samples_mapped
.
For a single sample, the command looks like this:
$ bwa mem -t 4 reference_catalog.fa samples/MYSAMPLE.fq.gz | samtools view -b | samtools sort --threads 4 > samples_mapped/MYSAMPLE.bam
Explanations:
bwa mem
uses 4 threads to align samples/MYSAMPLE.fq.gz to the reference catalog. The output is piped using the | symbol into the next command instead of being printed to the screen.samtools view
creates a bam file using-b
. That bam output is piped into thesamtools sort
command before finally being outputted as a file using>
into samples_mapped.
Now this is all well and good, but we don’t want to do it manually for each sample. The for
loop below is doing it for all samples by going through all the samples/*.fq.gz
files.
This is the chunkiest piece of code today. So no problem if you don’t soak it all in. If you are in a classroom right now, we’ll probably look at loops together. Otherwise, you could have a look at how they work here.
$ for filename in samples/*fq.gz
do base=$(basename ${filename} .fq.gz)
echo $base
bwa mem -t 4 reference_catalog.fa samples/${base}.fq.gz | samtools view -b | samtools sort --threads 4 > samples_mapped/${base}.bam
done
Explanations:
for
each filename in the folder ‘samples’ that ends with.fq.gz
, extract only the prefix of that filename:samples/PREFIX.fq.gz
with thebasename
function. Print the filename we are currently working with using echo. Use thebwa
+samtools
mapping explained above, using the base name to output a filePREFIX.bam
.
Well done, we only have ref_map.pl
to run now.
Run the ref_map pipeline
ref_map.pl
has fewer parameters since the mapping takes care of many of the steps from denovo_map.pl
, such as the creation of loci for each individual before a comparison of all loci across all individuals. ref_map.pl
uses the mapped files to identify the variable positions.
build your ref_map.pl command
Use the online help to build your
refmap.pl
command. you can also checkref_map.pl --help
.
• Unlike in the previous exercise, ask for 2 threads• Specify the path to the input folder
samples_mapped/
• Specify the path to the output folder
refmap_output/
• Specify the path to the population map. We will be able to use the same popmap as for the denovo analysis, since we are using the same samples.
• We only want to keep the loci that are found in 80% of individuals. This is done by passing specific arguments to the
populations
software inside Stacks.• Is your command ready? Run it briefly to check that it starts properly, once it does stop it using ctrl + c, we’ll run it as a job.
Solution
ref_map.pl -T 2 --samples samples_mapped/ -o refmap_output/ --popmap popmap.txt -r 0.8
Create the job file
• Time to put that into a job script. You can use the job script from the last exercise as a template. We will simply make a copy of it under a different name. In practice, we often end up reusing our own scripts. Let’s copy the
denovojob.sh
into a newrefmapjob.sh
:cp denovojob.sh refmapjob.sh
• Open
refmapjob.sh
using a text editor. Adjust the job name, adjust the number of threads/cpus to 2, adjust the running time to10min
, and the job output torefmap.log
instad ofdenovo.log
. Most importantly, replace thedenovo_map.pl
command with yourref_map.pl
command.• Save it, and run the job:
sbatch refmapjob.sh
That should take about 5 minutes. Remember you can use squeue -u <yourusername>
to check the status of the job. Once it is not there anymore, it has finished. This job will write out an output log called refmap_output/refmap.log
that you can check using less
.
Analyse your results
## Looking into the output
• Examine the output of the populations program in the fileref_map.log
inside yourrefmap_output
folder. (hint: use theless
command).• How many SNPs were identified?
• Why did
refmap.pl
run faster thandenovo_map.pl
?
Well done, you now know how to call SNPs with or without a reference genome. It is worth re-iterating that even a poor-quality reference genome should improve the quality of your SNP calling by avoiding lumping and splitting errors. But beware of some applications. Inbreeding analyses are one example of applications that are sensitive to the quality of your reference genome.
Only one thing left, the cool biology bits.
Key Points
Reference genomes, even of poor quality or from a related species are great for SNP identification
Reference-based SNP calling takes the guess work of distance between and within loci away by mapping reads to individual location within the genome