Trimming and Filtering reads

Overview

Teaching: 30 min
Exercises: 0 min

Questions

What steps are required for basic data cleaning in RNA-seq studies?

Objectives

Understand the various data cleaning steps for RNA-seq data.

Learn how to perfrorm adapter removal and quality trimming.

Cleaning Reads

In the previous section, we took a high-level look at the quality of each of our samples using FastQC. We visualized per-base quality graphs showing the distribution of read quality at each base across all reads in a sample and extracted information about which samples fail which quality checks. Some of our samples failed quite a few quality metrics used by FastQC. This doesn’t mean, though, that our samples should be thrown out! It’s very common to have some quality metrics fail, and this may or may not be a problem for your downstream application.

Adapter removal

“Adapters” are short DNA sequences that are added to each read as part of the sequencing process (we won’t get into “why” here).
These are removed as part of the data generation steps that occur during the sequencing run, but sometimes there is still a non-trivial amount of adapter sequence present in the FASTQ files.
Since the sequence is not part of the target genome (i.e., the genome if the species from which teh samples were derived) then we need to remove it to prevent it affecting the downstream analysis.
The FastQC application get detection adapter contamination in samples.

We will use a program called CutAdapt to filter poor quality reads and trim poor quality bases from our samples.

How to act on fastq after QC.

We can do several trimming:

on quality using Phred score. What will be the Phred score?
on the sequences, if they contain adaptor sequences.

To do so, we can use on tools: The cutadapt application is often used to remove adapter sequence from FASTQ files.

The following syntax will remove the adapter sequence AACCGGTT from the file SRR014335-chr1.fastq, create a new file called SRR014335-chr1_trimmed.fastq, and write a summary to the log file SRR014335-chr1.log:

$ pwd
/home/[Your_Username]/obss_2021/RNA_seq

$ mkdir Trimmed

$ module load cutadapt/2.10-gimkl-2020a-Python-3.8.2

$ cutadapt -q 20 -a AACCGGTT -o Trimmed/SRR014335-chr1_cutadapt.fastq Raw/SRR014335-chr1.fastq > Trimmed/SRR014335-chr1.log

We can have a look at the log file to see what cutadapt has done.

$ less Trimmed/SRR014335-chr1.log

Now we should trim all samples.

$ cd Raw

$ ls

SRR014335-chr1.fastq  SRR014336-chr1.fastq  SRR014337-chr1.fastq  SRR014339-chr1.fastq  SRR014340-chr1.fastq  SRR014341-chr1.fastq

$ for filename in *.fastq
> do base=$(basename ${filename} .fastq)
> cutadapt -q 20 -a AACCGGTT -o ../Trimmed/${base}.trimmed.fastq ${filename} > ../Trimmed/${base}.log
> done

MultiQC: `cutadapt` log files

If the log files from cutadapt are added to the directory containing the FastQC output, this information will also be incorporated into the MultiQC report the next time it is run.

$ cd ../MultiQC
 
$ cp ../Trimmed/*log .

$ multiqc .

Alt text

Key Points

Adapter removal and trimming (optional) are important steps in processign RNA-seq data.

previous episode

RNA-seq

next episode

Trimming and Filtering reads

Overview

Cleaning Reads

Adapter removal

How to act on fastq after QC.

MultiQC: `cutadapt` log files

Key Points

previous episode

next episode

previous episode

RNA-seq

next episode

Trimming and Filtering reads

Overview

Cleaning Reads

Adapter removal

How to act on fastq after QC.

MultiQC: cutadapt log files

Key Points

previous episode

next episode

MultiQC: `cutadapt` log files