Workshop Overview

Overview

Teaching: 30 min
Exercises: 10 min

Questions

What is RNA-seq?

What data will we be using in this workshop?

Objectives

Understand the basics of RNA-seq, from sample collection through to data generation.

Learn a little bit about teh yeast data set being used in this session.

RNA-seq Workflow

Before beginning an RNA-Seq experiment, you should understand and carefully consider each step of the RNA-Seq workflow: Experimental design, Extraction, Library preparation, Sequencing, and Data analysis.

RNAseq workflow

Experimental design

The design stage of your experiment is arguably the most critical step in ensuring the success of an RNA-Seq experiment. Researchers must make key decisions at the start of any NGS project, including the type of assay and the number of samples to analyze. The optimal approach will depend largely on the objectives of the experiment, hypotheses to be tested, and expected information to be gathered.

Extraction

The first step in characterizing the transcriptome involves isolating and purifying cellular RNA. The quality and quantity of the input material have a significant impact on data quality; therefore, care must be taken when isolating and preparing RNA for sequencing. Given the chemical instability of RNA, there are two major reasons for RNA degradation during experiments:

RNA contains ribose sugar and is not stable in alkaline conditions because of the reactive hydroxyl bonds. RNA is also more prone to heat degradation than DNA.
Ribonucleases (RNases) are ubiquitous and very stable, so avoiding them is nearly impossible. It is essential to maintain an RNase-free environment by wearing sterile disposable gloves when handling reagents and RNA samples, employing RNase inhibitors, and using DEPC-treated water instead of PCR-grade water. Additionally, proper storage of RNA is crucial to avoid RNA degradation.

In the short term, RNA may be stored in RNase-free water or TE buffer at -80°C for 1 year without degradation. For the long term, RNA samples may be stored as ethanol precipitates at -20°C. Avoid repeated freeze-thaw cycles of samples, which can lead to degradation. RNA of high integrity will maximize the likelihood of obtaining reliable and informative results.

Library Preparation

This involves generating a collection of RNA fragments that are compatible for sequencing. The process involves enrichment of target (non-ribosomal) RNA, fragmentation, reverse transcription (i.e. cDNA synthesis), and addition of sequencing adapters and amplification. The enrichment method determines which types of transcripts (e.g. mRNA, lncRNA, miRNA) will be included in the library. In addition, the cDNA synthesis step can be performed in a such a way as to maintain the original strand orientation of the transcript, generating what is known as ‘strand-specific’ or ‘directional’ libraries.

Sequencing

Parameters for sequencing—such as read length, configuration, and output—depend on the goals of your project and will influence your choice of instrument and sequencing chemistry. The main NGS technologies can be grouped into two categories: short-read (or ‘second generation’) sequencing, and long-read (or ‘third generation’) sequencing. Both have distinct benefits for RNA-Seq.

Short-read sequencing is relatively inexpensive on a per-base basis and can generate billions of reads in a massively parallel manner, with single-end read lengths ranging between 50 and 300 bp. The high-throughput nature of this technology is ideal for quantifying the relative abundance of transcripts or identifying rare transcripts. Several platforms available on the market offer flexible outputs using roughly similar chemistry. Each cDNA fragment can be sequenced from only one end, called single-end (SE) sequencing, or both ends, called paired-end (PE) sequencing. The former is generally less expensive and faster than the latter. However, paired-end sequencing helps detect genomic rearrangements and repetitive sequence alignments better than the single- end configuration, since more information is collected from each fragment.
Long-read sequencing can resolve inaccessible regions of the genome and read through the entire length of RNA transcripts, allowing precise determination of specific isoforms. Two of the leading long-read sequencing platform providers include Pacific Biosciences (PacBio), and Oxford Nanopore Technologies®.

However, if cost reduction is paramount and/or high data output is required, short-read sequencing is a better choice.

Data Analysis

Evaluating your data quality and extracting biologically relevant information is the final and most rewarding step in an RNA-Seq experiment. It is important to discuss your project with an experienced bioinformatician to find the best analysis pipeline for your data. One pipeline does not fit all approaches.

Exercise

Discussion: what are some possible uses of RNA-seq?

Solution

Rank genes based on expression

Identify differentially expressed genes after inducing a drug

Identify novel transcripts

Identify bacterial and eukaryotic genes in a sample

Investigate alternative splicing and isoform usage

De novo transcriptome assembly

Others?

Introduction

This workshop uses the dataset from yeast RNA-seq experiment, Lee et al 2008

Yeast Dataset

Wild-type versus RNA degradation mutants
Subset of data (chromosome 1)
Six samples (3 WT / 3 MT)
Single end sequencing

Overview of Illumina Sequencing

Here is a video to illumina sequencing.

Key Points

RNA-seq is a commonly used technology for profiling the transcriptome.

There are a number of different applications for RNA-seq data - we’ll be looking at detecetign differentially expressed genes using data from an experiment involving RNA-seq data from yeast.

Quality control of the sequencing data.

Overview

Teaching: 30 min
Exercises: 0 min

Questions

What tools do we use to assess the data quality data in RNA-seq experiments?

Objectives

Learn how to assess data quality

Use the FastQC application to perform quality assessment

Use the MultiQC application to combine and view quality information

Several tools available to do quality assessemnt. For this workshop, we will use fastqc.

First, it is always good to verify where we are:

$ pwd
/home/[your_username]
# good I am ready to work

cd ~/obss_2021/
# copy the needed data into your directory
cp -r /nesi/projects/nesi02659/obss_2021/Resources/RNA_seq/ .

Checking to make sure we have the Raw files for the workshop.

$ ls

edna gbs genomic_dna intro_bash intro_r RNA_seq . ..

Creating a directory where to store the QC data:

$ cd RNA_seq

$ ls
Genome  Raw  rsmodules.sh  yeast_counts_all_chr.txt

$ mkdir QC

Since we are working on the NeSI HPC, we need to search and load the package before we start using it.

$ module spider fastqc

and then load

$ module purge
$ module load FastQC/0.11.9

hint : there is a file named rsmodules.sh which is a shell script to load the required modules at once. Running source ~/RNA_seq/rsmodules.sh command will excute it.

Now we can start the quality control:

$ fastqc -o QC/ Raw/*

You will see an automatically updating output message telling you the progress of the analysis. It will start like this:

Started analysis of SRR014335-chr1.fastq
Approx 5% complete for SRR014335-chr1.fastq
Approx 10% complete for SRR014335-chr1.fastq
Approx 15% complete for SRR014335-chr1.fastq
Approx 20% complete for SRR014335-chr1.fastq
Approx 25% complete for SRR014335-chr1.fastq
Approx 30% complete for SRR014335-chr1.fastq
Approx 35% complete for SRR014335-chr1.fastq

The FastQC program has created several new files within our RNA_seq/QC/ directory.

$ ls QC

SRR014335-chr1_fastqc.html  SRR014336-chr1_fastqc.zip   SRR014339-chr1_fastqc.html  SRR014340-chr1_fastqc.zip
SRR014335-chr1_fastqc.zip   SRR014337-chr1_fastqc.html  SRR014339-chr1_fastqc.zip   SRR014341-chr1_fastqc.html
SRR014336-chr1_fastqc.html  SRR014337-chr1_fastqc.zip   SRR014340-chr1_fastqc.html  SRR014341-chr1_fastqc.zip

Viewing the FastQC results

If we were working on our local computers, we’d be able to look at each of these HTML files by opening them in a web browser.

However, these files are currently sitting on our remote NeSI HPC, where our local computer can’t see them. And, since we are only logging into NeSI via the command line - it doesn’t have any web browser setup to display these files either.

So the easiest way to look at these webpage summary reports will be to transfer them to our local computers (i.e. your laptop).

To transfer a file from a remote server to our own machines, we will use scp.

First we will make a new directory on our computer to store the HTML files we’re transferring. Let’s put it on our desktop for now. Open a new tab in your terminal program (you can use the pull down menu at the top of your screen or the Cmd+t keyboard shortcut) and type:

$ mkdir -p ~/Desktop/fastqc_html 

$ scp -r [Your_UserName]p@login.mahuika.nesi.org.nz:~/obss_2021/RNA_seq/QC/ ~/Desktop/fastqc_html

Working with the FastQC text output

Now that we’ve looked at our HTML reports to get a feel for the data, let’s look more closely at the other output files. Go back to the tab in your terminal program that is connected to NeSI and make sure you’re in our results subdirectory.

$ cd /home/fayfa80p/obss_2021/RNA_seq/QC

$ ls

SRR014335-chr1_fastqc.html  SRR014336-chr1_fastqc.zip   SRR014339-chr1_fastqc.html  SRR014340-chr1_fastqc.zip
SRR014335-chr1_fastqc.zip   SRR014337-chr1_fastqc.html  SRR014339-chr1_fastqc.zip   SRR014341-chr1_fastqc.html
SRR014336-chr1_fastqc.html  SRR014337-chr1_fastqc.zip   SRR014340-chr1_fastqc.html  SRR014341-chr1_fastqc.zip

Let’s unzip the files to look at the FastQC text file outputs.

$ for filename in *.zip
> do
> unzip $filename
> done

Inside each unzipped folder, there is a summary text which shows results of the statistical tests done by FastQC

$ ls SRR014335-chr1_fastqc

fastqc_data.txt  fastqc.fo  fastqc_report.html	Icons/	Images/  summary.txt

Use less to preview the summary.txt file

$ less SRR014335-chr1_fastqc/summary.txt

We can make a record of the results we obtained for all our samples by concatenating all of our summary.txt files into a single file using the cat command. We’ll call this fastqc_summaries.txt.

$ cat */summary.txt > ~/obss_2021/RNA_seq/QC/fastqc_summaries.txt 

Have a look at the fastqc_summaries.txt and search for any of the samples that have failed the QC statistical tests.

MultiQC - multi-sample analysis

The FastQC analysis is applied to each sample separately, and produces a report for each.
The application MultiQC provides a way to combine multiple sets of results (i.e., from MANY different software packages) across multipel samples.
To generate multqc results, run the following command in the directory with the output files you want to summarise (e.g., fastqc reports generated above):

$ module load MultiQC/1.9-gimkl-2020a-Python-3.8.2
$ cd ~/obss_2021/RNA_seq/
$ mkdir MultiQC
$ cd MultiQC
$ cp ../QC/* ./
$ multiqc .
$ ls -F

multiqc_data/  multiqc_report.html

The html report shows the MultiQC summary

Alt text

Key Points

Quality assessmet is a key initial step in the analysis of RNA-seq data

The FastQC and MultiQC applications are useful tools for quality assessent of RNA-seq data.

Trimming and Filtering reads

Overview

Teaching: 30 min
Exercises: 0 min

Questions

What steps are required for basic data cleaning in RNA-seq studies?

Objectives

Understand the various data cleaning steps for RNA-seq data.

Learn how to perfrorm adapter removal and quality trimming.

Cleaning Reads

In the previous section, we took a high-level look at the quality of each of our samples using FastQC. We visualized per-base quality graphs showing the distribution of read quality at each base across all reads in a sample and extracted information about which samples fail which quality checks. Some of our samples failed quite a few quality metrics used by FastQC. This doesn’t mean, though, that our samples should be thrown out! It’s very common to have some quality metrics fail, and this may or may not be a problem for your downstream application.

Adapter removal

“Adapters” are short DNA sequences that are added to each read as part of the sequencing process (we won’t get into “why” here).
These are removed as part of the data generation steps that occur during the sequencing run, but sometimes there is still a non-trivial amount of adapter sequence present in the FASTQ files.
Since the sequence is not part of the target genome (i.e., the genome if the species from which teh samples were derived) then we need to remove it to prevent it affecting the downstream analysis.
The FastQC application get detection adapter contamination in samples.

We will use a program called CutAdapt to filter poor quality reads and trim poor quality bases from our samples.

How to act on fastq after QC.

We can do several trimming:

on quality using Phred score. What will be the Phred score?
on the sequences, if they contain adaptor sequences.

To do so, we can use on tools: The cutadapt application is often used to remove adapter sequence from FASTQ files.

The following syntax will remove the adapter sequence AACCGGTT from the file SRR014335-chr1.fastq, create a new file called SRR014335-chr1_trimmed.fastq, and write a summary to the log file SRR014335-chr1.log:

$ pwd
/home/[Your_Username]/obss_2021/RNA_seq

$ mkdir Trimmed

$ module load cutadapt/2.10-gimkl-2020a-Python-3.8.2

$ cutadapt -q 20 -a AACCGGTT -o Trimmed/SRR014335-chr1_cutadapt.fastq Raw/SRR014335-chr1.fastq > Trimmed/SRR014335-chr1.log

We can have a look at the log file to see what cutadapt has done.

$ less Trimmed/SRR014335-chr1.log

Now we should trim all samples.

$ cd Raw

$ ls

SRR014335-chr1.fastq  SRR014336-chr1.fastq  SRR014337-chr1.fastq  SRR014339-chr1.fastq  SRR014340-chr1.fastq  SRR014341-chr1.fastq

$ for filename in *.fastq
> do base=$(basename ${filename} .fastq)
> cutadapt -q 20 -a AACCGGTT -o ../Trimmed/${base}.trimmed.fastq ${filename} > ../Trimmed/${base}.log
> done

MultiQC: `cutadapt` log files

If the log files from cutadapt are added to the directory containing the FastQC output, this information will also be incorporated into the MultiQC report the next time it is run.

$ cd ../MultiQC
 
$ cp ../Trimmed/*log .

$ multiqc .

Alt text

Key Points

Adapter removal and trimming (optional) are important steps in processign RNA-seq data.

Map and count reads

Overview

Teaching: 45 min
Exercises: 0 min

Questions

How do we turn fastq data into read counts?

Objectives

Align reads to reference genome.

Assess read mapping for each sample.

Generate per-sample count data for each genomic feature (e.g., genes, exons etc) that you are interested in.

Alignment to a reference genome

RNA-seq generate gene expression information by quantifying the number of transcripts (per gene) in a sample. This is acompished by counting the number of transcripts that have been sequenced - the more active a gene is, the more transcripts will be in a sample, and the more reads will be generated from that transcript.

For RNA-seq, we need to align or map each read back to the genome, to see which gene produced it.

Highly expressed genes will generate lots of transcripts, so there will be lots of reads that map back to the position of that transcript in the genome.
The per-gene data we work with in an RNA-seq experiment are counts: the number of reads from each sample that originated from that gene.

Preparation of the genome

To be able to map (align) sequencing reads on the genome, the genome needs to be indexed first. In this workshop we will use HISAT2. Note for speed reason, the reads will be aligned on the chr5 of the mouse genome.

$ cd /home/[Your_Username]/obss_2021/RNA_seq/Genome

#to list what is in your directory:
$ ls /home/[Your_Username]/obss_2021/RNA_seq/Genome

Saccharomyces_cerevisiae.R64-1-1.99.gtf  Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa

$ module load HISAT2/2.2.0-gimkl-2020a

# index file:
$ hisat2-build -p 4 -f Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa Saccharomyces_cerevisiae.R64-1-1.dna.toplevel

#list what is in the directory:
$ ls /home/[Your_Username]/obss_2021/RNA_seq/Genome

Saccharomyces_cerevisiae.R64-1-1.99.gtf              Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.4.ht2  Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.8.ht2
Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.1.ht2  Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.5.ht2  Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa
Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.2.ht2  Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.6.ht2
Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.3.ht2  Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.7.ht2

Option info:

-p number of threads
-f fasta file

How many files were created during the indexing process?

Alignment on the genome

Now that the genome is prepared. Sequencing reads can be aligned.

Information required:

Where the sequence information is stored (e.g. fastq files …) ?
What kind of sequencing: Single End or Paired end ?
Where are stored the indexes and the genome?
Where will the mapping files be stored?
Now, lets move one folder up (into the RNA_seq folder):

  
$ cd ..
  
$ ls

Genome  QC  Raw  Trimmed

Let’s map one of our sample to the genome

$ hisat2 -x Genome/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel -U Raw/SRR014335-chr1.fastq -S SRR014335.sam

125090 reads; of these:
  125090 (100.00%) were unpaired; of these:
    20537 (16.42%) aligned 0 times
    85066 (68.00%) aligned exactly 1 time
    19487 (15.58%) aligned >1 times
83.58% overall alignment rate

Now we need to align all the rest of the samples.

$ pwd

/home/[Your_Username]/obss_2021/RNA_seq/

$ mkdir Mapping

$ ls

Genome  Mapping  QC  Raw  SRR014335.sam  Trimmed

let’s use a for loop to process our samples:

$ cd Trimmed

$ ls

SRR014335-chr1.fastq  SRR014336-chr1.fastq  SRR014337-chr1.fastq  SRR014339-chr1.fastq  SRR014340-chr1.fastq  SRR014341-chr1.fastq

$ for filename in *.trimmed.fastq
> do
> base=$(basename ${filename} .trimmed.fastq)
> hisat2 -p 4 -x ../Genome/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel -U $filename -S ../Mapping/${base}.sam --summary-file ../Mapping/${base}_summary.txt
> done

Now we can explore our SAM files.

$ cd ../Mapping

$ ls

SRR014335-chr1.sam          SRR014336-chr1_summary.txt  SRR014339-chr1.sam          SRR014340-chr1_summary.txt
SRR014335-chr1_summary.txt  SRR014337-chr1.sam          SRR014339-chr1_summary.txt  SRR014341-chr1.sam
SRR014336-chr1.sam          SRR014337-chr1_summary.txt  SRR014340-chr1.sam          SRR014341-chr1_summary.txt

Converting SAM files to BAM files

The SAM file, is a tab-delimited text file that contains information for each individual read and its alignment to the genome. While we do not have time to go into detail about the features of the SAM format, the paper by Heng Li et al. provides a lot more detail on the specification.

The compressed binary version of SAM is called a BAM file. We use this version to reduce size and to allow for indexing, which enables efficient random access of the data contained within the file.

A quick look into the sam file

$ less SRR014335-chr1.sam

The file begins with a header, which is optional. The header is used to describe the source of data, reference sequence, method of alignment, etc., this will change depending on the aligner being used. Following the header is the alignment section. Each line that follows corresponds to alignment information for a single read. Each alignment line has 11 mandatory fields for essential mapping information and a variable number of other fields for aligner specific information. An example entry from a SAM file is displayed below with the different fields highlighted.

We will convert the SAM file to BAM format using the samtools program with the view command and tell this command that the input is in SAM format (-S) and to output BAM format (-b):

$ module load SAMtools/1.10-GCC-9.2.0

$ for filename in *.sam
> do
> base=$(basename ${filename} .sam)
> samtools view -S -b ${filename} -o ${base}.bam
> done

$ ls

SRR014335-chr1.bam  SRR014336-chr1.bam  SRR014337-chr1.bam  SRR014339-chr1.bam  SRR014340-chr1.bam  SRR014341-chr1.bam
SRR014335-chr1.sam  SRR014336-chr1.sam  SRR014337-chr1.sam  SRR014339-chr1.sam  SRR014340-chr1.sam  SRR014341-chr1.sam

Sorting BAM files

Next we sort the BAM file using the sort command from samtools. -o tells the command where to write the output.

$ for filename in *.bam
> do
> base=$(basename ${filename} .bam)
> samtools sort -o ${base}_sorted.bam ${filename}
> done

SAM/BAM files can be sorted in multiple ways, e.g. by location of alignment on the chromosome, by read name, etc. It is important to be aware that different alignment tools will output differently sorted SAM/BAM, and different downstream tools require differently sorted alignment files as input.

You can use samtools to learn more about the bam file as well.

Some stats on your mapping:

$ samtools flagstat SRR014335-chr1_sorted.bam

156984 + 0 in total (QC-passed reads + QC-failed reads)
+ 0 secondary
+ 0 supplementary
+ 0 duplicates
136447 + 0 mapped (86.92% : N/A)
+ 0 paired in sequencing
+ 0 read1
+ 0 read2
+ 0 properly paired (N/A : N/A)
+ 0 with itself and mate mapped
+ 0 singletons (N/A : N/A)
+ 0 with mate mapped to a different chr
+ 0 with mate mapped to a different chr (mapQ>=5)

MultiQC: `HISAT2` output

The HISAT2 output data can also be incorporated into the MultiQC report the next time it is run.

 $ cd ~/obss_2021/RNA_seq/MultiQC
 
 $ cp ../Mapping/*summary* ./
 
 $ multiqc .

Alt text

Please note: running HISAT2 with either option --summary-file or older versions (< v2.1.0) gives log output identical to Bowtie2. These logs are indistinguishable and summary statistics will appear in MultiQC reports labelled as Bowtie2.

Read Summarization

Sequencing reads often need to be assigned to genomic features of interest after they are mapped to the reference genome. This process is often called read summarization or read quantification. Read summarization is required by a number of downstream analyses such as gene expression analysis and histone modification analysis. The output of read summarization is a count table, in which the number of reads assigned to each feature in each library is recorded.

Counting

We need to do some counting!
Want to generate count data for each gene (actually each exon) - how many reads mapped to each exon in the genome, from each of our samples?
Once we have that information, we can start thinking about how to determine which genes were differentially expressed in our study.

Subread and FeatureCounts

The featureCounts tool from the Subread package can be used to count how many reads aligned to each genome feature (exon).
Need to specify the annotation informatyion (.gtf file) You can process all the samples at once:

$ module load Subread

$ pwd
/home/[Your_Username]/obss_2021/RNA_seq

$ mkdir Counts

$ cd Counts

$ featureCounts -a ../Genome/Saccharomyces_cerevisiae.R64-1-1.99.gtf -o ./yeast_counts.txt -T 2 -t exon -g gene_id ../Mapping/*sorted.bam

MultiQC: featureCounts output

If the samples are processed individually, the output data can be incorporated into the MultiQC report the next time it is run.

$ cd ../MultiQC

$ cp ../Counts/* .

$ multiqc .

Alt text

Since we now have all the count data in one file, we need to transfer it to our local computers so we could start working on RStudio to get differentially expressed genes.

And the code to do it is below, however for this workshop, we are going to use a different file (yeast_counts_all_chr.txt) that you can download from section 4. Differential_Expression. This file has all the data for all the chromosomes.

#In a new terminal that you can access you computer files, cd to the directory you want to save the counts file.

$ scp fayfa80p@login.mahuika.nesi.org.nz:/home/fayfa80p/RNA_seq/Counts/yeast_counts_all_chr.txt ./

Download with Jupyter

An alternative is to navigate to the directory ~/obss_2021/RNA_seq/Counts/ in the navigation panel of Jupyter then right-click on yeast_counts_all_chr.txt and select “Download”

Key Points

Alignment and feature counting are used to generate read counts for each genomic feature (e.g., genes) of interest, per sample.

The count data can then be used for stataistical analysis (e.g., to identify differentially expressed genes).

Differential Expression

Overview

Teaching: 75 min
Exercises: 0 min

Questions

How can we identify genes which exhibit changes in expresion between experimental conditions?

Objectives

Understand pre-processing steps required for standarising count data.

Use count data and the limma package to perform normalisation and identify differentially expressed genes.

Understand need for multiple comparisons correction, and be able to apply FWER and FDR correction methods.

Explore alternative methods (egdeR and DESeq2) for identifying differentially expressed genes.

RNA-seq: overview

Recap

In the last section we worked through the process of quality assessment and alignment for RNA-seq data
This typically takes place on the command line, but can also be done from within R.
The end result was the generation of count data (counts of reads aligned to each gene, per sample) using the FeatureCounts command from Subread/Rsubread.
Now that we’ve got count data in R, we can begin our differential expression analysis.

Data set reminder

Data obtained from yeast RNA-seq experiment, Lee et al 2008
Wild-type versus RNA degradation mutants
Six samples (3 WT / 3 MT)
NOTE: since we are dealing with an RNA degradation mutant yeast strain, we will see MAJOR transcriptomic differences between the wild-type and mutant groups (usually the differences you observe in a typical RNA-seq study wouldn’t be this extreme).

Getting organised

NOTE: skip to the next section (“Count Data”) if you are working within Jupyter on NeSI

Create an RStudio project

One of the first benefits we will take advantage of in RStudio is something called an RStudio Project. An RStudio project allows you to more easily:

Save data, files, variables, packages, etc. related to a specific analysis project

Restart work where you left off

Collaborate, especially if you are using version control such as git.

To create a project:

Open RStudio and go to the File menu, and click New Project.

In the window that opens select Existing Project, and browse to the RNA_seq folder.

Finally, click Create Project.

Save source from untitled to yeast_data.R and continue saving regularly as you work.

Lecture notes

In addition to the code below, we’ll also be going through some lecture notes that explain the various concepts being covered in this part of the workshop.

Lecture notes (PDF): Differential Expression

Count data

Note: I have now aligned the data for ALL CHROMOSOMES and generated counts, so we are working with data from all 7127 genes.

Let’s look at our data set and perform some basic checks before we do a differential expression analysis.

Working directory in Jupyter

Before launching your R notebook in Jupyter, navigate to ~/obss_2021/RNA_seq/ in the navigation panel. Ensure your working directory by using getwd() and if it isn’t ~/obss_2021/RNA_seq/, set it with setwd('~/obss_2021/RNA_seq').

library(dplyr)
fcData = read.table('yeast_counts_all_chr.txt', sep='\t', header=TRUE)
fcData %>% head()

    ##      Geneid Chr Start   End Strand Length ...STAR.SRR014335.Aligned.out.sam
    ## 1   YDL248W  IV  1802  2953      +   1152                                52
    ## 2 YDL247W-A  IV  3762  3836      +     75                                 0
    ## 3   YDL247W  IV  5985  7814      +   1830                                 2
    ## 4   YDL246C  IV  8683  9756      -   1074                                 0
    ## 5   YDL245C  IV 11657 13360      -   1704                                 0
    ## 6   YDL244W  IV 16204 17226      +   1023                                 6
    ##   ...STAR.SRR014336.Aligned.out.sam ...STAR.SRR014337.Aligned.out.sam
    ## 1                                46                                36
    ## 2                                 0                                 0
    ## 3                                 4                                 2
    ## 4                                 0                                 1
    ## 5                                 3                                 0
    ## 6                                 6                                 5
    ##   ...STAR.SRR014339.Aligned.out.sam ...STAR.SRR014340.Aligned.out.sam
    ## 1                                65                                70
    ## 2                                 0                                 1
    ## 3                                 6                                 8
    ## 4                                 1                                 2
    ## 5                                 5                                 7
    ## 6                                20                                30
    ##   ...STAR.SRR014341.Aligned.out.sam
    ## 1                                78
    ## 2                                 0
    ## 3                                 5
    ## 4                                 0
    ## 5                                 4
    ## 6                                19

Check dimensions:

dim(fcData)

    ## [1] 7127   12

names(fcData)

    ##  [1] "Geneid"                            "Chr"                              
    ##  [3] "Start"                             "End"                              
    ##  [5] "Strand"                            "Length"                           
    ##  [7] "...STAR.SRR014335.Aligned.out.sam" "...STAR.SRR014336.Aligned.out.sam"
    ##  [9] "...STAR.SRR014337.Aligned.out.sam" "...STAR.SRR014339.Aligned.out.sam"
    ## [11] "...STAR.SRR014340.Aligned.out.sam" "...STAR.SRR014341.Aligned.out.sam"

Rename data columns to reflect group membership

names(fcData)[7:12] = c("WT1", "WT2", "WT3", "MT1", "MT2", "MT3")
 
fcData %>% head()

    ##      Geneid Chr Start   End Strand Length WT1 WT2 WT3 MT1 MT2 MT3
    ## 1   YDL248W  IV  1802  2953      +   1152  52  46  36  65  70  78
    ## 2 YDL247W-A  IV  3762  3836      +     75   0   0   0   0   1   0
    ## 3   YDL247W  IV  5985  7814      +   1830   2   4   2   6   8   5
    ## 4   YDL246C  IV  8683  9756      -   1074   0   0   1   1   2   0
    ## 5   YDL245C  IV 11657 13360      -   1704   0   3   0   5   7   4
    ## 6   YDL244W  IV 16204 17226      +   1023   6   6   5  20  30  19

Extract count data

Remove annotation columns
Add row names

counts = fcData[, 7:12]
rownames(counts) = fcData$Geneid
counts %>% head()

    ##           WT1 WT2 WT3 MT1 MT2 MT3
    ## YDL248W    52  46  36  65  70  78
    ## YDL247W-A   0   0   0   0   1   0
    ## YDL247W     2   4   2   6   8   5
    ## YDL246C     0   0   1   1   2   0
    ## YDL245C     0   3   0   5   7   4
    ## YDL244W     6   6   5  20  30  19

Visualising the data

Data are highly skewed (suggests that logging might be useful):

boxplot(as.matrix(counts) ~ col(counts))

Boxplot of counts

Some genes have zero counts:

colSums(counts==0)

    ## WT1 WT2 WT3 MT1 MT2 MT3 
    ## 562 563 573 437 425 435

Log transformation (add 0.5 to avoid log(0) issues):

logCounts = log2(as.matrix(counts)+ 0.5)

Now we can see the per-sample distributions more clearly:

boxplot(as.matrix(logCounts) ~ col(logCounts))

"Boxplot of log(counts)"

Density plots are also a good way to visualise the data:

lineColour <- c("blue", "blue", "blue", "red", "red", "red")
lineColour

    ## [1] "blue" "blue" "blue" "red"  "red"  "red"

plot(density(logCounts[,1]), ylim=c(0,0.3), col=lineColour[1])
for(i in 2:ncol(logCounts)) lines(density(logCounts[,i]), col=lineColour[i])

Density plot of counts

The boxplots and density plots show clear differences between the sample groups - are these biological, or experimental artifacts? (often we don’t know).

Remember: the wild-type and mutant yeast strains are VERY different. You wouldn’t normally see this amount of difference in the distributions between the groups.

Read counts per sample

Normalisation process (slightly different for each analysis method) takes “library size” (number of reads generated for each sample) into account.

A good blog about normalization: Link

colSums(counts)

    ##     WT1     WT2     WT3     MT1     MT2     MT3 
    ## 4915975 4892227 4778158 4618409 4719413 4554283

Visualise via bar plot

colSums(counts) %>% barplot(., ylab="Reads mapped per sample")

Bar plot of reads mapped per sample

Now we are ready for differential expression analysis

Detecting differential expression:

We are going to identify genes that are differential expressed using 3 different packages (time allowing) and compare the results.

But first, lets take a brief aside, and talk about the process of detecting genes that have undergone a statistically significance change in expression between the two groups.

OPEN UP THIS PDF DOCUMENT

Limma

Limma can be used for analysis, by transforming the RNA-seq count data in an appropriate way (log-scale normality-based assumption rather than Negative Binomial for count data)
Use data transformation and log to satisfy normality assumptions (CPM = Counts per Million).

library(limma)
library(edgeR)

dge <- DGEList(counts=counts)
dge <- calcNormFactors(dge)
logCPM <- cpm(dge, log=TRUE, prior.count=3)

options(width=100)
head(logCPM, 3)

    ##                  WT1        WT2        WT3        MT1        MT2        MT3
    ## YDL248W    3.7199528  3.5561232  3.2538405  3.6446399  3.7156488  3.9155366
    ## YDL247W-A -0.6765789 -0.6765789 -0.6765789 -0.6765789 -0.3140297 -0.6765789
    ## YDL247W    0.1484688  0.6727144  0.1645731  0.7843936  1.0395626  0.6349276

Aside: RPKM

We can use R to generate RPKM values (or FPKM if using paired-end reads):

Need gene length information to do this
Can get this fromgoseq package (we’ll use this later for our pathway analysis)

library(goseq)
geneLengths = getlength(rownames(counts), "sacCer2","ensGene")
rpkmData <- rpkm(dge, geneLengths)
rpkmData %>% round(., 2) %>% head()

    ##             WT1  WT2  WT3   MT1   MT2   MT3
    ## YDL248W   10.89 9.66 7.73 10.30 10.85 12.55
    ## YDL247W-A  0.00 0.00 0.00  0.00  2.35  0.00
    ## YDL247W    0.26 0.53 0.27  0.60  0.78  0.51
    ## YDL246C    0.00 0.00 0.23  0.17  0.33  0.00
    ## YDL245C    0.00 0.43 0.00  0.54  0.73  0.44
    ## YDL244W    1.41 1.42 1.21  3.57  5.24  3.44

range(rpkmData, na.rm=TRUE)

## [1]     0.00 69786.58

Compare RPKM to logCPM

par(mfrow=c(2,3))
for(i in 1:ncol(rpkmData)){
  plot(logCPM[,i], log2(rpkmData[,i]), pch='.', 
       xlab="logCPM", ylab="RPKM", main=colnames(logCPM)[i])
}

"RPKM vs log(CPM)"

We’re NOT going to use RPKM data here. I just wanted to show you how to calculate it, and how it relates to the logCPM data

Back to the analysis… (using logCPM)

What if we just did a t-test?

## The beeswarm package is great for making jittered dot plots
library(beeswarm)

# Specify "conditions" (groups: WT and MT)
conds <- c("WT","WT","WT","MT","MT","MT")

## Perform t-test for "gene number 6" (because I like that one...)
t.test(logCPM[6,] ~ conds)

    ## 
    ##  Welch Two Sample t-test
    ## 
    ## data:  logCPM[6, ] by conds
    ## t = 7.0124, df = 2.3726, p-value = 0.01228
    ## alternative hypothesis: true difference in means is not equal to 0
    ## 95 percent confidence interval:
    ##  0.5840587 1.9006761
    ## sample estimates:
    ## mean in group MT mean in group WT 
    ##         2.244310         1.001943

beeswarm(logCPM[6,] ~ conds, pch=16, ylab="Expression (logCPM)", xlab="Group",
         main=paste0(rownames(logCPM)[6], ": MT vs WT"))

Beeswarm plot WT vs MT

we basically want to do this sort of analysis, for every gene
we’ll use a slightly more sophisticated approach though

However, before we get to statistical testing, we (might) first need to do a little bit more normalisation.

Limma: voom

The “voom” function estimates relationship between the mean and the variance of the logCPM data, normalises the data, and creates “precision weights” for each observation that are incorporated into the limma analysis.

design <- model.matrix(~conds)
design

    ##   (Intercept) condsWT
    ## 1           1       1
    ## 2           1       1
    ## 3           1       1
    ## 4           1       0
    ## 5           1       0
    ## 6           1       0
    ## attr(,"assign")
    ## [1] 0 1
    ## attr(,"contrasts")
    ## attr(,"contrasts")$conds
    ## [1] "contr.treatment"

v <- voom(dge, design, plot=TRUE)

Voom: mean-variance trend

Voom: impact on samples

Mainly affects the low-end (low abundance genes)

par(mfrow=c(2,3))
for(i in 1:ncol(logCPM)){
  plot(logCPM[,i], v$E[,i], xlab="LogCPM",
       ylab="Voom",main=colnames(logCPM)[i])
  abline(0,1)
}

Hasn’t removed the differences between the groups

boxplot(v$E ~ col(v$E))

lineColour <- ifelse(conds=="MT", "red", "blue")
lineColour

## [1] "blue" "blue" "blue" "red"  "red"  "red"

plot(density(v$E[,1]), ylim=c(0,0.3), col=lineColour[1])
for(i in 2:ncol(logCPM)) lines(density(v$E[,i]), col=lineColour[i])

We can deal with this via “quantile normalisation”. Specify

normalize="quantile"

in the voom function.

q <- voom(dge, design, plot=TRUE, normalize="quantile")

Quantile normalisation forces all of the distributions to be the same.

it is a strong assumption, and can have a major impact on the analysis
potentially removing biological differences between groups.

par(mfrow=c(2,3))
for(i in 1:ncol(logCPM)){
  plot(logCPM[,i], q$E[,i], xlab="LogCPM",
       ylab="Voom (+quantile norm)",main=colnames(logCPM)[i])
  abline(0,1)
}

boxplot(q$E ~ col(q$E))

lineColour <- ifelse(conds=="MT", "red", "blue")
lineColour

## [1] "blue" "blue" "blue" "red"  "red"  "red"

plot(density(q$E[,1]), ylim=c(0,0.3), col=lineColour[1])
for(i in 2:ncol(logCPM)) lines(density(q$E[,i]), col=lineColour[i])

Note: we’re NOT going to use the quantile normalised data here, but I wanted to show you how that method works.

Statistical testing: fitting a linear model

The next step is to fit a linear model to the transformed count data.

The lmFit() command does this for each gene (all at once).

fit <- lmFit(v, design)

The eBayes() function then performs Emprical Bays shrinkage estimation to adjust the variability estimates for each gene, and produce the moderated t-statistics.

fit <- eBayes(fit)

We then summarize this information using the topTable() function, which list the genes in order of the strength of statistical support for differential expression (i.e., genes with lowest p-values are at the top of the list):

tt <- topTable(fit, coef=ncol(design), n=nrow(counts))
head(tt)

##            logFC  AveExpr        t      P.Value    adj.P.Val        B
## YAL038W 2.313985 10.80214 319.5950 3.725452e-13 8.850433e-10 21.08951
## YOR161C 2.568389 10.80811 321.9628 3.574510e-13 8.850433e-10 21.08187
## YML128C 1.640664 11.40819 286.6167 6.857932e-13 9.775297e-10 20.84520
## YMR105C 2.772539  9.65092 331.8249 3.018547e-13 8.850433e-10 20.16815
## YHL021C 2.034496 10.17510 269.4034 9.702963e-13 1.152550e-09 20.07857
## YDR516C 2.085424 10.05426 260.8061 1.163655e-12 1.184767e-09 19.87217

Significant genes

Adjusted p-values less than 0.05:

sum(tt$adj.P.Val < 0.05)

## [1] 5140

Adjusted p-values less than 0.01:

sum(tt$adj.P.Val <= 0.01)

## [1] 4566

By default, limma uses FDR adjustment (but lets check):

sum(p.adjust(tt$P.Value, method="fdr") <= 0.01)

## [1] 4566

Volcano plots are a popular method for summarising the limma output:

volcanoplot(fit, coef=2)
abline(h=-log10(0.05))

Significantly DE genes:

sigGenes = which(tt$adj.P.Val <= 0.05)
length(sigGenes)

## [1] 5140

volcanoplot(fit, coef=2)
points(tt$logFC[sigGenes], -log10(tt$P.Value[sigGenes]), col='red', pch=16, cex=0.5)

Significantly DE genes above fold-change threshold:

sigGenes = which(tt$adj.P.Val <= 0.05 & (abs(tt$logFC) > log2(2)))
length(sigGenes)

## [1] 1891

volcanoplot(fit, coef=2)
points(tt$logFC[sigGenes], -log10(tt$P.Value[sigGenes]), col='red', pch=16, cex=0.5)

Get the rows of top table with significant adjusted p-values - we’ll save these for later to compare with the other methods.

limmaPadj <- tt[tt$adj.P.Val <= 0.01, ]

DESeq2

The DESeq2 package uses the Negative Binomial distribution to model the count data from each sample.
A statistical test based on the Negative Binomial distribution (via a generalized linear model, GLM) can be used to assess differential expression for each gene.
Use of the Negative Binomial distribution attempts to accurately capture the variation that is observed for count data.

More information about DESeq2: article by Love et al, 2014

Set up “DESeq object” for analysis:

library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = as.matrix(counts), 
                              colData = data.frame(conds=factor(conds)), 
                              design = formula(~conds))

Can access the count data in teh dds object via the counts() function:

counts(dds) %>% head()

##           WT1 WT2 WT3 MT1 MT2 MT3
## YDL248W    52  46  36  65  70  78
## YDL247W-A   0   0   0   0   1   0
## YDL247W     2   4   2   6   8   5
## YDL246C     0   0   1   1   2   0
## YDL245C     0   3   0   5   7   4
## YDL244W     6   6   5  20  30  19

Fit DESeq model to identify DE transcripts

dds <- DESeq(dds)

Take a look at the results table

res <- DESeq2::results(dds)
knitr::kable(res[1:6,])

	baseMean	log2FoldChange	lfcSE	stat	pvalue	padj
YDL248W	56.2230316	-0.2339470	0.2278417	-1.0267961	0.3045165	0.3663639
YDL247W-A	0.1397714	-0.5255769	4.0804729	-0.1288030	0.8975136	0.9173926
YDL247W	4.2428211	-0.8100552	0.8332543	-0.9721584	0.3309718	0.3948536
YDL246C	0.6182409	-1.0326364	2.1560899	-0.4789394	0.6319817	0.6915148
YDL245C	2.8486580	-1.9781751	1.1758845	-1.6822869	0.0925132	0.1230028
YDL244W	13.0883354	-1.5823354	0.5128788	-3.0852037	0.0020341	0.0032860

Summary of differential gene expression

summary(res) 

## 
## out of 6830 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 2520, 37%
## LFC < 0 (down)     : 2521, 37%
## outliers [1]       : 0, 0%
## low counts [2]     : 0, 0%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

Note: p-value adjustment

The “padj” column of the DESeq2 results (res) contains adjusted p-values (FDR).
Can use the p.adjust function to manually adjust the DESeq2 p-values if needed (e.g., to use Holm correction)

# Remove rows with NAs
res = na.omit(res)

# Get the rows of "res" with significant adjusted p-values
resPadj<-res[res$padj <= 0.05 , ]

# Get dimensions
dim(resPadj)

## [1] 4811    6

Number of adjusted p-values less than 0.05

sum(res$padj <= 0.05)

## [1] 4811

Check that this is the same using p.adjust with FDR correction

sum(p.adjust(res$pvalue, method="fdr") <= 0.05)

## [1] 4811

Number of Holm adjusted p-values less than 0.05

sum(p.adjust(res$pvalue, method="holm") <= 0.01)

## [1] 3429

Sort summary list by p-value and save the table as a CSV file that can be read in Excel (or any other spreadhseet program).

res <- res[order(res$padj),]

write.csv(as.data.frame(res),file='deseq2.csv')

edgeR

The edgeR package also uses the negative binomial distribution to model the RNA-seq count data.
Takes an empirical Bayes approach to the statistical analysis, in a similar way to how the limma package handles RNA-seq data.

Identify DEGs with edgeR’s “Exact” Method

Load edgeR package

library(edgeR)

Construct DGEList object

y <- DGEList(counts=counts, group=conds)

Calculate library size (counts per sample)

y <- calcNormFactors(y)

Estimate common dispersion (overall variability)

y <- estimateCommonDisp(y)

Estimate tagwise dispersion (per gene variability)

y <- estimateTagwiseDisp(y)

edgeR analysis and output

Compute exact test for the negative binomial distribution.

et <- exactTest(y) 

knitr::kable(topTags(et, n=4)$table)

	logFC	logCPM
YCR077C	13.098599	6.629306
snR71	-9.870811	8.294459
snR59	-8.752555	10.105453
snR53	-8.503101	7.687908

Adjusted p-values

edge <- as.data.frame(topTags(et, n=nrow(counts))) 
sum(edge$FDR <= 0.05)

## [1] 5242

sum(p.adjust(edge$PValue, method="fdr") <= 0.05)

## [1] 5242

Get the rows of “edge” with significant adjusted p-values

edgePadj <- edge[edge$FDR <= 0.05, ]

DESeq2 vs edgeR

Generate Venn diagram to compare DESeq2 and edgeR results.

There is fairly good overlap:

library(gplots)
setlist <- list(edgeRexact=rownames(edgePadj), DESeq2=rownames(resPadj))
venn(setlist)

Limma vs edgeR vs DESeq2

Again, fairly good overlap across the three methods we’ve looked at:

setlist <- list(edgeRexact=rownames(edgePadj), 
                DESeq2=rownames(resPadj),
                LimmaVoom=rownames(limmaPadj))
venn(setlist)

Summary

Once we’ve generated count data, there are a number of ways to perform a differential expression analysis.
- DESeq2 and edgeR model the count data, and assume a Negative Binomial distribution
- Limma transforms (and logs) the data and assumes normality
Here we’ve seen that these three approaches give quite similar results.
The next step in a “standard” RNA-seq workflow is to perform “pathway analysis”.

Save limma topTable results for next session…

save(list='tt', file='topTable.RData')

Key Points

Statistical analysis is required to identify genes exhibiting altered expression between experimental conditions.

The limma processing pipeline is a fairly standard (and robust) way to do this.

DESeq2 and edgeR offer alternative methods for identifying differetially expressed genes.

Overrepresentation analysis (Gene Ontology)

Overview

Teaching: 60 min
Exercises: 0 min

Questions

How can we identify changes in biological mechanisms (e.g., pathways) in gene expression studies?

Objectives

Gain basic understanding of Gene Ontology.

Understand how over-representation analysis works for collections of genes.

Gain insight into the need for gene-length adjustment when performing over-representation analysis with RNA-seq data.

Use the GOseq methodology to identify gene-set changes based on Gene Ontology groups.

Lecture notes

In addition to the code below, we’ll also be going through some lecture notes that explain the various concepts being covered in this part of thw workshop.

Lecture notes (PDF): Annotation and Pathways

Overrepresentation analysis (Gene Ontology)

Over Representation Analysis (Boyle et al. 2004) is a widely used approach to determine whether known biological functions or processes are over-represented (= enriched) in an experimentally-derived gene list, e.g. a list of differentially expressed genes (DEGs).

Can perform overrepresentation analysis online (e.g., Enrichr, GeneSetDB, PantherDB), and also in R.
The basic principles are to:
- identify a collection of differentially expressed genes
- test to see if genes that are members of specific gene sets (e.g., Reactome pathways, Gene Ontology categories) are differentially expressed more often than would be expected by chance.

Some caveats for RNA-seq data

The gene-set analysis methods are applicable to transcriptomic data from both microarrays and RNA-seq.
One caveat, however, is that the results need to take gene length into account.
RNA-seq tends to produce higher expression levels (i.e., greater counts) for longer genes: a longer transcript implies more aligned fragments, and thus higher counts. This also gives these genes a great chance of being statistically differentially expressed.
Some gene sets (pathways, GO terms) tend to involve families of long genes: if long genes have a great chance of being detected as differentially expressed, then gene sets consisting of long genes will have a great chance of appeared to be enriched in the analysis.

GOseq

The GOseq methodology (Young et al., 2010) overcomes this issue by allowing the over-representation analysis to be adjusted for gene length.
modification to hypergeometric sampling probability
exact method (resampling) and an approximation-based method
More recent publications have also applied this gene-length correction to GSEA-based methods.
Still not widely understood to be an issue when performing RNA-seq pathway analysis, but REALLY important to take into account.

Results from the Young et al (2010) publication:

Proportion DE by gene length and reads

Gene set ranks by standard analysis

Gene set ranks by GOseq

GOseq analysis

Need to figure out if our organism is supported… (code is “sacCer”)

library(dplyr)
library(goseq)
supportedOrganisms() %>% head()

##          Genome         Id  Id Description Lengths in geneLeneDataBase
## 10      anoCar1    ensGene Ensembl gene ID                        TRUE
## 11      anoGam1    ensGene Ensembl gene ID                        TRUE
## 132     anoGam3                                                  FALSE
## 12      apiMel2    ensGene Ensembl gene ID                        TRUE
## 137 Arabidopsis                                                  FALSE
## 56      bosTau2 geneSymbol     Gene Symbol                        TRUE
##     GO Annotation Available
## 10                    FALSE
## 11                     TRUE
## 132                    TRUE
## 12                    FALSE
## 137                    TRUE
## 56                     TRUE

Easier to find if we use View() (NB - this only works in RStudio. Can’t use in Jupyter on NeSI).

supportedOrganisms() %>% View()

Define differentially expressed genes

Create a vector of 0’s and 1’s to denote whether or not genes are differentially expressed (limma analysis: topTable).
Add gene names to the vector so that GOSeq knows which gene each data point relates to.

Load our topTable results from last session:

load('topTable.RData')

# Note: If you want to use tt from DESeq, replace $adj.P.Val with $padj below

genes <- ifelse(tt$adj.P.Val < 0.05, 1, 0) 
names(genes) <- rownames(tt)
head(genes)

## YAL038W YOR161C YML128C YMR105C YHL021C YDR516C 
##       1       1       1       1       1       1

table(genes)

## genes
##    0    1 
## 1987 5140

Calculate gene weights

Put genes into length-based “bins”, and plot length vs proportion differentially expressed
Likely restricts to only those genes with GO annotation

pwf=nullp(genes, "sacCer1", "ensGene")

Inspect output

Report length (bias) and weight data per gene.

head(pwf)

    ##         DEgenes bias.data       pwf
    ## YAL038W       1      1504 0.8205505
    ## YOR161C       1      1621 0.8205505
    ## YML128C       1      1543 0.8205505
    ## YMR105C       1      1711 0.8205505
    ## YHL021C       1      1399 0.8205505
    ## YDR516C       1      1504 0.8205505

Gene lengths and weights

par(mfrow=c(1,2))
hist(pwf$bias.data,30)
hist(pwf$pwf,30)

Gene length vs average expression

Is there an association between gene length and expression level?

library(ggplot2)
data.frame(logGeneLength = log2(pwf$bias.data), 
           avgExpr = tt$AveExpr) %>% 
  ggplot(., aes(x=logGeneLength, y=avgExpr)) + 
  geom_point(size=0.2) + 
  geom_smooth(method='lm')

How about gene length and statistical evidence supporting differential expression?

(Kinda hard to see, but it is apparently there…)

data.frame(logGeneLength = log2(pwf$bias.data), 
           negLogAdjP = -log10(tt$adj.P.Val)) %>% 
  ggplot(., aes(x=logGeneLength, negLogAdjP)) + 
  geom_point(size=0.2) + 
  geom_smooth(method='lm')

Length correction in GOSeq

Uses “Wallenius approximation” to perform correction.
Essentially it is performing a weighted Fisher’s Exact Test, but each gene in the 2x2 data does not contribute equally to the per cell count - it instead contributes its weight (based on its length).
This means that a gene set (e.g., GO term) containing lots of significant short genes will be considered more likely to be enriched that a gene set with a similar proportion of long genes that are differentially expressed.

Run GOSeq with gene length correction:

GO.wall=goseq(pwf, "sacCer1", "ensGene")

Output: Wallenius method

head(GO.wall)

    ##        category over_represented_pvalue under_represented_pvalue numDEInCat
    ## 1325 GO:0005622            1.817597e-17                        1       4298
    ## 7912 GO:0110165            8.200109e-12                        1       4456
    ## 5245 GO:0042254            9.441146e-12                        1        364
    ## 3817 GO:0022613            1.355385e-11                        1        439
    ## 5446 GO:0043226            2.516156e-11                        1       3826
    ## 5449 GO:0043229            4.433252e-11                        1       3817
    ##      numInCat                                 term ontology
    ## 1325     5179                        intracellular       CC
    ## 7912     5417           cellular anatomical entity       CC
    ## 5245      395                  ribosome biogenesis       BP
    ## 3817      482 ribonucleoprotein complex biogenesis       BP
    ## 5446     4609                            organelle       CC
    ## 5449     4599              intracellular organelle       CC

P-value adjustment

GO.wall.padj <- p.adjust(GO.wall$over_represented_pvalue, method="fdr")
sum(GO.wall.padj < 0.05)

## [1] 45

GO.wall.sig <- GO.wall$category[GO.wall.padj < 0.05]
length(GO.wall.sig)

## [1] 45

head(GO.wall.sig)

    ## [1] "GO:0005622" "GO:0110165" "GO:0042254" "GO:0022613" "GO:0043226"
    ## [6] "GO:0043229"

Filtering by gene set length

We are usually not interested in pathways or ontology terms that involve large numbers of genes (they are often rather broad terms or mechanisms), so we can exclude these from the results (here we only include significant GO terms that involve less than 500 genes):

GO.wall[GO.wall.padj < 0.05, ] %>% filter(numInCat < 500)

    ##      category over_represented_pvalue under_represented_pvalue numDEInCat
    ## 1  GO:0042254            9.441146e-12                1.0000000        364
    ## 2  GO:0022613            1.355385e-11                1.0000000        439
    ## 3  GO:0030684            3.348423e-08                1.0000000        163
    ## 4  GO:0016072            3.873737e-08                1.0000000        274
    ## 5  GO:0006364            4.532117e-08                1.0000000        259
    ## 6  GO:0034470            2.776978e-07                1.0000000        360
    ## 7  GO:0030490            9.891093e-07                0.9999998        111
    ## 8  GO:0030687            1.502461e-06                1.0000000         62
    ## 9  GO:0042274            2.616777e-06                0.9999993        134
    ## 10 GO:0005730            3.753501e-06                1.0000000        264
    ## 11 GO:0003735            3.799990e-06                1.0000000        194
    ## 12 GO:0022626            4.824970e-06                0.9999984        140
    ## 13 GO:0044391            6.732064e-06                1.0000000        202
    ## 14 GO:0000462            1.221128e-05                0.9999977         99
    ## 15 GO:0005840            1.780243e-05                0.9999942        221
    ## 16 GO:0002181            2.156069e-05                0.9999913        169
    ## 17 GO:0034660            7.782098e-05                0.9999571        409
    ## 18 GO:0042273            8.604282e-05                0.9999745        117
    ##    numInCat
    ## 1       395
    ## 2       482
    ## 3       171
    ## 4       299
    ## 5       282
    ## 6       401
    ## 7       116
    ## 8        62
    ## 9       143
    ## 10      292
    ## 11      221
    ## 12      156
    ## 13      231
    ## 14      104
    ## 15      254
    ## 16      191
    ## 17      467
    ## 18      126
    ##                                                                                        term
    ## 1                                                                       ribosome biogenesis
    ## 2                                                      ribonucleoprotein complex biogenesis
    ## 3                                                                               preribosome
    ## 4                                                                    rRNA metabolic process
    ## 5                                                                           rRNA processing
    ## 6                                                                          ncRNA processing
    ## 7                                                                    maturation of SSU-rRNA
    ## 8                                                      preribosome, large subunit precursor
    ## 9                                                        ribosomal small subunit biogenesis
    ## 10                                                                                nucleolus
    ## 11                                                       structural constituent of ribosome
    ## 12                                                                       cytosolic ribosome
    ## 13                                                                        ribosomal subunit
    ## 14 maturation of SSU-rRNA from tricistronic rRNA transcript (SSU-rRNA, 5.8S rRNA, LSU-rRNA)
    ## 15                                                                                 ribosome
    ## 16                                                                  cytoplasmic translation
    ## 17                                                                  ncRNA metabolic process
    ## 18                                                       ribosomal large subunit biogenesis
    ##    ontology
    ## 1        BP
    ## 2        BP
    ## 3        CC
    ## 4        BP
    ## 5        BP
    ## 6        BP
    ## 7        BP
    ## 8        CC
    ## 9        BP
    ## 10       CC
    ## 11       MF
    ## 12       CC
    ## 13       CC
    ## 14       BP
    ## 15       CC
    ## 16       BP
    ## 17       BP
    ## 18       BP

GO terms

Can use the GO.db package to get more information about the significant gene sets.

library(GO.db)

GOTERM[[GO.wall.sig[1]]]

## GOID: GO:0005622
## Term: intracellular
## Ontology: CC
## Definition: The living contents of a cell; the matter contained within
##     (but not including) the plasma membrane, usually taken to exclude
##     large vacuoles and masses of secretory or ingested material. In
##     eukaryotes it includes the nucleus and cytoplasm.
## Synonym: internal to cell
## Synonym: protoplasm
## Synonym: nucleocytoplasm
## Synonym: protoplast

NB - the code below is demosntrating the difference between running with and without the gene length correction that GOseq implements. You wouldn’t usually work through this as part of a standard pathway analysis.

Run GOSeq without gene length correction

GO.nobias=goseq(pwf, "sacCer1", "ensGene", method="Hypergeometric")

Output: Hypergeomtric (Fisher) method

head(GO.nobias)

##        category over_represented_pvalue under_represented_pvalue numDEInCat
## 1325 GO:0005622            8.650971e-29                        1       4298
## 7912 GO:0110165            3.271100e-21                        1       4456
## 5446 GO:0043226            9.826546e-16                        1       3826
## 2650 GO:0009987            1.330459e-15                        1       4146
## 5449 GO:0043229            2.060920e-15                        1       3817
## 5245 GO:0042254            7.222008e-11                        1        364
##      numInCat                       term ontology
## 1325     5179              intracellular       CC
## 7912     5417 cellular anatomical entity       CC
## 5446     4609                  organelle       CC
## 2650     5023           cellular process       BP
## 5449     4599    intracellular organelle       CC
## 5245      395        ribosome biogenesis       BP

P-value adjustment

GO.nobias.padj <- p.adjust(GO.nobias$over_represented_pvalue, method="fdr")
sum(GO.nobias.padj < 0.05)

## [1] 44

GO.nobias.sig <- GO.nobias$category[GO.nobias.padj < 0.05]
length(GO.nobias.sig)

## [1] 44

head(GO.nobias.sig)

## [1] "GO:0005622" "GO:0110165" "GO:0043226" "GO:0009987" "GO:0043229"
## [6] "GO:0042254"

Compare with and without adjustment

library(gplots)
venn(list(GO.wall=GO.wall.sig, GO.nobias=GO.nobias.sig))

Extract out the different parts of the Venn diagram (yes, there are definitely better ways to do this).

## Only significant in Hypergeomtric analysis
onlySig.nobias <- setdiff(GO.nobias.sig, GO.wall.sig)

## Only significant in Wallenius analysis
onlySig.wall <- setdiff(GO.wall.sig, GO.nobias.sig)

## Significant in both
sig.wall.nobias <- intersect(GO.wall.sig, GO.nobias.sig)

Gene lengths and GO term membership

Can also extract gene length and GO membership information.

len=getlength(names(genes),"sacCer1","ensGene")

head(len)

## [1] 1504 1621 1543 1711 1399 1504

go = getgo(names(genes),"sacCer1","ensGene")

names(go) %>% head()

## [1] "YAL038W" "YOR161C" "YML128C" "YMR105C" "YHL021C" "YDR516C"

class(go)

## [1] "list"

head(go[[1]])

## [1] "GO:0006082" "GO:0008150" "GO:0008152" "GO:0009987" "GO:0019752"
## [6] "GO:0032787"

head(go[[2]])

## [1] "GO:0006810" "GO:0008150" "GO:0009987" "GO:0051179" "GO:0051234"
## [6] "GO:0055085"

Getting fancy…

Figure out which genes are in the significant GO groups, and then gets their lengths.

lengths.onlySig.nobias <- list()

for(i in 1:length(onlySig.nobias)){
  inGo <- lapply(go, function(x)  onlySig.nobias[i] %in% x) %>% unlist()
  lengths.onlySig.nobias[[i]] <- len[inGo]
}

lengths.onlySig.wall <- list()

for(i in 1:length(onlySig.wall)){
  inGo <- lapply(go, function(x)  onlySig.wall[i] %in% x) %>% unlist()
  lengths.onlySig.wall[[i]] <- len[inGo]
}

Significant: Hypergeometric vs Wallenius

Only Hypergeometric (pink) vs only Wallenius (blue)
Hypergeometric method is findings GO terms containing longer genes.

cols <- rep(c("lightpink", "lightblue"), c(10,7))
boxplot(c(lengths.onlySig.nobias, lengths.onlySig.wall), col=cols)

All significant GO terms

lengths.sig.wall.nobias <- list()

for(i in 1:length(sig.wall.nobias)){
  inGo <- lapply(go, function(x)  sig.wall.nobias[i] %in% x) %>% unlist()
  lengths.sig.wall.nobias[[i]] <- len[inGo]
}

cols <- rep(c("lightpink", grey(0.7), "lightblue"), c(10,37,7))

avgLength <- lapply(c(lengths.onlySig.nobias, lengths.sig.wall.nobias, lengths.onlySig.wall),
                    median) %>% unlist()

oo <- order(avgLength, decreasing=TRUE)

boxplot(c(lengths.onlySig.nobias, lengths.sig.wall.nobias, lengths.onlySig.wall)[oo],
        col=cols[oo], ylab="Gene Length", xlab = "GO term")

Gene length versus P-value

avgLength.wall <- lapply(c(lengths.onlySig.wall, lengths.sig.wall.nobias), median)

avgLength.nobias <- lapply(c(lengths.onlySig.nobias, lengths.sig.wall.nobias), median)

cols <- rep(c("blue", "lightblue", "red","lightpink"),
            c(length(lengths.onlySig.wall), length(lengths.sig.wall.nobias),
              length(lengths.onlySig.nobias), length(lengths.sig.wall.nobias)))

plot(c(avgLength.wall, avgLength.nobias), 
     -log(c(GO.nobias.padj[GO.nobias.padj < 0.05], GO.wall.padj[GO.wall.padj < 0.05])), 
     col=cols, pch=16, xlab="Median Gene Length", ylab ="-log(FDR adj-pval)")
legend('topright', c("Only sig in NoBias", "Sig in both (nobias adjp)", 
                     "Sig in both (wal adjp)", "Only sig in Wall"), 
       fill=c("red", "pink", "lightblue", "blue"))

Summary

Once we’ve generated count data, there are a number of ways to perform a differential expression analysis.
DESeq2 and edgeR model the count data, and assume a Negative Binomial distribution
Limma transforms (and logs) the data and assumes normality
Here we’ve seen that these three approches give quite similar results.
For Gene Set analysis, gene length needs to be accounted for, since longer transcripts are more likely to be found to be differentially expressed.
- GOSeq adjusts for transcript length to take this into account.
- It is also possible to use GOSeq with other types of annotation (e.g., Reactome or KEGG pathways).

Key Points

Coordinated changes in groups of functionally related genes can tell us about the underlying biological mechanisms that changing between exprimental conditions.

The characteristics of RNA-seq experiments mean that gene-length correction is required, to avoid standard approaches to over-representation analysis givign erroneous results.

RNA-seq

Workshop Overview

Overview

RNA-seq Workflow

Experimental design

Extraction

Library Preparation

Sequencing

Data Analysis

Exercise

Solution

Introduction

Yeast Dataset

Overview of Illumina Sequencing

Key Points

Quality control of the sequencing data.

Overview

Viewing the FastQC results

Working with the FastQC text output

MultiQC - multi-sample analysis

Key Points

Trimming and Filtering reads

Overview

Cleaning Reads

Adapter removal

How to act on fastq after QC.

MultiQC: cutadapt log files

Key Points

Map and count reads

Overview

Alignment to a reference genome

Preparation of the genome

Alignment on the genome

Converting SAM files to BAM files

A quick look into the sam file

Sorting BAM files

Some stats on your mapping:

MultiQC: HISAT2 output

Read Summarization

Counting

Subread and FeatureCounts

MultiQC: featureCounts output

Download with Jupyter

Key Points

Differential Expression

Overview

RNA-seq: overview

Recap

Data set reminder

Getting organised

Create an RStudio project

Lecture notes

Count data

Working directory in Jupyter

Visualising the data

Read counts per sample

Detecting differential expression:

Limma

Aside: RPKM

Back to the analysis… (using logCPM)

Limma: voom

Voom: impact on samples

Statistical testing: fitting a linear model

Significant genes

DESeq2

edgeR

Identify DEGs with edgeR’s “Exact” Method

edgeR analysis and output

DESeq2 vs edgeR

Limma vs edgeR vs DESeq2

Summary

Save limma topTable results for next session…

Key Points

Overrepresentation analysis (Gene Ontology)

Overview

Lecture notes

Overrepresentation analysis (Gene Ontology)

GOseq

Results from the Young et al (2010) publication:

GOseq analysis

MultiQC: `cutadapt` log files

MultiQC: `HISAT2` output