This lesson is still being designed and assembled (Pre-Alpha version)

RNA-seq

Workshop Overview

Overview

Teaching: 30 min
Exercises: 10 min
Questions
  • What is RNA-seq?

  • What data will we be using in this workshop?

Objectives
  • Understand the basics of RNA-seq, from sample collection through to data generation.

  • Learn a little bit about teh yeast data set being used in this session.

RNA-seq Workflow

Before beginning an RNA-Seq experiment, you should understand and carefully consider each step of the RNA-Seq workflow: Experimental design, Extraction, Library preparation, Sequencing, and Data analysis.

RNAseq workflow

Experimental design

The design stage of your experiment is arguably the most critical step in ensuring the success of an RNA-Seq experiment. Researchers must make key decisions at the start of any NGS project, including the type of assay and the number of samples to analyze. The optimal approach will depend largely on the objectives of the experiment, hypotheses to be tested, and expected information to be gathered.

Extraction

The first step in characterizing the transcriptome involves isolating and purifying cellular RNA. The quality and quantity of the input material have a significant impact on data quality; therefore, care must be taken when isolating and preparing RNA for sequencing. Given the chemical instability of RNA, there are two major reasons for RNA degradation during experiments:

In the short term, RNA may be stored in RNase-free water or TE buffer at -80°C for 1 year without degradation. For the long term, RNA samples may be stored as ethanol precipitates at -20°C. Avoid repeated freeze-thaw cycles of samples, which can lead to degradation. RNA of high integrity will maximize the likelihood of obtaining reliable and informative results.

Library Preparation

This involves generating a collection of RNA fragments that are compatible for sequencing. The process involves enrichment of target (non-ribosomal) RNA, fragmentation, reverse transcription (i.e. cDNA synthesis), and addition of sequencing adapters and amplification. The enrichment method determines which types of transcripts (e.g. mRNA, lncRNA, miRNA) will be included in the library. In addition, the cDNA synthesis step can be performed in a such a way as to maintain the original strand orientation of the transcript, generating what is known as ‘strand-specific’ or ‘directional’ libraries.

Sequencing

Parameters for sequencing—such as read length, configuration, and output—depend on the goals of your project and will influence your choice of instrument and sequencing chemistry. The main NGS technologies can be grouped into two categories: short-read (or ‘second generation’) sequencing, and long-read (or ‘third generation’) sequencing. Both have distinct benefits for RNA-Seq.

However, if cost reduction is paramount and/or high data output is required, short-read sequencing is a better choice.

Data Analysis

Evaluating your data quality and extracting biologically relevant information is the final and most rewarding step in an RNA-Seq experiment. It is important to discuss your project with an experienced bioinformatician to find the best analysis pipeline for your data. One pipeline does not fit all approaches.

Exercise

Discussion: what are some possible uses of RNA-seq?

Solution

  • Rank genes based on expression
  • Identify differentially expressed genes after inducing a drug
  • Identify novel transcripts
  • Identify bacterial and eukaryotic genes in a sample
  • Investigate alternative splicing and isoform usage
  • De novo transcriptome assembly
  • Others?

Introduction

This workshop uses the dataset from yeast RNA-seq experiment, Lee et al 2008

Yeast Dataset

Overview of Illumina Sequencing

Here is a video to illumina sequencing.

Key Points

  • RNA-seq is a commonly used technology for profiling the transcriptome.

  • There are a number of different applications for RNA-seq data - we’ll be looking at detecetign differentially expressed genes using data from an experiment involving RNA-seq data from yeast.


Quality control of the sequencing data.

Overview

Teaching: 30 min
Exercises: 0 min
Questions
  • What tools do we use to assess the data quality data in RNA-seq experiments?

Objectives
  • Learn how to assess data quality

  • Use the FastQC application to perform quality assessment

  • Use the MultiQC application to combine and view quality information

Several tools available to do quality assessemnt. For this workshop, we will use fastqc.

First, it is always good to verify where we are:


$ pwd
/home/[your_username]
# good I am ready to work

cd ~/obss_2021/
# copy the needed data into your directory
cp -r /nesi/projects/nesi02659/obss_2021/Resources/RNA_seq/ .

Checking to make sure we have the Raw files for the workshop.

$ ls
edna gbs genomic_dna intro_bash intro_r RNA_seq . ..

Creating a directory where to store the QC data:

$ cd RNA_seq
$ ls
Genome  Raw  rsmodules.sh  yeast_counts_all_chr.txt
$ mkdir QC

Since we are working on the NeSI HPC, we need to search and load the package before we start using it.

Search

$ module spider fastqc

and then load

$ module purge
$ module load FastQC/0.11.9

hint : there is a file named rsmodules.sh which is a shell script to load the required modules at once. Running source ~/RNA_seq/rsmodules.sh command will excute it.

Now we can start the quality control:

$ fastqc -o QC/ Raw/*

You will see an automatically updating output message telling you the progress of the analysis. It will start like this:

Started analysis of SRR014335-chr1.fastq
Approx 5% complete for SRR014335-chr1.fastq
Approx 10% complete for SRR014335-chr1.fastq
Approx 15% complete for SRR014335-chr1.fastq
Approx 20% complete for SRR014335-chr1.fastq
Approx 25% complete for SRR014335-chr1.fastq
Approx 30% complete for SRR014335-chr1.fastq
Approx 35% complete for SRR014335-chr1.fastq

The FastQC program has created several new files within our RNA_seq/QC/ directory.

$ ls QC
SRR014335-chr1_fastqc.html  SRR014336-chr1_fastqc.zip   SRR014339-chr1_fastqc.html  SRR014340-chr1_fastqc.zip
SRR014335-chr1_fastqc.zip   SRR014337-chr1_fastqc.html  SRR014339-chr1_fastqc.zip   SRR014341-chr1_fastqc.html
SRR014336-chr1_fastqc.html  SRR014337-chr1_fastqc.zip   SRR014340-chr1_fastqc.html  SRR014341-chr1_fastqc.zip

Viewing the FastQC results

If we were working on our local computers, we’d be able to look at each of these HTML files by opening them in a web browser.

However, these files are currently sitting on our remote NeSI HPC, where our local computer can’t see them. And, since we are only logging into NeSI via the command line - it doesn’t have any web browser setup to display these files either.

So the easiest way to look at these webpage summary reports will be to transfer them to our local computers (i.e. your laptop).

To transfer a file from a remote server to our own machines, we will use scp.

First we will make a new directory on our computer to store the HTML files we’re transferring. Let’s put it on our desktop for now. Open a new tab in your terminal program (you can use the pull down menu at the top of your screen or the Cmd+t keyboard shortcut) and type:

$ mkdir -p ~/Desktop/fastqc_html 
$ scp -r [Your_UserName]p@login.mahuika.nesi.org.nz:~/obss_2021/RNA_seq/QC/ ~/Desktop/fastqc_html

Working with the FastQC text output

Now that we’ve looked at our HTML reports to get a feel for the data, let’s look more closely at the other output files. Go back to the tab in your terminal program that is connected to NeSI and make sure you’re in our results subdirectory.

$ cd /home/fayfa80p/obss_2021/RNA_seq/QC

$ ls
SRR014335-chr1_fastqc.html  SRR014336-chr1_fastqc.zip   SRR014339-chr1_fastqc.html  SRR014340-chr1_fastqc.zip
SRR014335-chr1_fastqc.zip   SRR014337-chr1_fastqc.html  SRR014339-chr1_fastqc.zip   SRR014341-chr1_fastqc.html
SRR014336-chr1_fastqc.html  SRR014337-chr1_fastqc.zip   SRR014340-chr1_fastqc.html  SRR014341-chr1_fastqc.zip

Let’s unzip the files to look at the FastQC text file outputs.

$ for filename in *.zip
> do
> unzip $filename
> done

Inside each unzipped folder, there is a summary text which shows results of the statistical tests done by FastQC

$ ls SRR014335-chr1_fastqc
fastqc_data.txt  fastqc.fo  fastqc_report.html	Icons/	Images/  summary.txt

Use less to preview the summary.txt file

$ less SRR014335-chr1_fastqc/summary.txt

We can make a record of the results we obtained for all our samples by concatenating all of our summary.txt files into a single file using the cat command. We’ll call this fastqc_summaries.txt.

$ cat */summary.txt > ~/obss_2021/RNA_seq/QC/fastqc_summaries.txt 

MultiQC - multi-sample analysis

$ module load MultiQC/1.9-gimkl-2020a-Python-3.8.2
$ cd ~/obss_2021/RNA_seq/
$ mkdir MultiQC
$ cd MultiQC
$ cp ../QC/* ./
$ multiqc .
$ ls -F
multiqc_data/  multiqc_report.html

The html report shows the MultiQC summary

Alt text

Key Points

  • Quality assessmet is a key initial step in the analysis of RNA-seq data

  • The FastQC and MultiQC applications are useful tools for quality assessent of RNA-seq data.


Trimming and Filtering reads

Overview

Teaching: 30 min
Exercises: 0 min
Questions
  • What steps are required for basic data cleaning in RNA-seq studies?

Objectives
  • Understand the various data cleaning steps for RNA-seq data.

  • Learn how to perfrorm adapter removal and quality trimming.

Cleaning Reads

In the previous section, we took a high-level look at the quality of each of our samples using FastQC. We visualized per-base quality graphs showing the distribution of read quality at each base across all reads in a sample and extracted information about which samples fail which quality checks. Some of our samples failed quite a few quality metrics used by FastQC. This doesn’t mean, though, that our samples should be thrown out! It’s very common to have some quality metrics fail, and this may or may not be a problem for your downstream application.

Adapter removal

We will use a program called CutAdapt to filter poor quality reads and trim poor quality bases from our samples.

How to act on fastq after QC.

We can do several trimming:

To do so, we can use on tools: The cutadapt application is often used to remove adapter sequence from FASTQ files.

$ pwd
/home/[Your_Username]/obss_2021/RNA_seq

$ mkdir Trimmed

$ module load cutadapt/2.10-gimkl-2020a-Python-3.8.2

$ cutadapt -q 20 -a AACCGGTT -o Trimmed/SRR014335-chr1_cutadapt.fastq Raw/SRR014335-chr1.fastq > Trimmed/SRR014335-chr1.log

We can have a look at the log file to see what cutadapt has done.

$ less Trimmed/SRR014335-chr1.log

Now we should trim all samples.

$ cd Raw

$ ls
SRR014335-chr1.fastq  SRR014336-chr1.fastq  SRR014337-chr1.fastq  SRR014339-chr1.fastq  SRR014340-chr1.fastq  SRR014341-chr1.fastq
$ for filename in *.fastq
> do base=$(basename ${filename} .fastq)
> cutadapt -q 20 -a AACCGGTT -o ../Trimmed/${base}.trimmed.fastq ${filename} > ../Trimmed/${base}.log
> done

MultiQC: cutadapt log files

$ cd ../MultiQC
 
$ cp ../Trimmed/*log .

$ multiqc .

Alt text

Key Points

  • Adapter removal and trimming (optional) are important steps in processign RNA-seq data.


Map and count reads

Overview

Teaching: 45 min
Exercises: 0 min
Questions
  • How do we turn fastq data into read counts?

Objectives
  • Align reads to reference genome.

  • Assess read mapping for each sample.

  • Generate per-sample count data for each genomic feature (e.g., genes, exons etc) that you are interested in.

Alignment to a reference genome

RNA-seq generate gene expression information by quantifying the number of transcripts (per gene) in a sample. This is acompished by counting the number of transcripts that have been sequenced - the more active a gene is, the more transcripts will be in a sample, and the more reads will be generated from that transcript.

For RNA-seq, we need to align or map each read back to the genome, to see which gene produced it.

Preparation of the genome

To be able to map (align) sequencing reads on the genome, the genome needs to be indexed first. In this workshop we will use HISAT2. Note for speed reason, the reads will be aligned on the chr5 of the mouse genome.

$ cd /home/[Your_Username]/obss_2021/RNA_seq/Genome

#to list what is in your directory:
$ ls /home/[Your_Username]/obss_2021/RNA_seq/Genome
Saccharomyces_cerevisiae.R64-1-1.99.gtf  Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa
$ module load HISAT2/2.2.0-gimkl-2020a

# index file:
$ hisat2-build -p 4 -f Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa Saccharomyces_cerevisiae.R64-1-1.dna.toplevel

#list what is in the directory:
$ ls /home/[Your_Username]/obss_2021/RNA_seq/Genome
Saccharomyces_cerevisiae.R64-1-1.99.gtf              Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.4.ht2  Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.8.ht2
Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.1.ht2  Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.5.ht2  Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa
Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.2.ht2  Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.6.ht2
Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.3.ht2  Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.7.ht2

Option info:

How many files were created during the indexing process?

Alignment on the genome

Now that the genome is prepared. Sequencing reads can be aligned.

Information required:

  
$ cd ..
  
$ ls
Genome  QC  Raw  Trimmed

Let’s map one of our sample to the genome

$ hisat2 -x Genome/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel -U Raw/SRR014335-chr1.fastq -S SRR014335.sam
125090 reads; of these:
  125090 (100.00%) were unpaired; of these:
    20537 (16.42%) aligned 0 times
    85066 (68.00%) aligned exactly 1 time
    19487 (15.58%) aligned >1 times
83.58% overall alignment rate

Now we need to align all the rest of the samples.

$ pwd
/home/[Your_Username]/obss_2021/RNA_seq/
$ mkdir Mapping

$ ls
Genome  Mapping  QC  Raw  SRR014335.sam  Trimmed

let’s use a for loop to process our samples:

$ cd Trimmed

$ ls
SRR014335-chr1.fastq  SRR014336-chr1.fastq  SRR014337-chr1.fastq  SRR014339-chr1.fastq  SRR014340-chr1.fastq  SRR014341-chr1.fastq
$ for filename in *.trimmed.fastq
> do
> base=$(basename ${filename} .trimmed.fastq)
> hisat2 -p 4 -x ../Genome/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel -U $filename -S ../Mapping/${base}.sam --summary-file ../Mapping/${base}_summary.txt
> done

Now we can explore our SAM files.

$ cd ../Mapping

$ ls
SRR014335-chr1.sam          SRR014336-chr1_summary.txt  SRR014339-chr1.sam          SRR014340-chr1_summary.txt
SRR014335-chr1_summary.txt  SRR014337-chr1.sam          SRR014339-chr1_summary.txt  SRR014341-chr1.sam
SRR014336-chr1.sam          SRR014337-chr1_summary.txt  SRR014340-chr1.sam          SRR014341-chr1_summary.txt

Converting SAM files to BAM files

The SAM file, is a tab-delimited text file that contains information for each individual read and its alignment to the genome. While we do not have time to go into detail about the features of the SAM format, the paper by Heng Li et al. provides a lot more detail on the specification.

The compressed binary version of SAM is called a BAM file. We use this version to reduce size and to allow for indexing, which enables efficient random access of the data contained within the file.

A quick look into the sam file

$ less SRR014335-chr1.sam 

The file begins with a header, which is optional. The header is used to describe the source of data, reference sequence, method of alignment, etc., this will change depending on the aligner being used. Following the header is the alignment section. Each line that follows corresponds to alignment information for a single read. Each alignment line has 11 mandatory fields for essential mapping information and a variable number of other fields for aligner specific information. An example entry from a SAM file is displayed below with the different fields highlighted.

We will convert the SAM file to BAM format using the samtools program with the view command and tell this command that the input is in SAM format (-S) and to output BAM format (-b):

$ module load SAMtools/1.10-GCC-9.2.0

$ for filename in *.sam
> do
> base=$(basename ${filename} .sam)
> samtools view -S -b ${filename} -o ${base}.bam
> done

$ ls
SRR014335-chr1.bam  SRR014336-chr1.bam  SRR014337-chr1.bam  SRR014339-chr1.bam  SRR014340-chr1.bam  SRR014341-chr1.bam
SRR014335-chr1.sam  SRR014336-chr1.sam  SRR014337-chr1.sam  SRR014339-chr1.sam  SRR014340-chr1.sam  SRR014341-chr1.sam

Sorting BAM files

Next we sort the BAM file using the sort command from samtools. -o tells the command where to write the output.

$ for filename in *.bam
> do
> base=$(basename ${filename} .bam)
> samtools sort -o ${base}_sorted.bam ${filename}
> done

SAM/BAM files can be sorted in multiple ways, e.g. by location of alignment on the chromosome, by read name, etc. It is important to be aware that different alignment tools will output differently sorted SAM/BAM, and different downstream tools require differently sorted alignment files as input.

You can use samtools to learn more about the bam file as well.

Some stats on your mapping:

$ samtools flagstat SRR014335-chr1_sorted.bam 
156984 + 0 in total (QC-passed reads + QC-failed reads)
31894 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
136447 + 0 mapped (86.92% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

MultiQC: HISAT2 output

 $ cd ~/obss_2021/RNA_seq/MultiQC
 
 $ cp ../Mapping/*summary* ./
 
 $ multiqc .

Alt text

Please note: running HISAT2 with either option --summary-file or older versions (< v2.1.0) gives log output identical to Bowtie2. These logs are indistinguishable and summary statistics will appear in MultiQC reports labelled as Bowtie2.


Read Summarization

Sequencing reads often need to be assigned to genomic features of interest after they are mapped to the reference genome. This process is often called read summarization or read quantification. Read summarization is required by a number of downstream analyses such as gene expression analysis and histone modification analysis. The output of read summarization is a count table, in which the number of reads assigned to each feature in each library is recorded.

Counting

Subread and FeatureCounts

$ module load Subread

$ pwd
/home/[Your_Username]/obss_2021/RNA_seq

$ mkdir Counts

$ cd Counts

$ featureCounts -a ../Genome/Saccharomyces_cerevisiae.R64-1-1.99.gtf -o ./yeast_counts.txt -T 2 -t exon -g gene_id ../Mapping/*sorted.bam

MultiQC: featureCounts output

$ cd ../MultiQC

$ cp ../Counts/* .

$ multiqc .

Alt text


Since we now have all the count data in one file, we need to transfer it to our local computers so we could start working on RStudio to get differentially expressed genes.

#In a new terminal that you can access you computer files, cd to the directory you want to save the counts file.

$ scp fayfa80p@login.mahuika.nesi.org.nz:/home/fayfa80p/RNA_seq/Counts/yeast_counts_all_chr.txt ./

Download with Jupyter

An alternative is to navigate to the directory ~/obss_2021/RNA_seq/Counts/ in the navigation panel of Jupyter then right-click on yeast_counts_all_chr.txt and select “Download”

Key Points

  • Alignment and feature counting are used to generate read counts for each genomic feature (e.g., genes) of interest, per sample.

  • The count data can then be used for stataistical analysis (e.g., to identify differentially expressed genes).


Differential Expression

Overview

Teaching: 75 min
Exercises: 0 min
Questions
  • How can we identify genes which exhibit changes in expresion between experimental conditions?

Objectives
  • Understand pre-processing steps required for standarising count data.

  • Use count data and the limma package to perform normalisation and identify differentially expressed genes.

  • Understand need for multiple comparisons correction, and be able to apply FWER and FDR correction methods.

  • Explore alternative methods (egdeR and DESeq2) for identifying differentially expressed genes.

RNA-seq: overview

Recap

Data set reminder

Getting organised

NOTE: skip to the next section (“Count Data”) if you are working within Jupyter on NeSI

Create an RStudio project

One of the first benefits we will take advantage of in RStudio is something called an RStudio Project. An RStudio project allows you to more easily:

  • Save data, files, variables, packages, etc. related to a specific analysis project
  • Restart work where you left off
  • Collaborate, especially if you are using version control such as git.

To create a project:

  • Open RStudio and go to the File menu, and click New Project.
  • In the window that opens select Existing Project, and browse to the RNA_seq folder.
  • Finally, click Create Project.

Save source from untitled to yeast_data.R and continue saving regularly as you work.

Lecture notes

In addition to the code below, we’ll also be going through some lecture notes that explain the various concepts being covered in this part of the workshop.

Lecture notes (PDF): Differential Expression

Count data

Note: I have now aligned the data for ALL CHROMOSOMES and generated counts, so we are working with data from all 7127 genes.

Let’s look at our data set and perform some basic checks before we do a differential expression analysis.

Working directory in Jupyter

Before launching your R notebook in Jupyter, navigate to ~/obss_2021/RNA_seq/ in the navigation panel. Ensure your working directory by using getwd() and if it isn’t ~/obss_2021/RNA_seq/, set it with setwd('~/obss_2021/RNA_seq').

library(dplyr)
fcData = read.table('yeast_counts_all_chr.txt', sep='\t', header=TRUE)
fcData %>% head()
    ##      Geneid Chr Start   End Strand Length ...STAR.SRR014335.Aligned.out.sam
    ## 1   YDL248W  IV  1802  2953      +   1152                                52
    ## 2 YDL247W-A  IV  3762  3836      +     75                                 0
    ## 3   YDL247W  IV  5985  7814      +   1830                                 2
    ## 4   YDL246C  IV  8683  9756      -   1074                                 0
    ## 5   YDL245C  IV 11657 13360      -   1704                                 0
    ## 6   YDL244W  IV 16204 17226      +   1023                                 6
    ##   ...STAR.SRR014336.Aligned.out.sam ...STAR.SRR014337.Aligned.out.sam
    ## 1                                46                                36
    ## 2                                 0                                 0
    ## 3                                 4                                 2
    ## 4                                 0                                 1
    ## 5                                 3                                 0
    ## 6                                 6                                 5
    ##   ...STAR.SRR014339.Aligned.out.sam ...STAR.SRR014340.Aligned.out.sam
    ## 1                                65                                70
    ## 2                                 0                                 1
    ## 3                                 6                                 8
    ## 4                                 1                                 2
    ## 5                                 5                                 7
    ## 6                                20                                30
    ##   ...STAR.SRR014341.Aligned.out.sam
    ## 1                                78
    ## 2                                 0
    ## 3                                 5
    ## 4                                 0
    ## 5                                 4
    ## 6                                19

Check dimensions:

dim(fcData)
    ## [1] 7127   12
names(fcData)
    ##  [1] "Geneid"                            "Chr"                              
    ##  [3] "Start"                             "End"                              
    ##  [5] "Strand"                            "Length"                           
    ##  [7] "...STAR.SRR014335.Aligned.out.sam" "...STAR.SRR014336.Aligned.out.sam"
    ##  [9] "...STAR.SRR014337.Aligned.out.sam" "...STAR.SRR014339.Aligned.out.sam"
    ## [11] "...STAR.SRR014340.Aligned.out.sam" "...STAR.SRR014341.Aligned.out.sam"

Rename data columns to reflect group membership

names(fcData)[7:12] = c("WT1", "WT2", "WT3", "MT1", "MT2", "MT3")
 
fcData %>% head()
    ##      Geneid Chr Start   End Strand Length WT1 WT2 WT3 MT1 MT2 MT3
    ## 1   YDL248W  IV  1802  2953      +   1152  52  46  36  65  70  78
    ## 2 YDL247W-A  IV  3762  3836      +     75   0   0   0   0   1   0
    ## 3   YDL247W  IV  5985  7814      +   1830   2   4   2   6   8   5
    ## 4   YDL246C  IV  8683  9756      -   1074   0   0   1   1   2   0
    ## 5   YDL245C  IV 11657 13360      -   1704   0   3   0   5   7   4
    ## 6   YDL244W  IV 16204 17226      +   1023   6   6   5  20  30  19

Extract count data

counts = fcData[, 7:12]
rownames(counts) = fcData$Geneid
counts %>% head()
    ##           WT1 WT2 WT3 MT1 MT2 MT3
    ## YDL248W    52  46  36  65  70  78
    ## YDL247W-A   0   0   0   0   1   0
    ## YDL247W     2   4   2   6   8   5
    ## YDL246C     0   0   1   1   2   0
    ## YDL245C     0   3   0   5   7   4
    ## YDL244W     6   6   5  20  30  19

Visualising the data

Data are highly skewed (suggests that logging might be useful):

boxplot(as.matrix(counts) ~ col(counts))

Boxplot of counts

Some genes have zero counts:

colSums(counts==0)
    ## WT1 WT2 WT3 MT1 MT2 MT3 
    ## 562 563 573 437 425 435

Log transformation (add 0.5 to avoid log(0) issues):

logCounts = log2(as.matrix(counts)+ 0.5)

Now we can see the per-sample distributions more clearly:

boxplot(as.matrix(logCounts) ~ col(logCounts))

"Boxplot of log(counts)"

Density plots are also a good way to visualise the data:

lineColour <- c("blue", "blue", "blue", "red", "red", "red")
lineColour
    ## [1] "blue" "blue" "blue" "red"  "red"  "red"
plot(density(logCounts[,1]), ylim=c(0,0.3), col=lineColour[1])
for(i in 2:ncol(logCounts)) lines(density(logCounts[,i]), col=lineColour[i])

Density plot of counts

The boxplots and density plots show clear differences between the sample groups - are these biological, or experimental artifacts? (often we don’t know).

Remember: the wild-type and mutant yeast strains are VERY different. You wouldn’t normally see this amount of difference in the distributions between the groups.

Read counts per sample

Normalisation process (slightly different for each analysis method) takes “library size” (number of reads generated for each sample) into account.

colSums(counts)
    ##     WT1     WT2     WT3     MT1     MT2     MT3 
    ## 4915975 4892227 4778158 4618409 4719413 4554283

Visualise via bar plot

colSums(counts) %>% barplot(., ylab="Reads mapped per sample")

Bar plot of reads mapped per sample

Now we are ready for differential expression analysis

Detecting differential expression:

We are going to identify genes that are differential expressed using 3 different packages (time allowing) and compare the results.

But first, lets take a brief aside, and talk about the process of detecting genes that have undergone a statistically significance change in expression between the two groups.

OPEN UP THIS PDF DOCUMENT


Limma

library(limma)
library(edgeR)

dge <- DGEList(counts=counts)
dge <- calcNormFactors(dge)
logCPM <- cpm(dge, log=TRUE, prior.count=3)

options(width=100)
head(logCPM, 3)
    ##                  WT1        WT2        WT3        MT1        MT2        MT3
    ## YDL248W    3.7199528  3.5561232  3.2538405  3.6446399  3.7156488  3.9155366
    ## YDL247W-A -0.6765789 -0.6765789 -0.6765789 -0.6765789 -0.3140297 -0.6765789
    ## YDL247W    0.1484688  0.6727144  0.1645731  0.7843936  1.0395626  0.6349276

Aside: RPKM

We can use R to generate RPKM values (or FPKM if using paired-end reads):

library(goseq)
geneLengths = getlength(rownames(counts), "sacCer2","ensGene")
rpkmData <- rpkm(dge, geneLengths)
rpkmData %>% round(., 2) %>% head()
    ##             WT1  WT2  WT3   MT1   MT2   MT3
    ## YDL248W   10.89 9.66 7.73 10.30 10.85 12.55
    ## YDL247W-A  0.00 0.00 0.00  0.00  2.35  0.00
    ## YDL247W    0.26 0.53 0.27  0.60  0.78  0.51
    ## YDL246C    0.00 0.00 0.23  0.17  0.33  0.00
    ## YDL245C    0.00 0.43 0.00  0.54  0.73  0.44
    ## YDL244W    1.41 1.42 1.21  3.57  5.24  3.44
range(rpkmData, na.rm=TRUE)
## [1]     0.00 69786.58

Compare RPKM to logCPM

par(mfrow=c(2,3))
for(i in 1:ncol(rpkmData)){
  plot(logCPM[,i], log2(rpkmData[,i]), pch='.', 
       xlab="logCPM", ylab="RPKM", main=colnames(logCPM)[i])
}

"RPKM vs log(CPM)"

We’re NOT going to use RPKM data here. I just wanted to show you how to calculate it, and how it relates to the logCPM data

Back to the analysis… (using logCPM)

What if we just did a t-test?

## The beeswarm package is great for making jittered dot plots
library(beeswarm)

# Specify "conditions" (groups: WT and MT)
conds <- c("WT","WT","WT","MT","MT","MT")

## Perform t-test for "gene number 6" (because I like that one...)
t.test(logCPM[6,] ~ conds)
    ## 
    ##  Welch Two Sample t-test
    ## 
    ## data:  logCPM[6, ] by conds
    ## t = 7.0124, df = 2.3726, p-value = 0.01228
    ## alternative hypothesis: true difference in means is not equal to 0
    ## 95 percent confidence interval:
    ##  0.5840587 1.9006761
    ## sample estimates:
    ## mean in group MT mean in group WT 
    ##         2.244310         1.001943
beeswarm(logCPM[6,] ~ conds, pch=16, ylab="Expression (logCPM)", xlab="Group",
         main=paste0(rownames(logCPM)[6], ": MT vs WT"))

Beeswarm plot WT vs MT

However, before we get to statistical testing, we (might) first need to do a little bit more normalisation.

Limma: voom

design <- model.matrix(~conds)
design
    ##   (Intercept) condsWT
    ## 1           1       1
    ## 2           1       1
    ## 3           1       1
    ## 4           1       0
    ## 5           1       0
    ## 6           1       0
    ## attr(,"assign")
    ## [1] 0 1
    ## attr(,"contrasts")
    ## attr(,"contrasts")$conds
    ## [1] "contr.treatment"
v <- voom(dge, design, plot=TRUE)

Voom: mean-variance trend

Voom: impact on samples

Mainly affects the low-end (low abundance genes)

par(mfrow=c(2,3))
for(i in 1:ncol(logCPM)){
  plot(logCPM[,i], v$E[,i], xlab="LogCPM",
       ylab="Voom",main=colnames(logCPM)[i])
  abline(0,1)
}

Hasn’t removed the differences between the groups

boxplot(v$E ~ col(v$E))

lineColour <- ifelse(conds=="MT", "red", "blue")
lineColour
## [1] "blue" "blue" "blue" "red"  "red"  "red"
plot(density(v$E[,1]), ylim=c(0,0.3), col=lineColour[1])
for(i in 2:ncol(logCPM)) lines(density(v$E[,i]), col=lineColour[i])

We can deal with this via “quantile normalisation”. Specify

normalize="quantile"

in the voom function.

q <- voom(dge, design, plot=TRUE, normalize="quantile")

Quantile normalisation forces all of the distributions to be the same.

par(mfrow=c(2,3))
for(i in 1:ncol(logCPM)){
  plot(logCPM[,i], q$E[,i], xlab="LogCPM",
       ylab="Voom (+quantile norm)",main=colnames(logCPM)[i])
  abline(0,1)
}

boxplot(q$E ~ col(q$E))

lineColour <- ifelse(conds=="MT", "red", "blue")
lineColour
## [1] "blue" "blue" "blue" "red"  "red"  "red"
plot(density(q$E[,1]), ylim=c(0,0.3), col=lineColour[1])
for(i in 2:ncol(logCPM)) lines(density(q$E[,i]), col=lineColour[i])

Note: we’re NOT going to use the quantile normalised data here, but I wanted to show you how that method works.

Statistical testing: fitting a linear model

The next step is to fit a linear model to the transformed count data.

The lmFit() command does this for each gene (all at once).

fit <- lmFit(v, design)

The eBayes() function then performs Emprical Bays shrinkage estimation to adjust the variability estimates for each gene, and produce the moderated t-statistics.

fit <- eBayes(fit)

We then summarize this information using the topTable() function, which list the genes in order of the strength of statistical support for differential expression (i.e., genes with lowest p-values are at the top of the list):

tt <- topTable(fit, coef=ncol(design), n=nrow(counts))
head(tt)
##            logFC  AveExpr        t      P.Value    adj.P.Val        B
## YAL038W 2.313985 10.80214 319.5950 3.725452e-13 8.850433e-10 21.08951
## YOR161C 2.568389 10.80811 321.9628 3.574510e-13 8.850433e-10 21.08187
## YML128C 1.640664 11.40819 286.6167 6.857932e-13 9.775297e-10 20.84520
## YMR105C 2.772539  9.65092 331.8249 3.018547e-13 8.850433e-10 20.16815
## YHL021C 2.034496 10.17510 269.4034 9.702963e-13 1.152550e-09 20.07857
## YDR516C 2.085424 10.05426 260.8061 1.163655e-12 1.184767e-09 19.87217

Significant genes

Adjusted p-values less than 0.05:

sum(tt$adj.P.Val < 0.05)
## [1] 5140

Adjusted p-values less than 0.01:

sum(tt$adj.P.Val <= 0.01)
## [1] 4566

By default, limma uses FDR adjustment (but lets check):

sum(p.adjust(tt$P.Value, method="fdr") <= 0.01)
## [1] 4566

Volcano plots are a popular method for summarising the limma output:

volcanoplot(fit, coef=2)
abline(h=-log10(0.05))

Significantly DE genes:

sigGenes = which(tt$adj.P.Val <= 0.05)
length(sigGenes)
## [1] 5140
volcanoplot(fit, coef=2)
points(tt$logFC[sigGenes], -log10(tt$P.Value[sigGenes]), col='red', pch=16, cex=0.5)

Significantly DE genes above fold-change threshold:

sigGenes = which(tt$adj.P.Val <= 0.05 & (abs(tt$logFC) > log2(2)))
length(sigGenes)
## [1] 1891
volcanoplot(fit, coef=2)
points(tt$logFC[sigGenes], -log10(tt$P.Value[sigGenes]), col='red', pch=16, cex=0.5)

Get the rows of top table with significant adjusted p-values - we’ll save these for later to compare with the other methods.

limmaPadj <- tt[tt$adj.P.Val <= 0.01, ]

DESeq2

More information about DESeq2: article by Love et al, 2014

Set up “DESeq object” for analysis:

library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = as.matrix(counts), 
                              colData = data.frame(conds=factor(conds)), 
                              design = formula(~conds))

Can access the count data in teh dds object via the counts() function:

counts(dds) %>% head()
##           WT1 WT2 WT3 MT1 MT2 MT3
## YDL248W    52  46  36  65  70  78
## YDL247W-A   0   0   0   0   1   0
## YDL247W     2   4   2   6   8   5
## YDL246C     0   0   1   1   2   0
## YDL245C     0   3   0   5   7   4
## YDL244W     6   6   5  20  30  19

Fit DESeq model to identify DE transcripts

dds <- DESeq(dds)

Take a look at the results table

res <- DESeq2::results(dds)
knitr::kable(res[1:6,])
  baseMean log2FoldChange lfcSE stat pvalue padj
YDL248W 56.2230316 -0.2339470 0.2278417 -1.0267961 0.3045165 0.3663639
YDL247W-A 0.1397714 -0.5255769 4.0804729 -0.1288030 0.8975136 0.9173926
YDL247W 4.2428211 -0.8100552 0.8332543 -0.9721584 0.3309718 0.3948536
YDL246C 0.6182409 -1.0326364 2.1560899 -0.4789394 0.6319817 0.6915148
YDL245C 2.8486580 -1.9781751 1.1758845 -1.6822869 0.0925132 0.1230028
YDL244W 13.0883354 -1.5823354 0.5128788 -3.0852037 0.0020341 0.0032860

Summary of differential gene expression

summary(res) 
## 
## out of 6830 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 2520, 37%
## LFC < 0 (down)     : 2521, 37%
## outliers [1]       : 0, 0%
## low counts [2]     : 0, 0%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

Note: p-value adjustment

# Remove rows with NAs
res = na.omit(res)

# Get the rows of "res" with significant adjusted p-values
resPadj<-res[res$padj <= 0.05 , ]

# Get dimensions
dim(resPadj)
## [1] 4811    6

Number of adjusted p-values less than 0.05

sum(res$padj <= 0.05)
## [1] 4811

Check that this is the same using p.adjust with FDR correction

sum(p.adjust(res$pvalue, method="fdr") <= 0.05)
## [1] 4811

Number of Holm adjusted p-values less than 0.05

sum(p.adjust(res$pvalue, method="holm") <= 0.01)
## [1] 3429

Sort summary list by p-value and save the table as a CSV file that can be read in Excel (or any other spreadhseet program).

res <- res[order(res$padj),]

write.csv(as.data.frame(res),file='deseq2.csv')

edgeR

Identify DEGs with edgeR’s “Exact” Method

Load edgeR package

library(edgeR)

Construct DGEList object

y <- DGEList(counts=counts, group=conds)

Calculate library size (counts per sample)

y <- calcNormFactors(y)

Estimate common dispersion (overall variability)

y <- estimateCommonDisp(y)

Estimate tagwise dispersion (per gene variability)

y <- estimateTagwiseDisp(y)

edgeR analysis and output

Compute exact test for the negative binomial distribution.

et <- exactTest(y) 

knitr::kable(topTags(et, n=4)$table)
  logFC logCPM PValue FDR
YCR077C 13.098599 6.629306 0 0
snR71 -9.870811 8.294459 0 0
snR59 -8.752555 10.105453 0 0
snR53 -8.503101 7.687908 0 0

Adjusted p-values

edge <- as.data.frame(topTags(et, n=nrow(counts))) 
sum(edge$FDR <= 0.05)
## [1] 5242
sum(p.adjust(edge$PValue, method="fdr") <= 0.05)
## [1] 5242

Get the rows of “edge” with significant adjusted p-values

edgePadj <- edge[edge$FDR <= 0.05, ]

DESeq2 vs edgeR

Generate Venn diagram to compare DESeq2 and edgeR results.

There is fairly good overlap:

library(gplots)
setlist <- list(edgeRexact=rownames(edgePadj), DESeq2=rownames(resPadj))
venn(setlist)


Limma vs edgeR vs DESeq2

Again, fairly good overlap across the three methods we’ve looked at:

setlist <- list(edgeRexact=rownames(edgePadj), 
                DESeq2=rownames(resPadj),
                LimmaVoom=rownames(limmaPadj))
venn(setlist)

Summary

Save limma topTable results for next session…

save(list='tt', file='topTable.RData')

Key Points

  • Statistical analysis is required to identify genes exhibiting altered expression between experimental conditions.

  • The limma processing pipeline is a fairly standard (and robust) way to do this.

  • DESeq2 and edgeR offer alternative methods for identifying differetially expressed genes.


Overrepresentation analysis (Gene Ontology)

Overview

Teaching: 60 min
Exercises: 0 min
Questions
  • How can we identify changes in biological mechanisms (e.g., pathways) in gene expression studies?

Objectives
  • Gain basic understanding of Gene Ontology.

  • Understand how over-representation analysis works for collections of genes.

  • Gain insight into the need for gene-length adjustment when performing over-representation analysis with RNA-seq data.

  • Use the GOseq methodology to identify gene-set changes based on Gene Ontology groups.

Lecture notes

In addition to the code below, we’ll also be going through some lecture notes that explain the various concepts being covered in this part of thw workshop.

Lecture notes (PDF): Annotation and Pathways

Overrepresentation analysis (Gene Ontology)

Over Representation Analysis (Boyle et al. 2004) is a widely used approach to determine whether known biological functions or processes are over-represented (= enriched) in an experimentally-derived gene list, e.g. a list of differentially expressed genes (DEGs).

Some caveats for RNA-seq data

GOseq

Results from the Young et al (2010) publication:

Proportion DE by gene length and reads

Gene set ranks by standard analysis

Gene set ranks by GOseq

GOseq analysis

Need to figure out if our organism is supported… (code is “sacCer”)

library(dplyr)
library(goseq)
supportedOrganisms() %>% head()
##          Genome         Id  Id Description Lengths in geneLeneDataBase
## 10      anoCar1    ensGene Ensembl gene ID                        TRUE
## 11      anoGam1    ensGene Ensembl gene ID                        TRUE
## 132     anoGam3                                                  FALSE
## 12      apiMel2    ensGene Ensembl gene ID                        TRUE
## 137 Arabidopsis                                                  FALSE
## 56      bosTau2 geneSymbol     Gene Symbol                        TRUE
##     GO Annotation Available
## 10                    FALSE
## 11                     TRUE
## 132                    TRUE
## 12                    FALSE
## 137                    TRUE
## 56                     TRUE

Easier to find if we use View() (NB - this only works in RStudio. Can’t use in Jupyter on NeSI).

supportedOrganisms() %>% View()

Define differentially expressed genes

Load our topTable results from last session:

load('topTable.RData')
# Note: If you want to use tt from DESeq, replace $adj.P.Val with $padj below

genes <- ifelse(tt$adj.P.Val < 0.05, 1, 0) 
names(genes) <- rownames(tt)
head(genes)
## YAL038W YOR161C YML128C YMR105C YHL021C YDR516C 
##       1       1       1       1       1       1
table(genes)
## genes
##    0    1 
## 1987 5140

Calculate gene weights

pwf=nullp(genes, "sacCer1", "ensGene")

Inspect output

Report length (bias) and weight data per gene.

head(pwf)
    ##         DEgenes bias.data       pwf
    ## YAL038W       1      1504 0.8205505
    ## YOR161C       1      1621 0.8205505
    ## YML128C       1      1543 0.8205505
    ## YMR105C       1      1711 0.8205505
    ## YHL021C       1      1399 0.8205505
    ## YDR516C       1      1504 0.8205505

Gene lengths and weights

par(mfrow=c(1,2))
hist(pwf$bias.data,30)
hist(pwf$pwf,30)

Gene length vs average expression

Is there an association between gene length and expression level?

library(ggplot2)
data.frame(logGeneLength = log2(pwf$bias.data), 
           avgExpr = tt$AveExpr) %>% 
  ggplot(., aes(x=logGeneLength, y=avgExpr)) + 
  geom_point(size=0.2) + 
  geom_smooth(method='lm')

How about gene length and statistical evidence supporting differential expression?

(Kinda hard to see, but it is apparently there…)

data.frame(logGeneLength = log2(pwf$bias.data), 
           negLogAdjP = -log10(tt$adj.P.Val)) %>% 
  ggplot(., aes(x=logGeneLength, negLogAdjP)) + 
  geom_point(size=0.2) + 
  geom_smooth(method='lm')

Length correction in GOSeq

Run GOSeq with gene length correction:

GO.wall=goseq(pwf, "sacCer1", "ensGene")

Output: Wallenius method

head(GO.wall)
    ##        category over_represented_pvalue under_represented_pvalue numDEInCat
    ## 1325 GO:0005622            1.817597e-17                        1       4298
    ## 7912 GO:0110165            8.200109e-12                        1       4456
    ## 5245 GO:0042254            9.441146e-12                        1        364
    ## 3817 GO:0022613            1.355385e-11                        1        439
    ## 5446 GO:0043226            2.516156e-11                        1       3826
    ## 5449 GO:0043229            4.433252e-11                        1       3817
    ##      numInCat                                 term ontology
    ## 1325     5179                        intracellular       CC
    ## 7912     5417           cellular anatomical entity       CC
    ## 5245      395                  ribosome biogenesis       BP
    ## 3817      482 ribonucleoprotein complex biogenesis       BP
    ## 5446     4609                            organelle       CC
    ## 5449     4599              intracellular organelle       CC

P-value adjustment

GO.wall.padj <- p.adjust(GO.wall$over_represented_pvalue, method="fdr")
sum(GO.wall.padj < 0.05)
## [1] 45
GO.wall.sig <- GO.wall$category[GO.wall.padj < 0.05]
length(GO.wall.sig)
## [1] 45
head(GO.wall.sig)
    ## [1] "GO:0005622" "GO:0110165" "GO:0042254" "GO:0022613" "GO:0043226"
    ## [6] "GO:0043229"

Filtering by gene set length

We are usually not interested in pathways or ontology terms that involve large numbers of genes (they are often rather broad terms or mechanisms), so we can exclude these from the results (here we only include significant GO terms that involve less than 500 genes):

GO.wall[GO.wall.padj < 0.05, ] %>% filter(numInCat < 500)
    ##      category over_represented_pvalue under_represented_pvalue numDEInCat
    ## 1  GO:0042254            9.441146e-12                1.0000000        364
    ## 2  GO:0022613            1.355385e-11                1.0000000        439
    ## 3  GO:0030684            3.348423e-08                1.0000000        163
    ## 4  GO:0016072            3.873737e-08                1.0000000        274
    ## 5  GO:0006364            4.532117e-08                1.0000000        259
    ## 6  GO:0034470            2.776978e-07                1.0000000        360
    ## 7  GO:0030490            9.891093e-07                0.9999998        111
    ## 8  GO:0030687            1.502461e-06                1.0000000         62
    ## 9  GO:0042274            2.616777e-06                0.9999993        134
    ## 10 GO:0005730            3.753501e-06                1.0000000        264
    ## 11 GO:0003735            3.799990e-06                1.0000000        194
    ## 12 GO:0022626            4.824970e-06                0.9999984        140
    ## 13 GO:0044391            6.732064e-06                1.0000000        202
    ## 14 GO:0000462            1.221128e-05                0.9999977         99
    ## 15 GO:0005840            1.780243e-05                0.9999942        221
    ## 16 GO:0002181            2.156069e-05                0.9999913        169
    ## 17 GO:0034660            7.782098e-05                0.9999571        409
    ## 18 GO:0042273            8.604282e-05                0.9999745        117
    ##    numInCat
    ## 1       395
    ## 2       482
    ## 3       171
    ## 4       299
    ## 5       282
    ## 6       401
    ## 7       116
    ## 8        62
    ## 9       143
    ## 10      292
    ## 11      221
    ## 12      156
    ## 13      231
    ## 14      104
    ## 15      254
    ## 16      191
    ## 17      467
    ## 18      126
    ##                                                                                        term
    ## 1                                                                       ribosome biogenesis
    ## 2                                                      ribonucleoprotein complex biogenesis
    ## 3                                                                               preribosome
    ## 4                                                                    rRNA metabolic process
    ## 5                                                                           rRNA processing
    ## 6                                                                          ncRNA processing
    ## 7                                                                    maturation of SSU-rRNA
    ## 8                                                      preribosome, large subunit precursor
    ## 9                                                        ribosomal small subunit biogenesis
    ## 10                                                                                nucleolus
    ## 11                                                       structural constituent of ribosome
    ## 12                                                                       cytosolic ribosome
    ## 13                                                                        ribosomal subunit
    ## 14 maturation of SSU-rRNA from tricistronic rRNA transcript (SSU-rRNA, 5.8S rRNA, LSU-rRNA)
    ## 15                                                                                 ribosome
    ## 16                                                                  cytoplasmic translation
    ## 17                                                                  ncRNA metabolic process
    ## 18                                                       ribosomal large subunit biogenesis
    ##    ontology
    ## 1        BP
    ## 2        BP
    ## 3        CC
    ## 4        BP
    ## 5        BP
    ## 6        BP
    ## 7        BP
    ## 8        CC
    ## 9        BP
    ## 10       CC
    ## 11       MF
    ## 12       CC
    ## 13       CC
    ## 14       BP
    ## 15       CC
    ## 16       BP
    ## 17       BP
    ## 18       BP


GO terms

library(GO.db)

GOTERM[[GO.wall.sig[1]]]
## GOID: GO:0005622
## Term: intracellular
## Ontology: CC
## Definition: The living contents of a cell; the matter contained within
##     (but not including) the plasma membrane, usually taken to exclude
##     large vacuoles and masses of secretory or ingested material. In
##     eukaryotes it includes the nucleus and cytoplasm.
## Synonym: internal to cell
## Synonym: protoplasm
## Synonym: nucleocytoplasm
## Synonym: protoplast

NB - the code below is demosntrating the difference between running with and without the gene length correction that GOseq implements. You wouldn’t usually work through this as part of a standard pathway analysis.


Run GOSeq without gene length correction

GO.nobias=goseq(pwf, "sacCer1", "ensGene", method="Hypergeometric")

Output: Hypergeomtric (Fisher) method

head(GO.nobias)
##        category over_represented_pvalue under_represented_pvalue numDEInCat
## 1325 GO:0005622            8.650971e-29                        1       4298
## 7912 GO:0110165            3.271100e-21                        1       4456
## 5446 GO:0043226            9.826546e-16                        1       3826
## 2650 GO:0009987            1.330459e-15                        1       4146
## 5449 GO:0043229            2.060920e-15                        1       3817
## 5245 GO:0042254            7.222008e-11                        1        364
##      numInCat                       term ontology
## 1325     5179              intracellular       CC
## 7912     5417 cellular anatomical entity       CC
## 5446     4609                  organelle       CC
## 2650     5023           cellular process       BP
## 5449     4599    intracellular organelle       CC
## 5245      395        ribosome biogenesis       BP

P-value adjustment

GO.nobias.padj <- p.adjust(GO.nobias$over_represented_pvalue, method="fdr")
sum(GO.nobias.padj < 0.05)
## [1] 44
GO.nobias.sig <- GO.nobias$category[GO.nobias.padj < 0.05]
length(GO.nobias.sig)
## [1] 44
head(GO.nobias.sig)
## [1] "GO:0005622" "GO:0110165" "GO:0043226" "GO:0009987" "GO:0043229"
## [6] "GO:0042254"

Compare with and without adjustment

library(gplots)
venn(list(GO.wall=GO.wall.sig, GO.nobias=GO.nobias.sig))

Extract out the different parts of the Venn diagram (yes, there are definitely better ways to do this).

## Only significant in Hypergeomtric analysis
onlySig.nobias <- setdiff(GO.nobias.sig, GO.wall.sig)

## Only significant in Wallenius analysis
onlySig.wall <- setdiff(GO.wall.sig, GO.nobias.sig)

## Significant in both
sig.wall.nobias <- intersect(GO.wall.sig, GO.nobias.sig)

Gene lengths and GO term membership

Can also extract gene length and GO membership information.

len=getlength(names(genes),"sacCer1","ensGene")

head(len)
## [1] 1504 1621 1543 1711 1399 1504
go = getgo(names(genes),"sacCer1","ensGene")

names(go) %>% head()
## [1] "YAL038W" "YOR161C" "YML128C" "YMR105C" "YHL021C" "YDR516C"
class(go)
## [1] "list"
head(go[[1]])
## [1] "GO:0006082" "GO:0008150" "GO:0008152" "GO:0009987" "GO:0019752"
## [6] "GO:0032787"
head(go[[2]])
## [1] "GO:0006810" "GO:0008150" "GO:0009987" "GO:0051179" "GO:0051234"
## [6] "GO:0055085"

Getting fancy…

Figure out which genes are in the significant GO groups, and then gets their lengths.

lengths.onlySig.nobias <- list()

for(i in 1:length(onlySig.nobias)){
  inGo <- lapply(go, function(x)  onlySig.nobias[i] %in% x) %>% unlist()
  lengths.onlySig.nobias[[i]] <- len[inGo]
}

lengths.onlySig.wall <- list()

for(i in 1:length(onlySig.wall)){
  inGo <- lapply(go, function(x)  onlySig.wall[i] %in% x) %>% unlist()
  lengths.onlySig.wall[[i]] <- len[inGo]
}

Significant: Hypergeometric vs Wallenius

cols <- rep(c("lightpink", "lightblue"), c(10,7))
boxplot(c(lengths.onlySig.nobias, lengths.onlySig.wall), col=cols)

All significant GO terms

lengths.sig.wall.nobias <- list()

for(i in 1:length(sig.wall.nobias)){
  inGo <- lapply(go, function(x)  sig.wall.nobias[i] %in% x) %>% unlist()
  lengths.sig.wall.nobias[[i]] <- len[inGo]
}
cols <- rep(c("lightpink", grey(0.7), "lightblue"), c(10,37,7))

avgLength <- lapply(c(lengths.onlySig.nobias, lengths.sig.wall.nobias, lengths.onlySig.wall),
                    median) %>% unlist()

oo <- order(avgLength, decreasing=TRUE)
boxplot(c(lengths.onlySig.nobias, lengths.sig.wall.nobias, lengths.onlySig.wall)[oo],
        col=cols[oo], ylab="Gene Length", xlab = "GO term")

Gene length versus P-value

avgLength.wall <- lapply(c(lengths.onlySig.wall, lengths.sig.wall.nobias), median)

avgLength.nobias <- lapply(c(lengths.onlySig.nobias, lengths.sig.wall.nobias), median)

cols <- rep(c("blue", "lightblue", "red","lightpink"),
            c(length(lengths.onlySig.wall), length(lengths.sig.wall.nobias),
              length(lengths.onlySig.nobias), length(lengths.sig.wall.nobias)))

plot(c(avgLength.wall, avgLength.nobias), 
     -log(c(GO.nobias.padj[GO.nobias.padj < 0.05], GO.wall.padj[GO.wall.padj < 0.05])), 
     col=cols, pch=16, xlab="Median Gene Length", ylab ="-log(FDR adj-pval)")
legend('topright', c("Only sig in NoBias", "Sig in both (nobias adjp)", 
                     "Sig in both (wal adjp)", "Only sig in Wall"), 
       fill=c("red", "pink", "lightblue", "blue"))


Summary

Key Points

  • Coordinated changes in groups of functionally related genes can tell us about the underlying biological mechanisms that changing between exprimental conditions.

  • The characteristics of RNA-seq experiments mean that gene-length correction is required, to avoid standard approaches to over-representation analysis givign erroneous results.