This lesson is still being designed and assembled (Pre-Alpha version)

Statistics and Graphing with R

Overview

Teaching: 10 min
Exercises: 50 min
Questions
  • How do I use ggplot and other programs with Phyloseq?

  • What statistical approaches do I use to explore my metabarcoding results?

Objectives
  • Create bar plots of taxonomy and use them to show differences across the metadata

  • Create distance ordinations and use them to make PCoA plots

  • Make subsets of the data to explore specific aspects of your results

In this lesson there are several examples for plotting the results of your analysis. As mentioned in the previous lesson, here are some links with additional graphing examples:

setwd(../plots)
# load the packages
library('phyloseq')
library('tibble')
library('ggplot2')
library('ape')
library('vegan')

physeq <- readRDS('fish_phyloseq.rds')
print(physeq)

Plotting taxonomy

Using the plot_bar function, we can plot the relative abundance of taxa across samples

plot_bar(physeq, fill='genus')

png

Plotting the taxonomy with the rarefied dataset can help to compare abundance across samples

plot_bar(physeq.rarefied, fill='genus')

png

You can collapse the frequency counts across metadata categories, such as location

plot_bar(physeq.rarefied, x='location', fill='genus')

png

Exercise: collapse counts around temperature

Try to collapse counts around another metadata category, like temperature. Then try filling at a different taxonomic level, like order or family

Solution

plot_bar(physeq.rarefied, x='temperature', fill='family')

Incorporate ggplot into your Phyloseq graph

Now we will try to add ggplot elements to the graph

Add ggplot facet_wrap tool to graph

Try using facet_wrap to break your plot into separate parts

Hint: First save your plot to a variable, then you can add to it

Solution

barplot1 <- plot_bar(physeq.rarefied, fill='order') 

barplot1 + facet_wrap(~location, scales="free_x",nrow=1)

png

Combine visuals to present more information

Now that you can use facet_wrap, try to show both location and temperature in a single graph.

Hint: Try to combine from the above two techniques

Solution

barplot2 <- plot_bar(physeq.rarefied, x="temperature", fill='order') +
facet_wrap(~location, scales="free_x", nrow=1)

barplot2

png


Community ecology

The Phyloseq package has many functions for exploring diversity within and between samples. This includes multiple methods of calculating distance matrices between samples.

An ongoing debate with metabarcoding data is whether relative frequency of OTUs should be taken into account, or whether they should be scored as presence or absence. Phyloseq allows you to calculate distance using both qualitative (i.e. presence/absence) and quantitative (abundance included) measures, taken from community ecology diversity statistics.

For qualitative distance use the Jaccard method

jac_dist <- distance(physeq.rarefied, method = "jaccard", binary = TRUE)
jac_dist
jac_dist
              AM1       AM2       AM3       AM4       AM5       AM6       AS2
    AM2 0.7692308                                                            
    AM3 0.6923077 0.8181818                                                  
    AM4 0.5714286 0.6666667 0.7857143                                        
    AM5 0.6666667 0.8235294 0.7647059 0.6666667                              
    AM6 0.4285714 0.8000000 0.7333333 0.6250000 0.5555556                    
    AS2 0.8461538 0.7777778 0.9090909 0.7500000 0.8823529 0.8666667          
    AS3 0.8666667 0.8181818 0.9230769 0.8666667 0.7647059 0.8823529 0.6666667
    AS4 0.8333333 0.7500000 0.9000000 0.7272727 0.8750000 0.8571429 0.2000000
    AS5 0.8571429 0.8000000 0.9166667 0.7692308 0.8888889 0.8750000 0.1666667
    AS6 0.8750000 0.8333333 0.9285714 0.8000000 0.9000000 0.8888889 0.5555556
              AS3       AS4       AS5
    AM2                              
    AM3                              
    AM4                              
    AM5                              
    AM6                              
    AS2                              
    AS3                              
    AS4 0.6250000                    
    AS5 0.7000000 0.3333333          
    AS6 0.7500000 0.5000000 0.6000000

Now you can run the ordination function in Phyloseq and then plot it.

qual_ord <- ordinate(physeq.rarefied, method="PCoA", distance=jac_dist)
plot_jac <- plot_ordination(physeq.rarefied, qual_ord, color="location", title='Jaccard') + theme(aspect.ratio=1) + geom_point(size=4)
plot_jac

png

Using quantitative distances

For quantitative distance use the Bray-Curtis method:

bc_dist <- distance(physeq.rarefied, method = "bray", binary = FALSE)

Now run ordination on the quantitative distances and plot them

Also try to make the points of the graph smaller

Solution

quant_ord <- ordinate(physeq.rarefied, method="PCoA", distance=bc_dist)
plot_bc <- plot_ordination(physeq.rarefied, quant_ord, color="location", title='Bray-Curtis') + theme(aspect.ratio=1) + geom_point(size=3)
plot_bc

png

View multiple variables on a PCoA plot

Phyloseq allows you to graph multiple characteristics. Try to plot location by color and temperature by shape

Solution

plot_locTemp <- plot_ordination(physeq.rarefied, quant_ord, color="location", shape='temperature', title="Jaccard Location and Temperature") + 
theme(aspect.ratio=1) + 
geom_point(size=4)

plot_locTemp

png


Permutational ANOVA

Phyloseq includes the adonis function to run a PERMANOVA to determine if OTUs from specific metadata fields are significantly different from each other.

adonis(bc_dist ~ sample_data(physeq.rarefied)$location)
    Call:
    adonis(formula = bc_dist ~ sample_data(physeq.rarefied)$location) 
    
    Permutation: free
    Number of permutations: 999
    
    Terms added sequentially (first to last)
    
                                          Df SumsOfSqs  MeanSqs F.Model      R2
    sample_data(physeq.rarefied)$location  1   0.13144 0.131438   5.938 0.39751
    Residuals                              9   0.19921 0.022135         0.60249
    Total                                 10   0.33065                  1.00000
                                          Pr(>F)   
    sample_data(physeq.rarefied)$location  0.002 **
    Residuals                                      
    Total                                          
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Is temperature significantly different?

Run PERMANOVA on the temperature variable

Solution

adonis(bc_dist ~ sample_data(physeq.rarefied)$temperature)
    Call:
    adonis(formula = bc_dist ~ sample_data(physeq.rarefied)$temperature) 
    
    Permutation: free
    Number of permutations: 999
    
    Terms added sequentially (first to last)
    
                                             Df SumsOfSqs  MeanSqs F.Model      R2
    sample_data(physeq.rarefied)$temperature  2   0.06606 0.033031 0.99869 0.19979
    Residuals                                 8   0.26459 0.033074         0.80021
    Total                                    10   0.33065                  1.00000
                                             Pr(>F)
    sample_data(physeq.rarefied)$temperature  0.441
    Residuals                                      
    Total                                          

Temperature is not significantly different across samples

Subsetting data

It is possible to take subsets of the Phyloseq object you have created. You can subset by taxa or samples.

Subsetting by taxonomy

Often there are so many taxonomic groups represented, that it is useful to create a subset of the data that contains only OTUs belonging to a taxonomic rank. This is done with the subset_taxa function. In the example dataset provided, there are not too many taxa included, but for larger datasets this can be very useful

scomb <- subset_taxa(physeq, order=='Scombriformes')

the subsetted database is still a valid Phyloseq object:

scomb
    phyloseq-class experiment-level object
    otu_table()   OTU Table:         [ 3 taxa and 11 samples ]
    sample_data() Sample Data:       [ 11 samples by 8 sample variables ]
    tax_table()   Taxonomy Table:    [ 3 taxa by 8 taxonomic ranks ]
plot_bar(scomb, fill='species')

png

Exercise: take a subset of a taxon and plot it


Subset samples

Now we will try to make a subset of the Phyloseq object based on a field from the sample metadata table.

First, let’s look at the metadata:

sample_data(physeq)
A sample_data: 11 × 8
fwd_barcoderev_barcodeforward_primerreverse_primerlocationtemperaturesalinitysample
<chr><chr><chr><chr><chr><fct><fct><chr>
AM1GAAGAGTAGCGTCGGACCCTATGGAGCTTTAGACCGCTGTTATCCCTADRGTAACTmudflats1232AM1
AM2GAAGAGTCTACTCGGACCCTATGGAGCTTTAGACCGCTGTTATCCCTADRGTAACTmudflats1432AM2
AM3GAAGAGATGACTCGGACCCTATGGAGCTTTAGACCGCTGTTATCCCTADRGTAACTmudflats1232AM3
AM4GAAGAGATCTATCGGACCCTATGGAGCTTTAGACCGCTGTTATCCCTADRGTAACTmudflats1032AM4
AM5GAAGAGACAGATCGGACCCTATGGAGCTTTAGACCGCTGTTATCCCTADRGTAACTmudflats1234AM5
AM6GAAGAGATACTGCGGACCCTATGGAGCTTTAGACCGCTGTTATCCCTADRGTAACTmudflats1034AM6
AS2GAAGAGAGATACTCGACCCTATGGAGCTTTAGACCGCTGTTATCCCTADRGTAACTshore 1232AS2
AS3GAAGAGATGCGATGGACCCTATGGAGCTTTAGACCGCTGTTATCCCTADRGTAACTshore 1232AS3
AS4GAAGAGTGCTACTCGACCCTATGGAGCTTTAGACCGCTGTTATCCCTADRGTAACTshore 1034AS4
AS5GAAGAGACGTCATGGACCCTATGGAGCTTTAGACCGCTGTTATCCCTADRGTAACTshore 1434AS5
AS6GAAGAGTCATGTCGGACCCTATGGAGCTTTAGACCGCTGTTATCCCTADRGTAACTshore 1034AS6

The command is similar to the subset_taxa command. For example, if we want to only analyse the samples found in the mudflats:

mud <- subset_samples(physeq, location=='mudflats')

mud

The output is still a Phyloseq object:

    phyloseq-class experiment-level object
    otu_table()   OTU Table:         [ 33 taxa and 6 samples ]
    sample_data() Sample Data:       [ 6 samples by 8 sample variables ]
    tax_table()   Taxonomy Table:    [ 33 taxa by 8 taxonomic ranks ]

A taxonomy bar plot will only show the mudflats samples:

plot_bar(mud, fill='genus')

png

Exercise: create a PCoA plot of the mudflats subset

Hint: is there anything that should be done with the data before plotting?

Solution

First, rarefy the new dataset:

mud.rarefied <- rarefy_even_depth(mud, rngseed=1, sample.size=0.9*min(sample_sums(mud)), replace=F)

Now, calculate the distance and ordination

mud_dist <- distance(mud.rarefied, method = 'bray', binary=FALSE)

mud_ord <- ordinate(mud.rarefied, method="PCoA", distance=mud_dist)

Finally, plot the ordination

plot_mud <- plot_ordination(mud.rarefied, mud_ord, color='temperature',shape='salinity', title="Mud Flats BC PCoA")+
theme(aspect.ratio=1)+ geom_point(size=4)

plot_mud

png

Merge OTUs by taxa

The final function we will present is to merge the OTUs by taxonomy. You will notice that there are often multiple OTUs that are assigned to a single species or genus. This can be due to multiple factors, such as sequence error, gaps in the reference database, or a species that varies in the target region. If you are interested in focusing on the taxonomy, then it is useful to merge OTUs that assign to the same taxon. Fortunately, Phyloseq provides a useful function to do this.

We will merge all OTUs at the lowest taxonomic level

phyMerge <- tax_glom(physeq, taxrank = 'species',NArm = T)

Now have a look at the taxonomy table. You should see that there are fewer OTUs.

tax_table(phyMerge)
A taxonomyTable: 16 × 8 of type chr
kingdomphylumclassorderfamilygenusspeciesConfidence
OTU.8EukaryotaChordataActinopteriClupeiformes Clupeidae Sprattus Sprattus_muelleri NA
OTU.21EukaryotaChordataActinopteriClupeiformes Clupeidae Sprattus Sprattus_antipodum NA
OTU.17EukaryotaChordataActinopteriBlenniiformes TripterygiidaeEnneanectes Enneanectes_carminalis NA
OTU.4EukaryotaChordataActinopteriBlenniiformes TripterygiidaeForsterygionForsterygion_lapillum NA
OTU.11EukaryotaChordataActinopteriOsmeriformes Retropinnidae Retropinna Retropinna_retropinna NA
OTU.10EukaryotaChordataActinopteriBeloniformes Hemiramphidae HyporhamphusHyporhamphus_melanochirNA
OTU.3EukaryotaChordataActinopteriPerciformes Bovichtidae Bovichtus Bovichtus_variegatus NA
OTU.7EukaryotaChordataActinopteriPerciformes Nototheniidae Notothenia Notothenia_angustata NA
OTU.2EukaryotaChordataActinopteriMugiliformes Mugilidae Aldrichetta Aldrichetta_forsteri NA
OTU.5EukaryotaChordataActinopteriLabriformes Labridae Notolabrus Notolabrus_fucicola NA
OTU.22EukaryotaChordataActinopteriLabriformes Odacidae Odax Odax_pullus NA
OTU.14EukaryotaChordataActinopteriGadiformes Moridae Lotella Lotella_rhacina NA
OTU.23EukaryotaChordataActinopteriGadiformes Moridae PseudophycisPseudophycis_barbata NA
OTU.19EukaryotaChordataActinopteriScombriformes Trichiuridae Lepidopus Lepidopus_caudatus NA
OTU.1EukaryotaChordataActinopteriScombriformes Gempylidae Thyrsites Thyrsites_atun NA
OTU.12EukaryotaChordataActinopteriMyctophiformesMyctophidae Hygophum Hygophum_hanseni NA

Exercise

In the previous example, you might have noticed that any OTU that was not resolved to species (i.e. ‘NA’ in the species column) was removed from the table. Try to rerun the function, keeping all the unresolved species.

Solution

phyMergeNA <- tax_glom(physeq, taxrank = 'species',NArm = F)
tax_table(phyMergeNA)
A taxonomyTable: 23 × 8 of type chr
kingdomphylumclassorderfamilygenusspeciesConfidence
OTU.33EukaryotaChordataNA NA NA NA NA NA
OTU.8EukaryotaChordataActinopteri Clupeiformes Clupeidae Sprattus Sprattus_muelleri NA
OTU.21EukaryotaChordataActinopteri Clupeiformes Clupeidae Sprattus Sprattus_antipodum NA
OTU.29EukaryotaChordataActinopteri Clupeiformes Clupeidae Sprattus NA NA
OTU.17EukaryotaChordataActinopteri Blenniiformes TripterygiidaeEnneanectes Enneanectes_carminalis NA
OTU.4EukaryotaChordataActinopteri Blenniiformes TripterygiidaeForsterygionForsterygion_lapillum NA
OTU.11EukaryotaChordataActinopteri Osmeriformes Retropinnidae Retropinna Retropinna_retropinna NA
OTU.15EukaryotaChordataActinopteri Salmoniformes Salmonidae Salmo NA NA
OTU.10EukaryotaChordataActinopteri Beloniformes Hemiramphidae HyporhamphusHyporhamphus_melanochirNA
OTU.3EukaryotaChordataActinopteri Perciformes Bovichtidae Bovichtus Bovichtus_variegatus NA
OTU.7EukaryotaChordataActinopteri Perciformes Nototheniidae Notothenia Notothenia_angustata NA
OTU.2EukaryotaChordataActinopteri Mugiliformes Mugilidae Aldrichetta Aldrichetta_forsteri NA
OTU.9EukaryotaChordataActinopteri NA NA NA NA NA
OTU.5EukaryotaChordataActinopteri Labriformes Labridae Notolabrus Notolabrus_fucicola NA
OTU.22EukaryotaChordataActinopteri Labriformes Odacidae Odax Odax_pullus NA
OTU.31EukaryotaChordataChondrichthyesRajiformes Rajidae Dipturus NA NA
OTU.14EukaryotaChordataActinopteri Gadiformes Moridae Lotella Lotella_rhacina NA
OTU.23EukaryotaChordataActinopteri Gadiformes Moridae PseudophycisPseudophycis_barbata NA
OTU.28EukaryotaChordataActinopteri Gadiformes Moridae PseudophycisNA NA
OTU.19EukaryotaChordataActinopteri Scombriformes Trichiuridae Lepidopus Lepidopus_caudatus NA
OTU.1EukaryotaChordataActinopteri Scombriformes Gempylidae Thyrsites Thyrsites_atun NA
OTU.13EukaryotaChordataActinopteri Scombriformes CentrolophidaeSeriolella NA NA
OTU.12EukaryotaChordataActinopteri MyctophiformesMyctophidae Hygophum Hygophum_hanseni NA


Exercise

What would happen if you set the taxrank parameter in the tax_glom function to a higher taxonomic level, like family?

Solution

phyMergeFAM <- tax_glom(physeq, taxrank = 'family',NArm = T)

tax_table(phyMergeFAM)
A taxonomyTable: 16 × 8 of type chr
kingdomphylumclassorderfamilygenusspeciesConfidence
OTU.8EukaryotaChordataActinopteri Clupeiformes Clupeidae NANANA
OTU.4EukaryotaChordataActinopteri Blenniiformes TripterygiidaeNANANA
OTU.11EukaryotaChordataActinopteri Osmeriformes Retropinnidae NANANA
OTU.15EukaryotaChordataActinopteri Salmoniformes Salmonidae NANANA
OTU.10EukaryotaChordataActinopteri Beloniformes Hemiramphidae NANANA
OTU.3EukaryotaChordataActinopteri Perciformes Bovichtidae NANANA
OTU.7EukaryotaChordataActinopteri Perciformes Nototheniidae NANANA
OTU.2EukaryotaChordataActinopteri Mugiliformes Mugilidae NANANA
OTU.5EukaryotaChordataActinopteri Labriformes Labridae NANANA
OTU.22EukaryotaChordataActinopteri Labriformes Odacidae NANANA
OTU.31EukaryotaChordataChondrichthyesRajiformes Rajidae NANANA
OTU.14EukaryotaChordataActinopteri Gadiformes Moridae NANANA
OTU.19EukaryotaChordataActinopteri Scombriformes Trichiuridae NANANA
OTU.1EukaryotaChordataActinopteri Scombriformes Gempylidae NANANA
OTU.13EukaryotaChordataActinopteri Scombriformes CentrolophidaeNANANA
OTU.12EukaryotaChordataActinopteri MyctophiformesMyctophidae NANANA

Key Points

  • The Phyloseq package has many useful features to conduct statistical analyses of your data