Metabarcoding analyses - Bioinformatics part

# Metabarcoding analyses - Bioinformatics part
## Migale bioinformatics facility
### Olivier Rué - Cédric Midoux
### 2022-06-07

---

# Practical informations

- 9h00 - 17h00

- 2 breaks morning and afternoon

- Possibility to have lunch in the INRAE restaurant

---

# Better know us

* Who are you ?
  - Institution, laboratory, position …
* What are your needs in metagenomics ?
* Do you have already dealed with metagenomics data ?
  - Which kind of data ?
  - Aim of the study ?
* Do you have generated data for a new analysis ?
  - Which design ? How many samples ? Sequencing technology ?

---

# Migale team

* <a href="https://migale.inrae.fr/">Migale website</a>
* Dedicated service to Data Analysis
 - Specialists in Metagenomics, Genomics, Bacterial genome assembly and annotation
 - Bioinformatics & Statistics
 - 84 projects since 2016
 - Collaboration or Support
* Developments
 - FROGS
 - easy16S, affiliationExplorer

Discover the service offer <a href="https://analyses.migale.inra.fr/presentation/service-offer/service-offer.html">here</a>

---

---

# Objectives

After this 4 days training, you will:

* Know the outlines, advantages and limits of amplicon sequencing data analysis
* Be able to use .large[**FROGS**] (through Galaxy) and .large[**phyloseq**] (through easy16S) tools on the training data set
* Be able to identify tools and parameters adapted to your own analyses

---

# Program

* Day 1 & 2: Bioinformatics
* Day 3 & 4: Statistics

And one time to train with your own data or another dataset.

---

# Introduction to amplicon analyses

---

# Meta-omics using next-genertation sequencing (NGS)

---

# Meta-omics using next-genertation sequencing (NGS)

---

# Strengths and weaknesses of amplicon analyses?

http://scrumblr.ca/strengths_weaknesses

---

## Strengths

* Detect subdominant microorganisms present in complex samples → microbial inventories 
* Get (approximate) relative abondances of different taxa in samples
* Analyze and compare many taxa (hundreds) at the same time 
* Taxonomic profiles of the communities (usually up to genus level, and sometimes up to species or strain)
* Low cost

## Weaknesses

* Compositional data, many biases -> no absolute quantification
* Exact identification of the organisms difficult
* Hard to distinguish live and dead fractions of the communities 
* No functional view of the ecosystem

---
# Gene marker power

---

# Choice of a marker gene

The perfect / ideal gene marker:

* is ubiquist
* is conserved among taxa 
* is enough divergent to distinguish stains
* is not submitted to lateral transfer
* has only one copy in genome
* has conserved regions to design *specific* primers
* is enough characterized to be present in databases for taxonomic affiliation

---

# Bacterial targets

The genes that have been proposed for this task include those encoding :

* 16S / 23S rRNA
* DNA gyrase subunit B (gyrB)
* RNA polymerase subunit B (rpoB)
* TU elongation factor (tuf)
* DNA recombinase protein (recA)
* protein synthesis elongation factor-G (fusA)
* dinitrogenase protein subunit D (nifD) ...

Bacterial lineages vary in their genomic contents, which suggests that different genes might be needed to resolve the diversity within certain taxonomic groups.

---

# The gene encoding the small subunit of the ribosomal RNA

* The most widely used gene in molecular phylogenetic studies
* Ubiquist gene: 16S rDNA in prokayotes ; 18S rDNA in eukaryotes
* Gene encoding a ribosomal RNA : non-coding RNA (not translated), part of the small subunit of the ribosome which is responsible for the translation of mRNA in proteins
* Not submitted to lateral gene transfer 
* Availability of databases facilitating comparison
 
---

---

## 16S rRNA structure

<img src="images/frogs003.png" width="90%" style="display: block; margin: auto;" />
]

<img src="images/frogs004.png" width="100%" style="display: block; margin: auto;" />
]

---

# Example of gyrB as interesting marker gene

* A single-copy housekeeping gene that encodes the subunit B of DNA gyrase, a type II DNA topoisomerase, and therefore plays an essential role in DNA replication. 
* Essential and ubiquitous in bacteria
* Higher rate of base substitution than 16S rDNA does
* Sufficiently large in size for use in analysis of microbial communities.
* Also present in Eukarya and sometimes in Archaea but it shows enough sequence dissimilarity between the three domains of life to be used selectively for Bacteria.

---

<cite style="font-size: 0.7em;position: absolute;bottom: 5px;"><a href="https://doi.org/10.1371/journal.pone.0204629">Poirier et al (2018)</a></cite>

---

# Eukaryotic counterpart

* 18S (small subunit ribosomal RNA)
* ITS (Internal Transcribed Spacers)
  - Length variability (50-1000 nt)
  - Many copies (up to hundreds!)

---
# Primers choice

* 16S rRNA gene

<img src="images/frogs014.png" width="100%" style="display: block; margin: auto;" />
]

* Internal Transcribed Spacer

<img src="images/frogs015.png" width="100%" style="display: block; margin: auto;" />
]

* A lot of others...

---
class: inverse, middle, center

# Planning an experiment

---

# Planning an experiment

---

# Planning an experiment

<img src="images/mermaid1.png" width="80%" style="display: block; margin: auto;" />
]

# Expected output after bioinformatics

* A matrix table containing "species" and abundances in samples

<table>
 <thead>
 <tr>
 <th style="text-align:left;"> OTU </th>
 <th style="text-align:left;"> Affiliation </th>
 <th style="text-align:right;"> Sample1 </th>
 <th style="text-align:right;"> Sample2 </th>
 <th style="text-align:right;"> Sample3 </th>
 </tr>
 </thead>
<tbody>
 <tr>
 <td style="text-align:left;"> OTU1 </td>
 <td style="text-align:left;"> SpeciesA </td>
 <td style="text-align:right;"> 0 </td>
 <td style="text-align:right;"> 500 </td>
 <td style="text-align:right;"> 0 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> OTU2 </td>
 <td style="text-align:left;"> GenusA </td>
 <td style="text-align:right;"> 200 </td>
 <td style="text-align:right;"> 41 </td>
 <td style="text-align:right;"> 100 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> OTU3 </td>
 <td style="text-align:left;"> SpeciesB </td>
 <td style="text-align:right;"> 1000 </td>
 <td style="text-align:right;"> 100 </td>
 <td style="text-align:right;"> 1000 </td>
 </tr>
</tbody>
</table>

---

# Experimental design

---

# Thinking before acting

<img src="images/mermaid1.png" width="80%" style="display: block; margin: auto;" />
]

.pull-right[
<img src="images/frogs_question.png" width="40%" style="display: block; margin: auto;" />
]

---

# Sampling

<img src="images/mermaid2.png" width="80%" style="display: block; margin: auto;" />
]

* Number of samples?
* Associated metadata are essential (Too many is better than too few)
* Contamination in lab
* Conservation / Transportation
* Storage

<cite style="font-size: 0.7em;position: absolute;bottom: 5px;"><a href="https://doi.org/10.1093/bib/bbz155">Bharti and Grimm (2019)</a></cite>
]

---

# DNA extraction and preparation

<img src="images/mermaid3.png" width="80%" style="display: block; margin: auto;" />
]

* Mechnical or chemical lysis?
* Choice of DNA extraction kit
* PCR amplification biases

]

---

## Universal primers are not so universal

* _Akkermensia_ genus detected (qPCR) but not found in metabarcoding results
* Primers used for amplification
  - F343: TACGGRAGGCAGCAG
  - R784: TACCAGGGTATCTAATCCT
* Mismatches in primers
  - 2 mismatches in Forward
  - 1 mismatch in Reverse
* No amplification...

<cite style="font-size: 0.7em;position: absolute;bottom: 5px;"><a href="https://doi.org/10.1111/1462-2920.13181">Alard et al (2016)</a></cite>

---

## Biological biases

* Gene copy number spans over an order of magnitude, from 1 to up to 15 in Bacteria, but only up to 5 in Archaea 
* Only a minority of bacterial genomes harbors identical 16S rRNA gene copies
* Sequence diversity increases with increasing copy numbers. 
* While certain taxa harbor dissimilar 16S rRNA genes, others contain sequences common to multiple species.
* Quantification is impossible (in real life)!

<cite style="font-size: 0.7em;position: absolute;bottom: 5px;"><a href="https://doi.org/10.1371/journal.pone.0057923">Vetrovsky and Baldrian (2013)</a> ; <a href="https://doi.org/10.1186/2049-2618-2-11">Angly et al (2014)</a></cite>

---

# PCR amplification bias

* Amplification by PCR has sequence-dependence efficiency, especially the sequence that binds to primers. 
* If one sequence is amplified 10% more than another in one round, it will be 1.130 = 17.4 x more abundant after 30 rounds. 
* This effect is most important when the sequence has one or more mismatches with the primer. 
* With one mismatch, amplification efficiency is usually significantly less, and with two or more mismatches the sequence may not be amplified to detectable levels.

---

## PCR problems

<cite style="font-size: 0.7em;position: absolute;bottom: 5px;"><a href="https://doi.org/10.1093/nar/gkv717">Kebschull and Zador (2015)</a></cite>
]

---

# Sequencing

---

# Sequencing

<img src="images/mermaid4.png" width="80%" style="display: block; margin: auto;" />
]

---

# Sequencing technologies?

---

## Main sequencing technologies for metabarcoding

<cite style="font-size: 0.7em;position: absolute;bottom: 5px;"><a href="https://doi.org/10.1186/s12866-017-1101-8">Allali et al (2017)</a></cite>

---

## Sequencing errors

<cite style="font-size: 0.7em;position: absolute;bottom: 5px;"><a href="https://doi.org/10.1186/s12864-015-2194-9">D'Amore et al (2016)</a></cite>

---

## Overlapping reads allow to correct some errors

<cite style="font-size: 0.7em;position: absolute;bottom: 5px;"><a href="https://doi.org/10.1186/s12864-015-2194-9">D'Amore et al (2016)</a></cite>
---

# Illumina MiSeq Sequencing

* DNA fragments are bound to the flowcell and sequenced (by synthesis) <a href="https://www.youtube.com/watch?v=fCd6B5HRaZ8">Video</a>
* Paired-end reads allow to obtain longer fragments than 250 or 300 bp
* Low error-rate
* Substitution type miscalls are the dominant source of errors
* Abordable cost due to *multiplexing*
 
---

## Multiplexing

* max 384 indexes by run

---

# PacBio promises

* Get the full 16S sequence!
* <cite>We further demonstrate that full-length sequencing platforms are sufficiently accurate to resolve subtle nucleotide substitutions (but not insertions/deletions) that exist between intragenomic copies of the 16S gene</cite>.

<cite style="font-size: 0.7em;position: absolute;bottom: 5px;"><a href="https://doi.org/10.1038/s41467-019-13036-1">Johnson et al (2019)</a></cite>

---

# PacBio caveats

* <cite>Low sequencing accuracy and low coverage of terminal regions in public 16S rRNA databases deteriorate the advantages of long read length, resulting in low taxonomic resolution in amplicon sequencing of human gut microbiota</cite>

<cite style="font-size: 0.7em;position: absolute;bottom: 5px;"><a href="https://doi.org/10.1038/sdata.2018.68">Whon et al (2018)</a></cite>
---

---

## Sequencing biases

* Contamination between samples during the same run
* Contamination between samples during different runs (residual contaminants)
* Variability between runs: take into account for experimental plan
* Variability inside run: add some controls

<cite style="font-size: 0.7em;position: absolute;bottom: 5px;"><a href="https://doi.org/10.1186/s12915-014-0087-z">Salter et al (2014)</a></cite>

---

## Negative controls are important!

<cite style="font-size: 0.7em;position: absolute;bottom: 5px;"><a href="https://doi.org/10.1186/s12915-014-0087-z">Salter et al (2014)</a></cite>
---

## Negative controls are important!

<img src="images/frogs007.png" width="60%" style="display: block; margin: auto;" />
<cite style="font-size: 0.7em;position: absolute;bottom: 5px;"><a href="https://doi.org/10.1186/s12915-014-0087-z">Salter et al (2014)</a></cite>

---

## Illustration

.pull-left[
<img src="images/frogs_ticks_paper.png" width="70%" style="display: block; margin: auto;" />
]

.pull-right[
<cite>Here, we showed that contaminant OTUs from extraction and amplification steps can represent *more than half* the total sequence yield in sequencing runs, and lead to unreliable results when characterizing *tick microbial communities*. We thus strongly advise the routine use of negative controls in tick microbiota studies, and more generally in studies involving low biomass samples</cite>
]

<cite style="font-size: 0.7em;position: absolute;bottom: 5px;"><a href="https://doi.org/10.3389/fmicb.2020.01093">Lejal et al (2019)</a></cite>

---

# Bioinformatics & Biostatistics

<img src="images/mermaid5.png" width="80%" style="display: block; margin: auto;" />
]

* Tools
* Databases
* Normalization
* Diversity indices
]

---
## Impact of method and targeted region

<div class="figure" style="text-align: center">
<img src="images/frogs005.png" alt="Compositions at the phylum level for Human gut and, using a range of different methods (separate subpanels within each group). " width="80%" />
Compositions at the phylum level for Human gut and, using a range of different methods (separate subpanels within each group). 
</div>

---
## Benchmarks

.pull-left[
<img src="images/frogs_benchmark_tools.jpg" width="80%" style="display: block; margin: auto;" />
]

.pull-right[
* Be cautious with benchmarks!
* Input data are never identical -> results are never exactly the same
]

<cite style="font-size: 0.7em;position: absolute;bottom: 5px;"><a href="https://doi.org/10.1128/mSystems.00003-15">Kopylova et al (2016)</a></cite>

---

## Benchmarks

<cite style="font-size: 0.7em;position: absolute;bottom: 5px;"><a href="https://doi.org/10.1186/s12866-017-1101-8">Allali et al (2017)</a></cite>
---
class: inverse, middle, center

# Conclusion 1: sequencing data do not contain exactly what you sampled...

---

---

<cite style="font-size: 0.7em;position: absolute;bottom: 5px;"><a href="https://doi.org/10.1186/s12866-015-0351-6">Brooks et al (2015)</a></cite>

---

# Conclusion 2: but keep biases in mind for analyzing your data!

---

# Summary

---

## Key advice

* Discuss with everyone involved in the experiment, from the field technician to the statistician
* Each choice affects the following steps!

---

---

# Bioinformatics

---

## The aim is to find the correct amplicon sequences and their abundances for each sample

---

## Step 1: construct real amplicon sequences

---

## Step 2: Assign a taxonomy to sequences

* Is that easy?

---

## Bioinformatics solutions

* MG-RAST (2008)
* Mothur (2009)
* Qiime (2010)
* UPARSE (2013)
* FROGS (2014)
* DADA2 (2016)
* Qiime2 (2019)
* ...

---

## Main differences

* Ease of use 
  - command line vs graphical interfaces
  - fitting complexity
* Scalling
* Paradigm: Clustering or denoising
* Chimera detection
* Taxonomic affiliation method
  - with training set
  - blast alignment

---

## FROGS: Find Rapidly OTUs with Galaxy Solution

<cite style="font-size: 0.7em;position: absolute;bottom: 5px;"><a href="https://doi.org/10.1093/bioinformatics/btx791">Escudié et al (20017)</a></cite>

* Easy to use for biologists
* Last updated and adapted tools
* Innovative affiliation tag to highlight databases conflicts and uncertainties
* Designed by a group of experts of metabarcoding analyses
* Better accuracy than other tools from 16S and ITS simulated and real data
* [Complete informations](http://frogs.toulouse.inra.fr)

]

---

---

# Switch to TP: Galaxy

[Introduction](https://training.galaxyproject.org/training-material/topics/introduction/slides/introduction.html#1)

---

# Sequencing data

---

# Content of sequenced fragments

* The expected amplicon sequence

ACTGGGTGTAAGAGCT

* The primers are sequenced too

ACTGACTGGGTGTAAGAGCTCTTA

* With two fragments:

- R1 ACTGACTGGGTGTAAG
 - R2 TAAGAGCTCTTACACC

---
# Content of sequenced fragments in multiplexed file

* Barcodes are added to each extremity

TTTTACTGACTGGGTGTAAGAGCTCTTACCCC

* With two fragments:

- R1 TTTTACTGACTGGGTG
 - R2 GGGGTAAGAGCTCTTA

---
## Demultiplexing

* Assign each read to FASTQ files depending on barcode found
* BARCODE FILE is expected to be tabular:
  - first column corresponds to the sample name (unique, without space) 
  - second to the forward sequence barcode used (None if only reverse barcode)
  - optional third is the reverse sequence barcode (optional)

---

# Switch to TP: FROGS Demultiplex

---

# FASTQ

---

# FASTQ syntax

The FASTQ format consists of 4 sections:
1. A FASTA-like header, but instead of the <code>></code> symbol it uses the <code>@</code> symbol. This is followed
by an ID and more optional text, similar to the FASTA headers.
2. The second section contains the measured sequence (typically on a single line), but it
may be wrapped until the <code>+</code> sign starts the next section.
3. The third section is marked by the <code>+</code> sign and may be optionally followed by the same
sequence id and header as the first section
4. The last line encodes the quality values for the sequence in section 2, and must be of
the same length as section 2.

---
# FASTQ syntax

Example

```bash
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
```

---

# FASTQ quality

The weird characters in the 4th section are the so called “encoded” numerical values.
Each character represents a numerical value: a so-called *Phred score*,
encoded via a single letter encoding.

```bash
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI
| | | | | | | | |
0....5...10...15...20...25...30...35...40
| | | | | | | | |
worst................................best
```

---
# FASTQ quality

The quality values of the FASTQ files are on top. The numbers in the middle of the scale
from 0 to 40 are called Phred scores. The numbers represent the error probabilities  via the formula:

Error=10ˆ(-P/10) 
It is basically summarized as:

- P=0 means 1/1 (100% probability of error)
- P=10 means 1/10 (10% probability of error)
- P=20 means 1/100 (1% probability of error)
- P=30 means 1/1000 (0.1% probability of error)
- P=40 means 1/10000 (0.01% probability of error)

---

# FASTQ quality encoding specificities

There was a time when instrumentation makers could not decide at what
character to start the scale. The **current standard** shown above is the so-called Sanger (+33)
format where the ASCII codes are shifted by 33. There is the so-called +64 format that
starts close to where the other scale ends.

<div class="figure" style="text-align: center">
<img src="images/qualityscore.png" alt="FASTQ encoding values" width="80%" />
FASTQ encoding values
</div>

---

# FASTQ files

* R1

```bash
@id1:1
ACTGACTGGGTGTAAG
+
EF!![!:;;;;;::;A
```

* R2

```bash
@id1:2
TAAGAGCTCTTACACC
+
;:,??!!???;..FFF
```

---

# FASTQ files

* It is crucial to check the quality of the raw data
  - expected number of files?
  - expected number of reads per file?
  - data quality?

* Do not start an analysis if something wrong
  - Unusefull if some data are missing
  - You can (have to) discuss with the sequencing platform to understand

---

# Switch to TP: Quality control

---

## Quality profiles

---

# Preprocess

---

## Preprocess

* Remove non-biological informations
  - primers, barcodes, remaining sequencing primers...
* Filter on length
* Filter on nucleotide content
* Overlap reads if possible

---

## Overlap

---

# Switch to TP: FROGS Preprocess

---

# Clustering

---
## Sequencing data are noised

---

## OTU paradigm

* Operational Taxonomic Unit

---

## ASV paradigm

* Amplicon Sequence Variants

<cite style="font-size:0.7em">ASV are inferred by a de novo process in which biological sequences are discriminated from errors on the basis of the expectation that biological sequences are more likely to be repeatedly observed than are error-containing sequences.</cite>

---

## ASV promises

<img src="images/frogs_asv_resolution.png" width="80%" style="display: block; margin: auto;" />
---

## Operational Taxonomic Units

---

## OTUs construction strategies

* De novo OTU picking
  - by choosing a fixed sequenced dissimilarity
  - by relying on a small local linking threshold, representing the maximum number of differences between two amplicons
* Closed-reference OTU picking
  - by using a reference databank
  - discards all reads not similar to the reference databank
* Open-reference OTU picking
  - by using a reference databank
  - de novo clusters remaining reads

---

## Fixed sequence dissimilarity: the traditional 97%...

---
## ... is input order dependent

---

## 97%?

---

## Swarm: A smart idea

---
## Swarm

* A robust and fast clustering method for amplicon-based studies
* The purpose of swarmis to provide a novel clustering algorithm to handle large sets of amplicons
* swarm results are resilient to input-order changes and rely on a small local linking threshold d, the maximum number of differences between two amplicons
* swarm forms stable high-resolution clusters, with a high yield of biological information
* Default: forms a lot of low-abundant OTUs that are in fact artifacts and need to be removed

---

## d: the small local linking threshold

---

## Swarm steps

---
class: center, inverse, middle

# Switch to TP: FROGS clustering

---
class: center, inverse, middle

# Chimera removal

---
# Chimera

.pull-left[
<img src="images/frogs_chimera.jpg" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
<img src="images/chimera.gif" width="100%" style="display: block; margin: auto;" />
]
---
# Chimera detection strategies

* Reference based: against a database of «genuine» sequences
* De novo: against abundant sequences in the samples

* FROGS uses <a href="https://doi.org/10.7717/peerj.2584">vsearch</a> as chimera removal tool

<cite style="font-size: 0.7em;position: absolute;bottom: 5px;"><a href="https://doi.org/10.7717/peerj.2584">Rognes et al (2016)</a></cite>

---

# Sample-cross validation

* FROGS adds a sample-cross validation

<img src="images/frogs031.png" width="80%" style="display: block; margin: auto;" />
---
# Chimera rates in samples

* From 5 to 40% in 16S data

* Few with ITS (<10%)

---
class: center, inverse, middle

# Switch to TP: FROGS remove chimera

---
class: center, inverse, middle

# Abundance filters
---

# Filters

* Scientific considerations :
  - Low abundant sequences are often chimeric
  - Impossible to distinguish rare biosphere and artefacts
  - Better accuracy after removing singletons
  - Smart to use replicates to keep good OTUs
  - Contaminations?

---
class: center, inverse, middle

# Switch to TP: FROGS filters

---

# Taxonomic affiliation
---

# Taxonomic affiliation

* Blast
* RDP-classifier
* IDTAXA
* QIIME
* SINTAX
* ...

---

## RDP classifier caveats

]

<img src="images/frogs038.png" width="80%" style="display: block; margin: auto;" />
]

---

---
<img src="images/frogs033.png" width="80%" style="display: block; margin: auto;" />

---

# Advantages of FROGS affiliation

* Gives %cov & %id informations
* Gives all hits in a multi-affiliation file
  - Allows a smart correction sometimes (hits with same Species but different strains)

---

# Switch to TP: FROGS affiliation OTU

---

# Phylogeny
---

## Phylogenetic similarity gives an other information to unknown OTUs

<cite style="font-size: 0.7em;position: absolute;bottom: 5px;"><a href="https://doi.org/10.1016/j.heliyon.2019.e03089 ">Chakraborty et al (2020)</a></cite>

---

# Phylogenetic tree

* FROGS allows you to build a phylogenetic tree:
  - Mafft for doing multiple sequence alignments
  - Fasttree to build the rooted phylogentic tree
  - Essential to compute Unifrac distances

* Not always possible if sequence diversity is too high (e.g. ITS)

---

# After this training... the real life

* Specific to FROGS
  - frogs-support@inrae.fr

* Migale support
  - help-migale@inrae.fr

* Want to collaborate with us?
 - <a href="https://migale.inrae.fr/ask-data-analysis">https://migale.inrae.fr/ask-data-analysis</a>