Comparaison de génomes microbiens

# Comparaison de génomes microbiens
## Cycle de formation à la bioinformatique par la pratique
### Hélène Chiapello - Valentin Loux
<br/>(helene.chiapello|valentin.loux)<span
class="citation">@inrae.fr</span>
### 2022/05/10

---

# Practical informations

- 9h30 - 17h00

- 2 breaks in the morning and in the afternoon

- Lunck break of 1 hour

<!-- 
- First session remote for this module … please be comprehensive !

-->

<p xmlns:cc="http://creativecommons.org/ns#" xmlns:dct="http://purl.org/dc/terms/"><a property="dct:title" rel="cc:attributionURL" href="https://formations.migale.inrae.fr/Comparative_Genomics/slides.html">These supports, </a> by <a rel="cc:attributionURL dct:creator" property="cc:attributionName" href="https://migale.inrae.fr">INRAE-Migale Bioinformatics Facility</a> are licensed under <a href="http://creativecommons.org/licenses/by-sa/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;">CC BY-SA 4.0<img style="height:22px!important;margin-left:3px;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/cc.svg?ref=chooser-v1"><img style="height:22px!important;margin-left:3px;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/by.svg?ref=chooser-v1"><img style="height:22px!important;margin-left:3px;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/sa.svg?ref=chooser-v1"></a></p>

---
# A quick round table presentation

* Who are you ?
  - Institution, laboratory, position …
* Are you (somewhat) familiar with Galaxy ?
* What are your needs in microbial genomes comparison ?
* Have you already dealt with microbial genomics data ?
  - Aim of the study ?
  - Species studied
  - Number of genomes
  - Difficulties ?
* How do you feel today ? Ok or Ko ?

---

# Migale team

* <a href="https://migale.inrae.fr/">Migale website</a>

* INRAE infrastructure dedicated to provide
  - Calculation & storage infrastructure
  - Trainings
  - Data analysis service (collaboration or accompaniement)
  - Bioinformatics tool development
  
* Member of the Institut Français de Bioinformatique

---

# Objectives

After this training, you will:

* Be able to construct a genomic dataset from public ressources and evaluate its quality and diversity
* Know the outlines, advantages and limits of main microbial genome comparison approaches
* Be able to use several tools like .large[**dRep**], .large[**MAUVE**] and .large[**ROARY**]  under Galaxy or using a graphical interface on the training data set
* Have some keys to interpret results

---

# Program

* Morning: 
  + Dataset construction 
  + Dataset quality evaluation
  + Dataset diversity analysis 
  + Genome alignment
* Afternoon: 
  + Pan-Genome construction
  + First steps in phylogenomics
  + Data visualization and interpretation

---
class: heading-slide, middle, center
# Microbial comparative genomics

---
# A huge number of microbial genomes
Bacterial and metagenomic genome projects: the top of the sequencing projects  
.pull-left[
<img src="images/gold-total-genomes.png" width="100%" style="display: block; margin: auto;" />
]
--
.pull-right[
<img src="images/gold-proka-genomes.png" width="90%" style="display: block; margin: auto;" />
]

Proteobacteria and Firmicutes: the two most sequenced group of genomes

Source: <a href="https://gold.jgi.doe.gov/statistics">GOLD statistics</a>
---
# And there is still a lot more to explore, especially for microbes
.pull-left[
<img src="images/TreeofLife_New_BactNonCultivables.jpg" width="80%" style="display: block; margin: auto;" />
]
--
.pull-right[
- genomic data where recovered from diverse metagenomic samples 
- tree reconstructed from an alignemnt of 16 ribosomal proteins
- red dots indicate lineages lacking an isolated representative
- there are a large number of major lineages without isolated representatives
]
Source : Hug, L., Baker, B., Anantharaman, K. et al. A new view of the tree of life. Nat Microbiol 1, 16048 (2016). https://doi.org/10.1038/nmicrobiol.2016.48

---
# Frequent problems for microbial genome analysis and comparison

* Heterogenous quality of sequencing and assembly
* Presence of huge number or public genomes OR absence of any close genomes of the same species in public databases
* Difficulties regarding microbial taxonomy (classification) and nomenclature (naming of genus, species and strain naming) for many non-model organisms

---
# Why comparative genomics

* Answer to (not so simple) questions like :
  - What is the genomic diversity into a microbial species / genus ? 
  - Is the genome structure conserved into a species / genus ? 
  - How does the gene repertory evolves into a species / genus ? 
  - Does this diversity could explain a given phenotype :
      - metabolism
      - probiotics (anti-inflamatory)
      - pathogenicity
  - …

---
# The training dataset

We will work on a reduced dataset of public *Salmonella* genomes

.pull-left[
<img src="images/SalmonellaClassification.Png" width="100%" style="display: block; margin: auto;" />
]
--
.pull-right[

<img src="images/SalmonellaNCBI.png" width="100%" style="display: block; margin: auto;" />
13.327 salmonella enterica public assemblies at NCBI!
]
---
# The training dataset: a list of 16 salmonella enterica public genomes (part 1)

Assembly_accession|Subspecies|Serotype|Strain|assembly_level
------------------|----------|--------|------|--------------
GCF_001951465.1|arizonae|18:z4,z23|CVM N27|Scaffold
GCF_001448925.1|arizonae|62:z36|5335/86|Contig
GCF_000756465.1|arizonae|62:z36|RKS2983|Complete Genome
GCF_000018625.1|arizonae|62:z4|z23|Complete Genome
GCF_000983595.1|enterica|ParatyphiA|na|Scaffold
GCF_000026565.1|enterica|ParatyphiA|AKU_12601|Complete Genome
GCF_000011885.1|enterica|ParatyphiA|ATCC 9150|Complete Genome
GCF_000484015.1|enterica|ParatyphiB|SARA61|Contig

---
# The training dataset: a list of 16 salmonella enterica public genomes (part 2)

Assembly_accession|Subspecies|Serotype|Strain|assembly_level
------------------|----------|--------|------|--------------
GCF_001951465.1|arizonae|18:z4,z23|CVM N27|Scaffold
GCF_900002585.1|enterica|Typhi|na|Scaffold
GCF_000256015.1|enterica|Typhi|BL196|Contig
GCF_000195995.1|enterica|Typhi|CT18|Complete Genome
GCF_000007545.1|enterica|Typhi|Ty2|Complete Genome
GCF_001120665.1|enterica|Typhimurium|DT104|Scaffold
GCF_000006945.2|enterica|Typhimurium|LT2|Complete Genome
GCF_000210855.2|enterica|Typhimurium|SL1344|Complete Genome
GCF_000312745.2|enterica|Typhimurium|STm6|Contig

---
class: heading-slide, middle, center
# Dataset construction

---
# Dataset building

* Genomes of interest could be 
  - already published and available at public databanks (ENA, NCBI, …)
  - **private**, not yet published.
  
  
* At least, we need :
  - [as much as possible] complete genome assemblies (contigs / scaffolds in fasta format)
  - Syntactic and functional annotation :
    - Genbank or GFF format

* For private genomes, you could/should use Prokka  [*See module 9*]

* It's always better if annotation is homogeneous

---
class: heading-slide, middle, center
# Quick reminder on format

---
#FASTA format

The FASTA format is used to represent sequence information. The format is very simple:
- A <code>></code> symbol on the FASTA header line indicates a fasta record start.
- A string of letters called the sequence id may follow the <code>></code> symbol.
- The header line may contain an arbitrary amount of text (including spaces) on the
same line.
- Subsequent lines contain the sequence.

<i>Example</i>

```bash
>foo
ATGCC
>bar other optional text could go here
CCGTA
>bidou
ACTGCAGT
TTCGN
>repeatmasker
ATGTGTcggggggATTTT
>prot2; my_favourite_prot
MTSRRSVKSGPREVPRDEYEDLYYTPSSGMASP
```
---
#Genbank  Format
The Genbank format is used to represent sequence **and**  annotation information together. 
- The start of the annotation section is marked by a line beginning with the word **“LOCUS”**. 
- Features (CDS, genes) are annotaed with thier position , strand and qualifiers that contains the n
annotation.
- The start of sequence section is marked by a line beginning with the word **“ORIGIN”** and the end of the section is marked by a line with only **“//”**.

- NCBI, ENA (European Nucleotide Archive) et DDBJ (Japan) entries are synchronized each day.
- Those three bank agree on the list of feature / qualifier that one can use to annotate sequence.
---

# Genbank entry example

```bash
LOCUS       SCU49845     5028 bp    DNA             PLN       21-JUN-1999
DEFINITION  Saccharomyces cerevisiae partial genes.
ACCESSION   U49845
VERSION     U49845.1  GI:1293613
KEYWORDS    .
SOURCE      Saccharomyces cerevisiae (baker's yeast)
  ORGANISM  Saccharomyces cerevisiae
            Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes;
            Saccharomycetales; Saccharomycetaceae; Saccharomyces.
REFERENCE   1  (bases 1 to 5028)
  AUTHORS   Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W.
  TITLE     Cloning and sequence of REV7, a gene whose function is required for
            DNA damage-induced mutagenesis in Saccharomyces cerevisiae
  JOURNAL   Yeast 10 (11), 1503-1509 (1994)
  PUBMED    7871890
FEATURES             Location/Qualifiers
     source          1..5028
                     /organism="Saccharomyces cerevisiae"
                     /db_xref="taxon:4932"
                     /chromosome="IX"
                     /map="9"
     CDS             <1..206
                     /codon_start=3
                     /product="TCP1-beta"
                     /protein_id="AAA98665.1"
                     /db_xref="GI:1293614"
                     /translation="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA
                     AEVLLRVDNIIRARPRTANRQHM"
ORIGIN
        1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg
       61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct
      121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa
//
```
---
## GFF format 
The **General Feature Format** contains annotation and (optionally) sequence.
It  consists of one line per feature, each containing 9 columns of data, plus optional track definition line.

```bash
##gff-version 3
##sequence-region NZ_LHTK01000001 1 688985
# organism Salmonella enterica subsp. arizonae serovar 62:z36:- str. 5335/86
# date 17-JAN-2020
NZ_LHTK01000001	GenBank	contig	1	688985	.	+	1	ID=NZ_LHTK01000001;Dbxref=BioProject:PRJNA224116,taxon:1245396;Name=NZ_LHTK01000001;Note=Salmonella enterica subsp. arizonae serovar 62:z36:- str. 5335/86 ssp-IIIa_O62_mirahybrid1_c1%2C whole genome shotgun sequence.,REFSEQ INFORMATION: The reference sequence was derived from LHTK01000001. The annotation was added by the NCBI Prokaryotic Genome Annotation Pipeline (PGAP). Information about PGAP can be found here: https://www.ncbi.nlm.nih.gov/genome/annotation_prok/ \n##Genome-Assembly-Data-START##\nAssembly Method :: MIRA v. 3.9.18\nGenome Representation :: Full\nExpected Final Version :: Yes\nGenome Coverage :: 80.89x\nSequencing Technology :: 454,Illumina MiSeq\n##Genome-Assembly-Data-END##;collected_by=Institut Pasteur%2C Paris%2C France;collection_date=1986;comment1=REFSEQ INFORMATION: The reference sequence was derived from LHTK01000001. The annotation was added by the NCBI Prokaryotic Genome Annotation Pipeline (PGAP). Information about PGAP can be found here: https://www.ncbi.nlm.nih.gov/genome/annotation_prok/ \n##Genome-Assembly-Data-START##\nAssembly Method :: MIRA v. 3.9.18\nGenome Representation :: Full\nExpected Final Version :: Yes\nGenome Coverage :: 80.89x\nSequencing Technology :: 454%3B Illumina MiSeq\n##Genome-Assembly-Data-END##;country=USA;date=17-JAN-2020;host=Homo sapiens;mol_type=genomic DNA;organism=Salmonella enterica subsp. arizonae serovar 62:z36:- str. 5335/86;serogroup=O62;serovar=62:z36:-;strain=5335/86;sub_species=arizonae;submitter_seqid=ssp-IIIa_O62_mirahybrid1_c1
NZ_LHTK01000001	GenBank	pseudogene	1	1014	.	-	1	ID=LFZ49_RS22320.pseudogene;Alias=LFZ49_RS22320;Name=LFZ49_RS22320;pseudo=_no_value
NZ_LHTK01000001	GenBank	gene	1011	1634	.	-	1	ID=LFZ49_RS00010;Name=LFZ49_RS00010;old_locus_tag=LFZ49_00010
NZ_LHTK01000001	GenBank	mRNA	1011	1634	.	-	1	ID=LFZ49_RS00010.t01;Parent=LFZ49_RS00010
```

---
# Practical : public genomes

#1 How to gather a list of public genomes of interest ?

- Work from the [prokaryotic public genomes available at NCBI](https://www.ncbi.nlm.nih.gov/genome/browse#!/prokaryotes/)
- Use the interface to filter, then download this table

- From this list of **accession** you will have to download a list of files.

---
class: heading-slide, middle, center
# Demonstration : download genbank and nct fasta file from NCBI

---
# Practical :  Public genomes -  NCBI web site

*  Go to the NCBI web site
* https://www.ncbi.nlm.nih.gov/
* browse to the "Genomes" section
 
<div class="figure" style="text-align: center">
<img src="images/recup-genomes-1.png" alt="NCBI web site " width="50%" />
<p class="caption">NCBI web site </p>
</div>

---
# Practical : Public genomes  list

* You will obtain a list of *complete* genomes with different informations :
  - accession (unique id) number
  - species
  - strain
  - completeness
  - **a link to download the genome file(s)** (Refseq or Deposited)

<div class="figure" style="text-align: center">
<img src="images/recup-genomes-2.png" alt="NCBI web site public genome list " width="50%" />
<p class="caption">NCBI web site public genome list </p>
</div>

https://www.ncbi.nlm.nih.gov/genome/browse#!/overview/

---
# Practical : Public genomes - filter and download

* The list can be 
  - filtered with the *filter* button
  - downloaded (csv file) with the "download" button
<div class="figure" style="text-align: center">
<img src="images/recup-genomes-3.png" alt="NCBI web site public genome list- filter" width="50%" />
<p class="caption">NCBI web site public genome list- filter</p>
</div>

---
# Practical : Public genomes - Remote Web Site Structure Exploration

* Explore the remote web site.
* Example : 
  - accession **GCA_003181115.1_ASM318111v1**
  - **FTP directory** : [ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/181/115/GCA_003181115.1_ASM318111v1](ftp://ftp.ncbi.nlm.nih.gov/genomes/all/README.txt)
* Different file format, including :
  - *accession*_genomic_gbff.gz : compressed **Genbank file**
  - *accession*_genomic_fna.gz : compressed **genomic Fasta file**
  - Full description : [ftp://ftp.ncbi.nlm.nih.gov/genomes/all/README.txt](ftp://ftp.ncbi.nlm.nih.gov/genomes/all/README.txt)

<div class="figure" style="text-align: center">
<img src="images/recup-genomes-4.png" alt="NCBI web site public genome list- filter" width="50%" />
<p class="caption">NCBI web site public genome list- filter</p>
</div>

---
# How to dowload a list of genomes files in Galaxy ?

* **Galaxy** can handle list of files to download.
* Needs only a **list of URLs** (http, ftp protocols)
* But, no simple way to have a direct download link to a (Genbank|GFF|Fasta) file.
  - We will have to manipulate the tabular file to reconstruct the URL with a concatenation of 
      - FTP site ( column **FTP**) 
      - accession number (end of URL in column **FTP**)
      - file suffix (ex: *genomic_fna.gz*)
  
From :
[ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/181/115/GCA_003181115.1_ASM318111v1](ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/181/115/GCA_003181115.1_ASM318111v1)

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/181/115/GCA_003181115.1_ASM318111v1/\\\\
GCA_003181115.1_ASM318111v1_genomic_fna.gz

* **Two ways** of doing this :
  - In your favorite spreadsheet software (Excel, LibreOffice)
  - Directly in **Galaxy** with Rule-based upload.
  
---
# Practical : Public genomes - Connect  to galaxy

![Site web du NCBI](images/recup-genomes-5.png)
  https://galaxy.migale.inrae.fr

---
# Practical : Public genomes list - Upload

- Select upper-left upload button
  - Upload the csv file, convert it to tabular (pen icon)
  - Upload button (again)
- Rule-based tab
- Load tabular from history 
- Build

You will then be able to apply a list of **rules and transformation** to this tabular file.

![Site web du NCBI](images/recup-genomes-8.png)

---
# Practical : Public genomes list - Remove first row

---
# Practical : Public genomes list- Extract id(1)

Use a *regular expressions* to extract the id

---
# Practical : Public genomes list- Extract id(2)

Use a *regular expressions* to extract the id :
  - Applied a column P
  - Create column matching expression groups (between brackets) :
    - ftp://.\*/(.\*)
      - ".*" means any character
      - This expression means, capture all the character you found after the last /
      - It will create a new column with what they have captured on each line

---
# Practical : Public genomes list - Identify Column  with ID

- Column Q is now filled with the ID

![Site web du NCBI](images/recup-genomes-12.png)

---
# Practical : Public genomes list - Add a column with fixed value

- Add a column with "/"
- Add a column with "suffix" (*ie* genomic_fna.gz)

---
# Practical : Public genomes list - Concatenate columns

- Conactenate column "URL" and "fixed value with /"
- Conactenate preceding column and accession
- Conactenate preceding column and suffix

---
# Practical : Public genomes list - Define columns

- define the last column (with the URL to the file you have constructed) as an URL
- It will tell Galaxy where to look for the files to downlaod

<img src="images/recup-genomes-16.png" width="30%" style="display: block; margin: auto;" />
---
# Practical : Public genomes list - Define columns(2)

- [Optional] define the accession column as a "name"
- It will tell Galaxy where to look for the name to give to the files downloaded (otherwise it gives the URL as the name)

---
# Practical : Public genomes list - Upload files form built list

- Check the rules
- Save it (wrench icon) for later
- click on upload

---
# Practical : Public genomes list - Launch Upload

* The tabular genome description file is in "Shared Data/ Data Library/ Formation Génomique Comparée/DataSet/DataSalmonella.tabular"

* The backup of the rules file is in "Shared Data/ Data Library/ Formation Génomique Comparée/ Correction/rule_based_ipload.json".

* Rules should be adapted to your tabular file

---
# Practical : create your dataset in galaxy

- Connect to Galaxy(https://galaxy.migale.inrae.fr) with your (or stage) account.

- Do not forget to login (upper right …)

- Create a new history

- Copy all the genomes fasta & GFF from "Shared Data / Data Libraries/ Formation Génomique Comparée/ Dataset/Fasta" and "Shared Data / Data Libraries/ Formation Génomique Comparée/ Dataset/GFF"

---
class: heading-slide, middle, center
# Quality control
---
## Why QC'ing your genomes ?

**Try to answer to (not always) simple questions :**
--

- What is the "quality" of an assembly [compared to what we expect] ? Is the assembly fragmented ?
  - Length
  - Number of contigs
  - Number of scaffolds
  - GC%
- What is the "quality" of an annotation [compared to what we expect]?
- Number of (pseudo)genes
- number of rRNA genes
- number of tRNA genes

---
# Tools to QC your dataset :

**Quast** (Quality Assessment Tool for Genome Assemblies, <a name=cite-Gurevich2013></a>([Gurevich, Saveliev, Vyahhi, and Tesler, 2013](https://doi.org/10.1093/bioinformatics/btt086)) ) is an easy to use software to evaluate genome assemblies.

It gives you, in one single report different metrics about one or more assemblies.

*Without* reference :
- Number of contigs / scaffolds (>0, >500bp, > 1kb)
- Largest contig
- N50 : the sequence length of the **shortest contig** at 50% of the total genome length (equivalent to a median of contig lengths)
- Number of Ns in the consensus sequence.

Additional metrics ** with a reference** genome :
- NG50 (N50 for reference genome size)
- number of "misassemblies"

---
# Practical : Quast your dataset !

## Apply quast to the 16 assemblies of you dataset.

---
class: center, middle

![coffee-break](https://www.nicepng.com/png/full/66-667402_coffee-break-logo-png.png)

---
class: heading-slide, middle, center
# Dataset diversity analysis

---
# Genome diversity evaluation
## Why ? 
- Build and de-replicate genome datasets
- Estimate genome similarity in a dataset and design an adapted comparative strategy

## How ? 
- Alignment based approaches (ANI)
- k-mer based approaches (MASH)

---
# Average Nucleotide Identity (ANI)
.pull-left[
- Meet the need for a robust measure of genomic reladness and a systematic and scalable species assignation technique 
- Mean identity percent of aligned regions of a pair of genomes
- Rely on pairwise alignments that may come either from aligned core genes or from genomic alignements
- Can easily be used to build phylogenetics tree using distance methods
- Is implemented in several bioinformatics tools (gANI, fastANI)
]
.pull-rigth[
<div class="figure" style="text-align: center">
<img src="images/spiroplasma-pangenome.png" alt="Pangenomics, phylogenomics, and ANI of 31 Spiroplasma genomes." width="40%" />
<p class="caption">Pangenomics, phylogenomics, and ANI of 31 Spiroplasma genomes.</p>
</div>
]

---
# Average Nucleotide Identity (ANI)
- ANI strongly correlates (R = 0.79 for logarithmic correlation) with the 16S rRNA gene sequence identity and can resolve areas where the 16S rRNA gene is inadequate (intra-species level)

- The average rate of synonymous substitutions shows a tight correspondence to ANI, suggesting that ANI may also be a useful descriptor of the evolutionary distance

- ANI shows a strong linear correlation to DNA–DNA reassociation values, and the 70% DNA–DNA reassociation standard corresponds to ≈93–94% ANI i.e. strains that show >94% ANI should belong to the same species
--

<div class="figure" style="text-align: center">
<img src="images/ANI_F2.large.jpg" alt="Relationships between ANI, 16S rRNA, mutation rate, and DNA–DNA reassociation" width="80%" />
<p class="caption">Relationships between ANI, 16S rRNA, mutation rate, and DNA–DNA reassociation</p>
</div>
Source : <a name=cite-Konstantinidis2567></a>([Konstantinidis and Tiedje, 2005](https://www.pnas.org/content/102/7/2567))

---
# MASH: fast (meta)genome distance estimation using MinHash
.pull-left[
### Mash allows to compute a pairwise mutation distance without alignment using k-mer counts
### Mash provides two basic functions for sequence comparisons: 
- sketch: converts a sequence or collection of sequences into a MinHash sketch 
- dist: compares two sketches and returns an estimate of the Jaccard index (i.e. the fraction of shared k-mers), a P value, and the Mash distance, which estimates the rate of sequence mutation under a simple evolutionary model 
]
.pull-right[
<div class="figure" style="text-align: center">
<img src="images/MASH_13059_2016_997_Fig1_HTML.gif" alt="Overview of the MinHash bottom sketch strategy for estimating the Jaccard index. " width="50%" />
<p class="caption">Overview of the MinHash bottom sketch strategy for estimating the Jaccard index. </p>
</div>
]
Source : <a name=cite-Ondov></a>([Ondov, Treangen, Melsted, Mallonee, Bergman, Koren, and Phillippy, 2016](https://doi.org/10.1186/s13059-016-0997-x))

---
#  MASH distances correlate well with ANI
.pull-left[
- Dataset: 500 complete E. coli genomes
- Gray lines: model relationship D = 1–ANI  
* Each plot column shows a different sketch size 
* Each plot row a different k-mer size k.  
. 
- Increasing the sketch size improves the accuracy of the MASH distance, especially for more divergent sequences.
- Limit on how well the MASH distance can approximate ANI, especially for more divergent genomes (e.g. ANI considers only the core genome)
]
.pull-rigth[
<div class="figure" style="text-align: center">
<img src="images/MASH_13059_2016_997_Fig2_HTML.gif" alt="Scatterplots illustrating the relationship between ANI and Mash distance for a collection of Escherichia genomes." width="40%" />
<p class="caption">Scatterplots illustrating the relationship between ANI and Mash distance for a collection of Escherichia genomes.</p>
</div>
]
Source : ([Ondov, Treangen, Melsted, et al., 2016](https://doi.org/10.1186/s13059-016-0997-x))

---
# dREP: comparison and de-replication

.pull-left[
 
- dRep is a python program which performs rapid pairwise genome comparisons using genomic distances
- it can be used for genome dereplication: identification of the 'same' genomes from a large set + determination of the highest quality genome in each replicate set

dREP uses 2 main steps:
1. a first (rapid) clustering of genomes using MASH similarity (90% by default) 
2. a second more sensitive step based on ANI on pairs of genomes that have at least a minimum level of "MASH" similarity 
]

.pull-right[
<div class="figure" style="text-align: center">
<img src="images/dRep_Figure1.png" alt="Assembly and de-replication with dRep" width="90%" />
<p class="caption">Assembly and de-replication with dRep</p>
</div>
]

Source : <a name=cite-Olm></a>([Olm, Brown, Brooks, and Banfield, 2017](https://doi.org/10.1038/ismej.2017.126))

---
# dREP important concepts and parameters
1. **dRep primary clustering use a greedy algorithm**, i.e. an algorithm that take shortcuts to run faster and generally produces "quasi-optimal" solutions. *Genomes that are not on the same MASH primary clustering will never be compared with ANI*
2. **Importance of genome completness:** MASH is very sensitive to genome completness.  the more incomplete of genomes you allow into your genome list, the more you must decrease the primary cluster threshold.
3. **The secondary ANI threshold** (default value: 99%, limit: 99.99%) indicates how similar genomes need to be to be considered the “same”. Depending on the application,you may modify this parameter, i.e.: 95% ANI for species-level de-replication or  98% ANI to generate a set of genomes that are distinct when mapping short reads.
4. **The score used to pick representative genomes** takes into account several parameters such as Completeness, Contamination,  strain heterogeneity and centrality (a measure of how similar a genome is to all other genomes in it’s cluster).
---
#dRep commands and parameters
1. **dREp compare**: compare and cluster a set of genomes using one or two clustering steps. 
2. **dREp dereplicate**: compare, cluster and dereplicate a set of genomes. During de-replication the first step is identifying groups of similar genomes, and the second step is picking a Representative Genome (RG) for each cluster. 
<<<<<<< HEAD

**Parameters of primary and secondary clustering may have to be adjusted depending on the diversity of the dataset and on the objective of the comparison/dereplication**

**Default values of dRep clustering parameters:**

-pa P_ANI, --P_ani P_ANI
                          ANI threshold to form primary (MASH) clusters
                          (default: 0.9)
    -sa S_ANI, --S_ani S_ANI
                          ANI threshold to form secondary clusters (default:
                          0.99)

---
# dREP produce many results files

.pull-left[
### dRep rely on several other programs:
1. **Mash**: to build the primary clusters
2. **Mummer**: to perform the ANI computation on pairwise genome alignements (used by default but **fastANI** or **gANI** may also be used)
3. **checkM** (Parks et al. 2015) to determine contamination and completeness of genomes
4. **Prodigal** (Hyatte et al. 2010): to predict genes (used by checkM and gANI)
4. **cipy** (Jones et al. 2001) to produce a final hierarchical clustering. 
]

.pull-right[
### Output files of dRep
<div class="figure" style="text-align: center">
<img src="images/dRep_output_files.png" alt="dRep results" width="90%" />
<p class="caption">dRep results</p>
</div>
]
Source : ([Olm, Brown, Brooks, et al., 2017](https://doi.org/10.1038/ismej.2017.126))

---
# Practice
- use **dREP-dreplicate** to explore the Salmonella genome dataset diversity and completenes and dereplicate the dataset
- explore and interpret results
- input : 16 genome fasta files
<div class="figure" style="text-align: center">
<img src="images/dRep_galaxy.png" alt="dRep on mMigale Galaxy server" width="60%" />
<p class="caption">dRep on mMigale Galaxy server</p>
</div>

---

# dRep results interpretation
Important outputs of dRep 
.pull-left[
The "Secondary_clustering_dendrograms.pdf" output file 
<div class="figure" style="text-align: center">
<img src="images/dRep_Secondary_clustering_dendrograms.png" alt="Secondary_clustering_dendrograms.pdf" width="70%" />
<p class="caption">Secondary_clustering_dendrograms.pdf</p>
</div>
]
--
.pull-right[
the "Winning_genomes.pdf" output file and the deReplicated genomes list
<div class="figure" style="text-align: center">
<img src="images/dRep_Scoring_Winning_genomes.png" alt="Winning_genomes.pdf" width="70%" />
<p class="caption">Winning_genomes.pdf</p>
</div>
]

---
class: heading-slide, middle, center
# Genome alignment

---

#Genome alignment
.pull-left[
 - Mostly targeted to **close genome comparisons** (generally at the intra-species level)
 - A variety of applications: 
   - help for genome assembly, scaffolding and annotation
   - genome architecture comparison
   - genome micro-evolution analysis
   - discovery of DNA motifs or elements in conserved non-coding regions
   - ....
- Aligning whole genome sequences is a challenge: 
  - computational intensive
  - heterogenous quality of assemblies
  - broad variety of mutational and evolutionary events (including rearrangemnets)
  - result analysis, interpretion and visualisation is tricky
  ]
.pull-rigth[
<div class="figure" style="text-align: center">
<img src="images/Alignement_geneomes_12859_2006_Article_1172_Fig1_HTML.jpg" alt="An approximate phylogeny of genome comparison tools over the past 30 years" width="30%" />
<p class="caption">An approximate phylogeny of genome comparison tools over the past 30 years</p>
</div>
Source : <a name=cite-Treangen></a>([Treangen and Messeguer, 2006](https://doi.org/10.1186/1471-2105-7-433)) 
]

---
#Mummer: pairwise alignment with rearrangments
Based on three main steps: 
#### Step 1: Perform a maximal unique match (MUM) decomposition of the two genomes using suffix trees

<div class="figure" style="text-align: center">
<img src="images/Mummer1.png" alt="A maximal unique matching subsequence (MUM) of 39 nt (shown in uppercase) shared by Genome A and Genome B" width="40%" />
<p class="caption">A maximal unique matching subsequence (MUM) of 39 nt (shown in uppercase) shared by Genome A and Genome B</p>
</div>
--
<div class="figure" style="text-align: center">
<img src="images/Mummer2.png" alt="A Suffix tree for the sequence gaaccgacct" width="40%" />
<p class="caption">A Suffix tree for the sequence gaaccgacct</p>
</div>

Source : <a name=cite-Delcher1999></a>([Delcher, Kasif, Fleischmann, Peterson, White, and Salzberg, 1999](https://doi.org/10.1093/nar/27.11.2369))
---
#Mummer: pairwise alignment with rearrangments
#### Step 2: Sort the matches found in the MUM alignment, and extract the longest possible set of matches that occur in the same order in both genomes
<div class="figure" style="text-align: center">
<img src="images/Mummer3.png" alt="LIS algorithm to find the longest set of MUMs whose sequences occur in ascending order in both Genome A and Genome B" width="50%" />
<p class="caption">LIS algorithm to find the longest set of MUMs whose sequences occur in ascending order in both Genome A and Genome B</p>
</div>

--
 
#### Step 3: Close the gaps (regions between the MUMs) by  
- detecting SNPs between MUMs
- identifying large inserts (transpositions or insertions) and repeats (overlapping MUMs)
- aligning small polymorphic regions using a standart dynamic programming algorithm approach

Source : ([Delcher, Kasif, Fleischmann, et al., 1999](https://doi.org/10.1093/nar/27.11.2369))

---
#Mummer: pairwise alignment with rearrangments
.pull-left[
#### Example of Nucmer results
- Alignment of M.genitalium (580 074 nt) x M.pneumoniae (816 394 nt)
- The MUM alignment clearly shows five translocations of M.genitalium sequence with respect to M.pneumoniae, in agree- ment with the analysis of Himmelreich et al. 1997
x
Source : ([Delcher, Kasif, Fleischmann, et al., 1999](https://doi.org/10.1093/nar/27.11.2369))

]
.pull-rigth[
<div class="figure" style="text-align: center">
<img src="images/Mummer5.png" alt="Alignment of M.genitalium and M.pneumoniae using FASTA (top), 25mers (middle) and MUMs (bottom)" width="40%" />
<p class="caption">Alignment of M.genitalium and M.pneumoniae using FASTA (top), 25mers (middle) and MUMs (bottom)</p>
</div>
]

---
# Practice
- Use **Galaxy-Nucmer** to align the two Salmonella typhi CT18 (Refseq accession:GCF_000195995.1) and Ty2 (Refseq accession:GCF_000007545.1) complete genomes
- Look at result files
- What do you conclude accorging their genome structure?
- Generate a list of coordinates of aligned regions using the **Show-Coords** program
--
<div class="figure" style="text-align: center">
<img src="images/Nucmer_Galaxy.png" alt="Nucmer on Galaxy" width="70%" />
<p class="caption">Nucmer on Galaxy</p>
</div>
---
# Nucmer result interpretation
The Galaxy-nucmer outputs
- The *dotplot* ouput

<div class="figure" style="text-align: center">
<img src="images/Nucmer_dotplot.png" alt="Dotplot Salmonella SPA CT18 vs STY2" width="70%" />
<p class="caption">Dotplot Salmonella SPA CT18 vs STY2</p>
</div>
---
# Nucmer result interpretation
The Galaxy-nucmer outputs

- The *alignment* ouput

<div class="figure" style="text-align: center">
<img src="images/Nucmer_tabular_format.png" alt="Dotplot Salmonella SPA CT18 vs STY2" width="90%" />
<p class="caption">Dotplot Salmonella SPA CT18 vs STY2</p>
</div>

---
# Nucmer result interpretation
The Galaxy-nucmer outputs

- The *show-coords* ouput

<div class="figure" style="text-align: center">
<img src="images/show-coords.png" alt="Show-coords Salmonella SPA CT18 vs STY2" width="90%" />
<p class="caption">Show-coords Salmonella SPA CT18 vs STY2</p>
</div>

---
#Mauve: multiple alignment with rearrangments
http://darlinglab.org/mauve/mauve.html

- One of the first multiple genome aligner that can deal with rearrangments
- Well suited to bacterial genome alignment
- Success largely due to its Graphical User Interface

Source : <a name=cite-Darling></a>([Darling, Mau, Blattner, and Perna, 2004](https://doi.org/10.1101/gr.2289704))

---
#Mauve: how it works?
.pull-left[
Mauve alignment algorithm main steps:

- Find local alignments (multi-MUMs).
- Use the multi-MUMs to calculate a phylogenetic guide tree.
- Select a subset of the multi-MUMs to use as anchors—these anchors are partitioned into collinear groups called LCBs.
- Perform recursive anchoring to identify additional alignment anchors within and outside each LCB.
- Perform a progressive alignment of each LCB using the guide tree. 
 
Source : ([Darling, Mau, Blattner, et al., 2004](https://doi.org/10.1101/gr.2289704)) 
  ]
.pull-rigth[
<div class="figure" style="text-align: center">
<img src="images/Mauve_78942-21f1_4o.jpg" alt="A pictorial representation of greedy breakpoint elimination in three genomes" width="30%" />
<p class="caption">A pictorial representation of greedy breakpoint elimination in three genomes</p>
</div>

]

---
#Mauve: Alignment of Nine Enterobacterial Genomes
.pull-left[
Genome alignment features
- Each contiguously colored region is a locally collinear block (LCB)
- LCB can be in reverse complement orientation relatively to reference genome (K12)
- 45 LCB with minimum weight of 69 consisting of 2.86 Mb of conserved backbone sequence broken into 1252 segments
- Several known inversions are confirmed such as the O157:H7 EDL933 inversion relative to K12 and the large inversion about the origin of replication among the S. enterica serovars Typhi CT18 and Ty2

Source : ([Darling, Mau, Blattner, et al., 2004](https://doi.org/10.1101/gr.2289704)) 
  ]
.pull-rigth[
<div class="figure" style="text-align: center">
<img src="images/Mauve_78942-21f6_4o.jpg" alt="Locally collinear blocks identified among the nine enterobacterial genomes" width="50%" />
<p class="caption">Locally collinear blocks identified among the nine enterobacterial genomes</p>
</div>

]
---
 
#Mauve companion tools
## Mauve Contig Mover (Rissman et al. 2009)
- Can order contigs of a draft genome relative to a related reference genome
- Based on iterative genome alignment using Mauve and requires anchors at both ends of contigs
- The reference used may be draft quality itself, or may have divergent genetic content

## ProgressiveMauve (Darling & Perna 2010)
- Can align regions conserved only in subsets of the genomes
- Set up an anchor scoring function that penalizes alignment anchoring in repetitive regions of the genome and penalizes genomic rearrangement
- Use a probabilistic scoring strategy (HMM) to reject erroneous alignments of unrelated sequence produced by Mauve
- In summary: can align faster and more accuratly than Mauve more distant and big dataset of genomes

---
# Practice Mauve
- Use *Mauve* **on your local computer** to align the 3 complete genomes of serotypes typhi (CT18, Refseq accession:GCF_000195995.1), typhimurium (LT2, Refseq accession:GCF_000006945.2) and Paratyphi A  (ATCC 9150, Refseq accession: GCF_000011885.1)
- Mauve input: fasta (or Genbank) files
- Choose *Mauve* and **not** *ProgressiveMauve* algorithm

<div class="figure" style="text-align: center">
<img src="images/Mauve.png" alt="Mauve on my local computer" width="70%" />
<p class="caption">Mauve on my local computer</p>
</div>

---
# Mauve results interpretation
- Genome alignment of serotypes typhi (CT18), typhimurium (LT2) and Paratyphi A
- Look at the LCB output (other output files description : http://darlinglab.org/mauve/user-guide/files.html )
- What do you conclude regarding genome structure ?
--

<div class="figure" style="text-align: center">
<img src="images/MauveLT2_CT18_1TCC9150_LCB.png" alt="Mauve on ly local computer" width="70%" />
<p class="caption">Mauve on ly local computer</p>
</div>

---

---
class: heading-slide, middle, center
# The microbial pan-genome
---
 
#The microbial pan-genome

.pull-left[
 First term apparition in 2005 in two publications 
  * Tettelin et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: Implications for the microbial “pan-genome” Proc Natl Acad Sci U S A.
  * Medini et al. "The microbial pangenome" Curr Opin Genet Dev.  
  
*A bacterial species can be described by its **pan-genome** composed of a **core genome** containing genes present in all strains, and a **dispensable genome** containing genes present in two or more strains and genes unique to single strains.*
]

<div class="figure" style="text-align: center">
<img src="images/PanGenome.jpg" alt="Streptococcus group B pan genome" width="85%" />
<p class="caption">Streptococcus group B pan genome</p>
</div>
]
 
References: <a name=cite-Tettelin2005></a>([Tettelin, Masignani, and Cieslewicz
MJ, 2005](https://doi.org/10.1073/pnas.0506758102))  and  <a name=cite-Medini2005></a>([Medini, Donati, Tettelin, Masignani, and Rappuoli, 2005](http://www.sciencedirect.com/science/article/pii/S0959437X05001759))
---

#The microbial pan-genome
- Definition refinment by Koonin (2008) and Collins (2012): the 3 classes of prokaryotic genes
  * **core (or persitent) genes**: a small fraction of highly conserved genes 
  * **shell genes**: a larger set of moderately conserved genes 
  * **cloud genes**: (nearly) unique genes

.pull-left[
<div class="figure" style="text-align: center">
<img src="images/CoreGenome.jpg" alt="Streptococcus group B core genome" width="60%" />
<p class="caption">Streptococcus group B core genome</p>
</div>
]

.pull-right[
<div class="figure" style="text-align: center">
<img src="images/Koonin_panGenes_classes.jpg" alt="A Common and rare genes in selected archaeal and bacterial genomes. Red, core; green, shell; light gray, cloud; dark gray, ORFans." width="40%" />
<p class="caption">A Common and rare genes in selected archaeal and bacterial genomes. Red, core; green, shell; light gray, cloud; dark gray, ORFans.</p>
</div>
]

Source : <a name=cite-Koonin2008></a>([Koonin and Wolf, 2008](https://doi.org/10.1093/nar/gkn668))  
Source : <a name=cite-Collins2012></a>([Collins and Higgs, 2012](https://doi.org/10.1093/molbev/mss163))
---
#Open or closed pan-genome
- Some bacterial species are considered to have an unlimited large gene repertoire => **open pan-genome** 
- Other species seem to be limited by a maximum number of genes in their gene pool=> **closed pan-genome** 
- Authors use **Power or Heaps law** to fit of the overall number of genes (pan-genome) obtained according to the number of sequenced genomes 
 
 
.pull-left[
<div class="figure" style="text-align: center">
<img src="images/open_closed_pangenome.png" alt="Open and closed pangenomes" width="60%" />
<p class="caption">Open and closed pangenomes</p>
</div>
]

.pull-right[
<div class="figure" style="text-align: center">
<img src="images/OpenAndClosePanGenomes.jpg" alt="Power law regression for species with open and closed pan-genomes.Red curves indicate closed pan-genomes, green curves indicate open ones." width="60%" />
<p class="caption">Power law regression for species with open and closed pan-genomes.Red curves indicate closed pan-genomes, green curves indicate open ones.</p>
</div>
]

Source : <a name=cite-Tettelin2008></a>([Tettelin, Riley, Cattuto, and Medini, 2008](http://www.sciencedirect.com/science/article/pii/S1369527408001239))

---

# Roary: rapid large-scale prokaryote pan genome analysis

Roary, the pan genome pipeline, takes *closely related* annotated genomes in GFF3 file format and calculates the pan genome.

## Input :
* annotated genomes in **GFF3** format
  - Roary is *very* sensitive to the validity of the GFF format
    - GFFs generated by **Prokka** are valid
    - Locus tags must be uniques across datasets.
    - GFF from NCBI are **invalid** (sequence is missing)
      - Must be converted from Genbank using "Genbank to GFF3" converter
]
--
.pull-right[
# What does Roary do ?
  - converts annotated coding sequences (CDS) into protein sequences
  - cluster these protein sequences iteratively by several methods ( cd-hit, all vs all blastp)
  - further refines clusters into orthologous genes
  - for each sample, determines if a gene is present/absent 
  - uses this  information to build a tree, using FastTree
  - overall, calculates the number of genes that are shared, and unique
  - optionally does an alignment of  the core genes for downstream analyses
]
---

# Roary workflow
<img src="images/roary_wkfw.png" width="60%" style="display: block; margin: auto;" />
---
# Practical : Roary your dataset !

## Apply roary to the 16 assemblies of you dataset.
.pull-left[
* Input : 
  - the 16 gff files
* Paramaters :
  - All the output files selected
  - No specific parameter ( -split-paralog to "yes")
]
.pull-right[
<img src="images/roary-galaxy.png" width="60%" style="display: block; margin: auto;" />
]
---
## Roary outputs
.pull-left[
* *Summary statistics* about number of gene in the core/pan/accessory genomes
* *Gene Presence Absence* : lists each cluster of gene, the most common annotation within the cluster and which genomes it is present in.  
* *Core gene alignement* : a multiple alignement file of the core genes created using PRANK
* *Clustered Proteins* : a file that gives for each cluster id the list of locus tags it is made of
* *pan-genome reference* : this fasta file contains a single nucleotide sequence (representative) from each of the clusters in the pan genome
* Other various files in R of CSF formats.
]
.pull-right[
<img src="images/roary-output.png" width="80%" style="display: block; margin: auto;" />
]
---

![coffee-break](https://www.nicepng.com/png/full/66-667402_coffee-break-logo-png.png)

---
class: heading-slide, middle, center
# Phylogenomics basics

---
#A few concepts on phylogenomics 
- Phylogenomics definition
<div class="figure" style="text-align: center">
<img src="images/Phylogenomics_definition.png" alt="Wikipedia phylogenomics definition" width="60%" />
<p class="caption">Wikipedia phylogenomics definition</p>
</div>

---
#A few concepts on phylogenomics 
- Original definition 
 + The application of phylogenetic methods for gene function analysis (Eisen, 1996)
 +  Organism evolution based on whole genome analyses 
- Recent usage: Various types of studies mixing genomics and phylogenetics, such as: 
 + Global patterns of synteny (conserved gene order) across species  
 + Global patterns of gene presence and absence studies across species  
 + Genome rearrangments analyses 
 + DNA substitution patterns seen in noncoding regions analyses 
 + Genomic epidemiological studies
 + ... 
-  These analyses  can be used to understand metabolism, pathogenicity, physiology, and behavior, speciation...

Reference: <a name=cite-Eisen2003></a>([Eisen and Fraser, 2003](https://science.sciencemag.org/content/300/5626/1706))

---
#Some basics about phylogenetic tree reconstruction methods
3 main methods:
- Neighbor-Joining (distance matrix)
- Parsimony (presence/absence patterns)
- Maximum likehood method (alignment)

<div class="figure" style="text-align: center">
<img src="images/Phylogenetics_methods.jpg" alt="Phylogenetics main methods" width="60%" />
<p class="caption">Phylogenetics main methods</p>
</div>

Reference: <a name=cite-Sleator2015></a>([Sleator, 2015](https://doi.org/10.1007/978-1-4899-7478-5_708))

---

--- 
# The tree Newick format
*Newick* is a text-based format for representing trees in computer-readable form using (nested) parentheses and commas
.pull-left[
- The tree ends with a semicolon
- Interior nodes are represented by a pair of matched parentheses, separated by commas
- Branch lengths are incorporated  by putting a real number after a node and preceded by a colon
]

.pull-right[
<div class="figure" style="text-align: center">
<img src="images/NewickFormat.png" alt="Phylogenetics main methods" width="90%" />
<p class="caption">Phylogenetics main methods</p>
</div>
]

Reference: <a name=cite-Stephens2016></a>([Stephens, Bhattacharya, Ragan, and Chan, 2016](https://doi.org/10.7717/peerj.2038))

---
#FastTree:  Approximately Maximum-Likelihood Trees for Large Alignments

FastTree 2 allows the inference of maximum-likelihood phylogenies for huge alignments
- Can deal with core-gene or core-genome alignments
- Can deal with hundred of thousands of sequences
- Relies on robust Maximum-Likehood statistical models
- Compute local support values with the Shimodaira-Hasegawa test to estimate the reliability of each split in the tree

FastTree in practice:
- takes as input an alignment file (Fasta or Phylip interleaved format)
- needs an evolution model: JTT or WAG or LG for protein, JC or GTR for nucleotide
- produces a tree in Newick format with SH support values [0-1] given as names for the internal nodes

http://www.microbesonline.org/fasttree/

---
#FastTree:  practice
.pull-left[
Use **Galaxy-FastTree** to build a Maximum likehood tree on the aligned core-genes
- input: the *Roary core genome alignment* file in fasta format
- Choose *Nucleotide algnment*
- Choose *GTR+CAT nucleotide evolution model*
]

.pull-left[
<div class="figure" style="text-align: center">
<img src="images/Galaxy-FastTree.png" alt="Falaxy-Fasttree" width="1000%" />
<p class="caption">Falaxy-Fasttree</p>
</div>
]

---
class: center, middle
# How can I add metadata to my tree and view results ?
# The Phandango viewer

---
# Phandango: an interactive viewer for bacterial population genomics
- run directly in a web browser (drag files to upload data)
- many possible inputs like: a phylogenetic tree (Newick format), pan-genome data (from Roary for instance), genome annotations (GFF3 format) or any metadata (in simple (CSV format)
- a valuable ressource for results interpretation

<div class="figure" style="text-align: center">
<img src="images/Phandango.png" alt="Phandango" width="60%" />
<p class="caption">Phandango</p>
</div>

https://jameshadfield.github.io/phandango/#/

Reference:<a name=cite-Hadfield2017></a>([Hadfield, Croucher, Goater, Abudahab, Aanensen, and Harris, 2017](https://doi.org/10.1093/bioinformatics/btx610))

---
# Phandango: practice
Open https://jameshadfield.github.io/phandango/#/ in a web browse of your local computer

.pull-left[
Upload 3 datafiles just by draging them: 
- the Roary gene presence-absence file 
- the Roary phylogenetic tree (change the extentiion file in *.tree*)
- A metadata csv file: DatasetSalmonella_metadata.csv
Interpret results 
]

.pull-right[
<div class="figure" style="text-align: center">
<img src="images/Phandango_salmonella_2.png" alt="Phandango results on the Salmonella dataset" width="80%" />
<p class="caption">Phandango results on the Salmonella dataset</p>
</div>
]

---
#FastTree results interpretation using Phandango
.pull-left[
Upload the following files 
- the FastTree phylogenetic tree (change the extension file in *.tree*)
- the metadata csv file: DatasetSalmonella_metadata.csv
Interpret results 
]

.pull-right[
<div class="figure" style="text-align: center">
<img src="images/Phadango_FastTree.png" alt="FastTree result" width="100%" />
<p class="caption">FastTree result</p>
</div>
]
  
---
#Take home message

- Genome comparison is still an ongoing active bioinformatics research field 
 
- Dataset construction, quality and diversity evaluation is a **mandatory** first-step and may be time-consuming
 
- Dataset de-replication may be helpful for some well-studied organisms

- Comparative strategy depends on the addressed question and on the genome diversity level 
 
- Phylogenomics approaches are powerful and promising 
---

#T H A N K      
  
#Y O U

---

# References
<a name=bib-Collins2012></a>[Collins, R. E. and P. G.
Higgs](#cite-Collins2012) (2012). "Testing the Infinitely Many Genes
Model for the Evolution of the Bacterial Core Genome and Pangenome".
In: _Molecular Biology and Evolution_ 29.11, pp. 3413-3425. ISSN:
0737-4038. DOI:
[10.1093/molbev/mss163](https://doi.org/10.1093%2Fmolbev%2Fmss163).
eprint:
https://academic.oup.com/mbe/article-pdf/29/11/3413/13648372/mss163.pdf.
URL:
[https://doi.org/10.1093/molbev/mss163](https://doi.org/10.1093/molbev/mss163).

<a name=bib-Darling></a>[Darling, A., B. Mau, F. Blattner, et
al.](#cite-Darling) (2004). "Mauve: multiple alignment of conserved
genomic sequence with rearrangements". In: _Genome Research_ 14(7), pp.
1394-1403. DOI:
[10.1101/gr.2289704](https://doi.org/10.1101%2Fgr.2289704).

<a name=bib-Delcher1999></a>[Delcher, A., S. Kasif, R. Fleischmann, et
al.](#cite-Delcher1999) (1999). "Alignment of whole genomes". In:
_Nucleic Acids Res_ 27(11):, pp. 2369-2376. DOI:
[10.1093/nar/27.11.2369](https://doi.org/10.1093%2Fnar%2F27.11.2369).

<a name=bib-Eisen2003></a>[Eisen, J. A. and C. M.
Fraser](#cite-Eisen2003) (2003). "Phylogenomics: Intersection of
Evolution and Genomics". In: _Science_ 300.5626, pp. 1706-1707. ISSN:
0036-8075. DOI:
[10.1126/science.1086292](https://doi.org/10.1126%2Fscience.1086292).
eprint: https://science.sciencemag.org/content/300/5626/1706.full.pdf.
URL:
[https://science.sciencemag.org/content/300/5626/1706](https://science.sciencemag.org/content/300/5626/1706).

<a name=bib-Gurevich2013></a>[Gurevich, A., V. Saveliev, N. Vyahhi, et
al.](#cite-Gurevich2013) (2013). "QUAST: quality assessment tool for
genome assemblies". In: _Bioinformatics_ 29.8, pp. 1072-1075. ISSN:
1367-4803. DOI:
[10.1093/bioinformatics/btt086](https://doi.org/10.1093%2Fbioinformatics%2Fbtt086).
eprint:
https://academic.oup.com/bioinformatics/article-pdf/29/8/1072/17106244/btt086.pdf.
URL:
[https://doi.org/10.1093/bioinformatics/btt086](https://doi.org/10.1093/bioinformatics/btt086).
---

# References(2)
<a name=bib-Gurevich2013></a>[Gurevich, A., V. Saveliev, N. Vyahhi, et
al.](#cite-Gurevich2013) (2013). "QUAST: quality assessment tool for
genome assemblies". In: _Bioinformatics_ 29.8, pp. 1072-1075. ISSN:
1367-4803. DOI:
[10.1093/bioinformatics/btt086](https://doi.org/10.1093%2Fbioinformatics%2Fbtt086).
eprint:
https://academic.oup.com/bioinformatics/article-pdf/29/8/1072/17106244/btt086.pdf.
URL:
[https://doi.org/10.1093/bioinformatics/btt086](https://doi.org/10.1093/bioinformatics/btt086).

<a name=bib-Hadfield2017></a>[Hadfield, J., N. J. Croucher, R. J.
Goater, et al.](#cite-Hadfield2017) (2017). "Phandango: an interactive
viewer for bacterial population genomics". In: _Bioinformatics_ 34.2,
pp. 292-293. ISSN: 1367-4803. DOI:
[10.1093/bioinformatics/btx610](https://doi.org/10.1093%2Fbioinformatics%2Fbtx610).
URL:
[https://doi.org/10.1093/bioinformatics/btx610](https://doi.org/10.1093/bioinformatics/btx610).

<a name=bib-Konstantinidis2567></a>[Konstantinidis, K. T. and J. M.
Tiedje](#cite-Konstantinidis2567) (2005). "Genomic insights that
advance the species definition for prokaryotes". In: _Proceedings of
the National Academy of Sciences_ 102.7, pp. 2567-2572. ISSN:
0027-8424. DOI:
[10.1073/pnas.0409727102](https://doi.org/10.1073%2Fpnas.0409727102).
eprint: https://www.pnas.org/content/102/7/2567.full.pdf. URL:
[https://www.pnas.org/content/102/7/2567](https://www.pnas.org/content/102/7/2567).

<a name=bib-Koonin2008></a>[Koonin, E. and Y. Wolf](#cite-Koonin2008)
(2008). "Genomics of bacteria and archaea: the emerging dynamic view of
the prokaryotic world". In: _Nucleic Acids Res_ 36(21), pp. 6688-6719.
DOI: [10.1093/nar/gkn668](https://doi.org/10.1093%2Fnar%2Fgkn668).

<a name=bib-Medini2005></a>[Medini, D., C. Donati, H. Tettelin, et
al.](#cite-Medini2005) (2005). "The microbial pan-genome". In: _Current
Opinion in Genetics & Development_ 15.6. Genomes and evolution, pp. 589
- 594. DOI:
[https://doi.org/10.1016/j.gde.2005.09.006](https://doi.org/https%3A%2F%2Fdoi.org%2F10.1016%2Fj.gde.2005.09.006).
URL:
[http://www.sciencedirect.com/science/article/pii/S0959437X05001759](http://www.sciencedirect.com/science/article/pii/S0959437X05001759).
---

# References(3)
<a name=bib-Olm></a>[Olm, M. R., C. T. Brown, B. Brooks, et
al.](#cite-Olm) (2017). "dRep: a tool for fast and accurate genomic
comparisons that enables improved genome recovery from metagenomes
through de-replication". In: _The ISME Journal_ 11.12, pp. 2864-2868.
DOI:
[10.1038/ismej.2017.126](https://doi.org/10.1038%2Fismej.2017.126).
URL:
[https://doi.org/10.1038/ismej.2017.126](https://doi.org/10.1038/ismej.2017.126).

<a name=bib-Ondov></a>[Ondov, B. D., T. J. Treangen, P. Melsted, et
al.](#cite-Ondov) (2016). "Mash: fast genome and metagenome distance
estimation using MinHash". In: _Genome Biology_ 17.1, p. 132. DOI:
[10.1186/s13059-016-0997-x](https://doi.org/10.1186%2Fs13059-016-0997-x).
URL:
[https://doi.org/10.1186/s13059-016-0997-x](https://doi.org/10.1186/s13059-016-0997-x).

<a name=bib-Sleator2015></a>[Sleator, R.](#cite-Sleator2015) (2015).
"Phylogenetics, Overview". In: _Encyclopedia of Metagenomics: Genes,
Genomes and Metagenomes: Basics, Methods, Databases and Tools_. Ed. by
K. E. Nelson. Boston, MA: Springer US, pp. 577-582. ISBN:
978-1-4899-7478-5. DOI:
[10.1007/978-1-4899-7478-5_708](https://doi.org/10.1007%2F978-1-4899-7478-5_708).
URL:
[https://doi.org/10.1007/978-1-4899-7478-5_708](https://doi.org/10.1007/978-1-4899-7478-5_708).

<a name=bib-Stephens2016></a>[Stephens, T. G., D. Bhattacharya, M. A.
Ragan, et al.](#cite-Stephens2016) (2016). "PhySortR: a fast, flexible
tool for sorting phylogenetic trees in R". In: _PeerJ_ 4, p. e2038.
ISSN: 2167-8359. DOI:
[10.7717/peerj.2038](https://doi.org/10.7717%2Fpeerj.2038). URL:
[https://doi.org/10.7717/peerj.2038](https://doi.org/10.7717/peerj.2038).

<a name=bib-Tettelin2005></a>[Tettelin, H., V. Masignani, and e. a.
Cieslewicz MJ](#cite-Tettelin2005) (2005). In: _Proc Natl Acad Sci U S
A_ 102(39), pp. 13950-13955. DOI:
[10.1073/pnas.0506758102](https://doi.org/10.1073%2Fpnas.0506758102).