Analyse de données métagénomiques «shotgun»

Cette formation est dédiée à l’analyse de données métagénomiques procaryotes de type «shotgun» issues de la technologie de séquençage Illumina. Nous présenterons les étapes bioinformatiques nécessaires pour nettoyer les données brutes,les caractériserd’un point de vuetaxonomique, et les comparer selon leur contenuen mots(k-mer). Nous aborderons ensuite les différentes stratégies à employer pour obtenir des comptagessur des gènes prédits. Enfin nous présenterons quelques outils pour obtenir une annotationfonctionnelle des échantillons.A l’issue des 2 jours de formation, les stagiaires connaîtront le périmètre, les avantages et limites des analyses de données de séquençageshotgun. Ils seront capables d’utiliser les outils présentéssur les jeux de données de la formation. Ilsseront capables d’identifier les outilset méthodes adaptées au cadre de leurs analyses.

Olivier Rué https://migale.inrae.fr (MaIAGE - Migale - INRAE) , Cédric Midoux https://migale.inrae.fr (PROSE - INRAE) , Valentin Loux https://migale.inrae.fr (MaIAGE - Migale - INRAE)
2022-06-20

Retrouvez la plaquette de la formation

Il est nécessaire de savoir utiliser un cluster de calcul et de savoir utiliser la ligne de commande


Vous trouverez via les slides suivants les liens vers les supports de cette formation :

Altschul, Stephen F, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman. 1990. “Basic Local Alignment Search Tool.” Journal of Molecular Biology 215 (3): 403–10.
Anders, Simon, Paul Theodor Pyl, and Wolfgang Huber. 2015. “HTSeq—a Python Framework to Work with High-Throughput Sequencing Data.” Bioinformatics 31 (2): 166–69.
Andrews, S. 2010. “FastQC a Quality Control Tool for High Throughput Sequence Data.” Http://Www.bioinformatics.babraham.ac.uk/Projects/Fastqc/. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.
Bankevich, Anton, Sergey Nurk, Dmitry Antipov, Alexey A Gurevich, Mikhail Dvorkin, Alexander S Kulikov, Valery M Lesin, et al. 2012. “SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing.” Journal of Computational Biology 19 (5): 455–77.
Bengtsson-Palme, Johan, Martin Ryberg, Martin Hartmann, Sara Branco, Zheng Wang, Anna Godhe, Pierre De Wit, et al. 2013. “Improved Software Detection and Extraction of Its1 and ITS 2 from Ribosomal ITS Sequences of Fungi and Other Eukaryotes for Analysis of Environmental Sequencing Data.” Methods in Ecology and Evolution 4 (10): 914–19.
Benoit, Gaëtan, Mahendra Mariadassou, Stéphane Robin, Sophie Schbath, Pierre Peterlongo, and Claire Lemaitre. 2019b. SimkaMin: fast and resource frugal de novo comparative metagenomics.” Bioinformatics, September. https://doi.org/10.1093/bioinformatics/btz685.
———. 2019a. SimkaMin: Fast and Resource Frugal de Novo Comparative Metagenomics.” Edited by John Hancock. Bioinformatics, September. https://doi.org/10.1093/bioinformatics/btz685.
Benoit, Gaëtan, Pierre Peterlongo, Mahendra Mariadassou, Erwan Drezen, Sophie Schbath, Dominique Lavenier, and Claire Lemaitre. 2016. “Multiple Comparative Metagenomics Using Multiset k-Mer Counting.” PeerJ Computer Science 2: e94.
Boyd, Joel A, Ben J Woodcroft, and Gene W Tyson. 2018. GraftM: A Tool for Scalable, Phylogenetically Informed Classification of Genes Within Metagenomes.” Nucleic Acids Research 46 (10): e59–59. https://doi.org/10.1093/nar/gky174.
Buchfink, Benjamin, Chao Xie, and Daniel H Huson. 2014. “Fast and Sensitive Protein Alignment Using DIAMOND.” Nature Methods 12 (1): 59–60. https://doi.org/10.1038/nmeth.3176.
Callahan, Benjamin J, Paul J McMurdie, Michael J Rosen, Andrew W Han, Amy Jo A Johnson, and Susan P Holmes. 2016. “Dada2: High-Resolution Sample Inference from Illumina Amplicon Data.” Nature Methods 13 (7): 581.
Cantalapiedra, Carlos P, Ana Hernández-Plaza, Ivica Letunic, Peer Bork, and Jaime Huerta-Cepas. 2021. “eggNOG-Mapper V2: Functional Annotation, Orthology Assignments, and Domain Prediction at the Metagenomic Scale.” Molecular Biology and Evolution 38 (12): 5825–29.
Caporaso, J Gregory, Kyle Bittinger, Frederic D Bushman, Todd Z DeSantis, Gary L Andersen, and Rob Knight. 2009. “PyNAST: A Flexible Tool for Aligning Sequences to a Template Alignment.” Bioinformatics 26 (2): 266–67.
Eren, A. Murat, Ozcan C. Esen, Christopher Quince, Joseph H. Vineis, Hilary G. Morrison, Mitchell L. Sogin, and Tom O. Delmont. 2015. “Anvi’o: An Advanced Analysis and Visualization Platform for ‘Omics Data.” PeerJ 3 (October): e1319. https://doi.org/10.7717/peerj.1319.
Escudié, Frédéric, Lucas Auer, Maria Bernard, Mahendra Mariadassou, Laurent Cauquil, Katia Vidal, Sarah Maman, Guillermina Hernandez-Raquet, Sylvie Combes, and Géraldine Pascal. 2017. “FROGS: Find, Rapidly, OTUs with Galaxy Solution.” Bioinformatics 34 (8): 1287–94.
Ewels, Philip, Måns Magnusson, Sverker Lundin, and Max Käller. 2016. “MultiQC: Summarize Analysis Results for Multiple Tools and Samples in a Single Report.” Bioinformatics 32 (19): 3047–48.
Fu, Limin, Beifang Niu, Zhengwei Zhu, Sitao Wu, and Weizhong Li. 2012. CD-HIT: Accelerated for Clustering the Next-Generation Sequencing Data.” Bioinformatics 28 (23): 3150–52. https://doi.org/10.1093/bioinformatics/bts565.
Gourlé, Hadrien, Oskar Karlsson-Lindsjö, Juliette Hayer, and Erik Bongcam-Rudloff. 2019. “Simulating Illumina Metagenomic Data with InSilicoSeq.” Bioinformatics 35 (3): 521–22.
Huerta-Cepas, Jaime, Kristoffer Forslund, Luis Pedro Coelho, Damian Szklarczyk, Lars Juhl Jensen, Christian von Mering, and Peer Bork. 2017. Fast Genome-Wide Functional Annotation through Orthology Assignment by eggNOG-Mapper.” Molecular Biology and Evolution 34 (8): 2115–22. https://doi.org/10.1093/molbev/msx148.
Hyatt, Doug, Gwo-Liang Chen, Philip F LoCascio, Miriam L Land, Frank W Larimer, and Loren J Hauser. 2010. “Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification.” BMC Bioinformatics 11 (1): 119.
Joshi, NA, and JN Fass. 2011. “Sickle: A Sliding-Window, Adaptive, Quality-Based Trimming Tool for FastQ Files.”
Kanehisa, Minoru, Yoko Sato, and Kanae Morishima. 2016. “BlastKOALA and GhostKOALA: KEGG Tools for Functional Characterization of Genome and Metagenome Sequences.” Journal of Molecular Biology 428 (4): 726–31.
Kang, Dongwan D, Jeff Froula, Rob Egan, and Zhong Wang. 2015. “MetaBAT, an Efficient Tool for Accurately Reconstructing Single Genomes from Complex Microbial Communities.” PeerJ 3: e1165.
Kieser, Silas, Joseph Brown, Evgeny M. Zdobnov, Mirko Trajkovski, and Lee Ann McCue. 2020. ATLAS: A Snakemake Workflow for Assembly, Annotation, and Genomic Binning of Metagenome Sequence Data.” BMC Bioinformatics 21 (1). https://doi.org/10.1186/s12859-020-03585-4.
Kolmogorov, Mikhail, Mikhail Rayko, Jeffrey Yuan, Evgeny Polevikov, and Pavel Pevzner. 2019. metaFlye: Scalable Long-Read Metagenome Assembly Using Repeat Graphs,” May. https://doi.org/10.1101/637637.
Kopylova, Evguenia, Laurent Noé, and Hélène Touzet. 2012. “SortMeRNA: Fast and Accurate Filtering of Ribosomal RNAs in Metatranscriptomic Data.” Bioinformatics 28 (24): 3211–17.
Köster, Johannes, and Sven Rahmann. 2012. “Snakemake—a Scalable Bioinformatics Workflow Engine.” Bioinformatics 28 (19): 2520–22.
Lagesen, Karin, Peter Hallin, Einar Andreas Rødland, Hans-Henrik Stærfeldt, Torbjørn Rognes, and David W Ussery. 2007. “RNAmmer: Consistent and Rapid Annotation of Ribosomal RNA Genes.” Nucleic Acids Research 35 (9): 3100–3108.
Laslett, Dean, and Bjorn Canback. 2004. “ARAGORN, a Program to Detect tRNA Genes and tmRNA Genes in Nucleotide Sequences.” Nucleic Acids Research 32 (1): 11–16.
Li, Dinghua, Chi-Man Liu, Ruibang Luo, Kunihiko Sadakane, and Tak-Wah Lam. 2015. “MEGAHIT: An Ultra-Fast Single-Node Solution for Large and Complex Metagenomics Assembly via Succinct de Bruijn Graph.” Bioinformatics 31 (10): 1674–76.
Li, Heng. 2013. “Aligning Sequence Reads, Clone Sequences and Assembly Contigs with BWA-MEM.” arXiv Preprint arXiv:1303.3997.
Li, Heng, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, and Richard Durbin. 2009. “The Sequence Alignment/Map Format and SAMtools.” Bioinformatics 25 (16): 2078–79.
Magoč, Tanja, and Steven L Salzberg. 2011. “FLASH: Fast Length Adjustment of Short Reads to Improve Genome Assemblies.” Bioinformatics 27 (21): 2957–63.
Mahé, Frédéric, Torbjørn Rognes, Christopher Quince, Colomban de Vargas, and Micah Dunthorn. 2015. “Swarm V2: Highly-Scalable and High-Resolution Amplicon Clustering.” PeerJ 3: e1420.
Martin, Marcel. 2011. “Cutadapt Removes Adapter Sequences from High-Throughput Sequencing Reads.” EMBnet. Journal 17 (1): 10–12.
McKenna, Aaron, Matthew Hanna, Eric Banks, Andrey Sivachenko, Kristian Cibulskis, Andrew Kernytsky, Kiran Garimella, et al. 2010. “The Genome Analysis Toolkit: A MapReduce Framework for Analyzing Next-Generation DNA Sequencing Data.” Genome Research 20 (9): 1297–1303.
McMurdie, Paul J, and Susan Holmes. 2013. “Phyloseq: An r Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data.” PloS One 8 (4): e61217.
Menzel, Peter, Kim Lee Ng, and Anders Krogh. 2016. “Fast and Sensitive Taxonomic Classification for Metagenomics with Kaiju.” Nature Communications 7: 11257.
Meola, Marco, Etienne Rifa, Noam Shani, Celine Delbes, Helene Berthoud, and Christophe Chassard. 2018. “DAIRYdb: A Manually Curated Gold Standard Reference Database for Improved Taxonomy Annotation of 16s rRNA Gene Sequences from Dairy Products.” bioRxiv, 386151.
Mikheenko, Alla, Andrey Prjibelski, Vladislav Saveliev, Dmitry Antipov, and Alexey Gurevich. 2018. Versatile genome assembly evaluation with QUAST-LG.” Bioinformatics 34 (13): i142–50. https://doi.org/10.1093/bioinformatics/bty266.
Mikheenko, Alla, Vladislav Saveliev, and Alexey Gurevich. 2015. MetaQUAST: evaluation of metagenome assemblies.” Bioinformatics 32 (7): 1088–90. https://doi.org/10.1093/bioinformatics/btv697.
Nawrocki, Eric P, Diana L Kolbe, and Sean R Eddy. 2009. “Infernal 1.0: Inference of RNA Alignments.” Bioinformatics 25 (10): 1335–37.
Nilsson, Rolf Henrik, Karl-Henrik Larsson, Andy F S Taylor, Johan Bengtsson-Palme, Thomas S Jeppesen, Dmitry Schigel, Peter Kennedy, et al. 2018. “The UNITE Database for Molecular Identification of Fungi: Handling Dark Taxa and Parallel Taxonomic Classifications.” Nucleic Acids Research 47 (D1): D259–64.
Okonechnikov, Konstantin, Ana Conesa, and Fernando Garcı́a-Alcalde. 2015. “Qualimap 2: Advanced Multi-Sample Quality Control for High-Throughput Sequencing Data.” Bioinformatics 32 (2): 292–94.
Ondov, Brian D, Nicholas H Bergman, and Adam M Phillippy. 2011. “Interactive Metagenomic Visualization in a Web Browser.” BMC Bioinformatics 12 (1): 385.
Parks, Donovan H, Michael Imelfort, Connor T Skennerton, Philip Hugenholtz, and Gene W Tyson. 2015. “CheckM: Assessing the Quality of Microbial Genomes Recovered from Isolates, Single Cells, and Metagenomes.” Genome Research 25 (7): 1043–55.
Petersen, Thomas Nordahl, Søren Brunak, Gunnar Von Heijne, and Henrik Nielsen. 2011. “SignalP 4.0: Discriminating Signal Peptides from Transmembrane Regions.” Nature Methods 8 (10): 785–86.
“Picard Toolkit.” 2019. Broad Institute, GitHub Repository. http://broadinstitute.github.io/picard/; Broad Institute.
Poplin, Ryan, Valentin Ruano-Rubio, Mark A DePristo, Tim J Fennell, Mauricio O Carneiro, Geraldine A Van der Auwera, David E Kling, et al. 2018. “Scaling Accurate Genetic Variant Discovery to Tens of Thousands of Samples.” BioRxiv, 201178.
Price, Morgan N, Paramvir S Dehal, and Adam P Arkin. 2010. “FastTree 2–Approximately Maximum-Likelihood Trees for Large Alignments.” PloS One 5 (3): e9490.
Quast, Christian, Elmar Pruesse, Pelin Yilmaz, Jan Gerken, Timmy Schweer, Pablo Yarza, Jörg Peplies, and Frank Oliver Glöckner. 2012. “The SILVA Ribosomal RNA Gene Database Project: Improved Data Processing and Web-Based Tools.” Nucleic Acids Research 41 (D1): D590–96.
Quinlan, Aaron R, and Ira M Hall. 2010. “BEDTools: A Flexible Suite of Utilities for Comparing Genomic Features.” Bioinformatics 26 (6): 841–42.
Rognes, Torbjørn, Tomáš Flouri, Ben Nichols, Christopher Quince, and Frédéric Mahé. 2016. “VSEARCH: A Versatile Open Source Tool for Metagenomics.” PeerJ 4: e2584.
Seemann, Torsten. 2014. “Prokka: Rapid Prokaryotic Genome Annotation.” Bioinformatics 30 (14): 2068–69.
Sevim, Volkan, Juna Lee, Robert Egan, Alicia Clum, Hope Hundley, Janey Lee, R Craig Everroad, et al. 2019. “Shotgun Metagenome Data of a Defined Mock Community Using Oxford Nanopore, PacBio and Illumina Technologies.” Scientific Data 6 (1): 1–9.
Shen, Wei, Shuai Le, Yan Li, and Fuquan Hu. 2016. “SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/q File Manipulation.” PloS One 11 (10): e0163962.
“SRA Tools.” 2020. http://ncbi.github.io/sra-tools/; NCBI.
Steinegger, Martin, Milot Mirdita, and Johannes Söding. 2019. “Protein-Level Assembly Increases Protein Sequence Recovery from Metagenomic Samples Manyfold.” Nature Methods 16 (7): 603–6. https://doi.org/10.1038/s41592-019-0437-4.
Steinegger, Martin, and Johannes Söding. 2018. “Clustering Huge Protein Sequence Sets in Linear Time.” Nature Communications 9 (1). https://doi.org/10.1038/s41467-018-04964-5.
Thorvaldsdóttir, Helga, James T Robinson, and Jill P Mesirov. 2013. “Integrative Genomics Viewer (IGV): High-Performance Genomics Data Visualization and Exploration.” Briefings in Bioinformatics 14 (2): 178–92.
Vollmers, John, Sandra Wiegand, and Anne-Kristin Kaster. 2017. “Comparing and Evaluating Metagenome Assembly Tools from a Microbiologist’s Perspective - Not Only Size Matters!” PLOS ONE 12 (1): 1–31. https://doi.org/10.1371/journal.pone.0169662.
Weiss, Stéphanie, Franck Samson, David Navarro, and Serge Casaregola. 2013. “YeastIP: A Database for Identification and Phylogeny of Saccharomycotina Yeasts.” FEMS Yeast Research 13 (1): 117–25.
Wheeler, David L, Tanya Barrett, Dennis A Benson, Stephen H Bryant, Kathi Canese, Vyacheslav Chetvernin, Deanna M Church, et al. 2006. “Database Resources of the National Center for Biotechnology Information.” Nucleic Acids Research 35 (suppl_1): D5–12.
Zhang, Jiajie, Kassian Kobert, Tomáš Flouri, and Alexandros Stamatakis. 2013. “PEAR: A Fast and Accurate Illumina Paired-End reAd mergeR.” Bioinformatics 30 (5): 614–20.
Zhou, Yanqing, Yaru Chen, Shifu Chen, and Jia Gu. 2018. “Fastp: An Ultra-Fast All-in-One FASTQ Preprocessor.” Bioinformatics 34 (17): i884–90. https://doi.org/10.1093/bioinformatics/bty560.

References

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".