Assessment of common and emerging bioinformatics pipelines for targeted metagenomics

Overview

More and more bioinformatics analysis pipelines are available for metagenetics (=targeted metagenomics) analyses. They often require advanced bioinformatics skills, computing resources and can dishearten users not familiar with the diversity of existing analytical processes. Each pipeline proposes its own guidelines, with configuration profiles, reference databases and recommended analytical steps. Choosing a pipeline with a set of parameters and algorithms for a given application can quickly become a brainteaser without a formal evaluation.

We introduce below the first-ever evaluation protocol to compare metagenetics pipelines in their entirety:

Protocol

This evaluation protocol includes universal comparison metrics such as clustering and diversity indices, and both simulated and real datasets to objectively evaluate the performances of a pipeline. Within this evaluation protocol, we evaluated a total of 6 pipelines: mothur, QIIME and BMP, which implement OTU clustering approaches and kraken, CLARK and One Codex, which are emerging solutions based on discriminative k-mers that were developed for shotgun metagenomics and have never been evaluated on metagenetics datasets.

This work has been published in the following paper : Assessment of Common and Emerging Bioinformatics Pipelines for Targeted Metagenomics, Siegwald et al. PLOS ONE, 2017. It provides a unique resource to help the user to choose the most appropriate bioinformatics pipeline to judiciously analyse his own metagenetics datasets.

Evaluation protocol datasets

The datasets built in this evaluation protocol are openly available (compressed fastq files) : 200(V3), 400(V4-V5) and real.

For the simulated metagenomes, the number of sequences per amplicon for each dataset is described in the following Excel file (one spreadsheet per target) : datasets_descriptions.xls

Pipelines execution modes

The following text files contain all guidelines used for each dataset (symbolized by $dataset) and database evaluated in our study :

BMP
## BMP pipeline ## Based on : http://www.brmicrobiome.org/#!16s-profiling-ion-torrent/cpdg ## Requirements : usearch v7.0.1090, QIIME 1.9.0, uc2otutab.py (UPARSE script) ## Variables # $file = fastq file # $cut = trim the sequences at this length ($cut=150 for 200(V3) & real, $cut=300 for 400(V4-V5)) # $ref_db = fasta file of the reference database (97% clustered sequences) # $ref_aligned = aligned fasta file of the reference database (97% clustered sequences) # $ref_taxo = taxonomy of the reference database ## Output files of interest # map.uc - Attribution of an OTU (column 9) for each read (column 8) # output/otus_tax_assignments.txt - Taxonomic assignment (column 2) for each OTU (column 1) #01 - Quality filtering, length truncate, and convert to FASTA usearch -fastq_filter $file -fastq_trunclen $cut -fastaout reads.fa #02 - Dereplication usearch -derep_fulllength reads.fa -output derep.fa -sizeout #03 - Abundance sort and discard singletons usearch -sortbysize derep.fa -output sorted.fa -minsize 2 #04 - OTU clustering using UPARSE method usearch -cluster_otus sorted.fa -otus otus.fa #05 - Map reads back to OTU database usearch -usearch_global reads.fa -db otus.fa -strand plus -id 0.97 -uc map.uc #06 - Assign taxonomy to OTUS using uclust method on QIIME (use the file “otus.fa” from UPARSE as input file) assign_taxonomy.py -i otus.fa -o output -r $ref_db -t $ref_taxo #07 - Align sequences on QIIME, using greengenes reference sequences (use the file “otus.fa” from UPARSE as input file) align_seqs.py -i otus.fa -o rep_set_align -t $ref_aligned #08 - Filter alignments on QIIME filter_alignment.py -i rep_set_align/otus_aligned.fasta -o filtered_alignment #09 - Make the reference tree on QIIME make_phylogeny.py -i filtered_alignment/otus_aligned_pfiltered.fasta -o rep_set.tre #10 - Convert UC to otu-table.txt python uc2otutab.py map.uc > otu_table.txt #11 - Convert otu_table.txt to otu-table.biom, used by QIIME biom convert -i otu_table.txt -o otu_table.biom --table-type=\"OTU table\" --to-json #12 - Add metadata (taxonomy) to OTU table biom add-metadata -i otu_table.biom -o otu_table_tax.biom --observation-metadata-fp output/otus_tax_assignments.txt --observation-header OTUID,taxonomy,confidence --sc-separated taxonomy --float-fields confidence
#13 - Check OTU Table on QIIME. biom summarize-table -i otu_table_tax.biom -o results_biom_table
mothur
## mothur pipeline ## Based on : http://www.mothur.org/wiki/454_SOP ## Requirements - mothur 1.35.1 ## Variables # $file = fastq file # $filename = fastq file name (no .fastq extension)" # $nb_threads = number of threads (in our case, $nb_threads=32) # $ref_db = fasta file of the reference database (99% clustered sequences)" # $ref_aligned = aligned fasta file of the reference database (97% clustered sequences)" # $ref_taxo = taxonomy of the reference database # $rdp_fasta = mothur-formatted fasta file of the RDP training set (in our case, $rdp_fasta=trainset9_032012.pds.fasta) # $rdp_taxo = taxonomy associated to the RDP training set (in our case, $rdp_fasta=trainset9_032012.pds.tax) ## Output files of interest # $filename.final.an.0.03.fasta - Fasta file of each read with its allocated OTU (read description : ">read_number \t OTU") # $filename.final.an.0.03.cons.taxonomy - Taxonomic assignment (column 3) for each OTU (column 1) #01 - Convert the fastq to fasta mothur "#fastq.info(fastq=$file, fasta=T)" #02 - Trim the reads mothur "#trim.seqs(fasta=$filename.fasta, pdiffs=2, bdiffs=1, minlength=100, maxhomop=8, processors=$nb_threads)" #03 - Get a summary mothur "#summary.seqs(fasta=$filename.trim.fasta)" #04 - Get the unique sequences mothur "#unique.seqs(fasta=$filename.trim.fasta)" #05 - Align against the database mothur "#align.seqs(fasta=$filename.trim.unique.fasta, flip=T, reference=$ref_aligned, processors=$nb_threads)" #06 - Summary of the alignment mothur "#summary.seqs(fasta=$filename.trim.unique.align, name=$filename.trim.names)" #07 - Trim of the alignment - Warning : check that the optimize criteria are chozen right based on the previous alignment summary, change start and end values if needed mothur "#screen.seqs(fasta=$filename.trim.unique.align, name=$filename.trim.names, optimize=start-end-minlength, processors=$nb_threads)" #08 - Alignment filtering mothur "#filter.seqs(fasta=$filename.trim.unique.good.align, trump=., vertical=T, processors=$nb_threads)" #09 - Get the unique sequences mothur "#unique.seqs(fasta=$filename.trim.unique.good.filter.fasta, name=$filename.trim.good.names)" #10 - New summary mothur "#summary.seqs(fasta=$filename.trim.unique.good.filter.unique.fasta)" #11 - Preclustering mothur "#pre.cluster(fasta=$filename.trim.unique.good.filter.unique.fasta, name=$filename.trim.unique.good.filter.names, diffs=2)" #12 - Create new variables because of long file names $preclustered_fasta="$filename.trim.unique.good.filter.unique.precluster.fasta" $preclustered_names="$filename.trim.unique.good.filter.unique.precluster.names" $taxonomy_file="$filename.trim.unique.good.filter.unique.precluster.xxx.taxonomy" # xxx has to be replaced according to name of the taxonomy file generated at this stage (depends on the database file name) #13 - Classify all preclustered reads mothur "#classify.seqs(fasta=$preclustered_fasta, name=$preclustered_names, template=$rdp_fasta, taxonomy=$rdp_taxo, cutoff=80, processors=$nb_threads)" #14 - Remove non-bacteria mothur "#remove.lineage(fasta=$preclustered_fasta, name=$preclustered_names, taxonomy=$taxonomy_file, taxon=Mitochondria-Chloroplast-Archaea-Eukaryota-unknown)" #15 - New summary mothur "#summary.seqs(fasta=$filename.trim.unique.good.filter.unique.precluster.pick.fasta)" #16 - Reorganize files mv $preclustered_names $filename.final.names mv $preclustered_fasta $filename.final.fasta mv $taxonomy_file $filename.final.taxonomy #17 - Distance matrix calculation mothur "#dist.seqs(fasta=$filename.final.fasta, cutoff=0.20, processors=$nb_threads)" #18 - Clustering into OTUs mothur "#cluster.split(column=$filename.final.dist, name=$filename.final.names, cutoff=0.10, processors=$nb_threads)" #19 - Classify OTUs mothur "#classify.otu(list=$filename.final.an.list, name=$filename.final.names, taxonomy=$filename.final.taxonomy, label=0.03)"
#20 - Generates a fasta file for all reads with their OTU attribution mothur "#bin.seqs(list=$filename.final.an.list, fasta=$filename.fasta, name=$filename.final.names, label=0.03)"
QIIME SortMeRna+SUMACLUST
## QIIME SortMeRna + SUMACLUST pipeline ## Requirements - QIIME 1.9.0 ## Source : http://qiime.org/scripts/pick_open_reference_otus.html ## Variables # $file = fastq file # $filename = fastq file name (no .fastq extension) # $nb_threads = number of threads (in our case, $nb_threads=32) # $ref_db = fasta file of the reference database (97% clustered sequences) # $ref_taxo = taxonomy of the reference database # $outdir = output directory ## Output files of interest # outdir/final_otu_map_mc2.txt - OTUs (column 1) and the reads they contain (all other columns) # outdir/uclust_assigned_taxonomy/rep_set_tax_assignments.txt - Taxonomic assignment (column 2) for each OTU (column 1) #01 - Create a parameters.txt file containing the following lines: assign_taxonomy:reference_seqs_fp $ref_db assign_taxonomy:id_to_taxonomy_fp $ref_taxo #02 - Generate a fasta file convert_fastaqual_fastq.py -c fastq_to_fastaqual -f $file -o .
#03 - Execute the open reference OTU picking pipeline pick_open_reference_otus.py -p parameters.txt -i $filename.fna -m sortmerna_sumaclust -a --jobs_to_start $nb_threads -o $outdir -r $ref_db
QIIME UCLUST
## QIIME UCLUST pipeline ## Requirements - QIIME 1.9.0 ## Source : http://qiime.org/scripts/pick_open_reference_otus.html ## Variables # $file = fastq file # $filename = fastq file name (no .fastq extension) # $nb_threads = number of threads (in our case, $nb_threads=32) # $ref_db = fasta file of the reference database (97% clustered sequences) # $ref_taxo = taxonomy of the reference database # $outdir = output directory ## Output files of interest # outdir/final_otu_map_mc2.txt - OTUs (column 1) and the reads they contain (all other columns) # outdir/uclust_assigned_taxonomy/rep_set_tax_assignments.txt - Taxonomic assignment (column 2) for each OTU (column 1) #01 - Create a parameters.txt file containing the following lines: pick_otus:enable_rev_strand_match True assign_taxonomy:reference_seqs_fp $ref_db assign_taxonomy:id_to_taxonomy_fp $ref_taxo #02 - Generate a fasta file convert_fastaqual_fastq.py -c fastq_to_fastaqual -f $file -o .
#03 - Execute the open reference OTU picking pipeline pick_open_reference_otus.py -p parameters.txt -i $filename.fna -a --jobs_to_start $nb_threads -o $outdir -r $ref_db
kraken
## kraken pipeline ## Source : https://ccb.jhu.edu/software/kraken/ ## Requirements : kraken 0.10.5-beta ## Variables # $file = fastq file # $nb_threads = number of threads (in our case, $nb_threads=32) # $db = directory of the kraken k-mer database (in our case, minikraken_20141208) ## Output files of interest # output.txt - Attribution of a taxid (column 3) for each read (column 2)
#01 - Classify the reads kraken --db $db --fastq-input $file --threads $nb_threads > output.txt
CLARK
## CLARK pipeline ## Source : http://clark.cs.ucr.edu/ ## Requirements : CLARK v1.1.2 ## Variables # $file = fastq file # $nb_threads = number of threads (in our case, $nb_threads=32) ## Output files of interest # output.txt - Attribution of a taxid (column 3) for each read (column 1) #01 - Build the k-mer database (Bacteria from RefSeq at the genus level) set_targets.sh ref Bacteria --genus
#02 - Classify the reads classify_metagenome.sh -O $file -R output.txt -n $nb_threads
One Codex
## One Codex ## Source : https://www.onecodex.com/ ## Requirements - Internet connexion & browser # Sign up on www.onecodex.com then upload your data.
## Output files of interest # Read-level data - Attribution of a taxid (column 2) for each read (column 1)

All guidelines can be downloaded in this archive : guidelines.zip.

Assessment of common and emerging bioinformatics pipelines for targeted metagenomics

Overview

Evaluation protocol datasets

Pipelines execution modes

Links

PEGASE platform