Overview
More and more bioinformatics analysis pipelines are available for metagenetics (=targeted metagenomics) analyses. They often require advanced bioinformatics skills, computing resources and can dishearten users not familiar with the diversity of existing analytical processes. Each pipeline proposes its own guidelines, with configuration profiles, reference databases and recommended analytical steps. Choosing a pipeline with a set of parameters and algorithms for a given application can quickly become a brainteaser without a formal evaluation.
We introduce below the first-ever evaluation protocol to compare metagenetics pipelines in their entirety:
This evaluation protocol includes universal comparison metrics such as clustering and diversity indices, and both simulated and real datasets to objectively evaluate the performances of a pipeline. Within this evaluation protocol, we evaluated a total of 6 pipelines: mothur, QIIME and BMP, which implement OTU clustering approaches and kraken, CLARK and One Codex, which are emerging solutions based on discriminative k-mers that were developed for shotgun metagenomics and have never been evaluated on metagenetics datasets.
This work has been published in the following paper : Assessment of Common and Emerging Bioinformatics Pipelines for Targeted Metagenomics, Siegwald et al. PLOS ONE, 2017. It provides a unique resource to help the user to choose the most appropriate bioinformatics pipeline to judiciously analyse his own metagenetics datasets.
Evaluation protocol datasets
The datasets built in this evaluation protocol are openly available (compressed fastq files) : 200(V3), 400(V4-V5) and real.
For the simulated metagenomes, the number of sequences per amplicon for each dataset is described in the following Excel file (one spreadsheet per target) : datasets_descriptions.xls
Pipelines execution modes
The following text files contain all guidelines used for each dataset (symbolized by $dataset) and database evaluated in our study :
## BMP pipeline
## Based on : http://www.brmicrobiome.org/#!16s-profiling-ion-torrent/cpdg
## Requirements : usearch v7.0.1090, QIIME 1.9.0, uc2otutab.py (UPARSE script)## Variables
# $file = fastq file
# $cut = trim the sequences at this length ($cut=150 for 200(V3) & real, $cut=300 for 400(V4-V5))
# $ref_db = fasta file of the reference database (97% clustered sequences)
# $ref_aligned = aligned fasta file of the reference database (97% clustered sequences)
# $ref_taxo = taxonomy of the reference database## Output files of interest
# map.uc - Attribution of an OTU (column 9) for each read (column 8)
# output/otus_tax_assignments.txt - Taxonomic assignment (column 2) for each OTU (column 1)#01 - Quality filtering, length truncate, and convert to FASTA
usearch -fastq_filter $file -fastq_trunclen $cut -fastaout reads.fa#02 - Dereplication
usearch -derep_fulllength reads.fa -output derep.fa -sizeout#03 - Abundance sort and discard singletons
usearch -sortbysize derep.fa -output sorted.fa -minsize 2#04 - OTU clustering using UPARSE method
usearch -cluster_otus sorted.fa -otus otus.fa#05 - Map reads back to OTU database
usearch -usearch_global reads.fa -db otus.fa -strand plus -id 0.97 -uc map.uc#06 - Assign taxonomy to OTUS using uclust method on QIIME (use the file “otus.fa” from UPARSE as input file)
assign_taxonomy.py -i otus.fa -o output -r $ref_db -t $ref_taxo#07 - Align sequences on QIIME, using greengenes reference sequences (use the file “otus.fa” from UPARSE as input file)
align_seqs.py -i otus.fa -o rep_set_align -t $ref_aligned#08 - Filter alignments on QIIME
filter_alignment.py -i rep_set_align/otus_aligned.fasta -o filtered_alignment#09 - Make the reference tree on QIIME
make_phylogeny.py -i filtered_alignment/otus_aligned_pfiltered.fasta -o rep_set.tre#10 - Convert UC to otu-table.txt
python uc2otutab.py map.uc > otu_table.txt#11 - Convert otu_table.txt to otu-table.biom, used by QIIME
biom convert -i otu_table.txt -o otu_table.biom --table-type=\"OTU table\" --to-json#12 - Add metadata (taxonomy) to OTU table
biom add-metadata -i otu_table.biom -o otu_table_tax.biom --observation-metadata-fp output/otus_tax_assignments.txt --observation-header OTUID,taxonomy,confidence --sc-separated taxonomy --float-fields confidence#13 - Check OTU Table on QIIME.
biom summarize-table -i otu_table_tax.biom -o results_biom_table
## mothur pipeline
## Based on : http://www.mothur.org/wiki/454_SOP
## Requirements - mothur 1.35.1## Variables
# $file = fastq file
# $filename = fastq file name (no .fastq extension)"
# $nb_threads = number of threads (in our case, $nb_threads=32)
# $ref_db = fasta file of the reference database (99% clustered sequences)"
# $ref_aligned = aligned fasta file of the reference database (97% clustered sequences)"
# $ref_taxo = taxonomy of the reference database
# $rdp_fasta = mothur-formatted fasta file of the RDP training set (in our case, $rdp_fasta=trainset9_032012.pds.fasta)
# $rdp_taxo = taxonomy associated to the RDP training set (in our case, $rdp_fasta=trainset9_032012.pds.tax)## Output files of interest
# $filename.final.an.0.03.fasta - Fasta file of each read with its allocated OTU (read description : ">read_number \t OTU")
# $filename.final.an.0.03.cons.taxonomy - Taxonomic assignment (column 3) for each OTU (column 1)#01 - Convert the fastq to fasta
mothur "#fastq.info(fastq=$file, fasta=T)"#02 - Trim the reads
mothur "#trim.seqs(fasta=$filename.fasta, pdiffs=2, bdiffs=1, minlength=100, maxhomop=8, processors=$nb_threads)"#03 - Get a summary
mothur "#summary.seqs(fasta=$filename.trim.fasta)"#04 - Get the unique sequences
mothur "#unique.seqs(fasta=$filename.trim.fasta)"#05 - Align against the database
mothur "#align.seqs(fasta=$filename.trim.unique.fasta, flip=T, reference=$ref_aligned, processors=$nb_threads)"#06 - Summary of the alignment
mothur "#summary.seqs(fasta=$filename.trim.unique.align, name=$filename.trim.names)"#07 - Trim of the alignment - Warning : check that the optimize criteria are chozen right based on the previous alignment summary, change start and end values if needed
mothur "#screen.seqs(fasta=$filename.trim.unique.align, name=$filename.trim.names, optimize=start-end-minlength, processors=$nb_threads)"#08 - Alignment filtering
mothur "#filter.seqs(fasta=$filename.trim.unique.good.align, trump=., vertical=T, processors=$nb_threads)"#09 - Get the unique sequences
mothur "#unique.seqs(fasta=$filename.trim.unique.good.filter.fasta, name=$filename.trim.good.names)"#10 - New summary
mothur "#summary.seqs(fasta=$filename.trim.unique.good.filter.unique.fasta)"#11 - Preclustering
mothur "#pre.cluster(fasta=$filename.trim.unique.good.filter.unique.fasta, name=$filename.trim.unique.good.filter.names, diffs=2)"#12 - Create new variables because of long file names
$preclustered_fasta="$filename.trim.unique.good.filter.unique.precluster.fasta"
$preclustered_names="$filename.trim.unique.good.filter.unique.precluster.names"
$taxonomy_file="$filename.trim.unique.good.filter.unique.precluster.xxx.taxonomy" # xxx has to be replaced according to name of the taxonomy file generated at this stage (depends on the database file name)#13 - Classify all preclustered reads
mothur "#classify.seqs(fasta=$preclustered_fasta, name=$preclustered_names, template=$rdp_fasta, taxonomy=$rdp_taxo, cutoff=80, processors=$nb_threads)"#14 - Remove non-bacteria
mothur "#remove.lineage(fasta=$preclustered_fasta, name=$preclustered_names, taxonomy=$taxonomy_file, taxon=Mitochondria-Chloroplast-Archaea-Eukaryota-unknown)"#15 - New summary
mothur "#summary.seqs(fasta=$filename.trim.unique.good.filter.unique.precluster.pick.fasta)"#16 - Reorganize files
mv $preclustered_names $filename.final.names
mv $preclustered_fasta $filename.final.fasta
mv $taxonomy_file $filename.final.taxonomy#17 - Distance matrix calculation
mothur "#dist.seqs(fasta=$filename.final.fasta, cutoff=0.20, processors=$nb_threads)"#18 - Clustering into OTUs
mothur "#cluster.split(column=$filename.final.dist, name=$filename.final.names, cutoff=0.10, processors=$nb_threads)"#19 - Classify OTUs
mothur "#classify.otu(list=$filename.final.an.list, name=$filename.final.names, taxonomy=$filename.final.taxonomy, label=0.03)"#20 - Generates a fasta file for all reads with their OTU attribution
mothur "#bin.seqs(list=$filename.final.an.list, fasta=$filename.fasta, name=$filename.final.names, label=0.03)"## QIIME SortMeRna + SUMACLUST pipeline
## Requirements - QIIME 1.9.0
## Source : http://qiime.org/scripts/pick_open_reference_otus.html
## Variables
# $file = fastq file
# $filename = fastq file name (no .fastq extension)
# $nb_threads = number of threads (in our case, $nb_threads=32)
# $ref_db = fasta file of the reference database (97% clustered sequences)
# $ref_taxo = taxonomy of the reference database
# $outdir = output directory## Output files of interest
# outdir/final_otu_map_mc2.txt - OTUs (column 1) and the reads they contain (all other columns)
# outdir/uclust_assigned_taxonomy/rep_set_tax_assignments.txt - Taxonomic assignment (column 2) for each OTU (column 1)#01 - Create a parameters.txt file containing the following lines:
assign_taxonomy:reference_seqs_fp $ref_db
assign_taxonomy:id_to_taxonomy_fp $ref_taxo#02 - Generate a fasta file
convert_fastaqual_fastq.py -c fastq_to_fastaqual -f $file -o .#03 - Execute the open reference OTU picking pipeline
pick_open_reference_otus.py -p parameters.txt -i $filename.fna -m sortmerna_sumaclust -a --jobs_to_start $nb_threads -o $outdir -r $ref_db## QIIME UCLUST pipeline
## Requirements - QIIME 1.9.0
## Source : http://qiime.org/scripts/pick_open_reference_otus.html
## Variables
# $file = fastq file
# $filename = fastq file name (no .fastq extension)
# $nb_threads = number of threads (in our case, $nb_threads=32)
# $ref_db = fasta file of the reference database (97% clustered sequences)
# $ref_taxo = taxonomy of the reference database
# $outdir = output directory## Output files of interest
# outdir/final_otu_map_mc2.txt - OTUs (column 1) and the reads they contain (all other columns)
# outdir/uclust_assigned_taxonomy/rep_set_tax_assignments.txt - Taxonomic assignment (column 2) for each OTU (column 1)#01 - Create a parameters.txt file containing the following lines:
pick_otus:enable_rev_strand_match True
assign_taxonomy:reference_seqs_fp $ref_db
assign_taxonomy:id_to_taxonomy_fp $ref_taxo#02 - Generate a fasta file
convert_fastaqual_fastq.py -c fastq_to_fastaqual -f $file -o .#03 - Execute the open reference OTU picking pipeline
pick_open_reference_otus.py -p parameters.txt -i $filename.fna -a --jobs_to_start $nb_threads -o $outdir -r $ref_db## kraken pipeline
## Source : https://ccb.jhu.edu/software/kraken/
## Requirements : kraken 0.10.5-beta## Variables
# $file = fastq file
# $nb_threads = number of threads (in our case, $nb_threads=32)
# $db = directory of the kraken k-mer database (in our case, minikraken_20141208)## Output files of interest
# output.txt - Attribution of a taxid (column 3) for each read (column 2)#01 - Classify the reads
kraken --db $db --fastq-input $file --threads $nb_threads > output.txt## CLARK pipeline
## Source : http://clark.cs.ucr.edu/
## Requirements : CLARK v1.1.2## Variables
# $file = fastq file
# $nb_threads = number of threads (in our case, $nb_threads=32)## Output files of interest
# output.txt - Attribution of a taxid (column 3) for each read (column 1)#01 - Build the k-mer database (Bacteria from RefSeq at the genus level)
set_targets.sh ref Bacteria --genus#02 - Classify the reads
classify_metagenome.sh -O $file -R output.txt -n $nb_threads## One Codex
## Source : https://www.onecodex.com/
## Requirements - Internet connexion & browser# Sign up on www.onecodex.com then upload your data.
## Output files of interest
# Read-level data - Attribution of a taxid (column 2) for each read (column 1)
All guidelines can be downloaded in this archive : guidelines.zip.