MICRA: User Guide

This page contains the detailed user guide of MICRA:

Minimal set of parameters

The minimal data to run MICRA is the FASTQ file(s) containing the reads to analyze and the sequencing technology (Illumina or Ion Torrent).

mica

The fields are:
  • E-mail address [not mandatory but highly recommended]: The e-mail address is not recorded, it is used only to send a link to access the results.
  • The FASTQ file(s) [mandatory]: this file must contain the reads in FASTQ format or compressed format (.ZIP or .GZ). The reads can be pre-processed (filtered, trimmed,...) by the user itself or be raw reads. For paired-end Illumina data give two files reads_1.fastq and reads_2.fastq corresponding to pairs of reads in the same order. The maximum size allowed is 1 GB corresponding to a standard WGS microorganism sequencing run.
  • The sequencing technology [mandatory]: the user has to indicate the sequencing technology used to produce its data; the software used in MICRA are not the same for illumina (BOWTIE2 and SPAdes) or for Ion Torrent (SHRiMP2 and MIRA) reads.

To submit the job press the submit button.

Advanced options

To access the advanced options, check the box "Show advanced options".
The MICRA advanced options are organized into four parts:

parts

Pre-processing options

This section allows the user to parameter the pre-processing step, i.e. quality and adpeter trimming.

preprocessing

This section allows the user to adjust several pre-processing parameters:

  • The "Performs FastQC" checkbox: when this box is checked, FastQC is run and results are saved in the result directory.
  • The "Cutadapt" checkbox: check this box to use cutadapt to perform quality and/or adapter trimming.
  • The "Performs adapter trimming" checkbox: when this box is checked, adapter trimming is performed. Two mode are available:
    • automatic detection: adapter sequences are automatically detected from FastQC results. To be detected a sequence must be present in 20% of reads by default. This value can be changed with the Threshold field.
    • Give your own list of adapters: a List of adapters can be given for reads1 and reads2. The adapter sequences must be separated by a comma.
  • Minimum quality for trimming: is the threshold for quality trimming (default Q20).
  • Minimum read length: is the threshold for read length (default reads smaller than 20nt will be discarded).

Selection of the reference sequences

This section allows the user to choose between two modes for the selection of the reference sequences.

reference

  1. Automatic selection This is the default mode of MICRA. In this case, MICRA automatically selects the reference sequences blasting a subset of reads against some specific databases. The user can tune several parameters:
    • Type of reference sequences: choose between "genomes", "plasmids" or "genomes and plasmids" mode, in this last case MICRA selects both genome and plasmid sequences.
    • Number of reads used for selection of reference genomes: must be comprized between [100;10000], it is the number of reads used to select the genome sequences.
    • Number of reference genomes to be selected: must be comprized between [1;10], it is the number of genome sequences selected and used in the MICRA analysis.
    • Number of reads used for selection of reference plasmids: must be comprized between [100;100000], it is the number of reads used to select the plasmid sequences.
    • Number of reference plasmids to be selected: must be comprized between [1;50], it is the number of plasmid sequences selected and used in the MICRA analysis.
    • Percentage of covered sequence for a plasmid to be considered in the analysis: this is the cutoff for the plasmid coverage. By default, a plasmid must be covered to at least 70% to be considered as probably present and inluded in the analysis.
  2. Give your own reference sequences The user can give its own list of reference sequences. In this case, an ID file must be loaded. The valid format is a text file containing 3 fields per line serarated by semi-colons [;]:
    • accession.version NCBI identifier
    • sequence name: not containing space or semicolon characters
    • type of sequence: must be "genome" OR "plasmid"

    Example of an ID file containing 2 reference genomes and 3 plasmids:

    	CU928145.2;Escherichia-coli-55989:CU928145.2;genome
    	CU928160.2;Escherichia-coli-IAI1:CU928160.2;genome
    	HE603111.1;Escherichia-coli-pHUSEC41-2:HE603111.1;plasmid
    	JN983042.1;Salmonella-enterica-subsp-pSH111-227:JN983042.1;plasmid
    	HE616529.1;Shigella-sonnei-53G-plasmid-A:HE616529.1;plasmid
    	

    If you want to use a personal reference sequence which has not NCBI identifier, please contact us at MICRA_contact

    Once your ID file is loaded, new options are available:

    force

    • You can change the reference ID file by loading a new file.
    • You can force MICRA to consider a given reference genome. Forcing a reference genome means that MICRA performs a complete comparative analysis for this genome even if it is not the closest reference genome sequences and produces variant calls, comparative annotations... (see results section for more details about MICRA results).
    • You can force MICRA to consider several reference plasmids checking the corresponding boxes. Forcing reference plasmids means that MICRA performs a complete comparative analysis for these plasmids even if there are not covered to the minimum coverage cutoff and produces variant calls, comparative annotations... (see results section for more details about MICRA results).
    • clicking the "cancel button deletes the ID file and give you back access to the initial options for selecting the reference sequences.
    • Percentage of covered sequence for a plasmid to be considered in the analysis: this is the cutoff for the plasmid coverage. By default, a plasmid must be covered to at least 70% to be considered as probably present and inluded in the analysis.
  3. Checking the box "Performs only the selection of the reference genomes?" run only the selection of the reference sequences NOT the complete MICRA analysis. This return to the user the reference ID file selected by MICRA and the corresponding sequence files. This data can then be used to analyse several samples for example.

General options

general

This section allows the user to adjust several MICRA parameters:

  • The CGView checkbox: when this box is checked, MICRA outputs a SVG picture of the covered annotation against the closest reference genome and plasmids.
  • Low coverage threshold: corresponds to the minimum number of reads mapped at a given position in the reference sequence for which the position is considered as covered. By default, MICRA considers a position as covered when at least 5 reads have been mapped to this position, corresponding to 1/4 for a 40X-depth sequencing.
  • Minimum frequency: correponds to the minimum frequency from which a variant is called. By default, this threshold is equal to 0.9.
  • The high repeat content checkbox: check this box when the sequenced genome contains high number of repeats. In this case, the mapper parameters will be tuned to deal with repeats.
  • The SAM checkbox: check this box to keep the SAM files produced during the MICRA analysis. By default, the mapping files are not saved.
  • Features to consider in GFF: The user can choose between analyse the CDSs or the genes contained into the annotation GFF file of the reference sequences.
  • The Skip de novo and BLAST steps checkbox: check this checkbox to skip the de novo assembly process from the remaining reads (reads not mapped during the iterative mapping step).
  • Minimum size of contigs: the user can choose the minimum size of contigs generated during the de novo assembly. By default, MICRA keeps only contigs greater than 500 nucleotides.
  • Minimum CDS coverage in BLAST steps: during the contig annotation, a BLAST against the PATRIC CDS databank is performed. By default, MICRA considers a CDS as present when the sequence is covered at least to 80%. The user can tune this threshold.

Options for the "Antibiotic susceptibility and resistance" module

To predict the antibiotic susceptibility and resistance, the box "Performs antibiotic module" has to be checked.

antibio

This section allows the user to adjust several parameters for the "Antibiotic susceptibility and resistance" module:

  • Drugs to be tested: The user can give its own drug list to be tested. Drug names have to be separated with the '@' symbol.
  • Minimum sequence coverage for target genes: corresponds to the minimum coverage for a target sequence to be considered as present. By default, MICRA considers the antibiotic target gene potentially present when it is covered at least to 30%.
  • Minimum sequence coverage for antibiotic resistance genes:corresponds to the minimum coverage for a resistance gene to be considered as present. By default, MICRA considers the antibiotic resistance gene potentially present when it is covered at least at 30%.
  • Identity percentage threshold for target genes: MICRA considers the antibiotic target gene potentially present when the identity percentage is greater than 97% by default. This threshold can be changed by the user.
  • Identity percentage threshold for antibiotic resistance genes: MICRA considers the antibiotic resistance gene potentially present when the identity percentage is greater than a given threshold. This threshold can be changed by the user. By default, MICRA runs in automatic mode with identity threshold depending on the antibiotic classes (see identity threshold for more details).

The MICRA comparison module

MICRA allows the user to compare 2 types of MICRA output files:

  1. MICRA files containing the annotations obtained from mapping, and optionnaly the assembly annotations obtained from the de novo assembly
  2. MICRA files containing variant calls

The user can select one of this 2 types of comparison switching the button at the top of the page.

Comparison of annotations

annot

The user can compare at least 2 and up to 4 samples. He can choose between two types of comparison: consider only common references in comparison (i.e. intersection mode, MICRA will consider only the annotations from the reference sequences common between the samples) or consider all the references (i.e. union mode, MICRA will consider all the annotations). For each sample, the mapping annotation file (mapping_annotations.csv) and optionaly the assembly annotation file (assembly_annotations.csv) must be loaded. They are both located in the subdirectory "annotations" of the MICRA result directory. The user can choose a name for each sample which must be unique, does not begin with a digit and does not contain the "-" sympbol.
Once the file have been loaded the user can submit the form clicking on the corresponding button. A link througth a zip file containing the comparison results will appear. The results directory contains the venn diagram picture (in .TIFF format) and the text files corresponding to the different classes of the venn diagram (for more details see the MICRA comparison results section).

Comparison of variant calls

variant

The user can compare at least 2 and up to 4 files containing MICRA variant calls in CSV format. These files are located in the subdirectory corresponding to the closest reference genome or plasmids and ending with "_variants.csv". The user can choose to consider both types of variants (SNVs + INDELs) or only SNVs or only INDELs selecting the corresponding mode at the top of the form. A link througth a zip file containing results will appear and contain the venn diagram image and the corresponding text files (for more details see the MICRA comparison results section).

The MICRA results

Once your MICRA job is finished, download the zip file from the link provided by MICRA. After uncompressing the .ZIP file, you will obtain a directory containing several files and subdirectories:

results

The result directory contains three files:

  • The summary.html file is an HTML file which list all the files generated during the analysis and allows an easy navigation into the MICRA results.
  • The remainingReads.fastq file contains the remaining reads at the end of the MICRA iterative mapping.
  • The fastMapping_stats.csv file contains the mapping metrics obtained during the first step of the analysis to identify the closest reference genome.

Several subdirectories are produced during the analysis and are described below.

The closest_genome subdirectory

In your results, the name of this directory is the name of the closest reference genome (e.g. Escherichia-coli-55989:CU928145.2). It contains:
  • closest_genome.sam (available only if the box " keep the SAM files" is checked in the general options of MICRA): corresponds to the mapping file of the reads onto the closest reference genome
  • closest_genome_comparativeAnnot.csv: table of the comparative annotations.

    table1

    Each feature in the GFF file of the closest reference genome correponds to a line in the table and several data is given:
    • location on the reference genome (begin and end columns)
    • type of the feature (CDS, gene, region...)
    • strand: positive or negative strand
    • statut of the feature: 5 statuts are used in MICRA depending on the coverage, the complexity of the region and the repartition of the covered regions.

      statut

    • the coverage column gives the percentage of coverage of the feature
    • the succ column gives the cumulative percentage of coverage and the number of blocks. For example, (44.44444444444444,1) means that the given feature is covered at 44.44% spread in only one block
    • the note column contains the description of the feature
  • closest_genome_comparativeAnnot.fa: give the consensus sequence of each feature in FASTA format built from the mapping. The region which are not covered are encoded with N symbols.
  • closest_genome.svg: picture of the comparative annotations against the closest reference genome produced with CGView

    cgview

  • closest_genome_un.fastq: the reads not mapped against the closest reference genome.
  • closest_genome_variants.csv: table containing the variant calls. When a position is covered (at least n reads) and when a difference occurs at this position compared to the reference genome sequence with a minimum frequency (by default 90%, this cutoff can be changed in the "general options" with the parameter "Minimum frequency"), a variant is called and showed in this table.

    tablevariants

    This table contains 9 colunms:
    1. ref. position is the position on the reference genome.
    2. type is the type of the variant: SNV for Single Nucleotide Variation and INDEL for insertion or deletion.
    3. ref. base is the corresponding base on the reference genome.
    4. variant gives the base on your sequenced genome.
    5. counts gives the number the number of reads containing the variant.
    6. frequency is the frequency of the variant observed in reads.
    7. depth gives the total number of reads at this position.
    8. CDS gives the description when the variant is located in a CDS.
    9. aa change gives the amino acid change induced by the variant when it is located in a CDS.

The user_genome subdirectory

When the user forces MICRA to analyse a given reference genome (see section selection of reference sequences in "MICRA parameters" section), a directory similar than the closest_reference genome directory is created.

The plasmids subdirectory

This directory contains all the files related to plasmids. The PLASMIDS.csv file gives the metrics obtained during the fast mapping step. If one or several plasmids have been identified as probably present (sequence coverage > 70% by default but can be tuned in part selection of reference sequences) a subdirectory is created for each plasmid and named with the corresponding plasmid name. Each subdirectory contains files similar to the ones created for the closest reference genome:

  • plasmid1.sam
  • plasmid1.svg
  • plasmid1_comparativeAnnot.csv
  • plasmid1_comparativeAnnot.fa
  • plasmid1_consensus.fa
  • plasmid1_variants.csv

The reference_genomes subdirectory

This subdirectory contains files about the reference genomes used in MICRA analysis:

  • The genomeList.html file is a web page showing the reference genomes automatically selected by MICRA and the corresponding number of reads matching on each of them.
  • The genomeList.txt file gives the selected reference genomes in the format which can be used in input of MICRA for further analysis (see section Reference sequence paramaters part 2: Give your own reference sequences of this user guide for more details).
  • The closest reference genome files: the genome sequence in FASTA format and the corresponding annotations in GFF format.
  • If the user force a given reference genome, the corresponding FASTA and GFF files are also available in this directory.

The reference_plasmids subdirectory

This subdirectory contains data about the plasmids used in the MICRA analysis:

  • The plasmidList.html file is a web page showing the plasmids automatically selected by MICRA and the corresponding number of reads matching on each of them.
  • The plasmidList.txt file gives the selected plasmids in the format which can be used in input of MICRA for further analysis (see section Reference sequence paramaters part 2: Give your own reference sequences of this user guide for more details).
  • For each plasmids identified as probably present (the name of the files are encoded with the plasmid name):
    • the plasmid sequence in FASTA format
    • the corresponding annotations in GFF format

The sequences subdirectory

This directory contains sequences generated during the analysis:

  • mapping_consensus.fa contains the sequences of consensus built from the iterative mapping step. The reads are first mapped against the closest reference genome, the unmapped reads are mapped against the identified plasmids and the unmapped reads are then iteratively mapped against all the other reference sequences. The consensus sequences (with no more than 25 successive low coverage positions) extracted from iterative mapping against all the reference sequences are given in this file.
  • assembly_contigs.fa contains the contigs (size > 500nt by default but can be changed in the MICRA general parameters) generated from remaining read (after the iterative mapping) with de novo assembly
  • assembly_stats.txt contains the statistics of the de novo assembly steps

The annotations subdirectory

This directory contains the annotation files:

  • mapping_annotations.csv: table containing the annotations built from the iterative mapping: all the covered features considered as present (statut 2 in MICRA) in all reference sequences are stored in a table:

    tableannotmapping

    This table contains 5 columns:
    1. the begin postion on the corresponding reference sequences
    2. the end postion on the corresponding reference sequences
    3. the strand
    4. the percentage of coverage
    5. the note describing the feature
  • mapping_annotations.fa: sequences of the corresponding features in FASTA format.
  • assembly_annotations.csv: table corresponding to the CDS identified from the de novo assembly blasting contigs against the PATRIC CDS database.

    tableannotcontig

    This table contains 6 columns:
    1. the contig ID
    2. the begin position of the CDS on the contig
    3. the end position of the CDS on the contig
    4. the descrition of the CDS
    5. the percentage of coverage of the CDS
    6. the corresponding PATRIC database identifier
  • assembly_annotations.fa: sequences of the contig features in FASTA format.
  • the assembly_BLAST_info folder contains files detailing the BLAST results of the contigs against the PATRIC CDS database: assembly_blast.txt are the raw results and assembly_blast.html summarizes the results in easy to read HTML format showing all CDS covered to at least 80% (by default, this cutoff can be changed in MICRA general parameters). When several hits occur in the same region, they are grouped and the CDS the most covered is kept in the final results.

The log subdirectory

It contains two files:

  • log.txt: this plain text file gives information about the different pipeline steps (input parameters, transitional results...). It is the text version of the summary.html file.
  • errors.txt: record of the standard output of the external tools used in MICRA.

The READS subdirectory

When the pre-processing module is performed, this subdircetory contains the reads (in FASTQ format) after trimming steps. This file is the input used in MICRA analysis.

The FASTQC subdirectory

When the pre-processing module is performed, this subdircetory contains the results for quality checking with FastQC.

The antibio subdirectory

When the "Antibiotic susceptibility and resistance" module is performed, a directory antibio is created containing:

  • antibiogram.html: This is the summary of the antibiotic susceptibility and resistance prediction in HTML format.

    antibiogram

    In this table, a line corresponds to a drug and contains:
    • the name of the drug
    • the Drugbank match giving the sucseptibility prediction
    • the ARDB match giving the resitance prediction
    • the prediction summarizing the profile of the strain for the drug
    • the ARDB resistance types giving more details about the resistance profile
    The statuts for the prediction are:
    • false FALSE: no match were return in the database with the corresponding threshold of identity and percentage of target coverage.

    • true TRUE: matches were return in the database with the corresponding threshold of identity and percentage of target coverage.

    • not defined NOT DEFINED: there is no entry in the database for this drug.

    The prediction is the summarize of the predictions for the sucseptibility and the resistance and are defined as follow:

    table

  • blast_results.html: this file gives the detailed results of the BLAST hits against the drugbank and ARDB databases.

    blastresults

    For each BLAST hit, several values are given:
    • the contig name produced by MICRA
    • the target coverage (the gene in the database)
    • the positions onto the contigs
    • the percentage of identity
    • the percentage of similarity

    The drugs considered in the analysis are in red and the hit considered as significative (percentage of identity and percentage of target coverage greater than the thresholds) are highlighted in yellow.

  • log.txt: gives the results in text format.

The MICRA comparison results

The result directory of the MICRA comparison tool contains several files:

  • The venn diagram named output.tiff in TIFF format showing the comparison between your samples.

    output

  • All the text files giving the gene or variant list from the different parts of the Venn diagram are also available. The name of each file gives the corresponding part of the venn diagram. The file "all.txt" contains the intersection of all samples (i.e. the variants or annotions shared by all the samples). The file named with only one sample name corresponds to the specific elements identified only into this sample. The file name with two sample names corresponds to the elements common between these two samples.