Overview
The bioinformatics pipeline developed and implemented in the INSaFLU web platform currently consists of six core steps: (1) read quality analysis and improvement, (2) type and sub-type identification, (3) variant detection and consensus generation, (4) coverage analysis, (5) alignment/phylogeny, (6) intra-host minor variant detection (and uncovering of putative mixed infections) (Fig. 1). A summary of the INSaFLU current outputs is presented in Table 1. A link [25] to the latest documentation for each module, including software settings and current versions, is provided at the website (https://insaflu.insa.pt) (the documentation at the time this article was published can be found in the Additional file 1; notable changes in INSaFLU platform will be continuously reported in the documentation’s “change log” tab).
Read quality analysis and improvement
This module is the first step in almost all WGS bioinformatics analyses and refers to the quality control and improvement of the raw sequencing data. INSaFLU currently accepts single- and paired-end reads (fastq.gz format) generated through widely used NGS technologies, such as Illumina or Ion Torrent. Reads’ quality control in the INSaFLU pipeline is performed by using FastQC software [26], while quality improvement is achieved through Trimmomatic [27]. This tool sequentially (i) performs a trimming sliding window by cutting reads once the average quality within a base window falls below a threshold of quality score, (ii) removes very low-quality bases (or N bases) from both the start and the end of each read if their quality falls below the specified minimum quality required, (iii) excludes reads that fall below a specified length, and (iv) standardize the quality scores by converting them to Phred-33 scores. This first module is automatically run upon reads upload (i.e., no user intervention is needed) and provides the following outputs: (i) FastQC graphical reports (“html” format) of well-established statistics of the reads quality before and after Trimmomatic analysis and (ii) quality processed reads (“fastq.gz” format).
Type and sub-type identification
In the second step of the pipeline (also automatically run without user involvement), a draft de novo assembly is performed over the quality processed reads using SPAdes [28]. Subsequently, the ABRicate tool [29] is applied to query the draft assemblies against an in house database (“influenza_typing”) of a set of type- and sub-type/lineage-specific gene markers that allows the discrimination of the influenza A and B types, all currently defined influenza A subtypes (18 hemagglutinin subtypes and 11 neuraminidase sub-types) and the two influenza B lineages (Yamagata and Victoria). Using this approach, INSaFLU provides the automatic identification of the influenza virus type and sub-type/lineage just after reads upload. Of note, samples are flagged as “putative mixed infections” if more than one type, HA or NA subtype or lineage is detected, and specific alerts are also generated if an incomplete type/subtype is assigned. No incongruence was observed between the in silico determined types or HA subtypes and the result obtained by the traditional “pentaplex” real-time RT-PCR assay applied for influenza diagnosis, typing and sub-typing [30] for the tested the tested 192 A(H3N2) (dataset 1) and 78 A(H1N1pdm09) (dataset 2) viruses. Also notable is that both or either the type and/or sub-type/lineage could be determined for viruses sequenced with very low coverage (mean depth of coverage < 5-fold across the eight amplicons), launching the perspective that this key typing data can be even retrieved from clinical samples with vestigial viruses abundance and/or generating very low PCR yield. The INSaFLU “influenza_typing” database (Additional file 2: Table S1.A) includes (i) representative sequences of the gene encoding the matrix protein (MP or M1 gene) of influenza A and B viruses (to infer the influenza type A or B), (ii) representative sequences of the HA gene of each of the 18 currently defined HA sub-types, (iii) representative sequences of the neuraminidase (NA) gene of each of the 11 currently defined NA sub-types, and (iv) HA representative sequences of the influenza B lineages Yamagata and Victoria. As a proof of concept, all MP, M1, HA, and NA sequences available at Influenza Virus Resource (NCBI) - Influenza Virus Database [31], a total of 184,067 sequences (database accessed in 23–25.10.2017), were screened using the INSaFLU “influenza_typing” tool. The percentage of hits correctly assigned exceeded 99.99% for NA and HA sub-typing and reached 100% for type determination. Of note, this assay detected several types/sub-types mislabeled in the NCBI database (confirmed by BLAST analyses), so these specific mis-discrepancies were not account for specificity estimation purposes. Following the same methodological rationale as described above, draft assemblies are additionally queried against another in house database (“influenza_assign_segments2contigs”) (Additional file 2: Table S1. B) using ABRIcate, enabling the automatic assignment of assembled contigs/nodes to each corresponding viral segment and a closely related reference influenza virus (output is provided as a “.tsv” table). This feature reinforces the application of INSaFLU to (i) analyze viruses for which a closely related whole-genome sequence is not available (e.g., avian influenza) at the INSaFLU or other databases (NICBI, GISAID, etc.), (ii) disclose mixed infections (e.g., by inspecting the output to find if two contigs assigned with same viral segment are flagged with distinct reference influenza viruses), (ii) investigate reassortments (e.g., by inspecting the output to find whether different reference viruses are assigned to different viral segments). Noteworthy, as the database for segments/reference assignment is not as exhaustive as the common influenza sequence repositories (e.g., Influenza Research Database/Fludb, Nextflu, EpiFLU/GISAID), it is prudent that users query those databases or apply other tools (e.g., BEAST, Giraf or BLAST) for specific purposes, such as the detection/confirmation of reassortments or assignment of the closest publicly available sequence of each segment. Yet, the database includes, for instance, representative virus of the circulating 3C.2a and 3C.2a1 genetic sub-groups of seasonal A(H3N2) influenza (as defined by the HA sequence diversity, following ECDC guidelines) as well as representative A(H5N1) viruses from distinct H5 genetic clades, so this INSaFLU feature can promote both the rapid traditional HA genetic sub-group classification and the detection of potential inter- or intra-subtype reassortments during the WGS-based influenza surveillance.
Altogether, upon sample data submission, INSaFLU automatically provides a rapid snapshot of whole-genome backbone of each virus and robustly detects the influenza virus type and sub-type/lineage, which guides the subsequent reference-based downstream module and constitutes an optimal complement to the traditional real-time RT-PCR assays, as it discriminates any HA and NA influenza A sub-types and both influenza B lineages.
Variant detection and consensus generation
This step of the pipeline consists of mapping the quality processed reads against user-specified reference sequences, followed by SNP/indel calling and annotation, and generation of consensus nucleotide sequences. The current reference database of INSaFLU includes reference sequences of (i) post-pandemic (2009) vaccine-like/reference influenza A(H1N1)pdm2009, A(H3N2) and B viruses (from both Northern and Southern hemispheres) and (ii) representative virus of multiple combinations of HA/NA subtypes (i.e., H1N1, H2N2, H5N1, H7N9, etc.) (check the latest list at the documentation webpage). All reference sequences at INSaFLU are publicly available at NCBI (or made available under permission from the authors). The reference files, both in “.fasta” and “.gbk” (GenBank) format (annotation performed by using Prokka) [32], have been prepared to fit amplicon-based schemas capturing the whole coding sequences (CDS) of the main eight genes of influenza virus (PB2, PB1, PA, HA, NP, NA, M, and NS). Nonetheless, INSaFLU is highly flexible and allows handling NGS data collected from any amplicon-based schema, provided that users fit the reference files to their amplicon design (users just have to generate and upload a multi-fasta file containing reference sequences of the individual amplicons they use with the precise size of the target sequence). Uploaded “.fasta” files are annotated using Prokka upon submission and automatically become available at the user-restricted reference database. In this module, INSaFLU takes advantage of Snippy [33], which is a high flexible multisoftware tool for rapid read mapping (using Burrows-Wheeler Aligner—BWA [34]), SNP- and indel calling (using samtools [35] and freebayes [36]), variant annotation (using SnpEff [37]), and consensus generation (using vcftools [38]). We selected the following criteria for reads mapping and validating SNPs /indels to be annotated, listed and assumed in the consensus sequences: (i) a minimum mapping quality of ≥ 20, (ii) a minimum number of 10 quality processed reads covering the variant position, and (iii) a minimum proportion of 51% of quality processed reads at the variant position differing from the reference. As a conservative approach, for each virus, consensus sequences are exclusively generated for loci with 100% of its length covered by ≥ 10-fold (see below the “Coverage analysis” module for more details), thus avoiding the generation of incomplete sequences that would shrink the nucleotide region available for genetic diversity analyses. Nonetheless, variants that fulfill the above described criteria, but fall within loci not fully covered with ≥ 10-fold, are still included in the list of all variants per sample/project (a specific flag is provided for these cases), so that users can still retrieve valuable and reliable data (e.g., specific epitope and antiviral drug resistance mutations) from samples with borderline coverage. Users can explore all output mapping files (“.bam” format) to view and inspect all reads and variants using the easy-to-use visualization tool Integrative Genomics Viewer [39] available at INSaFLU. These output files are also used in INSaFLU pipeline to more complex downstream analyses (see below the module “Intra-host minor variant analyses”). For each run (see INSaFLU usage section), users must choose the reference sequences (in general, the vaccine-like reference sequences of the season under surveillance) and the pool of samples to be compared (viruses sharing the same type/sub-type as the reference selected, as inferred in the previous module). The option to map reads against same type and sub-type reference sequences of the vaccine reference strains not only potentiates the mapping quality but also has the clear advantage of providing the user with a list of amino acid replacements properly coded to be reported for surveillance. In fact, the amino acid substitutions (including key markers of specific clades/genetic groups) that are reported by National Reference Laboratories to supranational health authorities (e.g., reports to ECDC/WHO via TESSy) are coded against the sequence profile of vaccine-like strains. In summary, this INSaFLU module provides the key data that are actually the core first-line “genetic requests” for effective and timely monitoring of influenza virus evolution on behalf of seasonal influenza laboratory surveillance, i.e., the list of variants (assumed in consensus sequences) and their effect at protein level and also consensus sequences. The latter constitutes the whole basis for the downstream phylogenetic inferences driving the continuous tracking of influenza temporal/geographical spread.
Coverage analysis
A key standard parameter to take into account when performing NGS is the mean depth of coverage, defined as the mean number of times each base shows up in individual reads (also known as vertical coverage). When handling small amplicon-based NGS data for virus variant detection and consensus generation, it is mandatory to finely inspect the fluctuation of the depth of coverage throughout each amplicon region [6]. Such inspection of the so-called horizontal coverage may not only be highly informative about sequencing-derived artifacts (the coverage plot should typically follow an invert U shape per amplicon) but also provides important clues about the degree of relatedness between the genetic background of the “query” virus and the reference sequence chose for mapping. For instance, obtaining sufficient mean depth of coverage for a given amplicon for which its complete length was not covered at 100% may be indicative of miss-mapping due to a high genetic distance between the reference sequence for that locus and the virus under sequencing. These phenomena are typically expected for cases of antigenic shift (reassortment between viral segments from different strains) or intra-segment homologous recombination, or even, for instance, for cases of “mis-subtyping” or “mis-choice” of the reference sequences (e.g., erroneous mapping of A/H1N1pdm09 viruses against a vaccine-like A/H3N2 reference). In this context, we developed the getCoverage.py script [40], so that INSaFLU automatically provides the user with a deep analysis of the coverage. Results are provided both per sample (graphical outputs) and as batch per project (“tsv” format), by yielding the following data: mean depth of coverage per locus, % of locus size covered by at least 1-fold, and % of locus size covered by at least 10-fold. The latter statistics was chosen both to fit the minimum depth of coverage for variant calling and to guide the consensus generation (as described above), i.e., the consensus sequences are exclusively provided for amplicons fulfilling the criteria of having 100% of their size covered by at least 10-fold. In addition, INSaFLU interactively yields intuitive color-coded outputs of the coverage statistics as well as depth of coverage plots for each locus per sample, enabling users to fine-tune this important parameter towards the uncovering of eventual atypical but highly relevant genetic events, such as reassortment/homologous recombination events.
Alignment/phylogeny
This module generates harmonized sequence and phylogenetic data that can be directly applied for fine-tuned downstream analysis and visualization platforms, thus promoting the operationalization of a harmonized supranational WGS-based surveillance of influenza virus [8, 41]. First, filtered consensus nucleotide sequences are used as input to progressiveMAUVE [42] and MAFFT [43] for draft and subsequent refined sequence alignment, respectively. INSaFLU provides refined nucleotide sequence alignments (FASTA and NEXUS formats) both at locus level, i.e., for each one of amplicon targets (which are, in general, influenza CDSs), and at “whole-genome” scale (after concatenation of all amplicon targets). Amino acid alignments for annotated proteins are also built using MAFFT [43]. Subsequently, phylogenetic trees (in standard “.nwk” and “.tree” formats) are inferred for each alignment by maximum likelihood under the General Time-Reversible (GTR) model (1000 bootstraps) using double-precision mode of FastTree2 [44]. In order to fulfill the demands of the cumulative data acquisition underlying laboratory surveillance throughout each flu season, for each INSaFLU project, alignments and phylogenetic trees are automatically re-build and updated as more samples are added, making data integration completely flexible and scalable (see “Usage” section). Alignments and phylogenetic trees can be either downloaded for external exploration or explored in situ at INSaFLU website using MSAViewer [45] and PhyloCanvas [46], respectively.
In summary, INSaFLU dynamically builds ready-to-explore scalable gene- and genome-based alignments and phylogenetic trees in standardized nomenclatures and formats that are fully compatible with multiple downstream applications. These include not only other web-based “surveillance-oriented” platforms for influenza genotyping, phenotypic prediction (e.g., Influenza Research Database/Fludb and EpiFLU/GISAID), or phylogeographical/patient data integration (such as, PHYLOViZ, Phandango and Microreact) [47,48,49], but also several computationally intensive bioinformatics algorithms commonly applied for fine-tuned research of influenza evolutionary dynamics, such as inference of signatures of selection or refined phylogenetics (e.g., the widely used MEGA, DnaSP, BEAST, and RAxML).
Intra-host minor variant detection (and uncovering of putative mixed infections)
INSaFLU additionally provides the user the possibility to get insight on the influenza intra-patient sub-population dynamics through the scrutiny of minor intra-host single nucleotide variants (iSNVs), i.e., SNV displaying intra-sample frequency below 50%. This is achieved by applying freebayes software [36] over mapping files (“.bam” format) with the following criteria: (i) excludes read alignments from analysis if they have a mapping quality of less than 20, (ii) excludes alleles from iSNV analysis if their supporting base quality is less than 20, (iii) requires a minimum of 100-fold depth of coverage to process a site for iSNV analysis, and (iv) requires at least 10 reads supporting an alternate allele within a single individual to evaluate the iSNV frequency. Once fulfilling the above previous criteria, no less than 1% of intra-host frequency of the alternate allele is reported. As such, in a dynamic manner, distinct minimum iSNV frequency cut-offs are assumed depending on the depth of coverage reached at each site, i.e., the identification of iSNV sites at frequencies of 10, 2, and 1% is only allowed if the depth of coverage at a particular site exceeds 100-fold, 500-fold, and 1000-fold, respectively. For each INSaFLU project, results are compiled in a table (“tsv” format) listing all iSNVs (detected for all project’ samples) at frequencies between 1 and 50% (reported frequencies refer to the proportion of reads harboring a nucleotide that is different from the one in the reference). As above, variant annotation (using SnpEff) [37] is also provided. Of note, variants at a frequency above 50%, which correspond to variants included in the consensus sequences, are filtered out from this table since they are systematically listed and annotated upstream in the pipeline (see module “Variant detection and consensus generation”). The table can easily be scrutinized to find sites displaying inter-patient redundancy (i.e., iSNV sites found in more than one individual). These may for instance constitute the ultimate genetic clues for disclosing influenza transmission links [50] or the emergence of antiviral resistance [51, 52]. Similarly to what is outlined in the previous module, this table is automatically re-build and cumulatively updated as more samples are added to each INSaFLU project. In order to additionally enable the detection of infections with influenza viruses presenting clearly distinct genetic backgrounds (so called “mixed infections”), INSaFLU additionally plots the proportion of iSNV at frequency 1–50% (minor iSNVs) and 50–90% detected for each sample (the positional mapping of iSNVs from these two categories within each amplicon can also be explored in the “coverage plots”; see above). A cumulative high proportion of iSNVs at both frequency’ ranges is mostly likely to represent a mixed infection, in a sense that the natural intra-patient influenza diversification is expected to be very low (no more than a few tenths of variants, most of them at frequency < 10%), within the limit of detection of the currently applied NGS techniques [7, 50, 53]. INSaFLU flags samples as “putative mixed infections” based on iSNVs if the following cumulative criteria are fulfilled: the ratio of the number of iSNVs at frequency 1–50% (minor iSNVs) and 50–90% and falls within the range 0.5–2.0 and the sum of the number of these two categories of iSNVs exceeds 20. Alternatively, to account for mixed infections involving extremely different viruses (e.g., A/H3N2 and A/H1N1), the flag is also displayed when the sum of the two categories of iSNVs exceeds 100, regardless of the first criterion. These numerical indicators were empirically inferred upon multiple testing, including the independent NGS run of sample replicates constituting “true” mixed infections (Additional file 3: Figure S1; dataset 1). In order to further consolidate these criteria, an additional proof of concept was carried out by running a bona fide dataset (dataset 3) of artificial mixtures (in triplicate) of A(H3N2) viruses at various proportions previously generated by Shepard and colleagues [17]. INSaFLU was able to detect these same sub-type mixtures at relative frequency of as far as 99:1, as well as yielded match “whole-genome” consensus sequences for all mixtures with the same dominant virus for all triplicates (Additional file 3: Figure S2; dataset 3). Finally, besides this iSNV-based approach, it is also worth noting that samples are also flagged as “putative mixed infections” if more than one type, HA or NA subtype or lineage is detected (see “Type and subtype identification” module).
In summary, through this module, INSaFLU supplies public health laboratories and influenza researchers with relevant data on influenza sub-population diversification within humans that can be systematically integrated in parallel with the “classical” data on “consensus-based” inter-patient virus genetic diversity. Taking into account the recent findings on this subject [50,51,52,53,54,55], it is expected that this dual approach will strengthen not only our ability to detect the emergence of antigenic and drug resistance variants but also to decode alternative pathways of influenza evolution and to unveil intricate routes of transmission.
Pre-NGS design and full pipeline testing
The INSaFLU pipeline has been mainly tested with two NGS datasets: 192 samples from A(H3N2) viruses (dataset 1) and 78 samples from A(H1N1) viruses (dataset 2) (see details below). These were generated in an Illumina MiSeq apparatus after influenza whole-genome amplification with a modified wet-lab protocol based on a previously reported RT-PCR assay [19,20,21]. The adapted pre-NGS protocols, both for influenza A and B viruses, are provided in the INSaFLU’s documentation and can be straightforwardly used for the routine generation of amplicon template for WGS of influenza viruses (irrespective of virus sub-type/lineage). Library preparation was conducted following the Nextera XT DNA Library Prep Reference Guide and WGS runs (96 samples per run) were carried out using MiSeq Illumina flow cells to obtain 2 × 150 paired-end reads (300 cycles). Based on our experience with the described experimental design, success (i.e., 100% of the length of the eight influenza CDS covered by ≥ 10-fold) is largely potentiated if WGS runs are designed to yield > 150,000 (2 × 75,000) reads per sample. In fact, above this cut-off, a success of 92% was achieved when comparing with less than 70% obtained for samples with < 150,000 dedicated reads. As a prudent approach, users should design NGS runs to go further this cut-off (e.g., 300,000 reads per sample) in order to better account for issues arising from both the PCR (e.g., fluctuations in the percentage of influenza-specific amplicons across samples and unbalanced relative proportions of the in-sample amplicons) and the NGS run (e.g., low yield and unbalanced demultiplexing of the reads across the samples). INSaFLU modules (relying on robust and widely used software) (Fig. 1) were subjected to specific validation tests to guarantee the generation of accurate outputs, as described above. Still, in order to further attest INSaFLU robustness as a whole, we ran both datasets 1 and 2 with IRMA (v0.6.1; influenza module; default settings) [17], which is the CDC command-line bioinformatics solution for NGS-driven whole-genome assembly and variant detection for RNA viruses, including influenza. Despite using distinct methodological approaches, both platforms start from raw reads towards the generation of the main outputs for influenza surveillance. Comparative analysis of the obtained “whole-genome” consensus sequences using INSaFLU versus IRMA demonstrated similar and robust performance of both pipelines. A detailed description of this assay is presented in Additional file 4: Table S2.