Parsing ecological signal from noise in next generation amplicon sequencing
Citations Over TimeTop 1% of 2014 papers
Abstract
It is clear that the use of next generation sequencing (NGS) applied to environmental DNA is changing the way researchers conduct experiments and significantly deepening our understanding of microbial communities around the globe (Amend et al., 2010; Caporaso et al., 2011; Bik et al., 2012; Bates et al., 2013). The lower per unit cost and sheer number of sequences relative to traditional methods provide tremendous advantages in characterizing the richness and composition of highly diverse microbial systems (Bokulich et al., 2013). In a recent volume of New Phytologist, Lindahl et al. (2013) presented an excellent introduction into high-throughput sequencing of amplified gene markers for fungi, and broadly discussed field sampling and handling, DNA extraction, markers, primers, amplicon library construction, sequencing platform, bioinformatic analyses and data interpretation. We applaud their overview as an important general guide, but we have found that there are significant additional issues regarding NGS that have not been well articulated in the literature, especially when applied to fungi. Below we highlight a series of platform-independent recommendations based on our recent experiences with NGS, which we think are critical for maximizing the signal:noise ratio in molecular ecological analyses. The inclusion of both negative and positive controls is indispensable in NGS-based studies due to the greater detection level than traditional sequencing (i.e. sequences can be readily detected in controls with NGS methods even in the absence of positive PCR bands). These controls are essential at multiple steps during the experimental process, for example, in the field, in laboratory settings where samples are processed, during DNA extraction, as well as before and after PCR. To be useful, the controls must be treated identically to other samples from initial processing through library preparation. As an example, we recently conducted a field-based study using the Illumina MiSeq platform to amplify the ITS1 region of soil fungi, which included a series of negative controls. Of the total sequence pool generated, we detected 0.01% from soil sieve controls (3.17% total OTUs (operational taxonomic unit(s))), 0.0001% (0.2% total OTUs) from DNA extraction controls, and 0.001% sequences (0.67% total OTUs) from PCR controls. Together, these controls accounted for 0.01% of total sequences (3.8% of total OTUs). While detection of fungal taxa in negative controls is key to determining which fungal taxa should be included in subsequent ecological analyses, there is currently no consensus on how to handle these sequences. One approach would be to simply delete any OTUs that appeared in negative controls across all samples (e.g. Vik et al., 2013). However, in our study, this would have deleted many of the most abundant OTUs in the experimental samples. It seems highly likely that those abundant OTUs were in fact present in the field because (1) many had been previously encountered in soil and (2) their abundance in the controls was multiple orders of magnitude lower. To avoid eliminating OTUs that appeared to be ecologically valid, we addressed this issue by subtracting the number of sequences of each OTU present in the negative controls from the sequence abundance of that OTU in the experimental samples (essentially, after subtraction, the negative control samples will contain zero sequences, and other samples will have reduced abundances). In our dataset, this approach eliminated only two low abundance OTUs (each had 10 sequences. Thus, if we were to use this particular method of OTU clustering, we would remove any OTUs that had three or less sequences across the whole experimental dataset. While it did not happen in our example, the fungal OTUs in the positive control could potentially appear in low abundance (i.e. singletons) in other samples due to primer contamination or tag switching (see below). If this were to happen, these low abundance OTUs should of course also be removed from across all the experimental samples. A number of molecular-based ecological datasets have shown that sequence abundance does not necessarily correlate well with tissue abundance across different species (Manter & Vivanco, 2007; Liti et al., 2009; Amend et al., 2010; Avis et al., 2010; Egge et al., 2013; Weber & Pawlowski, 2013). This was also reflected in our mock community where despite combining equal amounts of DNA of all 27 species, we found that one OTU appeared three orders of magnitude higher in total sequence number than eight of the OTUs and two orders of magnitude > 16 of the OTUs, much like a rank-abundance curve of any natural community (Table 1). While we think that this issue (which could be due to factors such as unequal gene copy number in fungal genomes or taxon-specific PCR bias) is important to keep in mind, we suggest that analyses based on sequence abundance data often have ecological relevance. For example, the most abundant ECM fungal species on root tips often have the highest number of sequences in NGS datasets (Tedersoo et al., 2010; P. Kennedy et al., unpublished data; N. H. Nguyen et al., unpublished data). Similarly, in mock communities containing different concentrations of known species across different samples, Amend et al. (2010) showed that sequence abundances generally scaled well with relative DNA concentration within but not between species (i.e. ‘semi-quantitative’). Further, Smith & Peay (2014) compared results based on incidence- (i.e. presence/absence) and abundance-based data and showed that only using the former led to artificially high estimates of β-diversity when re-sequencing the same DNA extract. Based on these combined examples, we suggest conducting ecological analyses of fungal communities using both incidence- and abundance-based sequence data is a better approach than using only one or the other, as the results from one data type can help to inform the other. If using only abundance-based data is preferred due to concerns about the amount of information lost when transforming to incidence data, we remind researchers that a variety of data transformations (i.e. log, square-root) can be used to down weight the importance of more abundant OTUs before ecological analyses. Primer cross-contamination in a multiplexed library is a serious issue and should be discussed openly. In a sequencing library that has primer cross-contamination, a small number of sequences can be erroneously assigned to a different sample, potentially skewing the ecological interpretation of the data. Primer cross-contamination could happen at any stage, from oligonucleotide manufacturing to PCR. Primers maintained in plates, be they from the manufacturer or aliquots, have a greater chance of being contaminated due to repeated opening and closing of the sealing mats/film. We emphasize the importance of asking explicit questions about the chance of cross-contamination and purification costs for primers from the manufacturer before ordering. In addition, we suggest that primers be ordered in individual tubes and aliquots be made in tubes instead of plates to minimize the possibility of cross-contamination. Tag-switching (sensu Carlsen et al., 2012) (where primer barcode tags from one sample may jump onto another sample during PCR) is a related issue, which can be accounted for by using primers tagged on both ends. Unfortunately, this would double the cost of primers, so may not be practical for the majority of researchers. Another issue is the accidental inclusion of previously amplified DNA from one project in the post-PCR sample processing of a different project. Fortunately, for a laboratory where NGS is used for the first time, there is no chance for this kind of contamination. It will, however, become immediately relevant in laboratories that have built multiple libraries from the same primer sets (P. Kennedy et al., unpublished data). While specifically accounting for post-PCR contamination is difficult (because controls at these steps cannot be easily parsed bioinformatically due to the absence of barcodes), careful additional laboratory hygiene (e.g. doing all post-PCR reactions with pipette tips with barriers, wiping down pipettors regularly with nuclease solutions, using a flow with will help this Although the provided here are specific to fungal molecular we think that it could be broadly applied to other study systems using amplified we think researchers to even to controls in NGS analyses in to as much from their datasets as possible. We stress that positive and negative controls different important information about NGS-based data and for each dataset, the inclusion and independent of both is key to determining the best used to the OTU used in ecological analyses. We recognize that more researchers have with these but we this will the by Lindahl et al. (2013) for researchers just to sequence fungi in ecological The Tedersoo and two for on previous of this
Related Papers
- → Systematic improvement of amplicon marker gene methods for increased accuracy in microbiome studies(2016)842 cited
- → Triplicate PCR Reactions for 16S rRNA Gene Amplicon Sequencing are Unnecessary(2019)96 cited
- → Coverage analysis in a targeted amplicon-based next-generation sequencing panel for myeloid neoplasms(2016)33 cited
- → Detection of genetically modified organisms using highly multiplexed amplicon sequencing(2023)1 cited
- → Abstract 5100: Optimization of amplicon design for polymerase chain reaction(2019)