0 citations0 references

Amborella gene presence/absence variation is associated with abiotic stress responses that may contribute to environmental adaptation

New Phytologist2021Vol. 233(4), pp. 1548–1555

Citations Over TimeTop 10% of 2021 papers

Haifei Hu, Armin Scheben, Brent Verpaalen, Soodeh Tirnaz, Philipp E. Bayer, Richard G.J. Hodel, Jacqueline Batley, Douglas E. Soltis, Pamela S. Soltis, David Edwards

Abstract

Amborella trichopoda (Amborellaceae) is the single living sister species of all other extant flowering plants and only occurs in rain forest habitats on the remote island of New Caledonia. These features make Amborella an important species in which to study genetic variation, including gene presence/absence variants (PAVs). Here, we apply the reference genome based iterative mapping and assembly strategy (Bayer et al., 2020) to assess gene diversity across 10 diverse individuals. The N50 of the newly assembled contigs was 1.2 kb, indicating similar contiguity to those in comparable pan-genome studies (Brassica oleracea, 0.6–1.9 kb; Brassica napus, 1.2 kb; Solanum lycopersicum L., 1.4 kb; Musaceae, 0.7–2.4 kb; and Oryza sativa L., 1.1 kb) (Golicz et al., 2016a; Hurgobin et al., 2018; Wang et al., 2018; Gao et al., 2019; Rijzaani et al., 2021) (Supporting Information Table S1). We identified 2765 additional genes not present in the reference assembly and found that Amborella may have relatively few dispensable genes (3136, 10.4%) (Table S2) compared with studies in other species (maize, 60.88%; Brassica oleracea, 18.71%; bread wheat, 42.30%; Brachypodium distachyon, 45%; and cultivated rice, 51.5%) (Hirsch et al., 2014; Golicz et al., 2016a; Gordon et al., 2017; Montenegro et al., 2017; Wang et al., 2018). Although a small set of samples was included in this study, our pan-genome modelling indicates that 11 Amborella samples were sufficient to capture the majority of PAVs within Amborella (Fig. S1). Although levels of genetic diversity in Amborella are comparable with outbreeding perennials such as Populus, population bottlenecks over the past 900 000 yr may have contributed to the relatively low number of gene PAVs (Amborella Genome Project, 2013). Studies in Brassica oleracea (e.g. cabbage, broccoli, cauliflower, kale), wheat, rice, tomato, soybean and maize have demonstrated that dispensable genes are frequently associated with both biotic and abiotic stress (Golicz et al., 2016b; Jin et al., 2016; Montenegro et al., 2017; Wang et al., 2018; Gao et al., 2019; Liu et al., 2020). By contrast, the dispensable genes in the Amborella genome are mainly enriched for functions associated with responses to abiotic stress, including salt, cadmium, zinc, cold, water deprivation and heat, with few dispensable genes associated with biotic stress (Tables S3,S4). Based on Gene Ontology (GO) annotation, 314 genes are related to cadmium ion response in Amborella, which is similar to 296 cadmium ion response genes identified in Brassica oleracea and significantly higher than the number of cadmium response genes identified in Brachypodium distachyon (44), Oryza sativa L. (44), Solanum lycopersicum L. (60) and Glycine max (28) (Tables S5,S6). Amborella has the highest proportion of cadmium-responsive genes classified as dispensable (44.6%), compared with Brassica oleracea (7.8%), Glycine max (7.8%), Brachypodium distachyon (25.0%), Oryza sativa L. (22.7%) and Solanum lycopersicum L. (23.3%) (Table S6). One-third of New Caledonia is covered by ultramafic soils that are rich in nickel, lead, silver, zinc, copper and cadmium (Lillie & Brothers, 1970). Although Amborella does not occur on ultramafic soils, it does occur on soils rich in metals (Thien et al., 2003), and genes dispensable in Amborella may be associated with adaptation to regions of the island with high metal content soils (Jaffre et al., 2013). Cadmium ion response genes can be grouped based on Pfam domain classification into 97 gene families, with 57 families only containing one single gene and 40 gene families containing multiple gene copies (Fig. S2; Table S7). Different gene families show expansion or contraction across the Amborella samples. For example, 12 cadmium ion response-associated genes have a Myb-like DNA-binding domain (PF00249); only seven copies were identified in the reference sample Santa Cruz, while all 12 copies were found in samples BA, BO, TOB and PWB (Fig. S3). We assessed more broadly the distribution of cadmium ion response-associated genes, showing that many are missing in the reference Santa Cruz, Mé Ori (MO) and PO samples (Fig. S4). However, the samples TC, BO, AM, TOB, BA and PWB experienced significant gene expansion. These samples were located near two ancient refugia (Poncet et al., 2013), suggesting that the observed difference in gene content may be associated with population changes during the last glacial maximum (c. 21 000 yr bp). We identified 152 dispensable genes predicted to be involved in the salt response, 100 related to the cold response, and 76 associated with the drought response (Fig. S2; Table S5). The geographical features and climate conditions vary significantly across New Caledonia. A central mountain range results in a rain shadow effect, with 800 mm yr−1 in the western coastal region compared with 4500 mm yr–1 on the eastern slopes of mountain areas (Teurlai et al., 2015). Moreover, as the average annual temperature decreases by c. 0.6°C with every 100 m increase of elevation (Andréfouët et al., 2004), Amborella populations occurring in elevations ranging from 110 m to 860 m (Poncet et al., 2013) grow under different temperature conditions. The diverse environmental conditions present within a relatively small and isolated island may be a driver for the retention or loss of genes for environmental adaptation across the island (Hodel et al., 2018). To also assess the diversity of resistance gene analogue (RGA) content in Amborella, we conducted genome-wide RGA discovery analysis. Over-masking repeats in a genome assembly can lead to a decreased number of RGAs identified (Bayer et al., 2018). Even without masking repeats, we found that Amborella contained relatively few RGAs (514), compared with other angiosperms of similar genome size (Solanum lycopersicum L., 1000; Manihot esculenta, 1412; Brassica napus, 1749; and Brassica oleracea, 1989) (Li et al., 2016; Bayer et al., 2019; Dolatabadian et al., 2020). Most RGAs were receptor-like kinase genes (RLK; 308 genes), followed by nucleotide binding site leucine-rich repeat genes (NLR; 140 genes) and receptor-like proteins (RLP; 68 genes) (Fig. S5; Table S8), consistent with previous reports that RLKs are the most abundant class of RGAs in plants (Shiu et al., 2004). The physical clustering of RGAs (Table S9) is also similar to observations in other species (Bayer et al., 2019). NLRs, which play an important role in disease resistance responses, are divided into subclasses based on their domain structures. Toll/interleukin-1-Nucleotide-binding site-Leucine-rich repeat (TNL) and Coiled-coil-Nucleotide-binding site-Leucine-rich repeat (CNL) are the two typical complete forms of NLRs. The remaining NLR subclasses, with missing domains and disordered domain structure, are known as atypical NLRs. Amborella contained 28 typical NLRs (CNL, 14; TNL, 14) and 112 atypical NLRs, of which the majority were NL (40) and NBS (39). Zhang et al. (2016) identified a total of 88 NLRs in Amborella, including 9 TNLs and 15 CNLs. However, Shao et al. (2016) reported a larger number of CNLs (89) and TNLs (15) in Amborella. The differences in the numbers of RGAs reported in these studies were due to the application of different RGA classification approaches (Tables S10, S11). Specifically, Shao et al. (2016) defined all the non-TNL RGAs as CNL genes, while we include the subclasses CC-NBS-LRR (CNL), CC-NBS (CN), NBS-LRR (NL), and NBS (NBS). In addition, our RGA analysis relies on the A. trichopoda v.6.0 assembly and gene annotation as well as newly annotated pan-genome genes, whereas Shao et al. (2016) and Zhang et al. (2016) rely on the lower number of genes in the A. trichopoda v.1.0 gene annotation. Overall, our analysis captures 105 of the 108 previously identified RGAs as well as detecting 35 previously unreported RGAs (Fig. S6), including 4 TNLs (Table S12). Of the 514 RGAs, we found that most RGAs (491) were core and the remaining 23 were dispensable. This result contrasts with findings in other studies in which the majority of RGAs are dispensable (Li et al., 2014; Hurgobin et al., 2018; Bayer et al., 2019). The relatively low number and lack of dispensable RGAs in Amborella may, in part, reflect the lack of diverse pathogen pressure given its isolated and limited distribution on a remote island, as well as the lack of recent WGD events, which tend to increase both the number and variation of NLRs across a genome (Seo et al., 2016). Lastly, we explored the population structure of Amborella using SNP-based and gene PAV-based population analysis. The 10 Amborella individuals in this study were sampled from across the native range in New Caledonia (Fig. 1), while the reference sample Santa Cruz was sourced from the California Santa Cruz Arboretum and Botanic Garden, with evidence suggesting that it had a similar origin as MO (Amborella Genome Project, 2013). A SNP-based phylogenetic tree and principal component analysis (PCA) plot (Fig. 2a,b) showed that Amborella individuals clustered into four clades or groups based on their geographical locations, consistent with previous studies that inferred Amborella population structure using microsatellite loci (Poncet et al., 2013) and SNPs (Amborella Genome Project, 2013). However, cluster and PCA analyses based on PAVs showed some differences in population structure (Fig. 2c,d). The two individuals Santa Cruz and MO remained closely associated, although the PCA based on PAVs suggested greater variation between these individuals than found by the SNP analyses. Ponandou (PO), with the second fewest genes, is located on the north-eastern side of the island and showed a distinct PAV pattern compared with Santa Cruz and MO (Fig. 2d). The cluster containing DOB, BA and AOC showed an intermediate pattern, while the third cluster (comprising BO, AM, TC, TOB and PWB) hosted the largest number of newly annotated genes (Figs 1, S7; Table S13). The differences in gene content between clusters may be driven by environmental factors and past ecological processes. The individuals in the cluster comprising BO, AM, TC, TOB, PWB, DOB, BA and AOC shared the majority of the newly annotated genes. These samples were collected from the northern and central areas of New Caledonia, two putative refugial areas from the last glacial maximum with higher levels of genetic diversity than other areas (Poncet et al., 2013). Although the MO sample is geographically close to AM and DOB, it shares a low number of genes with the reference sample Santa Cruz; this differentiates MO from other samples. This gene loss and low diversity may have been caused by either a bottleneck or a founder effect at the fringe of the geographical distribution (Hampe & Petit, 2005). Additionally, PAVs and SNPs are suitable for inferring population structure because they often show identity by descent as a result of single ancestral mutation events (Sudmant et al., 2015). However, PAVs may lead to different outcomes when used in phylogenetic inference compared with SNPs. For example, in Arabidopsis, phylogenies based on SNPs and PAVs found congruent relationships between major geographic clades but differed substantially in how they resolved relationships within these clades (Tan et al., 2012). Such patterns of PAVs conflicting with evolutionary distance and geography may also be caused by balancing selection and highlight the value of using PAVs in addition to SNPs in assessing species evolution. By characterising genetic diversity in the phylogenetically pivotal plant Amborella, our study showed that dispensable genes are annotated with functions associated with abiotic stress response that have the potential to contribute to adaptation. The 12 known Amborella populations, 10 of which we investigated here, occur in distinct sites distributed across New Caledonia (Amborella Genome Project, 2013; Poncet et al., 2013), and their genetic structure reflects their geographical location. Although some data on environmental conditions are available, the small number of samples hinders robust statistical inference of associations between gene PAVs and the environment. Further sampling of within-population diversity in Amborella, together with a more detailed assessment of the environment, may help to elucidate the effect of environment on genomic variation. However, Poncet et al. (2013) reported high genetic homogeneity within Amborella populations, suggesting that additional sampling would not uncover significantly more diversity. Even with limited sampling, our gene presence/absence analysis identified variation in Amborella gene family size that may be associated with local adaptation to the environment. By contrast with many other plants, particularly crops, Amborella resistance gene family sizes are relatively small and invariable. Notably, however, there was a substantial variation in cadmium-responsive genes and other genes responsive to environmental stress. This suggests that structural variation has the potential to contribute to adaptation in Amborella and families of dispensable environmentally responsive genes may be important for such adaptations. Pennsylvania State University provided whole-genome Illumina re-sequencing data for Amborella trichopoda as part of the Amborella Genome Project. One Amborella trichopoda individual from each of 10 natural populations represented the geographic and genetic diversity of Amborella on the mainland of the New Caledonia archipelago. These sites were sampled based on Poncet and colleagues’ microsatellite analysis (Poncet et al., 2013). In addition, sequence data for one plant from the University of California Santa Cruz (Santa Cruz) were included in the reference genome (Amborella Genome Project, 2013). The data published by the Amborella Genome Project are available in the NCBI Sequence Read Archive (SRA), accessions SRX337107–SRX337118 and SRX336900. The geographic locations are presented in Table S14. Here, 10 individuals with sequencing coverage of >10× were selected for novel contig assembly using a previously published assembly pipeline (Golicz et al., 2016a). An updated chromosome-level Amborella assembly was used as the starting reference. Iterating through each of the 10 samples, reads were aligned to the reference genome and unaligned reads were assembled into contigs. These contigs were then added to the reference before the next sample was aligned. The Amborella chloroplast genome (NC_005086.1) and mitochondrial genome (KF754803.1) were included in the reference assembly. Read adapters were removed using Trimmomatic (Bolger et al., 2014) v.0.36. Alignment was performed using Bowtie2 (Langmead & Salzberg, 2012) v.2.3.3.1 with parameters (--end-to-end --sensitive -I 0 -X 1000), and the de novo assembly of the unaligned reads was conducted using MaSuRca (Zimin et al., 2013) v.3.2.2. The resulting pan-genome was assessed using reads re-aligned by Bowtie2. Blast+ (Camacho et al., 2009) v.2.2.31 (-template_type coding_and_optimal -max_target_seqs 2 -e-value 1e-3) was used to perform contamination detection, by comparing the novel contigs with the NCBI nucleotide database (downloaded from NCBI, 26 June 2018). Query sequences showing best hits with over 80% query length covered and 80% identity to likely contaminants (sequences of nongreen plant species) were removed. To reduce variation in annotation caused by different annotation tools, we used the conservative annotation approach of Evidencemodeler (EVM) (Haas et al., 2008) v.1.1.1 to annotate the novel contigs, following the approach of the Amborella reference genome project (Amborella Genome Project, 2013). Three sets of evidence (gene prediction evidence, transcript evidence and protein evidence) were provided as the inputs. Gene evidence of newly assembled contigs was generated by running the Snap (Korf, 2004), Augustus (Stanke et al., 2006) and Genemark (Tang et al., 2015) gene prediction pipelines. For the transcript evidence, the published available RNA-seq data from reference sample were downloaded from NCBI (Table S15). Tophat et al., 2009) and et al., were used to perform the and assembly. In addition, the published RNA-seq assembly from the Amborella Genome Project (Amborella Genome Project, was aligned using the transcript prediction evidence was generated by the protein sequences of Amborella (downloaded from NCBI using & Moreover, repeat sequences by & 2009) were provided as an additional to annotation. Amborella samples from natural populations and one reference were used in gene PAV Gene PAV discovery was based on coverage analysis using the & 2018). from all were to the genome and novel contigs using with parameters (Li & Gene PAVs were based on the across all of the genes by We not significant of the genes present in each with sequencing (Table A gene was as missing when the coverage across of the gene was and the coverage was (Golicz et al., 2015). A gene was a core gene it was present in all 11 it was missing in one it was a dispensable The et al., was used to show the distribution of gene PAVs the 11 Amborella samples. The of pan-genome size and core genome size were using the from the et al., 2016). The pan-genome and protein sequences of sativa et al., were downloaded from the The pan-genome and protein sequences of Brachypodium distachyon et al., were downloaded from the Brachypodium The lycopersicum pan-genome et al., and its protein sequences were downloaded from the The Brassica oleracea pan-genome (Golicz et al., and its protein sequences were downloaded from the Brassica genome database The cultivated soybean pan-genome et al., 2021) and its protein sequences were downloaded from annotation was performed using & 2008) The genes were aligned to the proteins in the database using (Camacho et al., and only with were the results were to was used to for aligned genes based on the analysis of the dispensable genes was conducted by the & using with the approach used to for multiple with the to cadmium were and used to the of the dispensable genes in Amborella Brassica oleracea, Glycine Brachypodium distachyon, Oryza sativa L. and lycopersicum L. The pipeline (Li et al., was used to and genes. gene physical clusters were by comparing the genetic of RGA using the of Bayer et al. two RGAs were located within each other by 10 or 10 genes, they were into an For of RGA sets from Shao et al. (2016) and Zhang et al. (2016) using the A. trichopoda v.1.0 gene annotation with our RGA set based on the A. trichopoda v.6.0 assembly and pan-genome annotation, we used (Camacho et al., 2009) v.2.2.31 to with a sequence of The pipeline was used to RGA in these in A. trichopoda v.1.0 gene annotation (Amborella Genome Project, 2013). We 140 RGAs, including of the 105 RGAs found by Shao et al. (2016) and of the RGAs found by Zhang et al. We assessed the genes with a in classification between our results and those in Shao et al. using the prediction and Pfam protein domain (Table Based on protein we found that our were (Figs SNPs were annotated using et al., 2012). A SNP-based phylogenetic tree of 11 Amborella accessions was using et al., et al., using a maximum the parameters The was selected using based on the Information before tree SNPs were generated by SNPs with and missing using et al., The PAV-based maximum phylogenetic tree was generated with et al., et al., the parameters using the PAVs in as (gene was as and was as The was selected using based on the Information We of the Amborella Genome Project for to the updated chromosome-level assembly in this analysis. This was with the of provided at the and was by the and and the the for studies at the University of was by an by the the of the The and the and the contributed to the genetic analysis. contributed to the disease resistance analysis. and contributed to the pan-genome assembly analysis. performed all other analyses. contributed to the and of the sequencing data used in this study are available on through numbers and The assembled and other data are available at of all individuals are available at The genome can be using et al., at The PAV as in which areas for genes and areas for genes, SNPs and predicted genes. For the predicted genes, there are showing the transcript evidence, the evidence from protein and the evidence from gene The for data from each individual is also modelling of Amborella. genes in Amborella. show the number of gene copies across populations in the gene family with Myb-like DNA-binding domain PAVs distribution and the clustering of the dispensable genes annotated as cadmium ion response-associated genes. gene families in Amborella. of the RGAs reported by our study and Shao et al. (2016) and Zhang et al. plot of gene PAVs distribution 11 Amborella samples. domain analysis for the resistance gene analogue domain analysis for the resistance gene analogue the protein domains for the resistance gene Table of newly assembled contigs in Amborella pan-genome and other pan-genome Table number of core genes and dispensable genes in Amborella. Table of dispensable genes. Table of Amborella dispensable genes. Table genes in Amborella. Table and of dispensable genes in Amborella Brassica oleracea, Glycine Brachypodium distachyon, Oryza sativa L. and lycopersicum L. Table The gene family classification of cadmium responses associated genes. Table gene families in Amborella. Table clustering of RGAs in Amborella. Table of the resistance gene analogue (RGA) classification in Shao et al. (2016) and classification used in our study based on Table of the resistance gene analogue (RGA) classification in Zhang et al. (2016) and classification used in our study based on Table Sequence between the protein sequences trichopoda used in Zhang et al. (2016) and sequences in our Table Gene PAVs for different Amborella individuals. Table of the 11 Amborella samples used in this study and the of Illumina re-sequencing data Table of the RNA-seq used in this Table and sequencing reads in for each Table of the resistance gene analogue (RGA) classification in Zhang et al. Shao et al. (2016) and classification used in our study based on are not for the content or of Information by the than missing be to the New The is not for the content or of by the than missing be to the for the

Citations Over TimeTop 10% of 2021 papers

Abstract

Related Papers