High‐Throughput Assays to Assess the Functional Impact of Genetic Variants: A Road Towards Genomic‐Driven Medicine
Citations Over TimeTop 10% of 2017 papers
Abstract
Genome-wide genotyping and DNA sequencing has led to the identification of large numbers of genetic variants that are associated with many clinical phenotypes. The functional impacts of most of the variants are unknown. In this article, we review high-throughput assays that have been developed to assess a variety of the functional impacts of the variants. A better understanding of their functions should facilitate the implementation of many more variants in genomic-driven medicine. A cornerstone of precision medicine is the incorporation of genetic information into healthcare decisions. This approach relies on understanding the genome complexity, the genetic differences that exist between individuals, and the functional consequences of the genetic variants. In the personal genome era, improvements in sequencing technologies are leading to continuous identification of new variants and further illustrating the complexity of the human genome and the genetic diversity between populations. Large-scale high-throughput sequencing studies, such as the 1000 Genomes and NHLBI GO Exome Sequencing Projects, have already identified millions of genetic variants among individuals from different populations and have established a comprehensive resource on human genetic variation.1, 2 The genetic variants are cataloged in public databases, such as dbSNP (https://www.ncbi.nlm.nih.gov/snp/) and dbVAR (https://www.ncbi.nlm.nih.gov/dbvar/) (Table 1). The current build of dbSNP (build 147, updated on 14 April 2016) contains ∼154 million single nucleotide variants (SNVs) of which about 101 million have been validated and nearly 89 million are within genes. dbVAR (updated on 28 September 2016) contains ∼5 million structural variants and ∼2.3 million, 1.3 million, and 1.2 million of these variants are contributed by copy number variants, short tandem repeats, and insertions, respectively. In the 1000 Genomes Project, sequencing was carried out on 2,504 individuals from 26 populations in Africa, East Asia, Europe, South Asia, and the Americas. More than 88 million variants were identified, of which 84.7 million were single nucleotide polymorphisms (SNPs), 3.6 million were short indels, and 60,000 were structural variants. Only 8 million of the identified autosomal variants were observed in more than 5% of individuals, while 64 million rare variants (frequency of 5%) in at least one population group. Eighty-six percent of variants were only present in a single continental group. Sequencing of individuals from South Asian and African populations contributed to 24% and 28%, respectively, of novel variants discovered.1 Sudmant et al.3 reported the identification of 68,818 structural variants when analyzing sequencing data from the 1000 Genomes Project. The majority of these structural variants are deletions (42,279) with a median site size of 2,455 bp and median alleles per individual of 2,788. The nucleotide substitution rate is an important factor underlying the degree of genetic variation between individuals. Scally4 reported a present-day germline mutation rate of 0.5 × 10−9bp−1year−1. This mutation rate translates into ∼30 de novo variants in each offspring that are absent in the parents. The introduction of 30 new DNA variants with every meiosis event over a period of 3.7–6.6 million years (evolution of the human species) and rapid expansion of the human population during the last 10,000 years resulted in the observed enormous diversity of the human genome. For most of the genetic variants, the impact on gene function and the effect on disease susceptibility remains unknown. High-throughput sequencing continues to produce a more accurate estimation of how much genetic variation exists within and between genomes of individuals of different ethnicities. Typically, each genome has 4–5 million sites that differ from the reference human genome; the greatest number of variant sites were observed among individuals of African ancestry. Although SNPs and indels account for >99.9% of variants, the typical genome contains 2,100–2,500 structural variants that affect about 20 million bases of sequence. Deep sequencing allows for the identification of rare variants and an estimated 1–4% of variants (40,000–200,000) observed in a genome are rare (frequency of <0.5%). A typical genome reportedly contained 149–182 sites with protein truncation variants, 10,000–12,000 sites with nonsynonymous variants and 459,000–565,000 variant sites within regulatory regions (untranslated regions, promoters, insulators, enhancers, and transcription factor binding sites). The number of ClinVar variants (those associated with clinical phenotypes) within a typical genome range from 24–30.1 Tennessen et al.2 suggested that 2.3% of SNVs per individual exome are thought to disrupt protein function of about 313 of the 23,500 protein-coding genes and nearly 96% of SNPs predicted to affect gene function are rare. (Figure 1) is a representation of functionally important regulatory and gene regions with the number of variants within these regions for a typical genome. Genome-wide association studies (GWAS) have been used to determine which of the identified variants are associated with diseases. To date, more than 3,200 GWA studies have been conducted (http://www.ebi.ac.uk/gwas) and ∼10,000 common SNPs have been associated with human traits and diseases through GWA studies.5 Gusev et al.6 estimated that ∼80% of phenotypic heritability of common diseases and traits are explained by variants in noncoding regulatory regions. Approximately 2,000 variants per genome have been associated with complex traits in GWA studies.1 However, testing of such a large number of SNPs in a GWA study requires correction for multiple testing to decrease the number of false-positive associations by using very stringent significance thresholds. Bonferroni correction for multiple testing (0.05/number of tests) is often used, but it can result in overcorrection and, thus, miss SNPs that really are associated with the phenotype.7 A large number of study participants are also needed to identify rare causal variants with the use of GWAS.8 Genetic variants impact drug metabolism, efficacy, and adverse event risk and are especially relevant to precision medicine. Fujikura et al.9 analyzed sequencing data from the 1000 Genomes and the NHLBI GO Exome Sequencing Projects; they reported a total of 6,165 SNVs in the 57 cytochrome P450 (CYP) genes. Eighty-three percent of the 4,025 SNPs within the coding regions were very rare (frequency of <0.1%) and 65% were nonsynonymous substitutions. The calculated total number of genetic variations in CYP genes of 1 million Europeans and Africans was 3.4 × 104 and 4.8 × 104, respectively.9 Furthermore, every individual of European descent carries on average 94.6 SNVs in CYP genes, of which 24.6 are nonsynonymous, within splice sites, or affect stop codons.9 In the recent PGRN-seq study, 82 genes of pharmacogenomics relevance were sequenced among 5,639 individuals and 40,549 SNVs identified. Of the identified variants, 8,126 were in coding regions (4,858 missense, 3,169 synonymous, and 99 stop gain variants) and 19,923 were in noncoding regions (5,231 intronic, 5,981 upstream, 3,444 downstream, 4,165 3′UTR, 903 5′UTR, and 199 other variants). The majority (∼96%) of individuals had one or more Clinical Pharmacogenetics Implementation Consortium Level A actionable variants, while ∼23% (n = 1,273) of individuals have a single Level A actionable variant.10 The Human Gene Mutation Database (HGMD) is a repository of mutations associated with diseases; they are based on published literature, including GWA studies, and as of June 2013, had 141,161 germline mutation entries in 5,700 unique genes. Missense substitutions, nonsense substitutions, substitutions, and within regulatory account for and of the total Exome sequencing of individuals that each of these individuals carried mutations as by The large of variation data has a for the functional of many genetic this should to variants In most association studies are to assess the of rare genetic variants, as very large are needed to with the rare variants to is to determine the functional impact of the rare variants and variants with functional into one group. This has been with studies on by an to each that is based on the variants, including rare variants, are as or The is calculated and into a predicted 14 For other the functional impact of the variants are this approach used to the and facilitate association The of this review the current of many high-throughput functional assays (Table assays should the functional from variants, which information for the implementation of genomic-driven medicine. The exome is of the human genome and contains 23,500 protein-coding genes with Large-scale sequencing studies have on genetic variants within the as these variants protein 2 The number of identified genetic variants between populations, with individuals of African descent being more African individuals nonsynonymous and variants. The number of nonsynonymous and variants identified among individuals of East Asian or European were (Figure the identification of variants in regions coding for only a of these variants disrupt protein function and are The average number of variants per genome from in Europeans to among Missense or nonsense in protein-coding genes of variants in Missense and nonsense nucleotide and indels the of the which can to and of can many of the such as and and, functions of the such as and it is to variants that from indels or nucleotide substitutions. have been developed to variants affect protein and and variants in regions are or and, thus, used in a clinical The used for these are by the of variant and in the data The complex between and phenotypic effect also result in false-positive and by these 20 The with can by this approach with functional of variants. data with of the phenotypic of variants can also used to or as data to studies are used to assess the effect of genetic variants on protein A approach is mutations are and genes are identified based on the that A approach the of genes, which is by functional assays to protein these are as the effect of only a number of variants are are also especially as they the use of a variety of and on the function of a protein being To the rate of functional has been developed as a high-throughput to the function of of variants This can used to the functional impact of multiple variant including indels, and structural variants. of a on the of protein functional that is For for genes that an For functions or the of is protein the protein The can into as protein high-throughput and A has been published by and studies have used with different and to The the of a or of that a site in a This can by with mutations or mutations through et developed a is by the introduction of the into an et the use of assays a protein is from a or or can used to to about in in or more than in The of protein to use on of the such as how the protein variant the of the to and only on the number of A is that is to the protein function have been by testing their impact on protein of or or protein DNA or protein or of a studies have suggested that of the for each High-throughput sequencing is used to identify the variants with the phenotypic the of the of a bp to each mutation is of the is for and correction for and the sequencing A of variants requires about for at least per is important to determine the of each within the of variants or of variants are calculated by the by of each one or multiple of to the is used to identify that are or in during is when is used for a protein for of by et developed a to high-throughput sequencing data into a functional for each variant and a was developed to mutation impact from mutation data by using a However, for data have been The in which mutations further the of from Deep an to the effect of a of at can when the observed effect is different from the effect of the an large in or one variant the effect of et estimated that the of phenotypic that et developed a to the estimated effect from functional to account for the effect of variant and improvements to the current approach have been of variant function is for for which the function is in to assays or rate of the gene assays used during are often to a protein and function being these assays remains a For of such as protein with high-throughput and the complexity of human sequencing and of can used to for the average sequencing rate of Furthermore, of variants is for and can the of et developed a to for by using a gene to gene into different sequencing to the of short sequencing of genetic variants with the use of in to association studies, is in and understanding disease risk or For variants of significance are identified in in by DNA was used in a to the of nearly 2,000 in the of on and binding to this The variant functional were used to a of variant effect on DNA This the of variants observed in the clinical sequencing of the In to understanding the function of variants of it is also important to to mutations from mutations in protein or it is also to the impact of the mutation on the effect on and The study by et is of how can used in precision medicine. with in and was used to identify variants in the site of that are in to of The of the mutation was than that of the mutation The of the mutation result in this mutation being to that other more than a in a in This also to other for to A was also developed for regions to determine the effect of every single substitution on This was to a of the factor Although the majority of are or to of the mutations This approach in the for the of that with genetic mutations or is an autosomal genetic that A genetic exists in the coding of the The functional of this variant was with the use of high-throughput Although the variant impact of the protein to the it the of the to This phenotypic effect is associated with in the and In this can with the is a it can the of the by the and of The of for use in variant a further of how high-throughput functional assays can facilitate identification of actionable drug and of gene is by a variety of regulatory regions in the regions as binding sites for and and that gene Genetic variations in these transcription factor binding sites, regions, and regions can the binding of these and leading to in gene The of gene is the of the of of these binding sites to the 1000 Genomes the median number of variants among continental population range from variants in enhancers, in transcription factor binding sites, in promoters, and in the per typical human genome 1). that of these variant sites are to the for this has been to use individual assays that have of a The and of the are by the of a gene and protein that is by the and In these assays are also assays for the effect of genetic in binding the large number of regulatory SNPs that to this has led to more high-throughput functional assays are one of the high-throughput functional assays that have been used to assess genetic variants in regulatory regions. For using this variations at many in the regulatory were using and into a were also into the sequence. regulatory variants were in and the of each was by the of each the of the variants that each was This was by over variants that were of a with the in the to for The of was into The are and in the and the was by of assays have been developed and by different and as as to the of the in has been used to assess the of genetic variants in the gene This gene is a of and a for the they the of gene using a of to To further assess single nucleotide and a of the in they a the the within the with and The was into a and into at to a single per expansion and were by which has been validated to by DNA was and to the genome to assess variations in the associated with the and regulatory is high-throughput functional for variant on gene This is a in which have for one regulatory to per This is through of a single into a of This was by a the of a into using A of over the was for each of the genes of and was used to the with a from the which in of to a functional that were for these genes were based on and of mutations were analyzed to assess which mutations of A was developed in to that are in the different populations using the as the and as a high-throughput functional that can to study in is regulatory sequencing of a and into the of genes. This allows for these to and of the when into of the which the are by This was developed and using the genome and has the to identify and in This has been to such as enhancers, as as a approach in which DNA are on a and into are multiple high-throughput technologies for the impact of genetic variants on regulatory The diversity of regulatory requires that each of has a The variations in regulatory that gene can using assays that with and of genes, such as the the can sequenced to determine which variants resulted in the in regulatory are the sites of many important genetic variants, these assays for the variants with functional The regions of most genes are of and the transcription of the DNA into a of with in the to the and the to the This of is a and the majority of human exist in multiple to can have large impacts on the functions of the by the of the the of regulatory or the regulatory within the that are in are the splice the splice and a within the determine how a splice site is of splice site is to a at the an at the and an A at the of the In is variation in the other the splice and in the that in the of the splice site (Figure 1). In to the within the and the of a splice within the and within the the of an splice site being and Genetic variations within of the or can splice site the can function than the it is that variations in splice can impact many phenotypes. of diseases are by SNPs in noncoding regions, of which are to splice Approximately of variants in are to variants that are to mutations within the within the at the and of the the binding and the have also been of diseases to variants the identification of variants in the has been by the number of The majority of the variants are in splice The in the of or the of splice variants have been in a variety of sequencing and genome large numbers of variants have been associated with clinical phenotypes. of these variants are in noncoding regions of the genome and are predicted to their in clinical requires understanding the function of the variants. However, functionally testing of variants is to the and Typically, are on identified variants to functional significance or of used are by et In such as are also being that the functional significance of SNPs in splice sites for of in the binding are to but also has a that result in and protein binding at sites from the site of the Although the functional variants from large data to these are important to the human in studies the in of the of the the to the other complex genetic and in In of the variants are it can to that the variants of the of the often requires to that are to in are most used to determine the impact of variants in splice assays are the for functionally testing and variants predicted to In this a a is between within an The of the to is by the in a or in into and the of in the The are by which splice and the of and variant one can determine the functional impact of the genetic variants on This can used to by the of the and variant of the sequence. Although this is for testing and of high-throughput it is to or of variants. most functional variants in result in gain or of assays that can also used to functionally variants in splice are assays that differences in binding and assays are as in assays that of with a protein or a protein by the of from Although these are in assays to determine they to to in high-throughput assays into the of they of assays more the functional significance of a variant has been established by such as the of data being through technologies and through databases, the use of assays are a in testing large numbers of variants in splice are of variants that have been predicted to of variants are in with other functional variants, it to large numbers of variants. High-throughput that individual variants and for differences in alleles are in the causal variants. functional variants are identified, into the of of these variants also high-throughput Although have been in that these functional variants, high-throughput have been has been to high-throughput functional assays functional assays were into high-throughput high-throughput use the that are in on were by More of are of from bp in are on an and into a to the have been used in as and for gene based assays are for testing the functional significance of SNPs and short indels the are of a In high-throughput studies variants in splice these have been used in in and in In a a large number of are into the to a of The are as a of DNA that the with the total For each a and variant of the is in the are with binding on each which are used to the by In a single the are into the to a of that are in The of is into to and and are by the to the of each in the the of and is between the and variant this the effect of the variant on the A to this is the representation of the in the of et that only of the in the was in approach is the use of of with to a that contains the for transcription and approach to testing splice sites is to use an in 89 This high-throughput a of that have been by with the and by an The are by and in the of The is in a The of the are the and from the and the and the from the To identify and each the are by The are and analyzed by to the of This only differences in between and variant but it also identify the of that are by the of testing genetic variants for also to the is to protein the for is binding to the approach can used to identify which This is using a binding to a protein binding a of are for their to The of as is in and the is with to facilitate The that are to are by the and This can with a such as based protein binding or for by The populations in the and are and analyzed by in the binding of and variant in the The impact of on a variety of human has the of that the in that affect are the for in are to splice site as a result of to splice site by the between sites and the and binding have been used with in and are clinical For the most common of is by the of the the is to a splice site mutation that in the of that to binding to the of and, thus, the of functional of the has been to for an period of in of that to has been observed using that This approach of using also by the of regulatory or to The of of has also been with high-throughput drug that the of of protein have been of the the between a at the such as and have been to the of splice sites in the gene in The in of these genes is to to of between the binding site and the of the are approach to are that the of and, factor is a that is observed in a variety of from to In this from different are to a single a that result in a protein are by of with an The contains the to the in a that for a functional in of the diseases such as and are being sequencing and genotyping technologies have enormous genetic variation in the human has a large number of variants, but the of most individual variants are rare. In many new germline variants are in every This of many of the variants population associations for most variants their functional impact is for the majority of the genetic variants, an important this information into information is to determine the impact of the variant on the function of the gene For genes with functions and clinical these variants can used to risk and based on their effect on the understanding the functional in genes that have already been associated with clinical this also the of and, thus, to For genes that are being in clinical association studies, the rare variants within a gene can using gene that can used for associations with the clinical phenotypes. In to functionally variants and to the large number of variants that have functional we more and better high-throughput functional we have in this are high-throughput assays for a variety of functional However, many of the needed to assess the large number of variants and it are also needed to and for many of as new variants are they can Furthermore, are many functions that have high-throughput For assays that and as new are to the and of data from multiple and to the functional data by improvements to the clinical of the large of to to and reference This was by for and and the The contributed to this The of
Related Papers
- → Estimating autozygosity from high-throughput information: effects of SNP density and genotyping errors(2013)341 cited
- → SNPchiMp: a database to disentangle the SNPchip jungle in bovine livestock(2014)79 cited
- → Genome-Wide Evaluation of the Public Snp Databases(2003)46 cited
- → High-Throughput SNP Genotyping by SBE/SBH(2007)3 cited
- → Multiplexed SNP Genotyping Using Allele-Specific Primer Extension on Microarrays(2007)