0 citations0 references

The Chromosome Counts Database (CCDB) – a community resource of plant chromosome numbers

New Phytologist2014Vol. 206(1), pp. 19–26

Citations Over TimeTop 1% of 2014 papers

Anna Rice, Lior Glick, Shiran Abadi, Moshe Einhorn, Naama M. Kopelman, Ayelet Salman‐Minkov, Jonathan Mayzel, Ofer Chay, Itay Mayrose

Abstract

For nearly a century, biologists, and botanists in particular, have been interested in the determination and documentation of chromosome numbers for extant taxa (reviewed in Goldblatt & Lowry, 2011) as well as extinct ones (Laane & Hoiland, 1986; Masterson, 1994). These data have been widely used to evaluate the evolutionary pattern of chromosome number change and to estimate the base chromosome number of clades of interest. Chromosome numbers have also been extensively utilized as an important phylogenetic character in the context of cytotaxonomy (Chatterjee & Kumar Sharma, 1969; Schlarbaum & Tsuchiya, 1984; Guerra, 2012). Perhaps the most influential use of chromosome number data has been in the inference of major genomic events such as whole genome duplications (polyploidy), as well as changes in single chromosome numbers (e.g. dysploidy). Early researchers analyzed the distribution of chromosome numbers within a group of interest and employed various threshold techniques to estimate ploidy levels for the analyzed taxa (Stebbins, 1938; Grant, 1963; Goldblatt, 1980). More recently, phylogenetic information was incorporated into the analyses, allowing researchers to infer transitions in chromosome numbers along branches of the tree using either the maximum parsimony principle (Schultheis, 2001; Hansen et al., 2006; Ohi-Toma et al., 2006; Wood et al., 2009) or by using a probabilistic evolutionary model within the likelihood paradigm (Mayrose et al., 2010; Cusimano et al., 2012; Glick & Mayrose, 2014). Due to their significance and the relative ease by which chromosome numbers can be obtained, it is not surprising that chromosome number is the most extensively and consistently recorded cytological property in most plant families and genera (Guerra, 2008). These data have been documented along the years in an array of journal manuscripts, printed books (Löve & Löve, 1948; Darlington & Wylie, 1955; Fedorov, 1969) and, more recently, in the form of online databases (Goldblatt & Johnson, 1979; Watanabe, 2002; Bennett & Leitch, 2011). To date, the most comprehensive data source is the Index to Plant Chromosome Numbers (IPCN; Goldblatt & Johnson, 1979), which provides reference point to original chromosome counts reported in the literature. IPCN was initially established at the University of California Berkeley in the 1950s and was later maintained by Canada Department of Agriculture, Missouri Botanical Garden, and currently by the International Association for Plant Taxonomy (IAPT). A large portion of the counts referenced during 1979–2006, the years that IPCN has been housed in the Missouri Botanical Garden, can be accessed and searched online. Counts reported in more recent years are currently published under IAPT/IOPB Chromosome Data series (Marhold, 2006) but are not stored within a central, easily searched, database. In addition to IPCN, several other online data sources are available, most of which are dedicated to either a specific geographical region (Slovakia – Marhold et al., 2007; Poland – Góralski et al., 2009 onwards) or to a certain taxonomic group (e.g. Hieracium – Schuhwerk, 1996; Asteraceae – Watanabe, 2002). The amount of chromosome counts that exist to date is extensive, and searching the large number of resources that contain such information is a daunting task, particularly when a large number of taxa is examined. Consequently, many researchers search for chromosome number information only through the largest online database(s), while smaller but nonetheless valuable sources are ignored. This usually results in missing data for some of the species in question, which may lead to erroneous conclusions drawn from the analysis. Obviously, a large accessible database that unifies all currently known databases, including both printed and online sources, would be of great value to the botanical community and would make the task of data collection much easier. In addition, such a central resource would enable researchers to add new counts as soon as they are being reported, facilitating the task of data sharing. Here, we present the Chromosome Counts Database (CCDB), as a community resource of plant chromosome numbers. The database incorporates data from dozens of sources, more than doubling the amount of data available within any single resource. The online database additionally enables researchers to add new counts or to comment on existing data entries, thereby facilitating data sharing. The extensive amount of data currently available in CCDB further allowed us to analyze the patterns of chromosome number distribution among major plant groups. We estimate the percentage of plant species exhibiting intraspecific variation in chromosome numbers as well as in their ploidy levels. Chromosome counts were collected from a large number of electronic resources, older chromosome counts compendiums in the form of printed books, and an array of miscellaneous sources such as floras, monographs and other scientific manuscripts. The full list of resources is given in Table 1. Data from these sources were collected using the following procedures: Data from several online databases were retrieved directly from the database curator via personal communication in the form of comma-separated value (CSV) files. These include data from the Plant DNA C-values database (Bennett & Leitch, 2011; obtained from Ilia Leitch) and Chromosome number database of Polish plants (Góralski et al., 2009 onwards; obtained from Grzegorz Góralski). Other online chromosome counts databases were downloaded and processed using Perl/Python scripts. The following online sources were retrieved: IPCN (Goldblatt & Johnson, 1979–), Chilean plants cytogenetic database (Jara-Seguel & Urrutia, 2011), CHROBASE – Chromosome numbers for the Italian flora (Bedini et al., 2010 onwards), BSBI cytology database [accessed 20 June 2013] (http://rbg-web2.rbge.org.uk/BSBI/cytsearch.php), Index to chromosome numbers in Asteraceae (Watanabe, 2002), Published chromosome counts in Hieracium (Schuhwerk, 1996), ChromoPar – Paraguay chromosome counts database [accessed 12 June 2013] (http://www.ub.edu/botanica/cromopar/), Karyological database of the genus Cardamine (Kucera et al., 2005) and Chromosome number survey of the ferns and flowering plants of Slovakia (Marhold et al., 2007). In addition to online sources as already described, we have obtained well-known and widely used printed books containing chromosome counts indexes. The data in these books were retrieved in the following way: first, the books were scanned to generate image files. Then, using the optical character recognition (OCR) tool of Adobe Pro the files were converted to ‘textable’ PDF files. This OCR tool was chosen because it exhibited the most accurate performance compared to five other OCR tools in an initial screen of several books. In the next step we used ‘Some PDF to Text Converter’ (available through www.somepdf.com), which converted the PDF files into plain text files that could be parsed automatically using Python scripts. Because this whole automated process suffers from some inaccuracies – particularly due to errors rela-ted to the OCR conversion (e.g. occasional confusion between ‘l’, ‘1’, and ‘!’) – thousands of counts were manually verified. In addition, our general approach in processing such sources was to maximize retrieval accuracy rather than data completeness. Consequently, not all data available through the target source were retrieved. It should be emphasized that occasional errors may still remain (this is particularly so for the compendium published by Fedorov, 1969, for which OCR errors are more abundant due to the Cyrillic font and tables\columns included within the text) and CCDB allows users to report such cases. The following sources were retrieved this way: Chromosome numbers of northern plant species (Löve & Löve, 1948), Chromosome atlas of flowering plants (Darlington & Wylie, 1955), Cytotaxonomical atlas of the Pteridophyta (Löve et al., 1977), Chromosome numbers of flowering plants (Fedorov, 1969), Flora Europaea – checklist and chromosome index (Moore, 1982), Chromosome atlas of flowering plants of the Indian subcontinent; volumes 1 and 2 (Kumar & Subramaniam, 1987a) and Index to plant chromosome numbers for the years 1965–1974 (Ornduff, 1967, 1968; Moore, 1970, 1971, 1973, 1974, 1977). The IPCN volume for the years 1975–1978 (Goldblatt, 1981) was also parsed but counts were inserted into the database only in case the online IPCN database did not already contain them. In addition to dedicated chromosome counts databases and hard copy books, a large number of other sources exist that contain information regarding the chromosome number for a given taxon. These resources include floras, monographs and an array of scientific manuscripts. However, automatic retrieval of chromosome number data from such resources is not a trivial task because the data are organized in a source-specific manner (e.g. the botanical description of a given species as appears in its relevant flora obtained through http://www.efloras.org). Hence, the downloading and processing of each data source were performed using dedicated Perl/Python scripts written specifically for each data source, followed by a manual verification of hundreds of records. As mentioned above, we preferred to maximize data accuracy over data completeness and therefore some fraction of the data available in these sources was not used. Thousands of chromosome counts were acquired from online floras – eflora [accessed 20 October 2013] (http://www.efloras.org), Flora Iberica [accessed 20 June 2013] (http://www.floraiberica.es), and from the Interactive flora of NW Europe [accessed 20 June 2013] (http://wbd.etibioinformatics.nl/bis/flora.php). In addition to floras, chromosome counts that appear within several Systematic Botany Monographs were retrieved (Saunders, 2000; Bohs, 2001; Freire-Fierro, 2002; Aldasoro et al., 2004; Zuloaga et al., 2004; Thompson, 2005; Wagner et al., 2005; Meudt, 2006; Miller & Chambers, 2006). Scientific manuscripts that contain large amounts of chromosome counts were parsed in a source-specific manner and incorporated into the database. IAPT/IOPB Chromosome Data reports 1–16 (Marhold, 2006) were obtained from the International Organization of Plant Biosystematists website (http://www.iopb.org/) as PDF files, converted to text files and parsed using Perl scripts. In addition, a large number of journal manuscripts that contain counts for a given taxonomic group or geographic region were obtained and parsed in a source-specific procedure. These include data reported in a large number of Mediterranean chromosome number reports (Kamari et al., 1991), as well as large collections available for Araceae (Cusimano et al., 2012), Brassicaceae (Warwick & Al-Shehbaz, 2006), Colchicaceae (Chacón et al., 2014), Cyperaceae (Roalson, 2008), Pinguicula (Casper & Stimper, 2009), and Veroniceae (Albach et al., 2008). The full list of scientific manuscripts that were incorporated into CCDB is available through the database help pages (http://ccdb.tau.ac.il/about/). Finally, chromosome counts datasets that were compiled by individual researchers were obtained via personal communication. These include chromosome numbers of indigenous New Zealand plants obtained from Murray Dawson and chromosome numbers for a large number of Solanaceae species obtained from Emma Goldberg. Combining data from multiple sources required a method for standardization of the information, especially regarding the taxonomy of the records. Many plant species have been given different names by different authors. Some of these names are considered synonyms, others are recognized as accepted names, while another fraction is still unresolved. Another common problem is differences in spelling conventions between sources, or simply spelling mistakes, resulting from either manual typing errors in the original source, or incorrect processing of our automatic pipelines. To overcome these difficulties, we used Taxonome (Kluyver & Osborne, 2013), a taxonomic name resolution software that provides the ability to match synonymous taxon names to accepted names while accounting for differences in naming conventions and likely misspellings. As the underlying database for names, we used a local repository of synonymous and accepted names that was created based on The Plant List (TPL) v1.1 (http://www.theplantlist.org/) with some modifications (i.e. for Solanaceae we used Solanaceae Source (http://solanaceaesource.org/) as the primary taxonomic source supplemented with The Plant List for missing taxon names). In case a taxon name could not be matched to a recognized plant name (e.g. due to erroneous OCR processing), the corresponding data entry was excluded from the database. CCDB is available through http://ccdb.tau.ac.il/. Users can access the data by browsing through the taxonomic hierarchy or by searching for a specific genus or species. At each level, all counts can be retrieved as a CSV file. Additionally, users can access the data through the dedicated application programming interface (API), available through http://ccdb.tau.ac.il/services/. Researchers are invited to contribute to the completeness and correctness of the resource. This can be achieved by submitting new data, originating from resources not yet incorporated into the database as well as reporting errors found in the database. We note that unlike in IPCN, new data entries will not be thoroughly reviewed. Thus, data contributors are strongly encouraged to include supporting information such as voucher specimen or an image file of the cells analyzed. CCDB encompasses a wide array of resources, the majority of which were unavailable before in a digitized format. At present, CCDB contains 334 963 data entries, encompassing chromosome counts for 171 338 unique taxon names, including species names and infraspecific names. Following a taxonomic name resolution process that collapsed synonymous names to their accepted names, the number of unique names in CCDB is 77 958 (of these 68 146 are accepted names and 9812 are unresolved according to TPL V1.1). This represents a substantial increase in data coverage compared to IPCN – the largest online resource to date – that has information for a total of 60 167 plant names (48 829 following name resolution). Table 1 specifies the number of counts extracted from each source, as well as the number of unique names before and after name resolution. CCDB includes a total of 8750 genera from 539 families. The coverage of CCDB varies widely across the major plant groups. The current coverage for angiosperms is 19% (58 980 out of 304 419 accepted species as reported in TPL V1.1 – not including data available for infraspecific names). The exact coverage may, however, vary between 12% and 23% depending on the assumed number of angiosperm species, with estimates ranging from 261 750 (Stevens, 2012) to 500 000 if yet undiscovered species are considered (as discussed in Galbraith et al., 2011). The estimated coverage for pteridophytes (here and in the online database referred to as the monilophytes and lycophytes clades), bryophytes and gymnosperms is 22% (2350/10 620), 4% (1436/34 556) and 38% (427/1104), respectively. Within the 20 largest angiosperm families (Supporting Information Table S1), the is with counts available for of the taxa out of while the coverage for the largest plant the is out of the 20 largest the is with also some families chromosome data are particularly and should be particularly Some of the families in CCDB include the only out of accepted and In to estimate the completeness of the data obtained through CCDB compared to the of chromosome information (i.e. all counts reported in the we compared the coverage of CCDB relative to that obtained in five of these information in a manner for a specific plant and we as all available data for these – & Stimper, Araceae – Cusimano et al., 2012; Solanaceae – Colchicaceae – et al., – & 2014). In these we the fraction of species in the reference for which information in CCDB while data entries obtained from other resources only the data obtained from the five were already incorporated in As in Table for several such as Araceae and data completeness of CCDB is nearly that obtained by manual However, for other clades (i.e. our data retrieval was not as missing of the data that have been for the data in CCDB a major compared to is currently available through IPCN These results the for a community to the amount of chromosome number information that has been over the but appears within scientific manuscripts and is the chromosome counts data in we next the distribution of the chromosome numbers within each of the major plant groups. In case more than was available for a certain the was as the As has been in ferns & are more numbers than ones the whole database the chromosome number for taxa is and for it is Table resulting in a pattern As by & this pattern can be by because a genome will in an number while other changes in chromosome numbers (e.g. via can lead to both and numbers. the chromosome number distribution varies between the major plant In monilophytes a known to particularly chromosome numbers (reviewed in 2013), the most common number is followed by with at and that are exact duplications of the most common numbers. Additionally, while out of of the species an chromosome the counts than the number counts of the species have an that chromosome number are the of In are the chromosome counts originating from and a that includes counts from and and a of species. In angiosperms as is also in the distribution obtained for the number is more and is and the pattern is for chromosome numbers than of angiosperms have an the changes the major – the between and numbers 12 is (i.e. more than for and over it is As as chromosome numbers are it that plants chromosome numbers have a so that its has been by the angiosperm were to have more events compared to & the pattern for is particularly with an of the of In gymnosperms – a group in which is considered et al., – is a percentage of counts However, this is due to the of 12 of all and the pattern is not In pattern was with a between and we the by which chromosome number varies within species and infraspecific taxa (i.e. and from the corresponding that is within species and infraspecific existing in of taxa in our of taxa were reported with counts and with or more this at the species (i.e. by all infraspecific names to their corresponding that intraspecific variation in chromosome numbers in out of of species in our database of species were reported with counts and with or the of the of species with multiple counts is across the major and for bryophytes and These are an due to the of the database (i.e. not all reported are included in and the of many were not The multiple that exist within nearly of plant species that only the but not the genomic (e.g. chromosome and that both as a of major genomic such as As by et a fraction of such intraspecific through In many these should be as species under most used species Thus, we the to which intraspecific variation in chromosome numbers can be to using a To this for each species the ploidy index for all its was as the relative to the chromosome number found in that species (e.g. if the reported counts for a certain species were and 20 the were and As in a large fraction of the intraspecific variation is due to the most common is which to a single whole genome next are the and each corresponding to chromosome number changes due to In addition, the 1 is and could be by events as chromosome and while another is corresponding to the of In to evaluate the relative of to intraspecific chromosome number variation compared to other of a threshold of was used. that this threshold can be used to events to from of the intraspecific variation is due to are due to other of chromosome number In our that of plant species intraspecific variation in their ploidy levels – than the estimate by Wood et reported that of angiosperms and of species that including the multiple ploidy levels to in and in in monilophytes and in lycophytes in our The in estimates from the different used by Wood et in their estimate using a threshold of but also due to the data incorporated in Here, we the Chromosome Counts as a community resource for plant CCDB represents a step data coverage and for certain clades data completeness is still CCDB may other such as the collection of C-values et al., 2011) by out taxonomic collection could be particularly The current coverage for angiosperms in CCDB is while Bennett estimated this number to be the in these estimates may also from the number of angiosperm species are data that CCDB not For in bryophytes 4% of the species have information in while et estimated the coverage for bryophytes to be the reported by et was based on printed sources & which in the current of CCDB were not in coverage can be by the community either by directly through the CCDB website or by data in the form of a which can be automatically processed using the in the of CCDB was to an extensive, yet within which data can be by the facilitating data for a wide array of the pattern of chromosome number We Emma and Murray Dawson for us with extensive chromosome number Ilia for a CSV file of the Plant DNA database and Grzegorz Góralski for a CSV file of the chromosome number database of Polish for a scanned copy of the Cytotaxonomical atlas of the and and an for This was by a from the in & to and by the number are not for the or of any supporting information by the authors. than missing should be to the New The distribution of chromosome numbers in CCDB across all taxa and Table Data coverage in CCDB for the 20 largest plant families The is not for the or of any supporting information by the authors. than missing should be to the corresponding for the

Citations Over TimeTop 1% of 2014 papers

Abstract

Related Papers