Clustering protein functional families at large scale with hierarchical approaches
Citations Over Time
Abstract
Proteins, fundamental to cellular activities, reveal their function and evolution through their structure and sequence. CATH functional families (FunFams) are coherent clusters of protein domain sequences in which the function is conserved across their members. The increasing volume and complexity of protein data enabled by large-scale repositories like MGnify or AlphaFold Database requires more powerful approaches that can scale to the size of these new resources. In this work, we introduce MARC and FRAN, two algorithms developed to build upon and address limitations of GeMMA/FunFHMMER, our original methods developed to classify proteins with related functions using a hierarchical approach. We also present CATH-eMMA, which uses embeddings or Foldseek distances to form relationship trees from distance matrices, reducing computational demands and handling various data types effectively. CATH-eMMA offers a highly robust and much faster tool for clustering protein functions on a large scale, providing a new tool for future studies in protein function and evolution.
Related Papers
- → Improving protein function prediction using domain and protein complexes in PPI networks(2014)73 cited
- → Consensus-based clustering of single cells by reconstructing cell-to-cell dissimilarity(2021)12 cited
- → Towards region-specific propagation of protein functions(2018)11 cited
- → Using the protein interaction network to predict protein folds without homology(2007)
- PROTEIN DOMAIN LINKER PREDICTION: A DIRECTION FOR DETECTING PROTEIN-PROTEIN INTERACTIONS(2015)