Data sources used in OncoVar

Driver mutation sources
OncoKB, FASMIC, CGI and PMID25348012_GB
OncoKB 3,121 driver mutations, which selection based on a comprehensive, curated and evidence-based information about individual somatic mutations and structural alterations present in patient tumors.
FASMIC 316 human driver mutations based on experimental evidence on the functional impacts of somatic mutations detected in human cancers.
CGI 3,805 human driver mutations retrieved by combining the data contained in the DoCM鈦, ClinVar and OncoKB鈦 databases as well as the results of several published experimental assays and additional manual curation鈦 effort.
PMID25348012_GB 3,589 human driver mutations predicted by combining algorithms of 15 mutation effect prediction algorithms. Additional file 2
Driver gene sources
2020Rule, CGC, CGCpointMut, HCD, MouseMut, Oncogene, OncoKB, FASMIC, CTAT, TSGene, Intogen, PMID_29056346, PMID_29056346_collected_known, TCGA_papers and MutPanning
2020Rule 125 cancer genes based on the characteristic mutational patterns for oncogenes and tumor suppressor genes. Table S2A
CGC 723 cancer genes from the Cancer Genome Census database (CGC) including 576 high-confidence genes (Tier=1) and 147 less confidence genes (Tier=2).
Tier 1, genes with documented activity relevant to cancer, along with evidence of mutations in cancer which change the activity of the gene product in a way that promotes oncogenic transformation.
Tier 2, genes with strong indications of a role in cancer but with less extensive available evidence.
Cancer Gene Census (Tier=1 or 2)
CGCpointMut A CGC subset of 118 cancer genes which act in cancer via point mutations (CGCpointMut). CGCpointMut file in Cancer_GeneSet folder of MUFFINN software package
HCD 291 high-confidence driver genes based on a rule-based approach (HCD). High Confidence Driver
MouseMut 797 human orthologs of mouse cancer genes identified by insertional mutagenesis (MouseMut).
Oncogene 803 human oncogenes from the First literature database for oncogenes.
OncoKB 1,059 driver genes, which selection based on a comprehensive, curated and evidence-based information about individual somatic mutations and structural alterations present in patient tumors
FASMIC 93 driver genes based on experimental evidence on the functional impacts of somatic mutations detected in human cancer.
CTAT 299 driver genes identified from paper 鈥淐omprehensive Characterization of Cancer Driver Genes and Mutations鈥, which are based on biological processes and pathways analysis. Table S1
TSGene 983 tumor suppressor genes from literature-based knowledgebase TSGene 2.0. TSGene
Intogen 459 human driver genes download from Intogen database.
PMID_29056346 180 driver genes selected by putatively positive selection. Table S2
PMID_29056346_collected_known 369 high-confidence driver genes collected from COSMIC database (174聽COSMIC classic genes , version 73), Lawrence et聽al. (2014 )聽(219 significantly mutated genes) and their literature search (204 significantly mutated genes). Table S3
TCGA_papers 132 significantly mutated genes collected from PMID_25109877 and PMID_25079552. supplement-03 and Table S4
MutPanning 460 driver genes selected by Mutpanning algorithm, which considered the nucleotide context around mutations in their statistical model.
Two additional gene annotaton sources
Therapeutic_target_in_PMID_30971826 and DriverDBv3
Therapeutic_target_in_PMID_30971826 628 priority targets classified as groups 1, 2 and 3 based on tractability for drug development. Supplementary Table 9 - b
DriverDBv3 Siginificant mutated genes and the number means the supportted mutation tools which identify this gene as a mutation driver.
Allele frequency
dbSNP, gnomAD, ExAC, 1000Genomes, ESP, Kaviar, HRC, CG69
dbSNP The Single Nucleotide Polymorphism database (dbSNP) is a public-domain archive for a broad collection of simple genetic polymorphisms.
gnomAD The Genome Aggregation Database (gnomAD), is a coalition of investigators seeking to aggregate and harmonize exome and genome sequencing data from a variety of large-scale sequencing projects, and to make summary data available for the wider scientific community.The data set provided on this website spans 123,136 exomes and 15,496 genomes from unrelated individuals sequenced as part of various disease-specific and population genetic studies.
ExAC The Exome Aggregation Consortium (ExAC) is a coalition of investigators seeking to aggregate and harmonize exome sequencing data from a variety of large-scale sequencing projects, and to make summary data available for the wider scientific community.The data set provided on this website spans 60,706 unrelated individuals sequenced as part of various disease-specific and population genetic studies.
1000Genomes The 1000 Genomes Project was the first project to sequence the genomes of a large number of people, to provide a comprehensive resource on human genetic variation. In the final phase of the project, data from 2,504 samples was combined to allow highly accurate assignment of the genotypes in each sample at all the variant sites the project discovered and the data was from 26 populations,including African, Ad Mixed American, East Asian,European, South Asian, and so on.
ESP The dataset in NHLBI GO Exome Sequencing Project (ESP)is from the NHLBI GO Exome Sequencing Project and its ongoing studies which produced and provided exome variant calls for comparison .The current EVS data release (ESP6500SI-V2) is taken from 6503 samples drawn from multiple ESP cohorts and represents all of the ESP exome variant data.
Kaviar Kaviar is a compilation of SNVs, indels, and complex variants observed in humans, designed to facilitate testing for the novelty and frequency of observed variants.Kaviar contains 162 million SNV sites (including 25M not in dbSNP) and incorporates data from 35 projects encompassing 77,781 individuals (13.2K whole genome, 64.6K exome).
HRC The Haplotype Reference Consortium (HRC) is used for genotype imputation and phasing in other cohorts, typically genome-wide association studies (GWAS), where genotypes are available from genome-wide SNP microarrays.And it contains haplotypes from individuals with predominantly European ancestry, although the HRC includes the 1000 Genomes Project data.The first release consists of 64,976 haplotypes at 39,235,157 SNPs, all with an estimated minor allele count of greater than 5.
CG69 The database includes 69 DNA samples sequenced using their Standard Sequencing Service, which includes whole genome sequencing, mapping of the resulting reads to a human reference genome, comprehensive detection of variations, scoring, and informative annotation.
Missense prediction
SIFT, PolyPhen2 HDIV, PolyPhen2 HVAR, LRT, MutationTaster, MutationAssessor, FATHMM, PROVEAN, MetaSVM, MetaLR, VEST, M-CAP, CADD, GERP++, DANN, fathmm-MKL, Eigen, GenoCanyon, fitCons, PhyloP, PhastCons, SiPhy, REVEL, dbNSFP
SIFT SIFT predicts whether an amino acid substitution affects protein function. SIFT prediction is based on the degree of conservation of amino acid residues in sequence alignments derived from closely related sequences, collected through PSI-BLAST. SIFT can be applied to naturally occurring nonsynonymous polymorphisms or laboratory-induced missense mutations.
PolyPhen2_HDIV PolyPhen-2 is a tool which predicts possible impact of an amino acid substitution on the structure and function of a human protein using straightforward physical and comparative considerations.HumDiv-trained PolyPhen-2 is used for evaluating rare alleles at loci potentially involved in complex phenotypes, dense mapping of regions identified by genome-wide association studies, and analysis of natural selection from sequence data, where even mildly deleterious alleles must be treated as damaging.
PolyPhen2_HVAR PolyPhen-2 is a tool which predicts possible impact of an amino acid substitution on the structure and function of a human protein using straightforward physical and comparative considerations.HumVar-trained PolyPhen-2 can diagnose Mendelian diseases that requires distinguishing mutations with drastic effects from all the remaining human variation, including abundant mildly deleterious alleles.
LRT A likelihood ratio test (LRT) can accurately identify a subset of deleterious mutations that disrupt highly conserved amino acids within protein-coding sequences, which are likely to be unconditionally deleterious.
MutationTaster MutationTaster employs a Bayes classifier to eventually predict the disease potential of an alteration. The Bayes classifier is fed with the outcome of all tests and the features of the alterations and calculates probabilities for the alteration to be either a disease mutation or a harmless polymorphism.
MutationAssessor MutationAssessor predicts the functional impact of amino-acid substitutions in proteins, such as mutations discovered in cancer or missense polymorphisms. The functional impact is assessed based on evolutionary conservation of the affected amino acid in protein homologs.
FATHMM Functional Analysis through Hidden Markov Models(FATHMM) is specifically designed for non-synonymous single nucleotide variants (nsSNVs).
PROVEAN Protein Variation Effect Analyzer (PROVEAN) is a software tool which predicts whether an amino acid substitution or indel has an impact on the biological function of a protein. It is useful for filtering sequence variants to identify nonsynonymous or indel variants that are predicted to be functionally important.
MetaSVM MetaSVM is a ensemble scoring method for deleterious missense mutations.It integratea nine deleteriousness prediction scores and maximum minor allele frequency for more accurate and comprehensive evaluation of deleteriousness of missense mutations.
MetaLR MetaLR is a ensemble scoring method for deleterious missense mutations. It achieves the highest discriminative power compared to all eighteen existing deleteriousness prediction scores, which demonstrated the value of combining information from multiple orthologous approaches.
VEST 3.0 The Variant Effect Scoring Tool (VEST) 3.0 is a machine learning method that predicts the functional significance of missense mutations observed through genome sequencing, allowing mutations to be prioritized in subsequent functional studies, based on the probability that they impair protein activity.
M-CAP M-CAP is a pathogenicity classifier for rare missense variants in the human genome that is tuned to the high sensitivity required in the clinic. By combining previous pathogenicity scores (including SIFT, Polyphen-2 and CADD) with novel features and a powerful model, they attain the best classifier at all thresholds, reducing a typical exome/genome rare (<1%) missense variant (VUS) list from 300 to 120, while never mistaking 95% of known pathogenic variants as benign.
CADD Combined Annotation Dependent Depletion (CADD) is a tool for scoring the deleteriousness of single nucleotide variants as well as insertion/deletions variants in the human genome. It is a framework that integrates multiple annotations into one metric by contrasting variants that survived natural selection with simulated mutations.
GERP++ GERP++ is a new tool that uses maximum likelihood evolutionary rate estimation for position-specific scoring and, in contrast to previous bottomup methods, a novel dynamic programming approach to subsequently define constrained elements. GERP++ evaluates a richer set of candidate element breakpoints and ranks them based on statistical significance, eliminating the need for biased heuristic extension techniques.
DANN DANN is a deep learning approach for annotating the pathogenicity of whole-genome genetic variants.DANN uses the same feature set and training data as CADD to train a deep neural network (DNN). DNNs can capture non-linear relationships among features and are better suited than SVMs for problems with a large number of samples and features.
fathmm-MKL fathmm-MKL is capable of predicting the functional effects of protein missense mutations by combining sequence conservation within hidden Markov models (HMMs), representing the alignment of homologous sequences and conserved protein domains, with "pathogenicity weights", representing the overall tolerance of the protein/domain to mutations.
Eigen Eigen is a spectral approach to the functional annotation of genetic variants in coding and noncoding regions. Eigen makes use of a variety of functional annotations in both coding and noncoding regions (such as made available by the ENCODE and Roadmap Epigenomics projects), and combines them into one single measure of functional importance.
GenoCanyon GenoCanyon is a statistical framework to predict functional non-coding regions in the human genome through integrated analysis of annotation data.Meanwhile,it is a whole-genome functional annotation approach based on unsupervised statistical learning. It integrates genomic conservation measures and biochemical annotation data to predict the functional potential at each nucleotide.
fitCons The fitness consequences of functional annotation(fitCons) integrates functional assays (such as ChIP-Seq) with selective pressure inferred using the INSIGHT method. The result is a score 蟻 in the range [0.0-1.0] that indicates the fraction of genomic positions evincing a particular pattern (or "fingerprint") of functional assay results, that are under selective pressure.
PhyloP PhyloP scores measure evolutionary conservation at individual alignment sites.And the phyloP scores are useful to evaluate signatures of selection at particular nucleotides or classes of nucleotides (e.g., third codon positions, or first positions of miRNA target sites).
PhastCons PHAST is a freely available software package for comparative and evolutionary genomics. It consists of about half a dozen major programs, plus more than a dozen utilities for manipulating sequence alignments, phylogenetic trees, and genomic annotations.
SiPhy SiPhy is a approach that takes advantage of deeply sequenced clades to identify evolutionary selection by uncovering not only signatures of rate-based conservation but also substitution patterns characteristic of sequence undergoing natural selection.
REVEL REVEL is a new ensemble method for predicting the pathogenicity of missense variants based on a combination of scores from 13 individual tools: MutPred, FATHMM v2.3, VEST 3.0, Polyphen-2, SIFT, PROVEAN, MutationAssessor, MutationTaster, LRT, GERP++, SiPhy, phyloP, and phastCons. REVEL was trained using recently discovered pathogenic and rare neutral missense variants, excluding those previously used to train its constituent tools
dbNSFP The purpose of the dbNSFP is to provide a one-stop resource for functional predictions and annotations for human nonsynonymous single-nucleotide variants (nsSNVs) and splice-site variants (ssSNVs), and to facilitate the steps of filtering and prioritizing SNVs from a large list of SNVs discovered in an exome-sequencing study.
InterVar InterVar is a bioinformatics software tool for clinical interpretation of genetic variants by the ACMG/AMP 2015 guideline. The input to InterVar is an annotated file generated from ANNOVAR, while the output of InterVar is the classification of variants into 'Benign', 'Likely benign', 'Uncertain significance', 'Likely pathogenic' and 'Pathogenic', together with detailed evidence code.
COSMIC COSMIC is designed to store and display somatic mutation information and related details and contains information relating to human cancers. There are two types of data in COSMIC: Expert manual curation data and systematic screen data.The information in COSMIC is curated by expert scientists, primarily by scrutinizing large numbers of scientific publications.
ICGC The International Cancer Genome Consortium (ICGC) generates comprehensive catalogues of genomic abnormalities (somatic mutations, abnormal expression of genes, epigenetic modifications) in tumors from 50 different cancer types and/or subtypes which are of clinical and societal importance across the globe and make the data available to the entire research community as rapidly as possible, and with minimal restrictions, to accelerate research into the causes and control of cancer.
TCGA TThe Cancer Genome Atlas (TCGA), a landmark cancer genomics program, molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types. This joint effort between the National Cancer Institute and the National Human Genome Research Institute began in 2006, bringing together researchers from diverse disciplines and multiple institutions.
Gene and Pathway
RefSeq, Ensembl, NCBI, InterPro, Segmental duplication, DGIdb and gene ontology
RefSeq A comprehensive, integrated, non-redundant, well-annotated set of reference sequences including genomic, transcript, and protein.  
Ensembl Ensembl is a genome browser for vertebrate genomes that supports research in comparative genomics, evolution, sequence variation and transcriptional regulation. Ensembl annotate genes, computes multiple alignments, predicts regulatory function and collects disease data. Ensembl tools include BLAST, BLAT, BioMart and the Variant Effect Predictor (VEP) for all supported species.
NCBI Gene Gene integrates information from a wide range of species. A record may include nomenclature, Reference Sequences (RefSeqs), maps, pathways, variations, phenotypes, and links to genome-, phenotype-, and locus-specific resources worldwide.
InterPro InterPro is a resource that provides functional analysis of protein sequences by classifying them into families and predicting the presence of domains and important sites. To classify proteins in this way, InterPro uses predictive models, known as signatures, provided by several different databases (referred to as member databases) that make up the InterPro consortium.It combines signatures from multiple, diverse databases into a single searchable resource, reducing redundancy and helping users interpret their sequence analysis results.
Segmental duplication Segmental duplications is a method to detect identity between long stretches of genomic sequence despite the presence of high copy repeats and large insertion-deletions(> 90% identity and >1kb in length).
DGIdb The Drug-Gene Interaction database (DGIdb) mines existing resources that generate hypotheses about how mutated genes might be targeted therapeutically or prioritized for drug development. It provides an interface for searching lists of genes against a compendium of drug-gene interactions and potentially 'druggable' genes.It integrates data from 13 primary sources that cover disease-relevant human genes, drugs, drug-gene interactions and potential druggability. Currently, DGIdb contains over 14,144 drug-gene interactions involving 2,611 genes and 6,307 drugs, and in addition it includes 6,761 genes belonging to one or more of 39 potentially druggable gene categories. A total of 7,668 unique genes have either known or potential druggability.
Gene Ontology The Gene Ontology (GO) project is a major bioinformatics initiative to develop a computational representation of our evolving knowledge of how genes encode biological functions at the molecular, cellular and tissue system levels. The project has developed formal ontologies that represent over 40,000 biological concepts, and are constantly being revised to reflect new discoveries. To date, these concepts have been used to "annotate" gene functions based on experiments reported in over 100,000 peer-reviewed scientific papers.