OncoVar: an integrated database and analysis platform for oncogenic driver variants in cancers

Data sources used in OncoVar

Driver mutation sources	OncoKB, FASMIC, CGI and PMID25348012_GB
OncoKB	3,121 driver mutations, which selection based on a comprehensive, curated and evidence-based information about individual somatic mutations and structural alterations present in patient tumors.	http://oncokb.org/dataAccess
FASMIC	316 human driver mutations based on experimental evidence on the functional impacts of somatic mutations detected in human cancers.	https://ibl.mdanderson.org/fasmic/
CGI	3,805 human driver mutations retrieved by combining the data contained in the DoCM⁠, ClinVar and OncoKB⁠ databases as well as the results of several published experimental assays and additional manual curation⁠ effort.	https://www.cancergenomeinterpreter.org/mutations/
PMID25348012_GB	3,589 human driver mutations predicted by combining algorithms of 15 mutation effect prediction algorithms.	Additional file 2
Driver gene sources	2020Rule, CGC, CGCpointMut, HCD, MouseMut, Oncogene, OncoKB, FASMIC, CTAT, TSGene, Intogen, PMID_29056346, PMID_29056346_collected_known, TCGA_papers and MutPanning
2020Rule	125 cancer genes based on the characteristic mutational patterns for oncogenes and tumor suppressor genes.	Table S2A
CGC	723 cancer genes from the Cancer Genome Census database (CGC) including 576 high-confidence genes (Tier=1) and 147 less confidence genes (Tier=2). Tier 1, genes with documented activity relevant to cancer, along with evidence of mutations in cancer which change the activity of the gene product in a way that promotes oncogenic transformation. Tier 2, genes with strong indications of a role in cancer but with less extensive available evidence.	Cancer Gene Census (Tier=1 or 2)
CGCpointMut	A CGC subset of 118 cancer genes which act in cancer via point mutations (CGCpointMut).	CGCpointMut file in Cancer_GeneSet folder of MUFFINN software package
HCD	291 high-confidence driver genes based on a rule-based approach (HCD).	High Confidence Driver
MouseMut	797 human orthologs of mouse cancer genes identified by insertional mutagenesis (MouseMut).	http://evs.gs.washington.edu/EVS/
Oncogene	803 human oncogenes from the First literature database for oncogenes.	http://ongene.bioinfo-minzhao.org/
OncoKB	1,059 driver genes, which selection based on a comprehensive, curated and evidence-based information about individual somatic mutations and structural alterations present in patient tumors	http://oncokb.org/dataAccess
FASMIC	93 driver genes based on experimental evidence on the functional impacts of somatic mutations detected in human cancer.	https://bioinformatics.mdanderson.org/public-datasets/
CTAT	299 driver genes identified from paper “Comprehensive Characterization of Cancer Driver Genes and Mutations”, which are based on biological processes and pathways analysis.	Table S1
TSGene	983 tumor suppressor genes from literature-based knowledgebase TSGene 2.0.	TSGene
Intogen	459 human driver genes download from Intogen database.	https://www.intogen.org/download
PMID_29056346	180 driver genes selected by putatively positive selection.	Table S2
PMID_29056346_collected_known	369 high-confidence driver genes collected from COSMIC database (174 COSMIC classic genes , version 73), Lawrence et al. (2014 ) (219 significantly mutated genes) and their literature search (204 significantly mutated genes).	Table S3
TCGA_papers	132 significantly mutated genes collected from PMID_25109877 and PMID_25079552.	supplement-03 and Table S4
MutPanning	460 driver genes selected by Mutpanning algorithm, which considered the nucleotide context around mutations in their statistical model.	http://www.cancer-genes.org/
Two additional gene annotaton sources	Therapeutic_target_in_PMID_30971826 and DriverDBv3
Therapeutic_target_in_PMID_30971826	628 priority targets classified as groups 1, 2 and 3 based on tractability for drug development.	Supplementary Table 9 - b
DriverDBv3	Siginificant mutated genes and the number means the supportted mutation tools which identify this gene as a mutation driver.	http://driverdb.tms.cmu.edu.tw/
Allele frequency	dbSNP, gnomAD, ExAC, 1000Genomes, ESP, Kaviar, HRC, CG69
dbSNP	The Single Nucleotide Polymorphism database (dbSNP) is a public-domain archive for a broad collection of simple genetic polymorphisms.	https://www.ncbi.nlm.nih.gov/snp
gnomAD	The Genome Aggregation Database (gnomAD), is a coalition of investigators seeking to aggregate and harmonize exome and genome sequencing data from a variety of large-scale sequencing projects, and to make summary data available for the wider scientific community.The data set provided on this website spans 123,136 exomes and 15,496 genomes from unrelated individuals sequenced as part of various disease-specific and population genetic studies.	http://gnomad.broadinstitute.org/
ExAC	The Exome Aggregation Consortium (ExAC) is a coalition of investigators seeking to aggregate and harmonize exome sequencing data from a variety of large-scale sequencing projects, and to make summary data available for the wider scientific community.The data set provided on this website spans 60,706 unrelated individuals sequenced as part of various disease-specific and population genetic studies.	http://exac.broadinstitute.org/
1000Genomes	The 1000 Genomes Project was the first project to sequence the genomes of a large number of people, to provide a comprehensive resource on human genetic variation. In the final phase of the project, data from 2,504 samples was combined to allow highly accurate assignment of the genotypes in each sample at all the variant sites the project discovered and the data was from 26 populations,including African, Ad Mixed American, East Asian,European, South Asian, and so on.	http://www.1000genomes.org/
ESP	The dataset in NHLBI GO Exome Sequencing Project (ESP)is from the NHLBI GO Exome Sequencing Project and its ongoing studies which produced and provided exome variant calls for comparison .The current EVS data release (ESP6500SI-V2) is taken from 6503 samples drawn from multiple ESP cohorts and represents all of the ESP exome variant data.	http://evs.gs.washington.edu/EVS/
Kaviar	Kaviar is a compilation of SNVs, indels, and complex variants observed in humans, designed to facilitate testing for the novelty and frequency of observed variants.Kaviar contains 162 million SNV sites (including 25M not in dbSNP) and incorporates data from 35 projects encompassing 77,781 individuals (13.2K whole genome, 64.6K exome).	http://db.systemsbiology.net/kaviar/
HRC	The Haplotype Reference Consortium (HRC) is used for genotype imputation and phasing in other cohorts, typically genome-wide association studies (GWAS), where genotypes are available from genome-wide SNP microarrays.And it contains haplotypes from individuals with predominantly European ancestry, although the HRC includes the 1000 Genomes Project data.The first release consists of 64,976 haplotypes at 39,235,157 SNPs, all with an estimated minor allele count of greater than 5.	http://www.haplotype-reference-consortium.org
CG69	The database includes 69 DNA samples sequenced using their Standard Sequencing Service, which includes whole genome sequencing, mapping of the resulting reads to a human reference genome, comprehensive detection of variations, scoring, and informative annotation.	http://www.completegenomics.com/public-data/69-Genomes/
Missense prediction	SIFT, PolyPhen2 HDIV, PolyPhen2 HVAR, LRT, MutationTaster, MutationAssessor, FATHMM, PROVEAN, MetaSVM, MetaLR, VEST, M-CAP, CADD, GERP++, DANN, fathmm-MKL, Eigen, GenoCanyon, fitCons, PhyloP, PhastCons, SiPhy, REVEL, dbNSFP
SIFT	SIFT predicts whether an amino acid substitution affects protein function. SIFT prediction is based on the degree of conservation of amino acid residues in sequence alignments derived from closely related sequences, collected through PSI-BLAST. SIFT can be applied to naturally occurring nonsynonymous polymorphisms or laboratory-induced missense mutations.	http://sift.jcvi.org
PolyPhen2_HDIV	PolyPhen-2 is a tool which predicts possible impact of an amino acid substitution on the structure and function of a human protein using straightforward physical and comparative considerations.HumDiv-trained PolyPhen-2 is used for evaluating rare alleles at loci potentially involved in complex phenotypes, dense mapping of regions identified by genome-wide association studies, and analysis of natural selection from sequence data, where even mildly deleterious alleles must be treated as damaging.	http://genetics.bwh.harvard.edu/pph2
PolyPhen2_HVAR	PolyPhen-2 is a tool which predicts possible impact of an amino acid substitution on the structure and function of a human protein using straightforward physical and comparative considerations.HumVar-trained PolyPhen-2 can diagnose Mendelian diseases that requires distinguishing mutations with drastic effects from all the remaining human variation, including abundant mildly deleterious alleles.	http://genetics.bwh.harvard.edu/pph2
LRT	A likelihood ratio test (LRT) can accurately identify a subset of deleterious mutations that disrupt highly conserved amino acids within protein-coding sequences, which are likely to be unconditionally deleterious.	http://www.genetics.wustl.edu/jflab/lrt_query.html
MutationTaster	MutationTaster employs a Bayes classifier to eventually predict the disease potential of an alteration. The Bayes classifier is fed with the outcome of all tests and the features of the alterations and calculates probabilities for the alteration to be either a disease mutation or a harmless polymorphism.	http://www.mutationtaster.org
MutationAssessor	MutationAssessor predicts the functional impact of amino-acid substitutions in proteins, such as mutations discovered in cancer or missense polymorphisms. The functional impact is assessed based on evolutionary conservation of the affected amino acid in protein homologs.	http://mutationassessor.org
FATHMM	Functional Analysis through Hidden Markov Models(FATHMM) is specifically designed for non-synonymous single nucleotide variants (nsSNVs).	http://fathmm.biocompute.org.uk
PROVEAN	Protein Variation Effect Analyzer (PROVEAN) is a software tool which predicts whether an amino acid substitution or indel has an impact on the biological function of a protein. It is useful for filtering sequence variants to identify nonsynonymous or indel variants that are predicted to be functionally important.	http://provean.jcvi.org/
MetaSVM	MetaSVM is a ensemble scoring method for deleterious missense mutations.It integratea nine deleteriousness prediction scores and maximum minor allele frequency for more accurate and comprehensive evaluation of deleteriousness of missense mutations.	https://www.ncbi.nlm.nih.gov/pubmed/25552646
MetaLR	MetaLR is a ensemble scoring method for deleterious missense mutations. It achieves the highest discriminative power compared to all eighteen existing deleteriousness prediction scores, which demonstrated the value of combining information from multiple orthologous approaches.	https://www.ncbi.nlm.nih.gov/pubmed/25552646
VEST 3.0	The Variant Effect Scoring Tool (VEST) 3.0 is a machine learning method that predicts the functional significance of missense mutations observed through genome sequencing, allowing mutations to be prioritized in subsequent functional studies, based on the probability that they impair protein activity.	http://wiki.chasmsoftware.org
M-CAP	M-CAP is a pathogenicity classifier for rare missense variants in the human genome that is tuned to the high sensitivity required in the clinic. By combining previous pathogenicity scores (including SIFT, Polyphen-2 and CADD) with novel features and a powerful model, they attain the best classifier at all thresholds, reducing a typical exome/genome rare (<1%) missense variant (VUS) list from 300 to 120, while never mistaking 95% of known pathogenic variants as benign.	http://bejerano.stanford.edu/MCAP
CADD	Combined Annotation Dependent Depletion (CADD) is a tool for scoring the deleteriousness of single nucleotide variants as well as insertion/deletions variants in the human genome. It is a framework that integrates multiple annotations into one metric by contrasting variants that survived natural selection with simulated mutations.	http://cadd.gs.washington.edu/
GERP++	GERP++ is a new tool that uses maximum likelihood evolutionary rate estimation for position-specific scoring and, in contrast to previous bottomup methods, a novel dynamic programming approach to subsequently define constrained elements. GERP++ evaluates a richer set of candidate element breakpoints and ranks them based on statistical significance, eliminating the need for biased heuristic extension techniques.	http://mendel.stanford.edu/SidowLab/downloads/gerp/index.html
DANN	DANN is a deep learning approach for annotating the pathogenicity of whole-genome genetic variants.DANN uses the same feature set and training data as CADD to train a deep neural network (DNN). DNNs can capture non-linear relationships among features and are better suited than SVMs for problems with a large number of samples and features.	https://cbcl.ics.uci.edu/public_data/DANN/
fathmm-MKL	fathmm-MKL is capable of predicting the functional effects of protein missense mutations by combining sequence conservation within hidden Markov models (HMMs), representing the alignment of homologous sequences and conserved protein domains, with "pathogenicity weights", representing the overall tolerance of the protein/domain to mutations.	http://fathmm.biocompute.org.uk
Eigen	Eigen is a spectral approach to the functional annotation of genetic variants in coding and noncoding regions. Eigen makes use of a variety of functional annotations in both coding and noncoding regions (such as made available by the ENCODE and Roadmap Epigenomics projects), and combines them into one single measure of functional importance.	http://www.columbia.edu/~ii2135/eigen.html
GenoCanyon	GenoCanyon is a statistical framework to predict functional non-coding regions in the human genome through integrated analysis of annotation data.Meanwhile,it is a whole-genome functional annotation approach based on unsupervised statistical learning. It integrates genomic conservation measures and biochemical annotation data to predict the functional potential at each nucleotide.	http://genocanyon.med.yale.edu/
fitCons	The fitness consequences of functional annotation(fitCons) integrates functional assays (such as ChIP-Seq) with selective pressure inferred using the INSIGHT method. The result is a score ρ in the range [0.0-1.0] that indicates the fraction of genomic positions evincing a particular pattern (or "fingerprint") of functional assay results, that are under selective pressure.	http://compgen.cshl.edu/fitCons/
PhyloP	PhyloP scores measure evolutionary conservation at individual alignment sites.And the phyloP scores are useful to evaluate signatures of selection at particular nucleotides or classes of nucleotides (e.g., third codon positions, or first positions of miRNA target sites).	http://compgen.bscb.cornell.edu/phast
PhastCons	PHAST is a freely available software package for comparative and evolutionary genomics. It consists of about half a dozen major programs, plus more than a dozen utilities for manipulating sequence alignments, phylogenetic trees, and genomic annotations.	http://compgen.cshl.edu/phast/
SiPhy	SiPhy is a approach that takes advantage of deeply sequenced clades to identify evolutionary selection by uncovering not only signatures of rate-based conservation but also substitution patterns characteristic of sequence undergoing natural selection.	http://portals.broadinstitute.org/genome_bio/siphy/
REVEL	REVEL is a new ensemble method for predicting the pathogenicity of missense variants based on a combination of scores from 13 individual tools: MutPred, FATHMM v2.3, VEST 3.0, Polyphen-2, SIFT, PROVEAN, MutationAssessor, MutationTaster, LRT, GERP++, SiPhy, phyloP, and phastCons. REVEL was trained using recently discovered pathogenic and rare neutral missense variants, excluding those previously used to train its constituent tools	https://sites.google.com/site/revelgenomics/
dbNSFP	The purpose of the dbNSFP is to provide a one-stop resource for functional predictions and annotations for human nonsynonymous single-nucleotide variants (nsSNVs) and splice-site variants (ssSNVs), and to facilitate the steps of filtering and prioritizing SNVs from a large list of SNVs discovered in an exome-sequencing study.	http://sites.google.com/site/jpopgen/dbNSFP
Disease-related	InterVar, COSMIC, ICGC, TCGA
InterVar	InterVar is a bioinformatics software tool for clinical interpretation of genetic variants by the ACMG/AMP 2015 guideline. The input to InterVar is an annotated file generated from ANNOVAR, while the output of InterVar is the classification of variants into 'Benign', 'Likely benign', 'Uncertain significance', 'Likely pathogenic' and 'Pathogenic', together with detailed evidence code.	http://wintervar.wglab.org/
COSMIC	COSMIC is designed to store and display somatic mutation information and related details and contains information relating to human cancers. There are two types of data in COSMIC: Expert manual curation data and systematic screen data.The information in COSMIC is curated by expert scientists, primarily by scrutinizing large numbers of scientific publications.	http://cancer.sanger.ac.uk/cosmic
ICGC	The International Cancer Genome Consortium (ICGC) generates comprehensive catalogues of genomic abnormalities (somatic mutations, abnormal expression of genes, epigenetic modifications) in tumors from 50 different cancer types and/or subtypes which are of clinical and societal importance across the globe and make the data available to the entire research community as rapidly as possible, and with minimal restrictions, to accelerate research into the causes and control of cancer.	https://icgc.org
TCGA	TThe Cancer Genome Atlas (TCGA), a landmark cancer genomics program, molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types. This joint effort between the National Cancer Institute and the National Human Genome Research Institute began in 2006, bringing together researchers from diverse disciplines and multiple institutions.	https://cancergenome.nih.gov/
Gene and Pathway	RefSeq, Ensembl, NCBI, InterPro, Segmental duplication, DGIdb and gene ontology
RefSeq	A comprehensive, integrated, non-redundant, well-annotated set of reference sequences including genomic, transcript, and protein.	https://www.ncbi.nlm.nih.gov/refseq/
Ensembl	Ensembl is a genome browser for vertebrate genomes that supports research in comparative genomics, evolution, sequence variation and transcriptional regulation. Ensembl annotate genes, computes multiple alignments, predicts regulatory function and collects disease data. Ensembl tools include BLAST, BLAT, BioMart and the Variant Effect Predictor (VEP) for all supported species.	http://asia.ensembl.org/
NCBI Gene	Gene integrates information from a wide range of species. A record may include nomenclature, Reference Sequences (RefSeqs), maps, pathways, variations, phenotypes, and links to genome-, phenotype-, and locus-specific resources worldwide.	https://www.ncbi.nlm.nih.gov/gene/
InterPro	InterPro is a resource that provides functional analysis of protein sequences by classifying them into families and predicting the presence of domains and important sites. To classify proteins in this way, InterPro uses predictive models, known as signatures, provided by several different databases (referred to as member databases) that make up the InterPro consortium.It combines signatures from multiple, diverse databases into a single searchable resource, reducing redundancy and helping users interpret their sequence analysis results.	http://www.ebi.ac.uk/interpro/
Segmental duplication	Segmental duplications is a method to detect identity between long stretches of genomic sequence despite the presence of high copy repeats and large insertion-deletions(> 90% identity and >1kb in length).	http://humanparalogy.gs.washington.edu/
DGIdb	The Drug-Gene Interaction database (DGIdb) mines existing resources that generate hypotheses about how mutated genes might be targeted therapeutically or prioritized for drug development. It provides an interface for searching lists of genes against a compendium of drug-gene interactions and potentially 'druggable' genes.It integrates data from 13 primary sources that cover disease-relevant human genes, drugs, drug-gene interactions and potential druggability. Currently, DGIdb contains over 14,144 drug-gene interactions involving 2,611 genes and 6,307 drugs, and in addition it includes 6,761 genes belonging to one or more of 39 potentially druggable gene categories. A total of 7,668 unique genes have either known or potential druggability.	http://dgidb.genome.wustl.edu/
Gene Ontology	The Gene Ontology (GO) project is a major bioinformatics initiative to develop a computational representation of our evolving knowledge of how genes encode biological functions at the molecular, cellular and tissue system levels. The project has developed formal ontologies that represent over 40,000 biological concepts, and are constantly being revised to reflect new discoveries. To date, these concepts have been used to "annotate" gene functions based on experiments reported in over 100,000 peer-reviewed scientific papers.	http://geneontology.org/

Data sources used in OncoVar

Driver mutation sources

OncoKB, FASMIC, CGI and PMID25348012_GB

Driver gene sources

2020Rule, CGC, CGCpointMut, HCD, MouseMut, Oncogene, OncoKB, FASMIC, CTAT, TSGene, Intogen, PMID_29056346, PMID_29056346_collected_known, TCGA_papers and MutPanning

Two additional gene annotaton sources

Therapeutic_target_in_PMID_30971826 and DriverDBv3

Allele frequency

dbSNP, gnomAD, ExAC, 1000Genomes, ESP, Kaviar, HRC, CG69

Missense prediction

SIFT, PolyPhen2 HDIV, PolyPhen2 HVAR, LRT, MutationTaster, MutationAssessor, FATHMM, PROVEAN, MetaSVM, MetaLR, VEST, M-CAP, CADD, GERP++, DANN, fathmm-MKL, Eigen, GenoCanyon, fitCons, PhyloP, PhastCons, SiPhy, REVEL, dbNSFP

Disease-related

InterVar, COSMIC, ICGC, TCGA

Gene and Pathway

RefSeq, Ensembl, NCBI, InterPro, Segmental duplication, DGIdb and gene ontology