The Wikidata Life Science dataset
Wikidata contains items for over 1.1 million genes and 940 thousand proteins from 201 unique taxa. Annotation data on genes and proteins come from several key databases including NCBI Gene [15], Ensembl [16], UniProt [17], InterPro [18], and the Protein Data Bank (PDB) [19]. These annotations include information on protein families, gene functions, protein domains, genomic location, and orthologs, as well as links to related compounds, diseases, and variants.
Genetic variants
Annotations on genetic variants are primarily drawn from CIViC (http://www.civicdb.org), an open and community-curated database of cancer variants [20]. Variants are annotated with their relevance to disease predisposition, diagnosis, prognosis, and drug efficacy. Wikidata currently contains 1502 items corresponding to human genetic variants, focused on those with a clear clinical or therapeutic relevance.
Chemical compounds including drugs
Wikidata has items for over 150 thousand chemical compounds, including over 3500 items which are specifically designated as medications. Compound attributes are drawn from a diverse set of databases, including PubChem [21], RxNorm [22], IUPHAR Guide to Pharmacology [23–25], NDF-RT [26], and LIPID MAPS [27]. These items typically contain statements describing chemical structure and key physicochemical properties, and links to databases with experimental data (MassBank [28,29], PDB Ligand [30], etc.) and toxicological information (EPA CompTox Dashboard [31]). Additionally, these items contain links to compound classes, disease indications, pharmaceutical products, and protein targets.
Pathways
Wikidata has items for almost three thousand human biological pathways, primarily from two established public pathway repositories: Reactome [32] and WikiPathways [33]. The full details of the different pathways remain with the respective primary sources. Our bots enter data for Wikidata properties such as pathway name, identifier, organism, and the list of component genes, proteins, and chemical compounds. Properties for contributing authors (via ORCID properties [34]), descriptions and ontology annotations are also being added for Wikidata pathway entries.
Diseases
Wikidata has items for over 16 thousand diseases, the majority of which were created based on imports from the Human Disease Ontology [35], with additional disease terms added from the Monarch Disease Ontology [3]. Disease attributes include medical classifications, symptoms, relevant drugs, as well as subclass relationships to higher-level disease categories. In instances where the Human Disease Ontology specifies a related anatomic region and/or a causative organism (for infectious diseases), corresponding statements are also added.
References
Whenever practical, the provenance of each statement added to Wikidata was also added in a structured format. References are part of the core data model for a Wikidata statement. References can either cite the primary resource from which the statement was retrieved (including details like version number of the resource), or they can link to a Wikidata item corresponding to a publication as provided by a primary resource (as an extension of the WikiCite project [36]), or both. Wikidata contains over 20 million items corresponding to publications across many domain areas, including a heavy emphasis on biomedical journal articles.

Data processing
The 1,2 tera bytes daily dump was used to extract the json document for six different types :
- Disease from Diseas Ontology (DOID)
- Gene (EntrezGene)
- Protein (UniProt)
- Compound (ChEMBL)
- Pathway (WikiPathways)
- GO components (GO)
After extraction, JSON document have been converted to JSON-LD documents and then loaded into 6 differents indexes available in Kibio.
Then the JSON-LD collection for each types have been converted to RDF ntriples format to be loaded into 6 differents Virtuoso SPARQL endpoints.
The Challenge simplified query
This is the simplified SPARQL query used for the challenge. In can be run here : https://query.wikidata.org
SELECT DISTINCT ?compound ?compoundLabel where {
# gene has genetic association with a respiratory disease
?gene wdt:P31 wd:Q7187 .
?gene wdt:P2293 ?diseaseGA .
?diseaseGA wdt:P279* wd:Q3286546 .
# gene product is localized to the membrane
?gene wdt:P688 ?protein .
?protein wdt:P681 ?cc .
?cc wdt:P279* wd:Q14349455 .
# gene is involved in a pathway with another gene (gene2)
?pathway wdt:P31 wd:Q4915012 ;
wdt:P527 ?gene ;
wdt:P527 ?gene2 .
?gene2 wdt:P31 wd:Q7187 .
# gene2 product has a Ser/Thr protein kinase domain AND known enzyme inhibitor
?gene2 wdt:P688 ?protein2 .
?protein2 wdt:P129 ?compound ;
wdt:P527 wd:Q24787419 .
?compound wdt:P31 wd:Q11173 ;
wdt:P2868 wd:Q427492 .
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
This is the same query that can be tested in Virtuoso SPARQL endpoint here : http://wikidata-challenge.bio2rdf.org/sparql
PREFIX wdt: <http://bio2rdf.org/wikidata_voc:>
PREFIX wd: <http://bio2rdf.org/wikidata:>
SELECT DISTINCT ?compound ?compoundLabel WHERE {
?gene1 wdt:P31 wd:Q7187 .
?gene1 wdt:P2293 ?disease .
?disease wdt:P279* wd:Q3286546 .
?gene1 wdt:P688 ?protein1 .
?protein1 wdt:P681 ?component .
?component wdt:P279* wd:Q14349455 .
?pathway wdt:P31 wd:Q4915012 .
?pathway wdt:P527 ?gene1 .
?pathway wdt:P527 ?gene2 .
?gene2 wdt:P31 wd:Q7187 .
?gene2 wdt:P688 ?protein2 .
?protein2 wdt:P129 ?compound .
?protein2 wdt:P527 wd:Q24787419 .
?compound wdt:P31 wd:Q11173 .
?compound wdt:P2868 wd:Q427492 .
?compound <http://bio2rdf.org/wikidata_voc:label> ?compoundLabel . }
The mandatory query answer of three molecules
URI | Label |
---|---|
wikidata:Q7376181 | ruboxistaurin |
wikidata:Q423111 | vemurafenib |
wikidata:Q5957181 | staurosporine |