Challenge Query and Wikidata dataset

The Wikidata Life Science dataset

Wikidata contains items for over 1.1 million genes and 940 thousand proteins from 201 unique taxa. Annotation data on genes and proteins come from several key databases including NCBI Gene [15], Ensembl [16], UniProt [17], InterPro [18], and the Protein Data Bank (PDB) [19]. These annotations include information on protein families, gene functions, protein domains, genomic location, and orthologs, as well as links to related compounds, diseases, and variants.

Genetic variants

Annotations on genetic variants are primarily drawn from CIViC (, an open and community-curated database of cancer variants [20]. Variants are annotated with their relevance to disease predisposition, diagnosis, prognosis, and drug efficacy. Wikidata currently contains 1502 items corresponding to human genetic variants, focused on those with a clear clinical or therapeutic relevance.

Chemical compounds including drugs

Wikidata has items for over 150 thousand chemical compounds, including over 3500 items which are specifically designated as medications. Compound attributes are drawn from a diverse set of databases, including PubChem [21], RxNorm [22], IUPHAR Guide to Pharmacology [2325], NDF-RT [26], and LIPID MAPS [27]. These items typically contain statements describing chemical structure and key physicochemical properties, and links to databases with experimental data (MassBank [28,29], PDB Ligand [30], etc.) and toxicological information (EPA CompTox Dashboard [31]). Additionally, these items contain links to compound classes, disease indications, pharmaceutical products, and protein targets.


Wikidata has items for almost three thousand human biological pathways, primarily from two established public pathway repositories: Reactome [32] and WikiPathways [33]. The full details of the different pathways remain with the respective primary sources. Our bots enter data for Wikidata properties such as pathway name, identifier, organism, and the list of component genes, proteins, and chemical compounds. Properties for contributing authors (via ORCID properties [34]), descriptions and ontology annotations are also being added for Wikidata pathway entries.


Wikidata has items for over 16 thousand diseases, the majority of which were created based on imports from the Human Disease Ontology [35], with additional disease terms added from the Monarch Disease Ontology [3]. Disease attributes include medical classifications, symptoms, relevant drugs, as well as subclass relationships to higher-level disease categories. In instances where the Human Disease Ontology specifies a related anatomic region and/or a causative organism (for infectious diseases), corresponding statements are also added.


Whenever practical, the provenance of each statement added to Wikidata was also added in a structured format. References are part of the core data model for a Wikidata statement. References can either cite the primary resource from which the statement was retrieved (including details like version number of the resource), or they can link to a Wikidata item corresponding to a publication as provided by a primary resource (as an extension of the WikiCite project [36]), or both. Wikidata contains over 20 million items corresponding to publications across many domain areas, including a heavy emphasis on biomedical journal articles.

Data processing

The 1,2 tera bytes daily dump was used to extract the json document for six different types :

  • Disease from Diseas Ontology (DOID)
  • Gene (EntrezGene)
  • Protein (UniProt)
  • Compound (ChEMBL)
  • Pathway (WikiPathways)
  • GO components (GO)

After extraction, JSON document have been converted to JSON-LD documents and then loaded into 6 differents indexes available in Kibio.

Then the JSON-LD collection for each types have been converted to RDF ntriples format to be loaded into 6 differents Virtuoso SPARQL endpoints.

The Challenge simplified query

This is the simplified SPARQL query used for the challenge. In can be run here :

SELECT DISTINCT ?compound ?compoundLabel where {

  # gene has genetic association with a respiratory disease  
  ?gene       wdt:P31    wd:Q7187 .
  ?gene       wdt:P2293  ?diseaseGA .
  ?diseaseGA  wdt:P279*  wd:Q3286546 .

  # gene product is localized to the membrane
  ?gene     wdt:P688   ?protein .
  ?protein  wdt:P681   ?cc .
  ?cc       wdt:P279*  wd:Q14349455 .

  # gene is involved in a pathway with another gene (gene2)
  ?pathway  wdt:P31   wd:Q4915012 ;
            wdt:P527  ?gene ;
            wdt:P527  ?gene2 .
  ?gene2    wdt:P31   wd:Q7187 . 

  # gene2 product has a Ser/Thr protein kinase domain AND known enzyme inhibitor  
  ?gene2     wdt:P688  ?protein2 .
  ?protein2  wdt:P129  ?compound ;
             wdt:P527  wd:Q24787419 .
  ?compound  wdt:P31   wd:Q11173 ;
             wdt:P2868 wd:Q427492 .

  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }

This is the same query that can be tested in Virtuoso SPARQL endpoint here :

PREFIX wdt: <> 
PREFIX wd: <>  

SELECT DISTINCT ?compound ?compoundLabel WHERE {    
   ?gene1      wdt:P31    wd:Q7187 .   
   ?gene1      wdt:P2293  ?disease .   
   ?disease    wdt:P279*  wd:Q3286546 .    

   ?gene1      wdt:P688   ?protein1 .   
   ?protein1   wdt:P681   ?component .   
   ?component  wdt:P279*  wd:Q14349455 .   

   ?pathway    wdt:P31    wd:Q4915012 .   
   ?pathway    wdt:P527   ?gene1 .   
   ?pathway    wdt:P527   ?gene2 .   

   ?gene2      wdt:P31    wd:Q7187 .     
   ?gene2      wdt:P688   ?protein2 .   
   ?protein2   wdt:P129   ?compound .   

   ?protein2   wdt:P527   wd:Q24787419 .    
   ?compound   wdt:P31    wd:Q11173 .   
   ?compound   wdt:P2868  wd:Q427492 .   
   ?compound <> ?compoundLabel . }

The mandatory query answer of three molecules