Querying the Bgee Knowledge Graph with SPARQL

Introduction

In this tutorial, we will demonstrate how to build complex queries to retrieve gene expression information. We will build them step-by-step based on simple queries. The language used for querying the Bgee knowledge graph is SPARQL. The Bgee graph was built based on the GenEx semantic model.

Bgee has a SPARQL endpoint which is based on the EasyBgee database (see documentation on Bgee pipeline GitHub). EasyBgee is a view of the Bgee database, that contains the most useful, and explicit information.

Bgee SPARQL queries can be run using the Bgee SPARQL endpoint, or using the web interface Bio-Query search created for the BioSODA project. Bgee specific queries are present under the category Bgee database queries. It is possible to see the SPARQL queries and edit them by clicking on the Show SPARQL Query Editor button. Moreover, Bio-Query allows for writing federated queries among UniProt, OMA and Bgee SPARQL endpoints.

The following sections describe queries that can be run directly in our SPARQL endpoint webpage. Nevertheless, the Bgee SPARQL endpoint (see its webpage header below) can also be queried using your preferred programming language such as with the SPARQLWrapper package for the Python language or the R SPARQL package for the R language.

To improve readability, all reserved words of the SPARQL query language are written in capital letters. As per the SPARQL language:

  • All variables are defined by starting with a question mark ?.
  • The graph patterns are stated as triples ended with a full stop (.): subject predicate object ..
  • Results are projected via the variables that are defined in the query header such as the reserved word SELECT.

Querying species

Q01 is a SPARQL query to retrieve species that are present in Bgee.

Q01:

Question: What are the species present in Bgee? SPARQL query:

PREFIX up: <http://purl.uniprot.org/core/>
SELECT ?species {
	?species a up:Taxon .
}

All species are defined as a up:Taxon where up: is the prefix for http://purl.uniprot.org/core/ (the UniProtKB core ontology). See below a graphical representation of the Q01 query:

A species in Bgee may have the following attributes (i.e., properties):

  • up:scientificName (always present): the scientific name of a species such as "Homo sapiens".
  • up:commonName (optionally stated): the common name of a species such as "human", note that this attribute is not always present, it depends on the species.
  • up:rank (always present): taxonomic rank is the relative level of a group of organisms (a taxon) in an ancestral or hereditary hierarchy (e.g. species, kingdom, family). Currently, in Bgee, only the "species" rank is stated (i.e.,up:rank is always up:Species).

Q02 is the whole SPARQL query including all direct attributes for up:Taxon along with its graphical representation.

Q02:

Question: What are the species present in Bgee and their scientific and common names? SPARQL query:

PREFIX up: <http://purl.uniprot.org/core/>
SELECT ?species ?sci_name ?common_name {
	?species a up:Taxon .
	?species up:scientificName ?sci_name .
	?species up:rank up:Species .
	OPTIONAL { ?species up:commonName ?common_name . }
}

To run this query click here.

Q02 graphical representation:

Querying gene expression profile

All anatomic entities and developmental stages are represented with the UBERON ontology or related ontologies as controlled vocabularies.

Where is a gene expressed?

Genes that are expressed in a tissue, organ, or cell (i.e., anatomical entity in general) are represented with the relation genex:isExpressedIn, alternatively its corresponding relation obo:RO_0002206 is stated too. In the query Q03 (see graphical representation), more precisely, we query for the anatomical entities where the "APOC1" gene is expressed.

Q03:

Question: What are the anatomical entities where the "APOC1" gene is expressed? SPARQL query:

PREFIX orth: <http://purl.org/net/orth#>
PREFIX genex: <http://purl.org/genex#>
PREFIX obo: <http://purl.obolibrary.org/obo/>
SELECT DISTINCT ?anat ?anatName {
	?seq a orth:Gene .
	?seq genex:isExpressedIn ?anat .
	?seq rdfs:label "APOC1" .
	?anat a genex:AnatomicalEntity .
	?anat rdfs:label ?anatName .
}

To run this query click here.

Q03 graphical representation:

Other vocabulary terms:

  • orth:Gene: a class representing genes.
  • genex:AnatomicalEntity: a class representing anatomical entities such as organs.
  • rdfs:label: in the Bgee KG, this relation is often used to give names for each individual of a class.

Where is a human gene expressed? (simplified)

Similarly to the previous query, we can specify the species from where the gene comes from to avoid possible ambiguities among gene names in different species. In the query Q04 (see its graphical representation), more precisely, we query for the anatomical entities where the "APOC1" Homo sapiens gene is expressed.

Q04:

Question: What are the anatomical entities where the "APOC1" Homo sapiens gene is expressed? SPARQL query:

PREFIX orth: <http://purl.org/net/orth#>
PREFIX genex: <http://purl.org/genex#>
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX up: <http://purl.uniprot.org/core/>

SELECT DISTINCT ?anat ?anatName {
	?seq a orth:Gene .
	?seq genex:isExpressedIn ?anat .
	?seq rdfs:label "APOC1" .
	?anat a genex:AnatomicalEntity .
	?anat rdfs:label ?anatName .
	### Specifying species:
	?seq orth:organism ?organism .
	?organism obo:RO_0002162 ?species . #in taxon
	?species a up:Taxon .
	?species up:scientificName "Homo sapiens" .
}

To run this query click here.

Q04 graphical representation:

NOTE: orth:organism (a relation to assign an organism to a gene) chained with obo:RO_0002162 (a relation to assign a taxon to an organism) indicates from which taxon a gene belongs.

Where is a gene expressed? (with more details)

In addition to anatomical entities, many conditions can be specified with genex:isExpressedIn property. This property can relate a gene to several gene expression conditions defined with the genex:ExpressionCondition class. That is, in what conditions the gene is considered expressed such as below:

The query Q05 (see its graphical representation) retrieves the same results as Q03, but it is more accurate because it explicitly specifies the results are independent of developmental stage, sex, strain, and cell type.

Q05:

Question: What are the anatomical entities where the "APOC1" gene is expressed independently of the developmental stage, sex, strain, and cell type? SPARQL query:

PREFIX orth: <http://purl.org/net/orth#>
PREFIX genex: <http://purl.org/genex#>
PREFIX obo: <http://purl.obolibrary.org/obo/>

SELECT DISTINCT ?anat ?anatName {
	?seq a orth:Gene .
	?seq genex:isExpressedIn ?condition .
	?seq rdfs:label "APOC1" .
	?condition genex:hasAnatomicalEntity ?anat .
	?condition genex:hasAnatomicalEntity obo:GO_0005575 .
	?anat rdfs:label ?anatName .
	?condition genex:hasDevelopmentalStage ?stage .
	?stage rdfs:label "life cycle" .
	?condition genex:hasSex "any" .
	?condition genex:hasStrain ?strain .
	?strain rdfs:label "wild-type" .
}

To run this query click here.

Other vocabulary terms:

  • genex:hasAnatomicalEntity: this relation states in which anatomical entity a gene expression is being assessed.
  • genex:hasDevelopmentalStage: this relation states during which developmental stage a gene expression is being assessed.
  • genex:hasSex: this relation states in which sex a gene expression is being assessed.
  • genex:hasStrain: this relation states in which strain a gene expression is being assessed.

Q05 graphical representation:

These query triple patterns are more accurate because we can now precisely define the other expression conditions available instead of only an anatomical entity.

In what cell types is a gene expressed?

When defining the condition to assess a gene expression, the genex:hasAnatomicalEntity property is used to state cell types too since a cell type is also considered as an anatomical entity. For example, to define that a gene is expressed in the lung, the graph below is built where we can interpret that the gene is expressed in a cellular component located in the lung, in other words, in the lung. Therefore, when the cell type is unspecified, we assert with genex:hasAnatomicalEntity property the value obo:GO_0005575 (cellular_component), this Gene Ontology term is the root of all cell types.

Below, we show a question and its corresponding SPARQL query Q06 along with its graph representation where other gene expression conditions are specified, more precisely, the developmental stage.

Q06:

Question: What are the anatomical entities where the human gene "APOC1" is expressed in the post-juvenile stage? SPARQL query:

PREFIX orth: <http://purl.org/net/orth#>
PREFIX genex: <http://purl.org/genex#>
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX up: <http://purl.uniprot.org/core/>

SELECT DISTINCT ?anat ?anatName ?stage {
	?seq a orth:Gene .
	?seq genex:isExpressedIn ?condition .
	?seq rdfs:label "APOC1" .
	?condition genex:hasAnatomicalEntity ?anat .
	?anat rdfs:label ?anatName .
	?condition genex:hasAnatomicalEntity obo:GO_0005575 .
	?condition genex:hasDevelopmentalStage ?stage .
	?stage rdfs:label "post-juvenile" .
	?condition genex:hasSex "any" .
	?condition genex:hasStrain ?strain .
	?strain rdfs:label "wild-type" .
	?seq orth:organism ?organism .
	?organism obo:RO_0002162 ?species . #in taxon
	?species a up:Taxon .
	?species up:commonName "human" .
}

To run this query click here.

Q06 graphical representation:

Moreover, if there is not a specific strain to declare, the strain must be defined as "wild-type" since "wild-type" represents any strain. This is because Bgee only considers wild-type experiments. As a result, we ensure the gene is expressed independently of the strain type. If we do not state that is a "wild-type" strain, expressed genes that are exclusive to a specific strain will be considered too. Similarly, for sex, if stated as 'any', it means that the gene is expressed in any sex type.

NOTE: Currently, the data accessible via the SPARQL endpoint do not specify sex and strain types. Therefore, to optimize this query, we can omit triple patterns related to sex and strain. Q07 is the optimized SPARQL query that retrieves exactly the same results as Q06.

Q07:

Question: What are the anatomical entities where the human gene "APOC1" is expressed in the post-juvenile stage? SPARQL query:

PREFIX orth: <http://purl.org/net/orth#>
PREFIX genex: <http://purl.org/genex#>
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX up: <http://purl.uniprot.org/core/>

SELECT DISTINCT ?anat ?anatName ?stage {
	?seq a orth:Gene .
	?seq genex:isExpressedIn ?condition .
	?seq rdfs:label "APOC1" .
	?condition genex:hasAnatomicalEntity ?anat .
	?anat rdfs:label ?anatName .
	?condition genex:hasAnatomicalEntity obo:GO_0005575 .
	?condition genex:hasDevelopmentalStage ?stage .
	?stage rdfs:label "post-juvenile" .
	?seq orth:organism ?organism .
	?organism obo:RO_0002162 ?species . #in taxon
	?species a up:Taxon .
	?species up:commonName "human" .
}

To run this query click here.

Where is a gene expressed and its expression score?

Expression of genes and their corresponding scores can also be obtained via the genex:Expression concept. We rewrite the query Q07 using the genex:Expression concept.

The query Q08 below (see its graphical representation) retrieves anatomical entities where the human gene "APOC1" is expressed in the post-juvenile stage along with its expression score independently of the strain, sex, and cell type. The higher the expression score is, the higher the gene is expressed considering a given experimental condition. Note that the query Q08 orders results by expression scores with the statement: ORDER BY DESC(?score) where DESC() is an ordering modifier indicating the descending order.

Q08:

Question: What are the anatomical entities where the human gene "APOC1" is expressed in the post-juvenile stage along with its expression score independently of the strain, sex, and cell type? SPARQL query:

PREFIX orth: <http://purl.org/net/orth#>
PREFIX genex: <http://purl.org/genex#>
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX up: <http://purl.uniprot.org/core/>

SELECT DISTINCT ?anat ?anatName ?score ?stage {
	?seq a orth:Gene .
	?expression a genex:Expression .
	?expression genex:hasExpressionCondition ?condition .
	?expression genex:hasExpressionLevel ?score .
	?expression genex:hasSequenceUnit ?seq .
	?seq rdfs:label "APOC1" .
	?condition genex:hasAnatomicalEntity ?anat .
	?condition genex:hasAnatomicalEntity obo:GO_0005575 .
	?anat rdfs:label ?anatName .
	?condition genex:hasDevelopmentalStage ?stage .
	?stage rdfs:label "post-juvenile" .
	?condition genex:hasSex "any" .
	?condition genex:hasStrain ?strain .
	?strain rdfs:label "wild-type" .
	?seq orth:organism ?organism .
	?organism obo:RO_0002162 ?species . #in taxon
	?species a up:Taxon .
	?species up:commonName "human" .
FILTER (?anat != obo:GO_0005575)
} ORDER BY DESC(?score)

To run this query click here.

Q08 graphical representation:

NOTE: In the query Q08, we filter out the anatomical entity obo:GO_0005575 that is "cellular_component" with the expression FILTER(?anat != obo:GO_0005575) because it is not informative. In other words, "cellular_component" means any cell type.

NOTE: In Q08, we define the cell type where the gene expression is being evaluated as the most general (i.e., the root term obo:GO_0005575) with the statement ?condition genex:hasAnatomicalEntity obo:GO_0005575 .. The cost of doing this is that the query will not return specific cell level expression.

The query Q09 eliminates the ambiguities of genes that are not only stated as expressed in a given organ or tissue, but also in a specific cell type in the same tissue or organ, however with different expression scores. Note that this specific cell is a subtype of the "cellular_component" (obo:GO_0005575). Therefore, the query Q09 below also retrieves gene expression calls related to specific cell types different from "cellular_component". Q09 query can be interpreted as anatomical entities including cell types, if any, where the human gene "APOC1" is expressed at the post-juvenile stage along with its expression score independently of the strain and sex.

Q09:

Question: What are the anatomical entities including cell types, if any, where the human gene "APOC1" is expressed at the post-juvenile stage along with its expression score independently of the strain and sex? SPARQL query:

PREFIX orth: <http://purl.org/net/orth#>
PREFIX genex: <http://purl.org/genex#>
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX up: <http://purl.uniprot.org/core/>

SELECT DISTINCT ?anat ?cellType ?anatName ?cellTypeName ?score ?stage {
	?seq a orth:Gene .
	?expression a genex:Expression .
	?expression genex:hasExpressionCondition ?condition .
	?expression genex:hasExpressionLevel ?score .
	?expression genex:hasSequenceUnit ?seq .
	?seq rdfs:label "APOC1" .
	?condition genex:hasAnatomicalEntity ?anat .
	?anat rdfs:label ?anatName .
	?condition genex:hasAnatomicalEntity ?cellType .
	?cellType rdfs:label ?cellTypeName .
	?condition genex:hasDevelopmentalStage ?stage .
	?stage rdfs:label "post-juvenile" .
	?condition genex:hasSex "any" .
	?condition genex:hasStrain ?strain .
	?strain rdfs:label "wild-type" .
	?seq orth:organism ?organism .
	?organism obo:RO_0002162 ?species . #in taxon
	?species a up:Taxon .
	?species up:commonName "human" .
FILTER (?anat != obo:GO_0005575)
FILTER (?anat != ?cellType)
} ORDER BY DESC(?score)

To run this query click here.

Q09 graphical representation:

NOTE: Currently, the data accessible via the SPARQL endpoint do not specify sex and strain types. Therefore, to optimize Q09 query, we can omit triple patterns related to sex and strain. Q10 is the optimized SPARQL query that retrieves exactly the same results as Q09.

Q10:

Question: What are the anatomical entities including cell types, if any, where the human gene "APOC1" is expressed at the post-juvenile stage along with its expression score independently of the strain and sex? SPARQL query (Q09 optimised):

PREFIX orth: <http://purl.org/net/orth#>
PREFIX genex: <http://purl.org/genex#>
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX up: <http://purl.uniprot.org/core/>

SELECT DISTINCT ?anat ?cellType ?anatName ?cellTypeName ?score ?stage {
	?seq a orth:Gene .
	?expression a genex:Expression .
	?expression genex:hasExpressionCondition ?condition .
	?expression genex:hasExpressionLevel ?score .
	?expression genex:hasSequenceUnit ?seq .
	?seq rdfs:label "APOC1" .
	?condition genex:hasAnatomicalEntity ?anat .
	?anat rdfs:label ?anatName .
	?condition genex:hasAnatomicalEntity ?cellType .
	?cellType rdfs:label ?cellTypeName .
	?condition genex:hasDevelopmentalStage ?stage .
	?stage rdfs:label "post-juvenile" .
	?seq orth:organism ?organism .
	?organism obo:RO_0002162 ?species . #in taxon
	?species a up:Taxon .
	?species up:commonName "human" .
FILTER (?anat != obo:GO_0005575)
FILTER (?anat != ?cellType)
} ORDER BY DESC(?score)

To run this query click here.

Querying with controlled vocabularies and identifiers

Queries specifying conditions such as taxa, anatomical entities, and developmental stages can be written with their corresponding controlled vocabularies and represented as IRIs - Internationalized Resource Identifier (e.g., Web addresses).

Taxonomy identifiers

Taxa are based on the NCBI taxonomy identifiers, to find out the NCBI ID that corresponds to the species being looked for, you can choose the species at https://www.bgee.org/search/species and look for the species ID in the general information section. Alternatively, we can search the NCBI ID of a given species directly at the NCBI taxonomy website. For example, human ID is 9606, and its full corresponding IRI defined in the Bgee knowledge graph is http://purl.uniprot.org/taxonomy/9606. More precisely, the full IRI is composed of the http://purl.uniprot.org/taxonomy/ prefix and the NCBI ID as the suffix.

Gene identifiers

Currently, Bgee mostly reuses either Ensembl or NCBI Gene database identifiers depending on the genome source. These identifiers are stated for each gene with the dcterms:identifier relation. The genome source for each species in Bgee can be verified at each Bgee species page accessible at https://www.bgee.org/search/species such as human page states the genome source as being Ensembl in the "General information" section. To fetch the corresponding main gene identifier to a given gene in Bgee, we can rely on the Bgee's gene search tool. For example, if we search for the APOC1 gene with the Bgee's gene search tool, the first row result refers to the human APOC1 gene and the first column shows the corresponding Ensembl id: ENSG00000130208 and its full IRI is http://rdf.ebi.ac.uk/resource/ensembl/ENSG00000130208 as defined in the Bgee knowledge graph. The IRI is composed of the prefix http://rdf.ebi.ac.uk/resource/ensembl/ and the Ensembl id. If it is a NCBI Gene id, the IRI prefix is https://www.ncbi.nlm.nih.gov/gene/. For instance, https://www.ncbi.nlm.nih.gov/gene/118230125 corresponds to the apoc1 eel gene.

Anatomical entity vocabulary

For looking up an ontology term corresponding to a given anatomical entity, we can rely on the Ontology Look Up service at https://www.ebi.ac.uk/ols4/ontologies/uberon. For example, if we type liver in the search field we can retrieve the Uberon identifier that is UBERON:0002107 and its corresponding IRI http://purl.obolibrary.org/obo/UBERON_0002107 as defined in the Bgee knowledge graph.

Developmental stage vocabulary

For looking up an ontology term corresponding to a given developmental stage, we can browse the developmental stage file. In this file, by looking at the name and ID fields we can compose the corresponding IRI of a given developmental stage. For example, we can look up post-juvenile and find out that its ID is UBERON:0000113. By prefixing this ID with http://purl.obolibrary.org/obo/ and replacing : with _, we can define its IRI as stated in the Bgee knowledge graph: http://purl.obolibrary.org/obo/UBERON_0000113. Alternatively, we can retrieve all developmental stages and their IRIs in Bgee with the Q11 query.

Q11:

Question: What are the developmental stages present in Bgee? SPARQL query:

PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX efo: <http://www.ebi.ac.uk/efo/>

SELECT DISTINCT ?stage ?stageName ?stageDescription {
	?stage rdf:type efo:EFO_0000399 . #developmental stage
	?stage rdfs:label ?stageName .
	?stage dcterms:description ?stageDescription .
}

To run this query click here

In Q11 query, we can also apply a filter by adding the statement FILTER (CONTAINS(?stageName,"stage")) and replacing "stage" with the stage name or part of its name we are searching for such as the following:FILTER (CONTAINS(?stageName,"post-juvenile")) where the function CONTAINS checks if the "post-juvenile" string is a substring of any stage name in Bgee. Q11-a implements this filter.

Q11-a:

Question: What is the post-juvenile stage link and description? SPARQL query:

PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX efo: <http://www.ebi.ac.uk/efo/>

SELECT DISTINCT ?stage ?stageName ?stageDescription {
	?stage rdf:type efo:EFO_0000399 . #developmental stage
	?stage rdfs:label ?stageName .
	?stage dcterms:description ?stageDescription .
	FILTER (CONTAINS(?stageName,"post-juvenile"))
}

To run this query click here

Example 1: querying with controlled vocabularies

Let us consider the question addressed by the Q08 query: anatomical entities where the human gene "APOC1" is expressed at the post-juvenile stage along with its expression score independently of the strain, sex, and cell type. We can rewrite the Q08 query as shown in Q08-a (see its graphical representation) below, where:

  • obo:UBERON_0000113 represents the post-juvenile stage
  • the taxon up-taxon:9606 represents human - the APOC1 gene is defined with ensembl:ENSG00000130208 that is an IRI composed of the Ensembl gene identifier.
  • all prefixes (obo:, ensembl: and up-taxon:) are defined in Q08-a query header. Moreover, the full IRI can also be provided in the query by defining it between <>, for instance, <http://purl.uniprot.org/taxonomy/9606> is the same as up-taxon:9606 as defined in the Q08-a query.

Q08-a:

Question: What are the anatomical entities where the human gene "APOC1" is expressed at the post-juvenile stage along with its expression score independently of the strain, sex, and cell type? SPARQL query (a Q08 variant):

PREFIX orth: <http://purl.org/net/orth#>
PREFIX genex: <http://purl.org/genex#>
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX up-taxon:<http://purl.uniprot.org/taxonomy/>
PREFIX ensembl: <http://rdf.ebi.ac.uk/resource/ensembl/>
PREFIX lscr: <http://purl.org/lscr#>

SELECT DISTINCT ?anat ?anatName ?score {
	?seq a orth:Gene .
	?expression a genex:Expression .
	?expression genex:hasExpressionCondition ?condition .
	?expression genex:hasExpressionLevel ?score .
	?expression genex:hasSequenceUnit ?seq .
	?seq lscr:xrefEnsemblGene ensembl:ENSG00000130208 .
	?condition genex:hasAnatomicalEntity ?anat .
	?anat rdfs:label ?anatName .
	?condition genex:hasDevelopmentalStage obo:UBERON_0000113 .
	?condition genex:hasSex "any" .
	?condition genex:hasStrain ?strain .
	?strain rdfs:label "wild-type" .
	?seq orth:organism ?organism .
	?organism obo:RO_0002162 up-taxon:9606 .
FILTER (?anat != obo:GO_0005575)
} ORDER BY DESC(?score)

To run this query click here.

Q08-a graphical representation:

The Q08-a query can be further simplified by removing the statements about species because an Ensembl gene identifier is always associated with a unique species, hence, by stating ensembl:ENSG00000130208, we are already referring to a human gene. This simplified version is shown in Q08-b query.

Q08-b:

Question: What are the anatomical entities where the human gene "APOC1" is expressed at the post-juvenile stage along with its expression score independently of the strain, sex, and cell type? SPARQL query (a Q08 variant):

PREFIX orth: <http://purl.org/net/orth#>
PREFIX genex: <http://purl.org/genex#>
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX up-taxon:<http://purl.uniprot.org/taxonomy/>
PREFIX ensembl: <http://rdf.ebi.ac.uk/resource/ensembl/>
PREFIX lscr: <http://purl.org/lscr#>

SELECT DISTINCT ?anat ?anatName ?score {
	?seq a orth:Gene .
	?expression a genex:Expression .
	?expression genex:hasExpressionCondition ?condition .
	?expression genex:hasExpressionLevel ?score .
	?expression genex:hasSequenceUnit ?seq .
	?seq lscr:xrefEnsemblGene ensembl:ENSG00000130208 .
	?condition genex:hasAnatomicalEntity ?anat .
	?anat rdfs:label ?anatName .
	?condition genex:hasDevelopmentalStage obo:UBERON_0000113 .
	?condition genex:hasSex "any" .
	?condition genex:hasStrain ?strain .
	?strain rdfs:label "wild-type" .
FILTER (?anat != obo:GO_0005575)
} ORDER BY DESC(?score)

To run this query click here.

Example 2: querying with gene source identifiers

To query gene expression information of a species where the genome source is NCBI such as the example of the apoc1 eel gene (see Gene identifiers), one way is to use the lscr:xrefNCBIGene property. For example, to answer the Q12 question, we can consider Q08-b and replace the relation lscr:xrefEnsemblGene that relates a Bgee gene to its corresponding Ensembl gene IRI with lscr:xrefNCBIGene in the statement: ?seq lscr:xrefEnsemblGene ensembl:ENSG00000130208 . . ensembl:ENSG00000130208 should also be replaced with the apoc1 eel gene IRI from NCBI Gene database: <https://www.ncbi.nlm.nih.gov/gene/118230125>.

Q12:

Question: What are the anatomical entities where the eel gene "apoc1" is expressed along with its expression score independently of the strain, sex, and cell type? SPARQL query:

PREFIX orth: <http://purl.org/net/orth#>
PREFIX genex: <http://purl.org/genex#>
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX up-taxon:<http://purl.uniprot.org/taxonomy/>
PREFIX lscr: <http://purl.org/lscr#>

SELECT DISTINCT ?anat ?anatName ?stageIRI ?score {
	?seq a orth:Gene .
	?expression a genex:Expression .
	?expression genex:hasExpressionCondition ?condition .
	?expression genex:hasExpressionLevel ?score .
	?expression genex:hasSequenceUnit ?seq .
	?seq lscr:xrefNCBIGene <https://www.ncbi.nlm.nih.gov/gene/118230125> .
	?condition genex:hasAnatomicalEntity ?anat .
	?anat rdfs:label ?anatName .
	?condition genex:hasDevelopmentalStage ?stageIRI .
	?condition genex:hasSex "any" .
	?condition genex:hasStrain ?strain .
	?strain rdfs:label "wild-type" .
FILTER (?anat != obo:GO_0005575)
} ORDER BY DESC(?score)

To run this query click here.

Note that in the Q12 question a specific developmental stage is not declared. So, because it is unknown or not specified, we replaced obo:UBERON_0000113 in Q08-b with a variable (?stageIRI) and we projected it in the SELECT query header to properly answer the Q12 question. Moreover, if we want to generalize the Q12 query for any genome source, we could replace the property lscr:xrefNCBIGene with a variable (e.g., ?xref_property). Alternatively, we can use the dcterms:identifier property that assigns an identifier for each Bgee gene according to the gene identifier of the genome source of a given species in Bgee. See Q12-a which retrieves exactly the same results as Q12 but by using the dcterms:identifier relation. For a graphical representation of Q12-a see Q12-a graphical representation.

Q12-a:

Question: What are the anatomical entities where the eel gene "apoc1" is expressed along with its expression score independently of the strain, sex, and cell type? SPARQL query:

PREFIX orth: <http://purl.org/net/orth#>
PREFIX genex: <http://purl.org/genex#>
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX up-taxon:<http://purl.uniprot.org/taxonomy/>
PREFIX lscr: <http://purl.org/lscr#>
PREFIX dcterms: <http://purl.org/dc/terms/>

SELECT DISTINCT ?anat ?anatName ?stageIRI ?score {
	?seq a orth:Gene .
	?expression a genex:Expression .
	?expression genex:hasExpressionCondition ?condition .
	?expression genex:hasExpressionLevel ?score .
	?expression genex:hasSequenceUnit ?seq .
	?seq dcterms:identifier "118230125" .
	?condition genex:hasAnatomicalEntity ?anat .
	?anat rdfs:label ?anatName .
	?condition genex:hasDevelopmentalStage ?stageIRI .
	?condition genex:hasSex "any" .
	?condition genex:hasStrain ?strain .
	?strain rdfs:label "wild-type" .
FILTER (?anat != obo:GO_0005575)
} ORDER BY DESC(?score)

Q12-a graphical representation:

Querying with UniProtKB cross-references

To query with UniProtKB cross-references the easiest way is to state the property lscr:xrefUniprot that is assigned to each Bgee gene. For example, the human APOC1 gene has its corresponding UniProtKB IRI up-protein:P02654 where up-protein: is a prefix replacing the URI http://purl.uniprot.org/uniprot/ as defined in the header of the Q13.

Q13 retrieves the same results, and it is similar to Q03 but Q13 uses UniProtKB accession numbers for identifying genes instead of gene names or symbols, in other words, it uses lscr:xrefUniprot instead of rdfs:label property.

Q13:

Question: What are the anatomical entities where the P02654 gene is expressed? Note that P02654 is a UniProtKB identifier of the APOC1 human gene. SPARQL query (see Q13 graphical representation):

PREFIX orth: <http://purl.org/net/orth#>
PREFIX genex: <http://purl.org/genex#>
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX up-protein:<http://purl.uniprot.org/uniprot/>
PREFIX lscr: <http://purl.org/lscr#>

SELECT DISTINCT ?anat ?anatName {
	?seq a orth:Gene .
	?seq genex:isExpressedIn ?anat .
	?seq lscr:xrefUniprot up-protein:P02654 .
	?anat a genex:AnatomicalEntity .
	?anat rdfs:label ?anatName .
}

To run this query click here.

Q13 graphical representation:

Querying gene metadata

All direct attributes of a given gene in the Bgee knowledge graph are listed below:

Q14 shows a SPARQL query to retrieve all metadata related to the ENSG00000130208 Ensembl gene.

Q14:

Question: What is all the metadata related to the ENSG00000130208 gene, where ENSG00000130208 is the identifier of the "APOC1" human gene. SPARQL query (see Q14 graphical representation):

PREFIX orth: <http://purl.org/net/orth#>
PREFIX lscr: <http://purl.org/lscr#>
PREFIX dcterms: <http://purl.org/dc/terms/>

SELECT DISTINCT ?symbol ?description ?id
?links ?organism ?uniprot ?ensembl ?ncbi {
	?seq a orth:Gene .
	?seq rdfs:label ?symbol .
	?seq rdfs:seeAlso ?links .
	?seq dcterms:description ?description .
	?seq dcterms:identifier ?id .
	?seq orth:organism ?organism .
	OPTIONAL{ ?seq lscr:xrefUniprot ?uniprot . }
	OPTIONAL{ ?seq lscr:xrefEnsemblGene ?ensembl . }
	OPTIONAL{ ?seq lscr:xrefNCBIGene ?ncbi . }
	FILTER (?id = "ENSG00000130208")
}

To run this query click here.

Q14 graphical representation:

Querying Genes with Absence of Expression

To check for genes that are not expressed in some condition or tissue, we can use the genex:isAbsentIn and genex:AbsenceExpression terms instead of genex:isExpressedIn and genex:Expression, respectively. For instance, to answer the question Where is a gene not expressed?, we can answer this question by replacing the statement genex:isExpressedIn with genex:isAbsentIn in the query Q14 as shown in Q15 and its graphical representation.

Q15:

Question: What are the anatomical entities where the "APOC1" Homo sapiens gene is not expressed, that is where is "APOC1" absent? SPARQL query:

PREFIX orth: <http://purl.org/net/orth#>
PREFIX genex: <http://purl.org/genex#>
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX up: <http://purl.uniprot.org/core/>

SELECT DISTINCT ?anat ?anatName {
	?seq a orth:Gene .
	?seq genex:isAbsentIn ?anat .
	?seq rdfs:label "APOC1" .
	?anat a genex:AnatomicalEntity .
	?anat rdfs:label ?anatName .
	?seq orth:organism ?organism .
	?organism obo:RO_0002162 ?species . #in taxon
	?species a up:Taxon .
	?species up:scientificName "Homo sapiens" .
}

To run this query click here.

Q15 graphical representation:

Programmatic access to the latest version of the Bgee SPARQL endpoint

The latest version of the Bgee SPARQL endpoint is accessible by using your preferred programming language through the URL address https://www.bgee.org/sparql/.

For example, to retrieve all anatomic entities in Rattus norvegicus where the APOC1 gene is expressed, the query is:

PREFIX orth: <http://purl.org/net/orth#>
PREFIX genex: <http://purl.org/genex#>
PREFIX obo: <http://purl.obolibrary.org/obo/>
SELECT DISTINCT ?anatEntity ?anatName {
	?seq a             orth:Gene ;
	     orth:organism ?organism ;
	     rdfs:label    ?geneName .
	?organism obo:RO_0002162 <http://purl.uniprot.org/taxonomy/10116> . #in_taxon
	?seq genex:isExpressedIn ?anatEntity .
	?anatEntity a genex:AnatomicalEntity .
	?anatEntity rdfs:label ?anatName .
	FILTER (LCASE(?geneName) = LCASE('APOC1'))
}

It is possible to download the result of this query in the JSON or XML format.

(Of note, as opposed to the example below to access an archived version, when accessing the endpoint for the latest version, it is important NOT to specify the name of a graph to target; otherwise, results will be incorrect)

Stable programmatic access to this version of the Bgee SPARQL endpoint

This version of the Bgee SPARQL endpoint is accessible in a stable manner by using your preferred programming language through the stable URL address https://www.bgee.org/sparql15_1/.

In the SELECT section of your query, it is essential to specify the URL of the graph you want to query (https://bgee.org/rdf_v15_1), otherwise you won't be using the data for this version. For example, to retrieve all anatomic entities in Rattus norvegicus where the APOC1 gene is expressed, the query is:

PREFIX orth: <http://purl.org/net/orth#>
PREFIX genex: <http://purl.org/genex#>
PREFIX obo: <http://purl.obolibrary.org/obo/>
SELECT DISTINCT ?anatEntity ?anatName FROM <https://bgee.org/rdf_v15_1> {
	?seq a             orth:Gene ;
	     orth:organism ?organism ;
	     rdfs:label    ?geneName .
	?organism obo:RO_0002162 <http://purl.uniprot.org/taxonomy/10116> . #in_taxon
	?seq genex:isExpressedIn ?anatEntity .
	?anatEntity a genex:AnatomicalEntity .
	?anatEntity rdfs:label ?anatName .
	FILTER (LCASE(?geneName) = LCASE('APOC1'))
}

Again, it is essential to specify the name of the graph of the version to target (in the example above, https://bgee.org/rdf_v15_1); otherwise, results will be incorrect.

RDF serialisation and semantic models

The Bgee RDF data were created using an Ontology Based Data Access (OBDA) approach, so-called Ontop. The RDF serialisation of the 'EasyBgee' database is based on the GenEx semantic model specification and the OBDA mappings defined in OBDA mappings. The mappings are defined using the Ontop mapping language. We also inferred all implicit information based on OWL 2 Web Ontology Language Profile QL reasoning over GenEx.

To cross-reference other resources, this SPARQL endpoint contains annotation property assertions defined by a first draft of the life-sciences cross-reference (LSCR) ontology that is available to download at the Quest for Orthologs GitHub repository here.

Download the latest Bgee RDF data dump here.