This documentation describes the content of the presence/absence expression calls download files, how values of each column are generated, and how to download expression calls data.
Bgee provides presence/absence of expression calls that can be retrieved in download files or by using our R package "BgeeDB". The expression calls are reported either: 1) by gene and anatomical entity or 2) by gene and i) anatomical entity, ii) developmental and life stage, iii) sex, and iv) strain/ethnicity.
Only wild-type healthy gene expression data is included in Bgee (i.e. no treatment, no disease, no gene knock-out, etc.). Bgee collects data from different experiments and data types, and provides a summary from all these data as unique calls of presence and absence of expression, per gene and condition. For each call, an FDR-corrected p-value and expression score are provided, which allows you to compare levels of expression.
Present/absent expression calls are very similar to the data that can be reported using in situ hybridization methods; Bgee applies dedicated statistical analyses to generate such calls from EST, Affymetrix, bulk RNA-Seq, and single-cell RNA-Seq, and also collects in situ hybridization calls from model organism databases. This offers the possibility to aggregate and compare these present/absent expression calls between different experiments, different data types, and different species.
For each gene and each sample in Bgee, we produce a p-value based on a null hypothesis of expression level equal to or below the background expression noise (i.e. absence of expression).
We capture information about the anatomical localization of samples, their developmental and life stage, sex, and strain or ethnicity. We either manually capture this information using ontologies and controlled vocabularies, or we map existing annotations provided by MODs to these ontologies and vocabularies.
After p-values are generated from the raw data for each gene and sample, they are propagated using anatomical and life stage ontologies. For instance, the p-value obtained for a gene in a sample studying the condition 'midbrain' at 'aged stage', will be propagated to the condition 'brain' at 'adult stage'. All p-values are propagated in a similar way toward the root of the graph of conditions.
After all p-values have been propagated, we apply a Benjamini-Hochberg FDR correction to generate one FDR p-value per gene and condition.
The files can be found on the Bgee download page for each species. To access the download page from the Bgee homepage, go to the download section on the top toolbar and click on "Gene expression calls".
Once on the download calls webpage, you can either search for a specific species in the top search bar or look through the species list at the bottom and click on the species logo to access the different download file options. These datasets can also be downloaded directly using our R package "BgeeDB”.
Once a species is selected, you will need to choose if you want data only for anatomical entities or for all conditions, and if you want the summarized information (simple file) or all information (advanced file). The implications of each option are explained in further detail below.
There are two different options for condition parameters
:
There are two different options when downloading the file:
Simple and advanced files contain the same expression calls (same number of lines) but advanced files contain more information on each call (more columns).
Advanced file additional information:
Below is a complete description of which data you can expect to find in each download file and a detailed description of each column.
Unique identifier of the gene.
Name of the gene defined by Gene ID
(column 1).
Unique identifier of the anatomical entity, from the Uberon ontology.
Name of the anatomical entity defined by Anatomical entity ID
(column 3).
Unique identifier of the developmental stage, from the Uberon ontology.
Name of the developmental stage defined by Developmental stage ID
(column 5).
Sex of the sample used to generate the call.
Strain of the sample used to generate the call.
Call generated from all data types for the selected combination of condition parameters (anatomical or all conditions). Permitted values: present, absent.
Call quality from all data types for the selected combination of condition parameters (anatomical or all conditions). Permitted values: gold quality, silver quality.
FDR-corrected p-value of the call.
Score of expression to the call. The score uses the minimum and maximum Expression Rank
(column 13) of the species to normalize the expression to a value between 0 and 100.
Low score means that the gene is lowly expressed in the condition.
Rank score associated with the call. Rank scores of expression calls are normalized across genes, conditions and species.
A low score means that the gene is highly expressed in the condition.
Permitted value: yes
.
Only calls that were actually seen in experimental data, at least once, are in this file.
Number of observations coming from experimental data for this combination of condition parameters (anatomical or all conditions).
Number of observations coming from experimental data for the combination of condition parameters (anatomical or all conditions) descendant of the current one.
Call generated from Affymetrix data for the selected combination of condition parameters (anatomical or all conditions). Permitted values: present, absent, no data.
Quality associated with the call from Affymetrix data. Permitted values: gold quality, silver quality, NA.
FDR-corrected p-value of the call calculated using p-values coming from Affymetrix data.
Score of expression to the call from Affymetrix data. The score uses the minimum and maximum Expression Rank
(column 13) of the species to normalize the expression to a value between 0 and 100.
Low score means that the gene is lowly expressed in the condition.
Rank score associated with the call from Affymetrix data. Rank scores of expression calls are normalized across genes, conditions and species.
A low score means that the gene is highly expressed in the condition.
The weight given to Affymetrix expression ranks and scores when computing the weighted mean over several data types.
Information about the calls actually coming from experimental Affymetrix data for this combination of condition parameters (anatomical or all conditions).
Permitted value: yes
or no
.
Number of observations coming from experimental Affymetrix data for this combination of condition parameters (anatomical or all conditions).
Number of observations coming from experimental Affymetrix data for the combination of condition parameters (anatomical or all conditions) descendant of the current one.
Call generated from EST data for the selected combination of condition parameters (anatomical or all conditions). Permitted values: present, absent, no data.
Quality associated with the call from EST data. Permitted values: gold quality, silver quality, NA.
FDR-corrected p-value of the call calculated using p-values coming from EST data.
Score of expression to the call from EST data. The score uses the minimum and maximum Expression Rank
(column 13) of the species to normalize the expression to a value between 0 and 100.
Low score means that the gene is lowly expressed in the condition.
Rank score associated with the call from EST data. Rank scores of expression calls are normalized across genes, conditions, and species.
A low score means that the gene is highly expressed in the condition.
The weight given to EST expression ranks and scores when computing the weighted mean over several data types.
Information about the calls actually coming from experimental EST data for this combination of condition parameters (anatomical or all conditions).
Permitted value: yes
or no
.
Number of observations coming from experimental EST data for this combination of condition parameters (anatomical or all conditions).
Number of observations coming from experimental EST data for the combination of condition parameters (anatomical or all conditions) descendant of the current one.
Call generated from in situ hybridization data for the selected combination of condition parameters (anatomical or all conditions). Permitted values: present, absent, no data.
Quality associated with the call from in situ hybridization data. Permitted values: gold quality, silver quality, NA.
FDR value of the call calculated using p-values coming from in situ hybridization data.
Score of expression to the call from in situ hybridization data. The score uses the minimum and maximum Expression Rank
(column 13) of the species to normalize the expression to a value between 0 and 100.
Low score means that the gene is lowly expressed in the condition.
Rank score associated with the call from in situ hybridization data. Rank scores of expression calls are normalized across genes, conditions, and species.
A low score means that the gene is highly expressed in the condition.
The weight given to in situ hybridization expression ranks and scores when computing the weighted mean over several data types.
Information about the calls actually coming from experimental in situ hybridization data for this combination of condition parameters (anatomical or all conditions).
Permitted value: yes
or no
.
Number of observations coming from experimental in situ hybridization data for this combination of condition parameters (anatomical or all conditions).
Number of observations coming from experimental in situ hybridization data for the combination of condition parameters (anatomical or all conditions) descendant of the current one.
Call generated from bulk RNA-Seq data for the selected combination of condition parameters (anatomical or all conditions). Permitted values: present, absent, no data.
Quality associated with the call from bulk RNA-Seq data. Permitted values: gold quality, silver quality, NA.
FDR-corrected p-value of the call calculated using p-values coming from RNA-Seq data.
Score of expression to the call from RNA-Seq data. The score uses the minimum and maximum Expression Rank
(column 13) of the species to normalize the expression to a value between 0 and 100.
Low score means that the gene is lowly expressed in the condition.
Rank score associated with the call from RNA-Seq data. Rank scores of expression calls are normalized across genes, conditions and species.
A low score means that the gene is highly expressed in the condition.
The weight given to RNA-Seq expression ranks and scores when computing the weighted mean over several data types.
Information about the calls actually coming from experimental RNA-Seq data for this combination of condition parameters (anatomical or all conditions).
Permitted value: yes
or no
.
Number of observations coming from experimental RNA-Seq data for this combination of condition parameters (anatomical or all conditions).
Number of observations coming from experimental RNA-Seq data for the combination of condition parameters (anatomical or all conditions) descendant of the current one.
Call generated from full-length single-cell RNA-Seq data for the selected combination of condition parameters (anatomical or all conditions). Permitted values: present, absent, no data.
Quality associated with the call from full-length single-cell RNA-Seq data. Permitted values: gold quality, silver quality, NA.
FDR-corrected p-value of the call calculated using p-values coming from full-length single-cell RNA-Seq data.
Score of expression to the call from full-length single-cell RNA-Seq data. The score uses the minimum and maximum Expression Rank
(column 13) of the species to normalize the expression to a value between 0 and 100.
Low score means that the gene is lowly expressed in the condition.
Rank score associated with the call from full-length single-cell RNA-Seq data. Rank scores of expression calls are normalized across genes, conditions, and species.
A low score means that the gene is highly expressed in the condition.
The weight given to full-length single-cell expression ranks and scores when computing the weighted mean over several data types.
Information about the calls actually coming from experimental full-length single-cell RNA-Seq data for this combination of condition parameters (anatomical or all conditions).
Permitted value: yes
or no
.
Number of observations coming from experimental full-length single-cell RNA-Seq data for this combination of condition parameters (anatomical or all conditions).
Number of observations coming from experimental full-length single-cell RNA-Seq data for the combination of condition parameters (anatomical or all conditions) descendant of the current one.