Expression Calls File

This documentation describes the content of the presence/absence expression calls download files, how values of each column are generated, and how to download expression calls data.

Introduction
Generation of Expression Calls
Download Expression Calls
Choosing a Download File
- Condition Parameters
- Simple File vs. Advanced File
Download File Details
- File Content
- Column Descriptions

Introduction

Bgee provides presence/absence of expression calls that can be retrieved in download files or by using our R package "BgeeDB". The expression calls are reported either: 1) by gene and anatomical entity or 2) by gene and i) anatomical entity, ii) developmental and life stage, iii) sex, and iv) strain/ethnicity.

Only wild-type healthy gene expression data is included in Bgee (i.e. no treatment, no disease, no gene knock-out, etc.). Bgee collects data from different experiments and data types, and provides a summary from all these data as unique calls of presence and absence of expression, per gene and condition. For each call, an FDR-corrected p-value and expression score are provided, which allows you to compare levels of expression.

Present/absent expression calls are very similar to the data that can be reported using in situ hybridization methods; Bgee applies dedicated statistical analyses to generate such calls from EST, Affymetrix, bulk RNA-Seq, and single-cell RNA-Seq, and also collects in situ hybridization calls from model organism databases. This offers the possibility to aggregate and compare these present/absent expression calls between different experiments, different data types, and different species.

Generation of expression calls

First step: computation of expression p-values per gene and sample

For each gene and each sample in Bgee, we produce a p-value based on a null hypothesis of expression level equal to or below the background expression noise (i.e. absence of expression).

bulk RNA-Seq data: we use our own method to estimate for each RNA-Seq library independently the TPM threshold to consider a gene as actively transcribed, inferred by the amount of reads mapped to intergenic regions of the genome. For this, we first define a stringent set of reference intergenic regions based on available bulk RNA-Seq libraries for each species. We then call genes expressed if their level of expression is significantly higher than the background noise. For each gene in the library, we compute a Z-score in terms of standard deviations from the mean of reference intergenic regions. Then we calculate a p-value based on a null hypothesis of expression at a similar level to reference intergenic, estimated as a Normal distribution.
single-cell RNA-Seq data: the method used is the same as for bulk RNA-Seq data for each cell/library.
Affymetrix data: when raw CEL files are available, we use the gcRMA algorithm to normalize the signal taking into account probe sequences, and use a subset of weakly expressed probesets for estimating the background signal of expression. We then apply a Wilcoxon test to compare the normalized signal of the probesets with the background signal, as implemented in the 'mas5calls' function of the Bioconductor package 'affy', and we use the resulting p-value. When only the MAS5 files are available for an analysis, we use the flags provided by the MAS5 software with the following mapping to a p-value: 0.01 for 'present' detection flags, 0.05 for 'marginal' detection flags, 0.1 for 'absent' detection flags.
EST data: based on the number of ESTs mapped to a gene in a library, we produce a p-value based on the null hypothesis that the EST count is not different from 0, with the formula: 2^(-(est_count + 1)).
in situ hybridization data: we retrieve in situ hybridization data from Model Organism Databases part of the Alliance of Genome Resources. We map call qualities provided by these resources to p-values: 0.0004 for 'present high quality' calls; 0.01 for "present low quality"; 0.1 for "absent low quality"; 0.5 for "absent high quality".

Second step: FDR corrected p-values per gene and condition

We capture information about the anatomical localization of samples, their developmental and life stage, sex, and strain or ethnicity. We either manually capture this information using ontologies and controlled vocabularies, or we map existing annotations provided by MODs to these ontologies and vocabularies.

After p-values are generated from the curated data for each gene and sample, they are propagated using anatomical and life stage ontologies. For instance, the p-value obtained for a gene in a sample studying the condition 'midbrain' at 'aged stage', will be propagated to the condition 'brain' at 'adult stage'. All p-values are propagated in a similar way toward the root of the graph of conditions.

After all p-values have been propagated, we apply a Benjamini-Hochberg FDR correction to generate one FDR p-value per gene and condition.

Final step: generation of present/absent expression calls per gene and condition

Present gold quality expression calls: when the FDR-corrected p-value for a gene in a condition is less than or equal to 0.01.
Present silver quality expression calls: when the FDR-corrected p-value for a gene in a condition is less than or equal to 0.05, and greater than 0.01.
Absent gold quality expression calls:
- when the call is supported by at least one p-values generated from data types trusted for absent calls (bulk RNA-Seq, Affymetrix, in situ hybridization)
- and the FDR-corrected p-value for a gene in a condition is greater than 0.1, taking into account all requested data types
- and the FDR-corrected p-value taking into account only data types trusted for absent calls is greater than 0.1
- and there is no FDR-corrected p-value less than or equal to 0.05 in any child condition for that gene, considering the data types trusted for absent calls.
Absent silver quality expression calls: same as absent gold quality expression calls, but using an FDR-corrected p-value threshold of 0.05.

Download expression calls

The files can be found on the Bgee download page for each species. To access the download page from the Bgee homepage, go to the download section on the top toolbar and click on "Gene expression calls".

Once on the download calls webpage, you can either search for a specific species in the top search bar or look through the species list at the bottom and click on the species logo to access the different download file options. These datasets can also be downloaded directly using our R package "BgeeDB”.

Potential download problems

If you open a file with a spreadsheet editor, it will potentially transform some cell values into dates. Files need to be imported into a spreadsheet editor to avoid such problems.
Download files are compressed with gzip. They have to be uncompressed before opening them into an editor.
Tarball containing TPM values for a species contain gzip files that also need to be uncompressed before opening with an editor.

Choosing a download file

Once a species is selected, you will need to choose if you want data only for anatomical entities or for all conditions, and if you want the summarized information (simple file) or all information (advanced file). The implications of each option are explained in further detail below.

Condition parameters

There are two different options for condition parameters:

anatomical entities only: files contain one expression call for each unique pair of genes and anatomical entities.
all conditions parameters: files contain one expression call for each unique gene, anatomical entity, developmental stage, sex and strain.

Simple file vs. Advanced file

There are two different options when downloading the file:

simple: aimed at providing summarized information over all data types.
advanced: aimed at reporting all information, for instance allowing you to retrieve the contribution of each data type to a call.

Simple and advanced files contain the same expression calls (same number of lines) but advanced files contain more information on each call (more columns).

Advanced file additional information:

expression status generated from each data type are provided (present, absent, no data).
number of present high quality and present low quality calls from each data type.
number of absent high quality and absent low quality calls from in situ, Affymetrix, and RNA-Seq.
data type for which calls are observed. Each call is observed in at least one data type.

Download file details

Below is a complete description of which data you can expect to find in each download file and a detailed description of each column.

File content

Column	Content	In anatomical simple files	In anatomical advanced files	In all conditions simple files	In all conditions advanced files	Example
1	Gene ID	Yes	Yes	Yes	Yes	FBgn0005427
2	Gene name	Yes	Yes	Yes	Yes	ewg
3	Anatomical entity ID	Yes	Yes	Yes	Yes	UBERON:6001060
4	Anatomical entity name	Yes	Yes	Yes	Yes	embryonic brain
5	Developmental stage ID	No	No	Yes	Yes	FBdv:00005334
6	Developmental stage name	No	No	Yes	Yes	embryonic stage 16 (fruit fly)
7	Sex	No	No	Yes	Yes	any
8	Strain	No	No	Yes	Yes	wild-type
9	Expression	Yes	Yes	Yes	Yes	present
10	Call quality	Yes	Yes	Yes	Yes	gold quality
11	FDR	Yes	Yes	Yes	Yes	0.0004
12	Expression score	Yes	Yes	Yes	Yes	49.99
13	Expression rank	Yes	Yes	Yes	Yes	8.32e3
14	Including observed data	No	Yes	No	Yes	yes
15	Self observation count	No	Yes	No	Yes	1
16	Descendant observation count	No	Yes	No	Yes	0
17	Affymetrix expression	No	Yes	No	Yes	no data
18	Affymetrix call quality	No	Yes	No	Yes	NA
19	Affymetrix FDR	No	Yes	No	Yes	NA
20	Affymetrix expression score	No	Yes	No	Yes	NA
21	Affymetrix expression rank	No	Yes	No	Yes	NA
22	Affymetrix weight for expression rank and score	No	Yes	No	Yes	NA
23	Including Affymetrix observed data	No	Yes	No	Yes	no
24	Self observation count Affymetrix	No	Yes	No	Yes	0
25	Descendant observation count Affymetrix	No	Yes	No	Yes	0
26	EST expression	No	Yes	No	Yes	no data
27	EST call quality	No	Yes	No	Yes	NA
28	EST FDR	No	Yes	No	Yes	NA
29	EST expression score	No	Yes	No	Yes	NA
30	EST expression rank	No	Yes	No	Yes	NA
31	EST weight for expression rank and score	No	Yes	No	Yes	NA
32	Including EST observed data	No	Yes	No	Yes	no
33	Self observation count EST	No	Yes	No	Yes	0
34	Descendant observation count EST	No	Yes	No	Yes	0
35	in situ hybridization expression	No	Yes	No	Yes	present
36	in situ hybridization call quality	No	Yes	No	Yes	gold quality
37	in situ hybridization FDR	No	Yes	No	Yes	0.0004
38	in situ hybridization expression score	No	Yes	No	Yes	49.99
39	in situ hybridization expression rank	No	Yes	No	Yes	8.32e3
40	in situ hybridization weight for expression rank and score	No	Yes	No	Yes	5.00
41	Including in situ hybridization observed data	No	Yes	No	Yes	yes
42	Self observation count in situ hybridization	No	Yes	No	Yes	1
43	Descendant observation count in situ hybridization	No	Yes	No	Yes	0
44	RNA-Seq expression	No	Yes	No	Yes	no data
45	RNA-Seq call quality	No	Yes	No	Yes	NA
46	RNA-Seq FDR	No	Yes	No	Yes	NA
47	RNA-Seq expression score	No	Yes	No	Yes	NA
48	RNA-Seq expression rank	No	Yes	No	Yes	NA
49	RNA-Seq weight for expression rank and score	No	Yes	No	Yes	NA
50	Including RNA-Seq observed data	No	Yes	No	Yes	no
51	Self observation count RNA-Seq	No	Yes	No	Yes	0
52	Descendant observation count RNA-Seq	No	Yes	No	Yes	0
53	full-length single-cell RNA-Seq expression	No	Yes	No	Yes	no data
54	full-length single-cell RNA-Seq call quality	No	Yes	No	Yes	NA
55	full-length single-cell RNA-Seq FDR	No	Yes	No	Yes	NA
56	full-length single-cell RNA-Seq expression score	No	Yes	No	Yes	NA
57	full-length single-cell RNA-Seq expression rank	No	Yes	No	Yes	NA
58	full-length single-cell RNA-Seq weight for expression rank and score	No	Yes	No	Yes	NA
59	Including full-length single-cell RNA-Seq observed data	No	Yes	No	Yes	no
60	Self observation count full-length single-cell RNA-Seq	No	Yes	No	Yes	0
61	Descendant observation count full-length single-cell RNA-Seq	No	Yes	No	Yes	0

Column descriptions

Gene ID (column 1)

Unique identifier of the gene.

Gene name (column 2)

Name of the gene defined by Gene ID (column 1).

Anatomical entity ID (column 3)

Unique identifier of the anatomical entity, from the Uberon ontology.

Anatomical entity name (column 4)

Name of the anatomical entity defined by Anatomical entity ID (column 3).

Developmental stage ID (column 5)

Unique identifier of the developmental stage, from the Uberon ontology.

Developmental stage name (column 6)

Name of the developmental stage defined by Developmental stage ID (column 5).

Sex (column 7)

Sex of the sample used to generate the call.

Strain (column 8)

Strain of the sample used to generate the call.

Expression (column 9)

Call generated from all data types for the selected combination of condition parameters (anatomical or all conditions). Permitted values: present, absent.

Call quality (column 10)

Call quality from all data types for the selected combination of condition parameters (anatomical or all conditions). Permitted values: gold quality, silver quality.

FDR (column 11)

FDR-corrected p-value of the call.

Expression score (column 12)

Score of expression to the call. The score uses the minimum and maximum Expression Rank (column 13) of the species to normalize the expression to a value between 0 and 100.

Low score means that the gene is lowly expressed in the condition.

Expression rank (column 13)

Rank score associated with the call. Rank scores of expression calls are normalized across genes, conditions and species.

A low score means that the gene is highly expressed in the condition.

Including observed data (column 14)

Permitted value: yes.

Only calls that were actually seen in experimental data, at least once, are in this file.

Self observation count (column 15)

Number of observations coming from experimental data for this combination of condition parameters (anatomical or all conditions).

Descendant observation count (column 16)

Number of observations coming from experimental data for the combination of condition parameters (anatomical or all conditions) descendant of the current one.

Affymetrix expression (column 17)

Call generated from Affymetrix data for the selected combination of condition parameters (anatomical or all conditions). Permitted values: present, absent, no data.

Affymetrix call quality (column 18)

Quality associated with the call from Affymetrix data. Permitted values: gold quality, silver quality, NA.

Affymetrix FDR (column 19)

FDR-corrected p-value of the call calculated using p-values coming from Affymetrix data.

Affymetrix expression score (column 20)

Score of expression to the call from Affymetrix data. The score uses the minimum and maximum Expression Rank (column 13) of the species to normalize the expression to a value between 0 and 100.

Low score means that the gene is lowly expressed in the condition.

Affymetrix expression rank (column 21)

Rank score associated with the call from Affymetrix data. Rank scores of expression calls are normalized across genes, conditions and species.

A low score means that the gene is highly expressed in the condition.

Affymetrix weight for expression rank and score (column 22)

The weight given to Affymetrix expression ranks and scores when computing the weighted mean over several data types.

Including Affymetrix observed data (column 23)

Information about the calls actually coming from experimental Affymetrix data for this combination of condition parameters (anatomical or all conditions).

Permitted value: yes or no.

Self observation count Affymetrix (column 24)

Number of observations coming from experimental Affymetrix data for this combination of condition parameters (anatomical or all conditions).

Descendant observation count Affymetrix (column 25)

Number of observations coming from experimental Affymetrix data for the combination of condition parameters (anatomical or all conditions) descendant of the current one.

EST expression (column 26)

Call generated from EST data for the selected combination of condition parameters (anatomical or all conditions). Permitted values: present, absent, no data.

EST call quality (column 27)

Quality associated with the call from EST data. Permitted values: gold quality, silver quality, NA.

EST FDR (column 28)

FDR-corrected p-value of the call calculated using p-values coming from EST data.

EST expression score (column 29)

Score of expression to the call from EST data. The score uses the minimum and maximum Expression Rank (column 13) of the species to normalize the expression to a value between 0 and 100.

Low score means that the gene is lowly expressed in the condition.

EST expression rank (column 30)

Rank score associated with the call from EST data. Rank scores of expression calls are normalized across genes, conditions, and species.

A low score means that the gene is highly expressed in the condition.

EST weight for expression rank and score (column 31)

The weight given to EST expression ranks and scores when computing the weighted mean over several data types.

Including EST observed data (column 32)

Information about the calls actually coming from experimental EST data for this combination of condition parameters (anatomical or all conditions).

Permitted value: yes or no.

Self observation count EST (column 33)

Number of observations coming from experimental EST data for this combination of condition parameters (anatomical or all conditions).

Descendant observation count EST (column 34)

Number of observations coming from experimental EST data for the combination of condition parameters (anatomical or all conditions) descendant of the current one.

in situ hybridization expression (column 35)

Call generated from in situ hybridization data for the selected combination of condition parameters (anatomical or all conditions). Permitted values: present, absent, no data.

in situ hybridization call quality (column 36)

Quality associated with the call from in situ hybridization data. Permitted values: gold quality, silver quality, NA.

in situ hybridization FDR (column 37)

FDR value of the call calculated using p-values coming from in situ hybridization data.

in situ hybridization expression score (column 38)

Score of expression to the call from in situ hybridization data. The score uses the minimum and maximum Expression Rank (column 13) of the species to normalize the expression to a value between 0 and 100.

Low score means that the gene is lowly expressed in the condition.

in situ hybridization expression rank (column 39)

Rank score associated with the call from in situ hybridization data. Rank scores of expression calls are normalized across genes, conditions, and species.

A low score means that the gene is highly expressed in the condition.

in situ hybridization weight for expression rank and score (column 40)

The weight given to in situ hybridization expression ranks and scores when computing the weighted mean over several data types.

Including in situ hybridization observed data (column 41)

Information about the calls actually coming from experimental in situ hybridization data for this combination of condition parameters (anatomical or all conditions).

Permitted value: yes or no.

Self observation count in situ hybridization (column 42)

Number of observations coming from experimental in situ hybridization data for this combination of condition parameters (anatomical or all conditions).

Descendant observation count in situ hybridization (column 43)

Number of observations coming from experimental in situ hybridization data for the combination of condition parameters (anatomical or all conditions) descendant of the current one.

RNA-Seq expression (column 44)

Call generated from bulk RNA-Seq data for the selected combination of condition parameters (anatomical or all conditions). Permitted values: present, absent, no data.

RNA-Seq call quality (column 45)

Quality associated with the call from bulk RNA-Seq data. Permitted values: gold quality, silver quality, NA.

RNA-Seq FDR (column 46)

FDR-corrected p-value of the call calculated using p-values coming from RNA-Seq data.

RNA-Seq expression score (column 47)

Score of expression to the call from RNA-Seq data. The score uses the minimum and maximum Expression Rank (column 13) of the species to normalize the expression to a value between 0 and 100.

Low score means that the gene is lowly expressed in the condition.

RNA-Seq expression rank (column 48)

Rank score associated with the call from RNA-Seq data. Rank scores of expression calls are normalized across genes, conditions and species.

A low score means that the gene is highly expressed in the condition.

RNA-Seq weight for expression rank and score (column 49)

The weight given to RNA-Seq expression ranks and scores when computing the weighted mean over several data types.

Including RNA-Seq observed data (column 50)

Information about the calls actually coming from experimental RNA-Seq data for this combination of condition parameters (anatomical or all conditions).

Permitted value: yes or no.

Self observation count RNA-Seq (column 51)

Number of observations coming from experimental RNA-Seq data for this combination of condition parameters (anatomical or all conditions).

Descendant observation count RNA-Seq (column 52)

Number of observations coming from experimental RNA-Seq data for the combination of condition parameters (anatomical or all conditions) descendant of the current one.

full-length single-cell RNA-Seq expression (column 53)

Call generated from full-length single-cell RNA-Seq data for the selected combination of condition parameters (anatomical or all conditions). Permitted values: present, absent, no data.

full-length single-cell RNA-Seq call quality (column 54)

Quality associated with the call from full-length single-cell RNA-Seq data. Permitted values: gold quality, silver quality, NA.

full-length single-cell RNA-Seq FDR (column 55)

FDR-corrected p-value of the call calculated using p-values coming from full-length single-cell RNA-Seq data.

full-length single-cell RNA-Seq expression score (column 56)

Score of expression to the call from full-length single-cell RNA-Seq data. The score uses the minimum and maximum Expression Rank (column 13) of the species to normalize the expression to a value between 0 and 100.

Low score means that the gene is lowly expressed in the condition.

full-length single-cell RNA-Seq expression rank (column 57)

Rank score associated with the call from full-length single-cell RNA-Seq data. Rank scores of expression calls are normalized across genes, conditions, and species.

A low score means that the gene is highly expressed in the condition.

full-length single-cell RNA-Seq weight for expression rank and score (column 58)

The weight given to full-length single-cell expression ranks and scores when computing the weighted mean over several data types.

Including full-length single-cell RNA-Seq observed data (column 59)

Information about the calls actually coming from experimental full-length single-cell RNA-Seq data for this combination of condition parameters (anatomical or all conditions).

Permitted value: yes or no.

Self observation count full-length single-cell RNA-Seq (column 60)

Number of observations coming from experimental full-length single-cell RNA-Seq data for this combination of condition parameters (anatomical or all conditions).

Descendant observation count full-length single-cell RNA-Seq (column 61)

Number of observations coming from experimental full-length single-cell RNA-Seq data for the combination of condition parameters (anatomical or all conditions) descendant of the current one.