Expression Calls File

This documentation describes the content of the presence/absence expression calls download files, how values of each column are generated, and how to download expression calls data.

Introduction

Bgee provides presence/absence of expression calls that can be retrieved in download files or by using our R package "BgeeDB". The expression calls are reported either: 1) by gene and anatomical entity or 2) by gene and i) anatomical entity, ii) developmental and life stage, iii) sex, and iv) strain/ethnicity.

Only wild-type healthy gene expression data is included in Bgee (i.e. no treatment, no disease, no gene knock-out, etc.). Bgee collects data from different experiments and data types, and provides a summary from all these data as unique calls of presence and absence of expression, per gene and condition. For each call, an FDR-corrected p-value and expression score are provided, which allows you to compare levels of expression.

Present/absent expression calls are very similar to the data that can be reported using in situ hybridization methods; Bgee applies dedicated statistical analyses to generate such calls from EST, Affymetrix, bulk RNA-Seq, and single-cell RNA-Seq, and also collects in situ hybridization calls from model organism databases. This offers the possibility to aggregate and compare these present/absent expression calls between different experiments, different data types, and different species.

Generation of expression calls

First step: computation of expression p-values per gene and sample

For each gene and each sample in Bgee, we produce a p-value based on a null hypothesis of expression level equal to or below the background expression noise (i.e. absence of expression).

  • bulk RNA-Seq data: we use our own method to estimate for each RNA-Seq library independently the TPM threshold to consider a gene as actively transcribed, inferred by the amount of reads mapped to intergenic regions of the genome. For this, we first define a stringent set of reference intergenic regions based on available bulk RNA-Seq libraries for each species. We then call genes expressed if their level of expression is significantly higher than the background noise. For each gene in the library, we compute a Z-score in terms of standard deviations from the mean of reference intergenic regions. Then we calculate a p-value based on a null hypothesis of expression at a similar level to reference intergenic, estimated as a Normal distribution.
  • single-cell RNA-Seq data: the method used is the same as for bulk RNA-Seq data for each cell/library.
  • Affymetrix data: when raw CEL files are available, we use the gcRMA algorithm to normalize the signal taking into account probe sequences, and use a subset of weakly expressed probesets for estimating the background signal of expression. We then apply a Wilcoxon test to compare the normalized signal of the probesets with the background signal, as implemented in the 'mas5calls' function of the Bioconductor package 'affy', and we use the resulting p-value. When only the MAS5 files are available for an analysis, we use the flags provided by the MAS5 software with the following mapping to a p-value: 0.01 for 'present' detection flags, 0.05 for 'marginal' detection flags, 0.1 for 'absent' detection flags.
  • EST data: based on the number of ESTs mapped to a gene in a library, we produce a p-value based on the null hypothesis that the EST count is not different from 0, with the formula: 2^(-(est_count + 1)).
  • in situ hybridization data: we retrieve in situ hybridization data from Model Organism Databases part of the Alliance of Genome Resources. We map call qualities provided by these resources to p-values: 0.0004 for 'present high quality' calls; 0.01 for "present low quality"; 0.1 for "absent low quality"; 0.5 for "absent high quality".

Second step: FDR corrected p-values per gene and condition

We capture information about the anatomical localization of samples, their developmental and life stage, sex, and strain or ethnicity. We either manually capture this information using ontologies and controlled vocabularies, or we map existing annotations provided by MODs to these ontologies and vocabularies.

After p-values are generated from the raw data for each gene and sample, they are propagated using anatomical and life stage ontologies. For instance, the p-value obtained for a gene in a sample studying the condition 'midbrain' at 'aged stage', will be propagated to the condition 'brain' at 'adult stage'. All p-values are propagated in a similar way toward the root of the graph of conditions.

After all p-values have been propagated, we apply a Benjamini-Hochberg FDR correction to generate one FDR p-value per gene and condition.

Final step: generation of present/absent expression calls per gene and condition

  • Present gold quality expression calls: when the FDR-corrected p-value for a gene in a condition is less than or equal to 0.01.
  • Present silver quality expression calls: when the FDR-corrected p-value for a gene in a condition is less than or equal to 0.05, and greater than 0.01.
  • Absent gold quality expression calls:
    • when the call is supported by at least one p-values generated from data types trusted for absent calls (bulk RNA-Seq, Affymetrix, in situ hybridization)
    • and the FDR-corrected p-value for a gene in a condition is greater than 0.1, taking into account all requested data types
    • and the FDR-corrected p-value taking into account only data types trusted for absent calls is greater than 0.1
    • and there is no FDR-corrected p-value less than or equal to 0.05 in any child condition for that gene, considering the data types trusted for absent calls.
  • Absent silver quality expression calls: same as absent gold quality expression calls, but using an FDR-corrected p-value threshold of 0.05.

Download expression calls

The files can be found on the Bgee download page for each species. To access the download page from the Bgee homepage, go to the download section on the top toolbar and click on "Gene expression calls".

Once on the download calls webpage, you can either search for a specific species in the top search bar or look through the species list at the bottom and click on the species logo to access the different download file options. These datasets can also be downloaded directly using our R package "BgeeDB”.

Potential download problems

  • If you open a file with a spreadsheet editor, it will potentially transform some cell values into dates. Files need to be imported into a spreadsheet editor to avoid such problems.
  • Download files are compressed with gzip. They have to be uncompressed before opening them into an editor.
  • Tarball containing TPM values for a species contain gzip files that also need to be uncompressed before opening with an editor.

Choosing a download file

Once a species is selected, you will need to choose if you want data only for anatomical entities or for all conditions, and if you want the summarized information (simple file) or all information (advanced file). The implications of each option are explained in further detail below.

Condition parameters

There are two different options for condition parameters:

  • anatomical entities only: files contain one expression call for each unique pair of genes and anatomical entities.
  • all conditions parameters: files contain one expression call for each unique gene, anatomical entity, developmental stage, sex and strain.

Simple file vs. Advanced file

There are two different options when downloading the file:

  • simple: aimed at providing summarized information over all data types.
  • advanced: aimed at reporting all information, for instance allowing you to retrieve the contribution of each data type to a call.

Simple and advanced files contain the same expression calls (same number of lines) but advanced files contain more information on each call (more columns).

Advanced file additional information:

  • expression status generated from each data type are provided (present, absent, no data).
  • number of present high quality and present low quality calls from each data type.
  • number of absent high quality and absent low quality calls from in situ, Affymetrix, and RNA-Seq.
  • data type for which calls are observed. Each call is observed in at least one data type.

Download file details

Below is a complete description of which data you can expect to find in each download file and a detailed description of each column.

File content

ColumnContentIn anatomical simple filesIn anatomical advanced filesIn all conditions simple filesIn all conditions advanced filesExample
1Gene IDYesYesYesYesFBgn0005427
2Gene nameYesYesYesYesewg
3Anatomical entity IDYesYesYesYesUBERON:6001060
4Anatomical entity nameYesYesYesYesembryonic brain
5Developmental stage IDNoNoYesYesFBdv:00005334
6Developmental stage nameNoNoYesYesembryonic stage 16 (fruit fly)
7SexNoNoYesYesany
8StrainNoNoYesYeswild-type
9ExpressionYesYesYesYespresent
10Call qualityYesYesYesYesgold quality
11FDRYesYesYesYes0.0004
12Expression scoreYesYesYesYes49.99
13Expression rankYesYesYesYes8.32e3
14Including observed dataNoYesNoYesyes
15Self observation countNoYesNoYes1
16Descendant observation countNoYesNoYes0
17Affymetrix expressionNoYesNoYesno data
18Affymetrix call qualityNoYesNoYesNA
19Affymetrix FDRNoYesNoYesNA
20Affymetrix expression scoreNoYesNoYesNA
21Affymetrix expression rankNoYesNoYesNA
22Affymetrix weight for expression rank and scoreNoYesNoYesNA
23Including Affymetrix observed dataNoYesNoYesno
24Self observation count AffymetrixNoYesNoYes0
25Descendant observation count AffymetrixNoYesNoYes0
26EST expressionNoYesNoYesno data
27EST call qualityNoYesNoYesNA
28EST FDRNoYesNoYesNA
29EST expression scoreNoYesNoYesNA
30EST expression rankNoYesNoYesNA
31EST weight for expression rank and scoreNoYesNoYesNA
32Including EST observed dataNoYesNoYesno
33Self observation count ESTNoYesNoYes0
34Descendant observation count ESTNoYesNoYes0
35in situ hybridization expressionNoYesNoYespresent
36in situ hybridization call qualityNoYesNoYesgold quality
37in situ hybridization FDRNoYesNoYes0.0004
38in situ hybridization expression scoreNoYesNoYes49.99
39in situ hybridization expression rankNoYesNoYes8.32e3
40in situ hybridization weight for expression rank and scoreNoYesNoYes5.00
41Including in situ hybridization observed dataNoYesNoYesyes
42Self observation count in situ hybridizationNoYesNoYes1
43Descendant observation count in situ hybridizationNoYesNoYes0
44RNA-Seq expressionNoYesNoYesno data
45RNA-Seq call qualityNoYesNoYesNA
46RNA-Seq FDRNoYesNoYesNA
47RNA-Seq expression scoreNoYesNoYesNA
48RNA-Seq expression rankNoYesNoYesNA
49RNA-Seq weight for expression rank and scoreNoYesNoYesNA
50Including RNA-Seq observed dataNoYesNoYesno
51Self observation count RNA-SeqNoYesNoYes0
52Descendant observation count RNA-SeqNoYesNoYes0
53full-length single-cell RNA-Seq expressionNoYesNoYesno data
54full-length single-cell RNA-Seq call qualityNoYesNoYesNA
55full-length single-cell RNA-Seq FDRNoYesNoYesNA
56full-length single-cell RNA-Seq expression scoreNoYesNoYesNA
57full-length single-cell RNA-Seq expression rankNoYesNoYesNA
58full-length single-cell RNA-Seq weight for expression rank and scoreNoYesNoYesNA
59Including full-length single-cell RNA-Seq observed dataNoYesNoYesno
60Self observation count full-length single-cell RNA-SeqNoYesNoYes0
61Descendant observation count full-length single-cell RNA-SeqNoYesNoYes0

Column descriptions

Gene ID (column 1)

Unique identifier of the gene.

Gene name (column 2)

Name of the gene defined by Gene ID (column 1).

Anatomical entity ID (column 3)

Unique identifier of the anatomical entity, from the Uberon ontology.

Anatomical entity name (column 4)

Name of the anatomical entity defined by Anatomical entity ID (column 3).

Developmental stage ID (column 5)

Unique identifier of the developmental stage, from the Uberon ontology.

Developmental stage name (column 6)

Name of the developmental stage defined by Developmental stage ID (column 5).

Sex (column 7)

Sex of the sample used to generate the call.

Strain (column 8)

Strain of the sample used to generate the call.

Expression (column 9)

Call generated from all data types for the selected combination of condition parameters (anatomical or all conditions). Permitted values: present, absent.

Call quality (column 10)

Call quality from all data types for the selected combination of condition parameters (anatomical or all conditions). Permitted values: gold quality, silver quality.

FDR (column 11)

FDR-corrected p-value of the call.

Expression score (column 12)

Score of expression to the call. The score uses the minimum and maximum Expression Rank (column 13) of the species to normalize the expression to a value between 0 and 100.

Low score means that the gene is lowly expressed in the condition.

Expression rank (column 13)

Rank score associated with the call. Rank scores of expression calls are normalized across genes, conditions and species.

A low score means that the gene is highly expressed in the condition.

Including observed data (column 14)

Permitted value: yes.

Only calls that were actually seen in experimental data, at least once, are in this file.

Self observation count (column 15)

Number of observations coming from experimental data for this combination of condition parameters (anatomical or all conditions).

Descendant observation count (column 16)

Number of observations coming from experimental data for the combination of condition parameters (anatomical or all conditions) descendant of the current one.

Affymetrix expression (column 17)

Call generated from Affymetrix data for the selected combination of condition parameters (anatomical or all conditions). Permitted values: present, absent, no data.

Affymetrix call quality (column 18)

Quality associated with the call from Affymetrix data. Permitted values: gold quality, silver quality, NA.

Affymetrix FDR (column 19)

FDR-corrected p-value of the call calculated using p-values coming from Affymetrix data.

Affymetrix expression score (column 20)

Score of expression to the call from Affymetrix data. The score uses the minimum and maximum Expression Rank (column 13) of the species to normalize the expression to a value between 0 and 100.

Low score means that the gene is lowly expressed in the condition.

Affymetrix expression rank (column 21)

Rank score associated with the call from Affymetrix data. Rank scores of expression calls are normalized across genes, conditions and species.

A low score means that the gene is highly expressed in the condition.

Affymetrix weight for expression rank and score (column 22)

The weight given to Affymetrix expression ranks and scores when computing the weighted mean over several data types.

Including Affymetrix observed data (column 23)

Information about the calls actually coming from experimental Affymetrix data for this combination of condition parameters (anatomical or all conditions).

Permitted value: yes or no.

Self observation count Affymetrix (column 24)

Number of observations coming from experimental Affymetrix data for this combination of condition parameters (anatomical or all conditions).

Descendant observation count Affymetrix (column 25)

Number of observations coming from experimental Affymetrix data for the combination of condition parameters (anatomical or all conditions) descendant of the current one.

EST expression (column 26)

Call generated from EST data for the selected combination of condition parameters (anatomical or all conditions). Permitted values: present, absent, no data.

EST call quality (column 27)

Quality associated with the call from EST data. Permitted values: gold quality, silver quality, NA.

EST FDR (column 28)

FDR-corrected p-value of the call calculated using p-values coming from EST data.

EST expression score (column 29)

Score of expression to the call from EST data. The score uses the minimum and maximum Expression Rank (column 13) of the species to normalize the expression to a value between 0 and 100.

Low score means that the gene is lowly expressed in the condition.

EST expression rank (column 30)

Rank score associated with the call from EST data. Rank scores of expression calls are normalized across genes, conditions, and species.

A low score means that the gene is highly expressed in the condition.

EST weight for expression rank and score (column 31)

The weight given to EST expression ranks and scores when computing the weighted mean over several data types.

Including EST observed data (column 32)

Information about the calls actually coming from experimental EST data for this combination of condition parameters (anatomical or all conditions).

Permitted value: yes or no.

Self observation count EST (column 33)

Number of observations coming from experimental EST data for this combination of condition parameters (anatomical or all conditions).

Descendant observation count EST (column 34)

Number of observations coming from experimental EST data for the combination of condition parameters (anatomical or all conditions) descendant of the current one.

in situ hybridization expression (column 35)

Call generated from in situ hybridization data for the selected combination of condition parameters (anatomical or all conditions). Permitted values: present, absent, no data.

in situ hybridization call quality (column 36)

Quality associated with the call from in situ hybridization data. Permitted values: gold quality, silver quality, NA.

in situ hybridization FDR (column 37)

FDR value of the call calculated using p-values coming from in situ hybridization data.

in situ hybridization expression score (column 38)

Score of expression to the call from in situ hybridization data. The score uses the minimum and maximum Expression Rank (column 13) of the species to normalize the expression to a value between 0 and 100.

Low score means that the gene is lowly expressed in the condition.

in situ hybridization expression rank (column 39)

Rank score associated with the call from in situ hybridization data. Rank scores of expression calls are normalized across genes, conditions, and species.

A low score means that the gene is highly expressed in the condition.

in situ hybridization weight for expression rank and score (column 40)

The weight given to in situ hybridization expression ranks and scores when computing the weighted mean over several data types.

Including in situ hybridization observed data (column 41)

Information about the calls actually coming from experimental in situ hybridization data for this combination of condition parameters (anatomical or all conditions).

Permitted value: yes or no.

Self observation count in situ hybridization (column 42)

Number of observations coming from experimental in situ hybridization data for this combination of condition parameters (anatomical or all conditions).

Descendant observation count in situ hybridization (column 43)

Number of observations coming from experimental in situ hybridization data for the combination of condition parameters (anatomical or all conditions) descendant of the current one.

RNA-Seq expression (column 44)

Call generated from bulk RNA-Seq data for the selected combination of condition parameters (anatomical or all conditions). Permitted values: present, absent, no data.

RNA-Seq call quality (column 45)

Quality associated with the call from bulk RNA-Seq data. Permitted values: gold quality, silver quality, NA.

RNA-Seq FDR (column 46)

FDR-corrected p-value of the call calculated using p-values coming from RNA-Seq data.

RNA-Seq expression score (column 47)

Score of expression to the call from RNA-Seq data. The score uses the minimum and maximum Expression Rank (column 13) of the species to normalize the expression to a value between 0 and 100.

Low score means that the gene is lowly expressed in the condition.

RNA-Seq expression rank (column 48)

Rank score associated with the call from RNA-Seq data. Rank scores of expression calls are normalized across genes, conditions and species.

A low score means that the gene is highly expressed in the condition.

RNA-Seq weight for expression rank and score (column 49)

The weight given to RNA-Seq expression ranks and scores when computing the weighted mean over several data types.

Including RNA-Seq observed data (column 50)

Information about the calls actually coming from experimental RNA-Seq data for this combination of condition parameters (anatomical or all conditions).

Permitted value: yes or no.

Self observation count RNA-Seq (column 51)

Number of observations coming from experimental RNA-Seq data for this combination of condition parameters (anatomical or all conditions).

Descendant observation count RNA-Seq (column 52)

Number of observations coming from experimental RNA-Seq data for the combination of condition parameters (anatomical or all conditions) descendant of the current one.

full-length single-cell RNA-Seq expression (column 53)

Call generated from full-length single-cell RNA-Seq data for the selected combination of condition parameters (anatomical or all conditions). Permitted values: present, absent, no data.

full-length single-cell RNA-Seq call quality (column 54)

Quality associated with the call from full-length single-cell RNA-Seq data. Permitted values: gold quality, silver quality, NA.

full-length single-cell RNA-Seq FDR (column 55)

FDR-corrected p-value of the call calculated using p-values coming from full-length single-cell RNA-Seq data.

full-length single-cell RNA-Seq expression score (column 56)

Score of expression to the call from full-length single-cell RNA-Seq data. The score uses the minimum and maximum Expression Rank (column 13) of the species to normalize the expression to a value between 0 and 100.

Low score means that the gene is lowly expressed in the condition.

full-length single-cell RNA-Seq expression rank (column 57)

Rank score associated with the call from full-length single-cell RNA-Seq data. Rank scores of expression calls are normalized across genes, conditions, and species.

A low score means that the gene is highly expressed in the condition.

full-length single-cell RNA-Seq weight for expression rank and score (column 58)

The weight given to full-length single-cell expression ranks and scores when computing the weighted mean over several data types.

Including full-length single-cell RNA-Seq observed data (column 59)

Information about the calls actually coming from experimental full-length single-cell RNA-Seq data for this combination of condition parameters (anatomical or all conditions).

Permitted value: yes or no.

Self observation count full-length single-cell RNA-Seq (column 60)

Number of observations coming from experimental full-length single-cell RNA-Seq data for this combination of condition parameters (anatomical or all conditions).

Descendant observation count full-length single-cell RNA-Seq (column 61)

Number of observations coming from experimental full-length single-cell RNA-Seq data for the combination of condition parameters (anatomical or all conditions) descendant of the current one.