The full Bgee database is provided as TSV files or MySQL dump in the download section. Here are the description of these files, and some basic information on their content. The data are linked between files using common IDs: if they have the same name, they represent the same object.
Jump to:
BackOntologies
Species
File: specie.tsv
Homologous Organs Groups
File: homologousOrgansGroups.tsv
Description: homology relationships between anatomical structures of different species-specific ontologies are represented as groups of homologous anatomical structures (Homologous Organs Groups, HOGs). This file contains the list of HOGs. The mapping of the anatomical structures to these HOGs is present in the anatomicalStructures.tsv file. See also the Homology relationships section for more information.
HOG relationships
File: HOGRelationships.tsv
Descent HOG ID | Relation type |
Description: represents the relationships between HOGs (part_of, is_a, ...). See the Homology relationships section for more information.
Metastages
File: metastages.tsv
Metastage name | Metastage description | Metastage left bound | Metastage right bound | Metastage level |
Description: the metastages represent the key events of development common to all bilaterian animals. The developmental stages of each species are then mapped to these metastages. This file contains the list of metastages. The mapping of the developmental stages to these metastages is present in the stages.tsv file. The is_a relationships between metastages are represented as a Nested Set Model (see [Nested Set Model in Wikipedia]), by using the columns left bound, right bound, and level.
Metastage name synonyms
File: metastageNameSynonyms.tsv
Developmental stages
File: stages.tsv
Stage name | Stage description | Stage left bound | Stage right bound | Stage level | Species ID | Metastage ID | Too granular |
Description: the is_a relationships between developmental stages are represented as a Nested Set Model (see [Nested Set Model in Wikipedia]), by using the columns left bound, right bound, and level. These values are of course independent for different "Species ID". The developmental stages are mapped to the metastages using the "Metastage ID" column. Note that any stages not explicitly mapped to a metastage ("Metastage ID" not defined), but with a mapped parent stage, are also mapped to this metastage.
The field Too granular
defines stages that are too granular for differential expression analyses: in some cases, some annotations can be too granular for a proper use in differential analyses. For instance, using our human developmental ontology, data annotated at the age of "23 year-old" would not be considered to be the same condition as "24 year-old". To solve this problem, we defined some developmental stages as being "too granular". All data mapped to such stages are transfered to their closest parent stage that is not too granular, for the differential analyses. In the example above, data annotated at "23 year-old" and "24 year-old" would be considered as being the same "young adult stage" condition. The parent stage that is actually used for the analysis can then be retrieved in the DEA chips groups
table (see below).
Developmental stage name synonyms
File: stageNameSynonyms.tsv
Anatomical structures
File: anatomicalStructures.tsv
Anatomical structure name | Anatomical structure description | Start stage ID | End stage ID | HOG ID | HOG confidence | HOG reference |
Description: represents the members of the anatomical ontologies. Homology relationships between anatomical structures are represented as groups of homologous structures (Homologous Organs Groups, HOGs). The anatomical structures are mapped to these HOGs using the "HOG ID" column. For each mapping, a level of confidence is provided: "obvious", "well-established" (a reference is provided), "debated" (a consensus has been chosen), "uncertain" (a reference may be provided), "Homolonto" (unreviewed automatic alignment). Note that any structures not explicitly mapped to a HOG ("HOG ID" not defined), but with a mapped parent structure, are also mapped to this HOG. See the Homology relationships section for more information.
Anatomical structure name synonyms
File: anatomicalStructureNameSynonyms.tsv
Anatomical structure name synonym |
Anatomical structure relationships
File: anatomicalStructureRelationships.tsv
Descent anatomical structure ID | Relation type |
Description: represents the relationships between anatomical structures (part_of, is_a, ...).
BackGenes and gene families
Gene family prediction methods
File: geneFamilyPredictionMethods.tsv
Gene family prediction method |
Description: Bgee uses several prediction methods to define gene families:
- Large families: For protein coding genes, Bgee recovers the families as defined in Ensembl ("Protein families"). These family predictions are based on the Tribe MCL clustering method, including all protein isoforms of every coding gene that Ensembl predicts, but also all fungi/metaozoa proteins present in Uniprot/SWISSPROT and Uniprot/SPTREMBL.
For miRNA families, Bgee recovers the families as defined in miRBase. These families are taken from Rfam.
Note that miRNAs are only part of this type of gene families.
- Orthologs Vertebrates: Bgee reports groups of orthologs with a common ancestor gene in vertebrates or any sub-taxa (more precisely Euteleostomi or any sub-taxa), based on the Ensembl gene trees. These are based on TreeBeST, which aims to represent the evolutionary history of gene families, i.e. genes that diverged from duplication or speciation events [Gene Orthology/Paralogy prediction method].
- Orthologs Animals: Bgee reports groups of orthologs with a common ancestor gene in animals (more precisely Coelomata), based on the Ensembl gene trees. These are based on TreeBeST, which aims to represent the evolutionary history of gene families, i.e. genes that diverged from duplication or speciation events [Gene Orthology/Paralogy prediction method].
Gene families
File: geneFamilies.tsv
Gene family name | Gene family description | Gene family prediction method ID |
Gene types
File: geneBioTypes.txt
Genes
File: genes.tsv
Gene name | Gene description | Gene type ID | Species ID |
Gene to gene families
File: geneToGeneFamilies.tsv
Description: as Bgee uses several gene family prediction methods, genes can belong to several gene families. This association file is thus required.
BackGlobal expression data
Data sources
File: dataSources.tsv
Description: this file contains the list of primary databases used to construct the Bgee database. Expression data present in Bgee are then mapped to these data sources.
Expression
File: expression.tsv
Gene ID | Anatomical structure ID | Stage ID | Expression confidence for EST data | Expression confidence for Affymetrix data | Expression confidence for in situ hybridization data | Expression confidence for RNA-Seq data |
Description: this file recapitulates all the expression data stored in Bgee, whatever the data type. Each line represents an expression pattern: a gene, expressed in an anatomical structure, at a developmental stage. A column is then added for each data type, which can takes three values: "no data", "poor quality", or "high quality". If this expression pattern has not been detected by using this data type, the value taken is "no data". For EST data, "poor quality" and "high quality" represent the best data quality, amongst all the data from this type that defines this expression pattern. For Affymetrix, in-situ hybridization, and RNA-Seq data, this value represent the overall expression summary. See the data analysis section for more information.
No expression
File: noExpression.tsv
Gene ID | Anatomical structure ID | Stage ID | No expression confidence for Affymetrix data | No expression confidence for in situ hybridization data | No expression confidence for RNA-Seq data |
Description: this file recapitulates information of no-expression stored in Bgee, whatever the data type. Each line represents the information that a gene is NOT expressed in an anatomical structure, at a developmental stage. A column is then added for each data type, which can takes three values: "no data", or "high quality" ("no-expression" data are only "high quality" data). If this information has not been detected by using this data type, the value taken is "no data". See the data analysis section for more information.
Differential expression
File: differentialExpression.tsv
Gene ID | Anatomical structure ID | Stage ID | Differential expression direction ('over' or 'under') | Differential expression confidence for Affymetrix data |
Description: this file recapitulates information of differential expression stored in Bgee, from Affymetrix data. Each line represents the information that a gene is under or over expressed in an anatomical structure, at a developmental stage. The last column represents the confidence in this information, which can takes three values: "no data", "poor quality", or "high quality". See the data analysis section for more information.
BackRNA-Seq data
RNA-Seq experiments
File: rnaSeqExperiments.tsv
Rna-Seq experiment name | Rna-Seq experiment description | Data source ID |
Description: list of RNA-Seq experiments used in Bgee. They are linked to the original data source by the column "Data source ID".
RNA-Seq platforms
File: rnaSeqPlatforms.tsv
RNA-Seq platform description |
Description: RNA-Seq platforms used, e.g., Illumina Genome Analyzer IIx
RNA-Seq libraries
File: rnaSeqLibraries.tsv
RNA-Seq secondary library ID | RNA-Seq experiment ID | RNA-Seq platform ID | Anatomical structure ID | Stage ID | log2 RPK threshold | Percentage of "present" genes | Percentage of "present" protein-coding genes | Percentage of "present" intronic regions | Percentage of "present" intergenic regions | Total reads count | Used reads count | Aligned reads count | Minimum read length | Maximum read length | Library type |
Description: list of the RNA-Seq libraries annotated and used in Bgee. They are mapped to the ontologies by the columns "Anatomical structure ID" and "Stage ID".
'RNA-Seq library ID' represents the ID of the sample in the GEO database.
'RNA-Seq secondary library ID' represents the ID of the library in the SRA database.
'log2 RPK threshold' represents the value above which genes are considered as expressed (see data analysis section for more information).
'Total reads count' is the total number of reads present in the library (all runs aggregated).
'Used reads count' is the number of remaining reads after filtering by the TopHat software.
RNA-Seq runs
File: rnaSeqRuns.tsv
Description: RNA-Seq runs from the SRA database, linked to their container library.
RNA-Seq results
File: rnaSeqResults.tsv
Gene ID | log2 RPK value | Aligned reads count | Detection flag | Expression ID | No expression ID | Expression confidence for RNA-Seq data | If result excluded, reason for exclusion |
Description: list of the gene present/absent calls for every RNA-Seq libraries stored in Bgee.
'log2 RPK value' is the value used to define whether a gene is expressed or not, as compared to the log2 RPK threshold for the given library (see data analysis section).
'Expression confidence for RNA-Seq data': pease note that for now, all RNA-Seq results are considered as high quality data.
A result can be either associated to an expression result (Expression ID not null), or a no-expression result (No Expression ID not null).
A result with a detection flag "absent" can still be associated to an expression result (Expression ID not null), when it represents a conflict (another result, for the same gene at the same developmental stage in the same anatomical structure, show expression for this gene). This result flagged as "absent" is used to decrease the quality of the associated expression result.
If both Expression ID and No Expression ID are null, the result has been removed from the dataset. Reasons for exclusions are:
- pre filtering: genes always seen as "absent" over the whole dataset are not considered.
- noExpression conflict: a "no-expression" result has been removed because of expression of the same gene detected on other libraries in some substructures/child stages.
BackAffymetrix data
Microarray experiments
File: microarrayExperiments.tsv
Microarray experiment name | Microarray experiment description | Data source ID |
Description: list of microarray experiments used in Bgee. They are linked to the original data source by the column "Data source ID".
Chip types
File: chipTypes.tsv
Chip type name | CDF name | Usable in Bgee | Bgee quality score threshold | Percent present |
Description: list of the different chip types used in Bgee.
'Bgee quality score threshold' and 'Percent present threshold' are the thresholds used to consider a chip as of good quality. If their value is '0', it means that no threshold was defined for this chip type.
Normalization types
File: normalizationTypes.tsv
Description: list of the different normalization methods used by Bgee to renormalize probeset signal intensities, e.g. "gcRMA". See the data analysis section for more information.
Detection types
File: detectionTypes.tsv
Description: list of the different methods used by Bgee to determine whether a gene is expressed or not, and with which confidence, based on the normalized probeset signal intensities, e.g. "Schuster et al. method". See the data analysis section for more information.
Affymetrix chips
File: affymetrixChips.tsv
Affymetrix chip ID | Microarray experiment ID | Chip type ID | Scan date | Normalization type ID | Detection type ID | Bgee quality score | Percent present | Anatomical structure ID | Stage ID |
Description: list of the Affymetrix chips annotated and used in Bgee. They are mapped to the ontologies by the columns "Anatomical structure ID" and "Stage ID".
'Bgee quality score' is an Affymetrix quality score developed by Bgee (see documentation) used to remove low quality chips. It can only be used when raw CEL files are available. Percentage of probesets with a 'present' state ('Percent present') is also used to remove low quality chips, and is applied on all data, even when only MAS5 processed files are available. Thresholds on quality score and percent present are defined fo each chip type independently. They can be retrieved from the table 'Chip types', using the 'Chip type ID' field. All chips present in Bgee have passed quality controls.
The field 'Affymetrix Chip ID' represents the ID of the chip in the source database. Bgee affymetrix chip ID is a Bgee internal ID, because 'Affymetrix Chip ID' are not unique (but the couples 'Affymetrix Chip ID' - 'Microarray experiment ID' are). This internal ID is used to link to other tables ('Affymetrix probesets' and 'DEA chips groups to affymetrix chips')
Affymetrix probesets
File: affymetrixProbesets.tsv
Bgee affymetrix chip ID | Gene ID | Normalized signal intensity | Detection flag | Expression ID | No Expression ID | Confidence for Affymetrix data | Reason for exlusion |
Description: list of the probesets of every Affymetrix chips stored in Bgee. A level of confidence is assigned for each probeset: "poor quality" or "high quality" (see data analysis section).
A probeset can be either associated to an expression result (Expression ID not null), or a no-expression result (No Expression ID not null).
A probeset with a detection flag "absent" can still be associated to an expression result (Expression ID not null), when it represents a conflict (another probeset, for the same gene at the same developmental stage in the same anatomical structure, is detected as expressed). This Probeset flagged as "absent" is used to decrease the quality of the associated expression result.
If both Expression ID and No Expression ID are null, the probeset has been removed from the dataset. Reasons for exclusions are:
- pre filtering: probesets always seen as "absent" or "marginal" over the whole dataset are removed.
- bronze quality (quality too low): for a gene/organ/stage, mix of probesets "absent" and "marginal" (no "present", and inconsistency expression / no-expression).
- absent low quality (MAS5): a no-expression result is retrieved only using MAS5. No-expression results must be confirmed by analyses where raw data are available.
- noExpression conflict: a "no-expression" result has been removed because of expression of the same gene detected by other probesets in some substructures/child stages.
See the data analysis section for more information. Note that this file is very large.
Differential expression analysis (DEA) types
File: differentialExpressionAnalysisTypes.tsv
Description: types of analysis used to detect over-expression of genes.
Differential expression analyses (DEA)
File: differentialExpressionAnalyses.tsv
DEA type ID | Microarray experiment ID |
Description: analysis performed to detect over-expression of genes.
DEA chips groups
File: DEAChipsGroups.tsv
DEA ID | Anatomical structure ID | Stage ID |
Description: to detect over-expression of genes, Differential Expression Analyses can only be performed on groups of chips (more or less equivalent to a set of replicates), with the same conditions (anatomical structures/developmental stages), to estimate variance of gene expression. The analysis (DEA ID) will then used at least 3 different groups of chips, from the same experiment, in different conditions (anatomical structures/developmental stages), to detect over-expression in some of them.
The Stage ID
does not always correspond to the Stage ID
of the corresponding Affymetrix chips in the Affymetrix chips
table. It does not correspond when the chip was annotated to a developmental stage too granular for differential expresson analyses. In that case, data are transfered to the closest parent stage not too granular. This is the Stage ID
of this parent than can be retrieved in this table. In cases when the annotation is not too granular, Stage ID
is the same than for the corresponding chips. See documentation of the table Developmental stages
for more information.
DEA chips groups to affymetrix chips
File: DEAChipsGroupsToAffymetrixChips.tsv
Description: this file allow to get back to the original chips grouped to performed a Differential Expression Analysis.
DEA affymetrix probesets summaries
File: deaAffymetrixProbesetSummary.tsv
DEA chips group ID | Gene ID | Fold change | Differential Expression ID | Expression confidence for DE Affymetrix data | Raw p-value | If excluded, reason for exlusion |
Description: a line in this table is a summary of a set of probesets, used for the differential expression analysis, belonging to differentindividual affymetrix chips, corresponding to one group of chips (DEA chips group ID). If Differential Expression ID is not null, then this gene is differentially expressed in these conditions.
A probeset showing under-expression can still be associated to an over-expression result (and vice versa): it means there was a conflict (another probeset or chip, for the same gene at the same developmental stage in the same anatomical structure, showed the opposite direction of differential expression). The result with the lowest p-value is considered and overall quality is set to "low quality".
If Differential Expression ID is null, the probeset has been removed from the dataset. Reasons for exclusions are:
- not expressed: the corresponding gene has been shown to be not expressed in this anatomical structure at this developmental stage (see noExpression table).
BackIn situ hybridization data
In situ experiments
File: inSituExperiments.tsv
In situ experiment name | In situ experiment description | Data source ID |
Description: list of the in situ hybridization experiments used in Bgee. They are linked to the original data source by the column "Data source ID".
In situ evidences
File: inSituEvidences.tsv
Description: an "evidence" represents a material used to define a gene expression pattern, most of the time, an image of an in situ hybridization. But it can also be a publication, ...
In situ spots
File: inSituSpots.tsv
In situ evidence ID | Anatomical structure ID | Stage ID | Gene ID | Detection flag | Expression ID | No Expression ID | Confidence for in situ hybridization data | Reason for exclusion |
Description: a spot represents a gene expression pattern defined by in situ hybridization data. Most of the time, it represents a labeled area of an hybridization image. But it could also represents the information found in a publication. They are mapped to the ontologies by the columns "Anatomical structure ID" and "Stage ID". A level of confidence is assigned for each spot: "poor quality" or "high quality" (see data analysis section).
A spot can be either associated to an expression result (Expression ID not null), or a no-expression result (No Expression ID not null).
A spot with a detection flag "absent" can still be associated to an expression result (Expression ID not null), when it represents a conflict (another in situ spot, for the same gene at the same developmental stage in the same anatomical structure, is detected as expressed). This spot flagged as "absent" is used to decrease the quality of the associated expression result.
If both Expression ID and No Expression ID are null, the spot has been removed from the dataset. Reasons for exclusions are:
- bronze quality (quality too low): for a gene/organ/stage, mix of spots "absent" and "expressed low quality" (no "expressed high quality", and inconsistency expression / no-expression).
- absent low quality: a no-expression result has been retrieved using "low quality" spots only. A no-expression result must be confirmed by "high quality" data.
- noExpression conflict: a "no-expression" result has been removed because of expression of the same gene detected by other spots in some substructures/child stages.
See the
data analysis section for more information.
EST libraries
File: ESTLibraries.tsv
EST library name | EST library description | Anatomical structure ID | Stage ID | Data source ID |
Description: list of EST libraries annotated and used in Bgee. They are mapped to the ontologies by the columns "Anatomical structure ID" and "Stage ID". They are linked to the original data source by the column "Data source ID".
Expressed Sequence Tags
File: ESTs.tsv
EST library ID | Gene ID | Unigene cluster ID | Expression ID | Expression confidence for EST data |
Description: a level of confidence is assigned for each EST: "poor quality" or "high quality". See the data analysis section for more information.