This documentation describes the format of annotation download files (library and experiment files) as well as the processed expression values download files for single-cell RNA-Seq data. The files can be found on the Bgee download page.
The annotation download files are divided into 2 main files:
library file: provides detailed information for each individual sample (where each sample is a unique cell), including anatomical entity, developmental stage, cell type, sex, strain, and quality scores used in quality control metrics.
experiment file: provides overall information about the experiment, including the number of libraries that belong to the experiment, and the number of conditions, organs, stages, cell types, and strains.
The Experiment ID column provides the unique identifier per experiment.
The Library ID column provides the unique identifier per sample (where each sample is a unique cell) that belongs to an Experiment ID (column 1).
The Anatomical entity ID column provides the unique identifier of the anatomical entity, from the Uberon ontology.
The anatomical entity name column provides the name of the anatomical entity defined by Anatomical entity ID (column 3).
The Stage ID column provides the unique identifier of the developmental stage, from the Uberon ontology.
The Stage name column provides the name of the developmental stage defined by Stage ID (column 5).
The Cell type ID column provides the unique identifier of the cell type, from the Uberon ontology.
The Cell type name column provides the name of the cell type defined by Cell type ID (column 7).
The Sex column provides the sex information ('not annotated', 'NA', 'mixed', 'male', 'female', 'hermaphrodite').
The Strain column provides information about the genetic variant or subtype of an organism.
The Expression mapped anatomical entity ID column is the annotation used in the Bgee expression calls. It can be different from the Anatomical entity ID (column 3) if it is too granular to be inserted in the database.
The Expression mapped anatomical entity name column provides the name of the anatomical entity defined by Expression mapped anatomical entity ID (column 11).
The Expression mapped stage ID column is the annotation used in the Bgee expression calls. It can be different from the Stage ID (column 5) if it is too granular to be inserted in the database.
The Expression mapped stage name column provides the name of the developmental stage defined by Expression mapped stage ID (column 13).
The Expression mapped cell type ID column is the annotation used in the Bgee expression calls. It can be different from the Cell type ID (column 7) if it is too granular to be inserted in the database.
The Expression mapped cell type name column provides the name of the cell type defined by Expression mapped cell type ID (column 15).
The Expression mapped sex column provides the sex information used in the Bgee expression calls ('any', 'male', 'female', 'hermaphrodite').
The Expression mapped strain column provides the genetic variant or subtype of an organism used in the Bgee expression calls.
The Platform ID column provides the sequencing platform identifier.
The Library type column consists of the strandedness of the library. This can be single or paired-end.
The Library orientation column provides the relative orientation of the reads.
The TPM expression threshold column provides the minimum TPM value to call expressed genes in the Library ID (column 2).
The Read count column provides the total number of read counts that will be mapped to the transcriptome.
The Mapped read count column provides the number of read counts that overlap/map to the genomic position.
The Min. read length column provides the minimum number of base pairs (bp) sequenced from a DNA fragment.
The Max. read length column provides the maximum number of base pairs (bp) sequenced from a DNA fragment.
The All genes percent present column provides information about the proportion of genes called actively expressed in the Library ID (column 2).
The Protein coding genes percent present column provides information about the proportion of protein coding genes called actively expressed in the Library ID (column 2).
The Intergenic regions percent present column provides information about the proportion of intergenic regions called actively expressed in the Library ID (column 2).
The Distinct rank count column provides information about unique rank counts in the Library ID (column 2). It is used to weigh the rank information coming from this library when computing expression ranks and expression scores.
The Max rank in the expression mapped condition column provides the max rank over all libraries in this condition. It is used to normalize ranks between conditions when computing expression ranks and expression scores.
The Run IDs column refers to a sequencing run associated with the library ID (column 2).
Data repository from where the raw files were extracted. Collect all Run IDs (column 32) corresponding to a target library ID (column 2).
URL pathway to the data repository where is located the library ID (column 2).
URL pathway where is located the processed data for the correspondent Experiment ID (column 1) in Bgee.
URL pathway to the SRA Run Selector. This allows access to the Run IDs (column 32) through the library ID (column 2).
| Experiment ID | Library ID | Anatomical entity ID | Anatomical entity name | Stage ID | Stage name | Cell type ID | Cell type name | Sex | Strain | Expression mapped anatomical entity ID | Expression mapped anatomical entity name | Expression mapped stage ID | Expression mapped stage name | Expression mapped cell type ID | Expression mapped cell type name | Expression mapped sex | Expression mapped strain | Platform ID | Library type | Library orientation | TPM expression threshold | Read count | Mapped read count | Min. read length | Max. read length | All genes percent present | Protein coding genes percent present | Intergenic regions percent present | Distinct rank count | Max rank in the expression mapped condition | Run IDs | Data source | Data source URL | Bgee normalized data URL | Raw file URL |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ERP013381 | ERX1226594 | UBERON:0000922 | embryo | MmusDv:0000014 | Theiler stage 09 (mouse) | CL:0000352 | epiblast cell | NA | CD-1 | UBERON:0000922 | embryo | MmusDv:0000014 | Theiler stage 09 (mouse) | CL:0000352 | epiblast cell | not annotated | CD-1 | Illumina HiSeq 2500 | single | NA | 3.88442 | 3238518 | 1467281 | 125 | 125 | 13.79 | 31.54 | 1.55 | 10642 | NA | NA | SRA | https://www.ncbi.nlm.nih.gov/sra/?term=ERX1226594 | https://bgee.org/ftp/current/download/processed_expr_values/sc_rnaseq/Mus_musculus/Mus_musculus_Full-Length_SC_RNA-Seq_read_counts_TPM_FPKM_ERP013381.tsv.gz | https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=ERX1226594 |
| ERP013381 | ERX1226595 | UBERON:0000922 | embryo | MmusDv:0000014 | Theiler stage 09 (mouse) | CL:0000352 | epiblast cell | NA | CD-1 | UBERON:0000922 | embryo | MmusDv:0000014 | Theiler stage 09 (mouse) | CL:0000352 | epiblast cell | not annotated | CD-1 | Illumina HiSeq 2500 | single | NA | 2.32718 | 3621774 | 2049490 | 125 | 125 | 15.34 | 34.81 | 1.28 | 11014 | NA | NA | SRA | https://www.ncbi.nlm.nih.gov/sra/?term=ERX1226595 | https://bgee.org/ftp/current/download/processed_expr_values/sc_rnaseq/Mus_musculus/Mus_musculus_Full-Length_SC_RNA-Seq_read_counts_TPM_FPKM_ERP013381.tsv.gz | https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=ERX1226595 |
| ERP013381 | ERX1226596 | UBERON:0000922 | embryo | MmusDv:0000014 | Theiler stage 09 (mouse) | CL:0000352 | epiblast cell | NA | CD-1 | UBERON:0000922 | embryo | MmusDv:0000014 | Theiler stage 09 (mouse) | CL:0000352 | epiblast cell | not annotated | CD-1 | Illumina HiSeq 2500 | single | NA | 3.39165 | 3581718 | 1606871 | 125 | 125 | 13.11 | 29.75 | 1.17 | 9585 | NA | NA | SRA | https://www.ncbi.nlm.nih.gov/sra/?term=ERX1226596 | https://bgee.org/ftp/current/download/processed_expr_values/sc_rnaseq/Mus_musculus/Mus_musculus_Full-Length_SC_RNA-Seq_read_counts_TPM_FPKM_ERP013381.tsv.gz | https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=ERX1226596 |
| Column | Content | Example |
|---|---|---|
| 1 | Experiment ID | ERP013381 |
| 2 | Experiment name | Mouse embryonic RNA-seq |
| 3 | Library count | 1205 |
| 4 | Condition count | 4 |
| 5 | Organ-stage count | 4 |
| 6 | Organ count | 1 |
| 7 | Stage count | 4 |
| 8 | Cell-Type count | 2 |
| 9 | Sex count | 1 |
| 10 | Strain count | 1 |
| 11 | Data source | SRA |
| 12 | Data source URL | https://www.ncbi.nlm.nih.gov/sra/ERP013381 |
| 13 | Bgee normalized data URL | https://bgee.org/ftp/current/download/processed_expr_values/sc_rnaseq/Mus_musculus/Mus_musculus_Full-Length_SC_RNA-Seq_read_counts_TPM_FPKM_ERP013381.tsv.gz |
| 14 | Experiment description | The study was aimed at interrogating the early stages of blood cell development within the embryo... |
The Experiment ID column provides the unique identifier per experiment.
The Experiment name column provides the title referent to the Experiment ID (column 1).
The Library count column provides the total number of the libraries associated with the Experiment ID (column 1).
The Condition count column provides all the combinations of unique parameters in Bgee. This means, combinations between unique anatomical entities, developmental stages, cell types, sex, and strains.
The Organ-stage count column provides the total number of unique combinations between anatomical entities Ids Organ count (column 6) and developmental stages Stage count (column 7) in the target Experiment ID (column 1).
The Organ count column provides the total number of anatomical entity ids in the target Experiment ID (column 1).
The Stage count column provides the total number of developmental stages in the target Experiment ID (column 1).
The Cell-Type count column provides the total number of cell types in the target Experiment ID (column 1).
The Sex count column provides the total number of sexes in the target Experiment ID (column 1).
The Strain count column provides the total number of genetic variants or subtypes in the target Experiment ID (column 1).
Data repository from where the raw files that belong to the Experiment ID (column 1) were extracted.
URL pathway to the data repository where is located the Experiment ID (column 1).
URL pathway where is located the processed data for the correspondent Experiment ID (column 1) in Bgee.
Description provided by the authors of the Experiment ID (column 1).
| Experiment ID | Experiment name | Library count | Condition count | Organ-stage count | Organ count | Stage count | Cell-Type count | Sex count | Strain count | Data source | Data source URL | Bgee normalized data URL | Experiment description |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ERP013381 | Mouse embryonic RNA-seq | 1205 | 4 | 4 | 1 | 4 | 2 | 1 | 1 | SRA | https://www.ncbi.nlm.nih.gov/sra/ERP013381 | https://bgee.org/ftp/current/download/processed_expr_values/sc_rnaseq/Mus_musculus/Mus_musculus_Full-Length_SC_RNA-Seq_read_counts_TPM_FPKM_ERP013381.tsv.gz | The study was aimed at interrogating the early stages of blood cell development within the embryo... |
| SRP020490 | Single-cell RNA-Seq reveals dynamic, random monoallelic gene expression in mammalian cells | 118 | 2 | 2 | 2 | 2 | 2 | 2 | 1 | SRA | https://www.ncbi.nlm.nih.gov/sra/SRP020490 | https://bgee.org/ftp/current/download/processed_expr_values/sc_rnaseq/Mus_musculus/Mus_musculus_Full-Length_SC_RNA-Seq_read_counts_TPM_FPKM_SRP020490.tsv.gz | In the diploid genome, genes come in two copies, which can have different DNA sequence and where one is maternal and one is paternal... |
The processed expression (read counts, TPM, FPKM) files can be retrieved per experiment for a specific species, accessed through FTP or through the download page by selecting the species of interest and then by clicking on the button Download read counts, TPM, and FPKMs. When using the web page, all processed data for the species are downloaded. The data for each experiment are contained in separate files named using the experiment identifier. Each experiment file includes all processed data of all samples from the experiment.
| Column | Content | Example |
|---|---|---|
| 1 | Experiment ID | SRP020490 |
| 2 | Library ID | SRX259105 |
| 3 | Library type | single |
| 4 | Gene ID | ENSMUSG00000000001 |
| 5 | Anatomical entity ID | UBERON:0000085 |
| 6 | Anatomical entity name | morula |
| 7 | Stage ID | MmusDv:0000006 |
| 8 | Stage name | Theiler stage 03 (mouse) |
| 9 | Cell type ID | CL:0000353 |
| 10 | Cell type name | blastoderm cell |
| 11 | Sex | NA |
| 12 | Strain | CAST_EiJ(mother)_x_C57BL_6J(father) |
| 13 | Read count | 2154 |
| 14 | TPM | 54.538026 |
| 15 | FPKM | 55.33224 |
| 16 | Rank | 2465 |
| 17 | Detection flag | present |
| 18 | pValue | 3.70353E-06 |
| 19 | State in Bgee | Part of a call |
The Experiment ID column provides the unique identifier per experiment.
The Library ID column provides the unique identifier per sample (where each sample is a unique cell) that belongs to an Experiment ID (column 1).
The Library type column consists of the strandedness of the library. This can be single or paired-end.
The Gene ID column provides the unique identifier of genes from Ensembl.
The Anatomical entity ID column provides the unique identifier of the anatomical entity, from the Uberon ontology.
The Anatomical entity name column provides the name of the anatomical entity defined by Anatomical entity ID (column 5).
The Stage ID column provides the unique identifier of the developmental stage, from the Uberon ontology.
The Stage name column provides the name of the developmental stage defined by Stage ID (column 7).
The Cell type ID column provides the unique identifier of the cell type, from the Uberon ontology.
The Cell type name column provides the name of the cell type defined by Cell type ID (column 9).
The Sex column provides the sex information ('not annotated', 'NA', 'mixed', 'male', 'female', 'hermaphrodite').
The Strain column provides information about the genetic variant or subtype of an organism.
The Read count column provides the total number of reads of Gene ID (column 4) from a target Library ID (column 2) that will be mapped to the transcriptome.
The TPM column provides a normalized quantification measure for sequencing depth and gene length of Gene ID (column 4) from a specific Library ID (column 2).
The FPKM column provides a normalized quantification measure for sequencing depth and gene length of Gene ID (column 4) from a specific Library ID (column 2).
The Rank column provides the rank of a Gene ID (column 4) in a condition for a species. It is used to compute expression ranks and expression scores.
The Detection flag column provides an informative classification of a Gene ID (column 4). The flag can be present, which means that the gene is actively expressed or empty classification (NULL).
Note that in single cell RNA-Seq full-length data, we don't call absent genes.
The genes are classified as present based on pValue (column 18) cutoff.
The p-value is a quantitative metric to detect if Gene ID (column 4) is actively expressed in any standalone RNA-Seq Library ID (column 2).
For each individual Library ID (column 2) we map reads both to transcripts and to the reference intergenic regions, and compute TPM (column 14) per Gene ID (column 4) (summing over transcripts) and per intergenic region. Then for each Gene ID (column 4) in the Library ID (column 2), we compute a Z-score in terms of standard deviations from the mean of reference intergenic regions:
log2(TPM (column 14)_{Gene ID (column 4)}) - mean(log2(TPM_{RefIntergenic}))
ZScore_{Gene ID (column 4)} = ----------------------------------------------------------------------------
sd(log2(TPM_{RefIntergenic}))
Then for Gene ID (column 4) in the Library ID (column 2) we calculate a p-value based on a null hypothesis of expression at a similar level to reference intergenic, estimated as a Normal distribution.
The library-specific TPM limit to call genes expressed is the minimum value of TPM where p-value ≤ α. In the download files, we used α = 0.05.
The State in Bgee column provides information about the usage of Gene ID (column 4) to make expression calls.
Three different labels can be retrieved in this column:
Gene ID (column 4) was used to make an expression informative call.Library ID (column 2). No calls will be generated for those Gene ID (column 4).Library ID (column 2) does not allow to consider Gene ID (column 4) for absent calls.| Experiment ID | Library ID | Library type | Gene ID | Anatomical entity ID | Anatomical entity name | Stage ID | Stage name | Cell type ID | Cell type name | Sex | Strain | Read count | TPM | FPKM | Rank | Detection flag | pValue | State in Bgee |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SRP020490 | SRX259105 | single | ENSMUSG00000000001 | UBERON:0000085 | morula | MmusDv:0000006 | Theiler stage 03 (mouse) | CL:0000353 | blastoderm cell | NA | CAST_EiJ(mother)_x_C57BL_6J(father) | 2154 | 54.538026 | 55.33224 | 2465 | present | 3.70353E-06 | Part of a call |
| SRP020490 | SRX259105 | single | ENSMUSG00000000003 | UBERON:0000085 | morula | MmusDv:0000006 | Theiler stage 03 (mouse) | CL:0000353 | blastoderm cell | NA | CAST_EiJ(mother)_x_C57BL_6J(father) | 0 | 0 | 0 | NA | NA | Result excluded, reason: absent call not reliable | |
| SRP020490 | SRX259105 | single | ENSMUSG00000000028 | UBERON:0000085 | morula | MmusDv:0000006 | Theiler stage 03 (mouse) | CL:0000353 | blastoderm cell | NA | CAST_EiJ(mother)_x_C57BL_6J(father) | 341.9999 | 16.191957 | 16.427753 | 4671 | present | 8.89015E-05 | Part of a call |