Full-Length scRNA-seq Processed Expression Values Files

This documentation describes the format of annotation download files (library and experiment files) as well as the processed expression values download files for single-cell RNA-Seq data. The files can be found on the Bgee download page.

Annotation (experiments/libraries) files

The annotation download files are divided into 2 main files:

  1. library file: provides detailed information for each individual sample (where each sample is a unique cell), including anatomical entity, developmental stage, cell type, sex, strain, and quality scores used in quality control metrics.

  2. experiment file: provides overall information about the experiment, including the number of libraries that belong to the experiment, and the number of conditions, organs, stages, cell types, and strains.

Library file

File format and column descriptions

ColumnContentExample
1Experiment IDERP013381
2Library IDERX1226594
3Anatomical entity IDUBERON:0000922
4Anatomical entity nameembryo
5Stage IDMmusDv:0000014
6Stage nameTheiler stage 09 (mouse)
7Cell type IDCL:0000352
8Cell type nameepiblast cell
9SexNA
10StrainCD-1
11Expression mapped anatomical entity IDUBERON:0000922
12Expression mapped anatomical entity nameembryo
13Expression mapped stage IDMmusDv:0000014
14Expression mapped stage nameTheiler stage 09 (mouse)
15Expression mapped cell type IDCL:0000352
16Expression mapped cell type nameepiblast cell
17Expression mapped sexnot annotated
18Expression mapped strainCD-1
19Platform IDIllumina HiSeq 2500
20Library typesingle
21Library orientationNA
22TPM expression threshold3.88442
23Read count3238518
24Mapped read count1467281
25Min. read length125
26Max. read length125
27All genes percent present13.79
28Protein coding genes percent present31.54
29Intergenic regions percent present1.55
30Distinct rank count10642
31Max rank in the expression mapped conditionNA
32Run IDsNA
33Data sourceSRA
34Data source URLhttps://www.ncbi.nlm.nih.gov/sra/?term=ERX1226594
35Bgee normalized data URLhttps://bgee.org/ftp/current/download/processed_expr_values/sc_rnaseq/Mus_musculus/Mus_musculus_Full-Length_SC_RNA-Seq_read_counts_TPM_FPKM_ERP013381.tsv.gz
36Raw file URLhttps://trace.ncbi.nlm.nih.gov/Traces/study/?acc=ERX1226594
Experiment ID (column 1)

The Experiment ID column provides the unique identifier per experiment.

Library ID (column 2)

The Library ID column provides the unique identifier per sample (where each sample is a unique cell) that belongs to an Experiment ID (column 1).

Anatomical entity ID (column 3)

The Anatomical entity ID column provides the unique identifier of the anatomical entity, from the Uberon ontology.

Anatomical entity name (column 4)

The anatomical entity name column provides the name of the anatomical entity defined by Anatomical entity ID (column 3).

Stage ID (column 5)

The Stage ID column provides the unique identifier of the developmental stage, from the Uberon ontology.

Stage name (column 6)

The Stage name column provides the name of the developmental stage defined by Stage ID (column 5).

Cell type ID (column 7)

The Cell type ID column provides the unique identifier of the cell type, from the Uberon ontology.

Cell type name (column 8)

The Cell type name column provides the name of the cell type defined by Cell type ID (column 7).

Sex (column 9)

The Sex column provides the sex information ('not annotated', 'NA', 'mixed', 'male', 'female', 'hermaphrodite').

Strain (column 10)

The Strain column provides information about the genetic variant or subtype of an organism.

Expression mapped anatomical entity ID (column 11)

The Expression mapped anatomical entity ID column is the annotation used in the Bgee expression calls. It can be different from the Anatomical entity ID (column 3) if it is too granular to be inserted in the database.

Expression mapped anatomical entity name (column 12)

The Expression mapped anatomical entity name column provides the name of the anatomical entity defined by Expression mapped anatomical entity ID (column 11).

Expression mapped stage ID (column 13)

The Expression mapped stage ID column is the annotation used in the Bgee expression calls. It can be different from the Stage ID (column 5) if it is too granular to be inserted in the database.

Expression mapped stage name (column 14)

The Expression mapped stage name column provides the name of the developmental stage defined by Expression mapped stage ID (column 13).

Expression mapped cell type ID (column 15)

The Expression mapped cell type ID column is the annotation used in the Bgee expression calls. It can be different from the Cell type ID (column 7) if it is too granular to be inserted in the database.

Expression mapped cell type name (column 16)

The Expression mapped cell type name column provides the name of the cell type defined by Expression mapped cell type ID (column 15).

Expression mapped sex (column 17)

The Expression mapped sex column provides the sex information used in the Bgee expression calls ('any', 'male', 'female', 'hermaphrodite').

Expression mapped strain (column 18)

The Expression mapped strain column provides the genetic variant or subtype of an organism used in the Bgee expression calls.

Platform ID (column 19)

The Platform ID column provides the sequencing platform identifier.

Library type (column 20)

The Library type column consists of the strandedness of the library. This can be single or paired-end.

Library orientation (column 21)

The Library orientation column provides the relative orientation of the reads.

TPM expression threshold (column 22)

The TPM expression threshold column provides the minimum TPM value to call expressed genes in the Library ID (column 2).

Read count (column 23)

The Read count column provides the total number of read counts that will be mapped to the transcriptome.

Mapped read count (column 24)

The Mapped read count column provides the number of read counts that overlap/map to the genomic position.

Min. read length (column 25)

The Min. read length column provides the minimum number of base pairs (bp) sequenced from a DNA fragment.

Max. read length (column 26)

The Max. read length column provides the maximum number of base pairs (bp) sequenced from a DNA fragment.

All genes percent present (column 27)

The All genes percent present column provides information about the proportion of genes called actively expressed in the Library ID (column 2).

Protein coding genes percent present (column 28)

The Protein coding genes percent present column provides information about the proportion of protein coding genes called actively expressed in the Library ID (column 2).

Intergenic regions percent present (column 29)

The Intergenic regions percent present column provides information about the proportion of intergenic regions called actively expressed in the Library ID (column 2).

Distinct rank count (column 30)

The Distinct rank count column provides information about unique rank counts in the Library ID (column 2). It is used to weigh the rank information coming from this library when computing expression ranks and expression scores.

Max rank in the expression mapped condition (column 31)

The Max rank in the expression mapped condition column provides the max rank over all libraries in this condition. It is used to normalize ranks between conditions when computing expression ranks and expression scores.

Run IDs (column 32)

The Run IDs column refers to a sequencing run associated with the library ID (column 2).

Data source (column 33)

Data repository from where the raw files were extracted. Collect all Run IDs (column 32) corresponding to a target library ID (column 2).

Data source URL (column 34)

URL pathway to the data repository where is located the library ID (column 2).

Bgee normalized data URL (column 35)

URL pathway where is located the processed data for the correspondent Experiment ID (column 1) in Bgee.

Raw file URL (column 36)

URL pathway to the SRA Run Selector. This allows access to the Run IDs (column 32) through the library ID (column 2).

Example rows

Experiment IDLibrary IDAnatomical entity IDAnatomical entity nameStage IDStage nameCell type IDCell type nameSexStrainExpression mapped anatomical entity IDExpression mapped anatomical entity nameExpression mapped stage IDExpression mapped stage nameExpression mapped cell type IDExpression mapped cell type nameExpression mapped sexExpression mapped strainPlatform IDLibrary typeLibrary orientationTPM expression thresholdRead countMapped read countMin. read lengthMax. read lengthAll genes percent presentProtein coding genes percent presentIntergenic regions percent presentDistinct rank countMax rank in the expression mapped conditionRun IDsData sourceData source URLBgee normalized data URLRaw file URL
ERP013381ERX1226594UBERON:0000922embryoMmusDv:0000014Theiler stage 09 (mouse)CL:0000352epiblast cellNACD-1UBERON:0000922embryoMmusDv:0000014Theiler stage 09 (mouse)CL:0000352epiblast cellnot annotatedCD-1Illumina HiSeq 2500singleNA3.884423238518146728112512513.7931.541.5510642NANASRAhttps://www.ncbi.nlm.nih.gov/sra/?term=ERX1226594https://bgee.org/ftp/current/download/processed_expr_values/sc_rnaseq/Mus_musculus/Mus_musculus_Full-Length_SC_RNA-Seq_read_counts_TPM_FPKM_ERP013381.tsv.gzhttps://trace.ncbi.nlm.nih.gov/Traces/study/?acc=ERX1226594
ERP013381ERX1226595UBERON:0000922embryoMmusDv:0000014Theiler stage 09 (mouse)CL:0000352epiblast cellNACD-1UBERON:0000922embryoMmusDv:0000014Theiler stage 09 (mouse)CL:0000352epiblast cellnot annotatedCD-1Illumina HiSeq 2500singleNA2.327183621774204949012512515.3434.811.2811014NANASRAhttps://www.ncbi.nlm.nih.gov/sra/?term=ERX1226595https://bgee.org/ftp/current/download/processed_expr_values/sc_rnaseq/Mus_musculus/Mus_musculus_Full-Length_SC_RNA-Seq_read_counts_TPM_FPKM_ERP013381.tsv.gzhttps://trace.ncbi.nlm.nih.gov/Traces/study/?acc=ERX1226595
ERP013381ERX1226596UBERON:0000922embryoMmusDv:0000014Theiler stage 09 (mouse)CL:0000352epiblast cellNACD-1UBERON:0000922embryoMmusDv:0000014Theiler stage 09 (mouse)CL:0000352epiblast cellnot annotatedCD-1Illumina HiSeq 2500singleNA3.391653581718160687112512513.1129.751.179585NANASRAhttps://www.ncbi.nlm.nih.gov/sra/?term=ERX1226596https://bgee.org/ftp/current/download/processed_expr_values/sc_rnaseq/Mus_musculus/Mus_musculus_Full-Length_SC_RNA-Seq_read_counts_TPM_FPKM_ERP013381.tsv.gzhttps://trace.ncbi.nlm.nih.gov/Traces/study/?acc=ERX1226596

Experiment file

File format and column descriptions

ColumnContentExample
1Experiment IDERP013381
2Experiment nameMouse embryonic RNA-seq
3Library count1205
4Condition count4
5Organ-stage count4
6Organ count1
7Stage count4
8Cell-Type count2
9Sex count1
10Strain count1
11Data sourceSRA
12Data source URLhttps://www.ncbi.nlm.nih.gov/sra/ERP013381
13Bgee normalized data URLhttps://bgee.org/ftp/current/download/processed_expr_values/sc_rnaseq/Mus_musculus/Mus_musculus_Full-Length_SC_RNA-Seq_read_counts_TPM_FPKM_ERP013381.tsv.gz
14Experiment descriptionThe study was aimed at interrogating the early stages of blood cell development within the embryo...
Experiment ID (column 1)

The Experiment ID column provides the unique identifier per experiment.

Experiment name (column 2)

The Experiment name column provides the title referent to the Experiment ID (column 1).

Library count (column 3)

The Library count column provides the total number of the libraries associated with the Experiment ID (column 1).

Condition count (column 4)

The Condition count column provides all the combinations of unique parameters in Bgee. This means, combinations between unique anatomical entities, developmental stages, cell types, sex, and strains.

Organ-stage count (column 5)

The Organ-stage count column provides the total number of unique combinations between anatomical entities Ids Organ count (column 6) and developmental stages Stage count (column 7) in the target Experiment ID (column 1).

Organ count (column 6)

The Organ count column provides the total number of anatomical entity ids in the target Experiment ID (column 1).

Stage count (column 7)

The Stage count column provides the total number of developmental stages in the target Experiment ID (column 1).

Cell-Type count (column 8)

The Cell-Type count column provides the total number of cell types in the target Experiment ID (column 1).

Sex count (column 9)

The Sex count column provides the total number of sexes in the target Experiment ID (column 1).

Strain count (column 10)

The Strain count column provides the total number of genetic variants or subtypes in the target Experiment ID (column 1).

Data source (column 11)

Data repository from where the raw files that belong to the Experiment ID (column 1) were extracted.

Data source URL (column 12)

URL pathway to the data repository where is located the Experiment ID (column 1).

Bgee normalized data URL (column 13)

URL pathway where is located the processed data for the correspondent Experiment ID (column 1) in Bgee.

Experiment description (column 14)

Description provided by the authors of the Experiment ID (column 1).

Example rows

Experiment IDExperiment nameLibrary countCondition countOrgan-stage countOrgan countStage countCell-Type countSex countStrain countData sourceData source URLBgee normalized data URLExperiment description
ERP013381Mouse embryonic RNA-seq12054414211SRAhttps://www.ncbi.nlm.nih.gov/sra/ERP013381https://bgee.org/ftp/current/download/processed_expr_values/sc_rnaseq/Mus_musculus/Mus_musculus_Full-Length_SC_RNA-Seq_read_counts_TPM_FPKM_ERP013381.tsv.gzThe study was aimed at interrogating the early stages of blood cell development within the embryo...
SRP020490Single-cell RNA-Seq reveals dynamic, random monoallelic gene expression in mammalian cells1182222221SRAhttps://www.ncbi.nlm.nih.gov/sra/SRP020490https://bgee.org/ftp/current/download/processed_expr_values/sc_rnaseq/Mus_musculus/Mus_musculus_Full-Length_SC_RNA-Seq_read_counts_TPM_FPKM_SRP020490.tsv.gzIn the diploid genome, genes come in two copies, which can have different DNA sequence and where one is maternal and one is paternal...

Processed expression (read counts, TPM, FPKM) files

The processed expression (read counts, TPM, FPKM) files can be retrieved per experiment for a specific species, accessed through FTP or through the download page by selecting the species of interest and then by clicking on the button Download read counts, TPM, and FPKMs. When using the web page, all processed data for the species are downloaded. The data for each experiment are contained in separate files named using the experiment identifier. Each experiment file includes all processed data of all samples from the experiment.

File format and column descriptions

ColumnContentExample
1Experiment IDSRP020490
2Library IDSRX259105
3Library typesingle
4Gene IDENSMUSG00000000001
5Anatomical entity IDUBERON:0000085
6Anatomical entity namemorula
7Stage IDMmusDv:0000006
8Stage nameTheiler stage 03 (mouse)
9Cell type IDCL:0000353
10Cell type nameblastoderm cell
11SexNA
12StrainCAST_EiJ(mother)_x_C57BL_6J(father)
13Read count2154
14TPM54.538026
15FPKM55.33224
16Rank2465
17Detection flagpresent
18pValue3.70353E-06
19State in BgeePart of a call
Experiment ID (column 1)

The Experiment ID column provides the unique identifier per experiment.

Library ID (column 2)

The Library ID column provides the unique identifier per sample (where each sample is a unique cell) that belongs to an Experiment ID (column 1).

Library type (column 3)

The Library type column consists of the strandedness of the library. This can be single or paired-end.

Gene ID (column 4)

The Gene ID column provides the unique identifier of genes from Ensembl.

Anatomical entity ID (column 5)

The Anatomical entity ID column provides the unique identifier of the anatomical entity, from the Uberon ontology.

Anatomical entity name (column 6)

The Anatomical entity name column provides the name of the anatomical entity defined by Anatomical entity ID (column 5).

Stage ID (column 7)

The Stage ID column provides the unique identifier of the developmental stage, from the Uberon ontology.

Stage name (column 8)

The Stage name column provides the name of the developmental stage defined by Stage ID (column 7).

Cell type ID (column 9)

The Cell type ID column provides the unique identifier of the cell type, from the Uberon ontology.

Cell type name (column 10)

The Cell type name column provides the name of the cell type defined by Cell type ID (column 9).

Sex (column 11)

The Sex column provides the sex information ('not annotated', 'NA', 'mixed', 'male', 'female', 'hermaphrodite').

Strain (column 12)

The Strain column provides information about the genetic variant or subtype of an organism.

Read count (column 13)

The Read count column provides the total number of reads of Gene ID (column 4) from a target Library ID (column 2) that will be mapped to the transcriptome.

TPM (column 14)

The TPM column provides a normalized quantification measure for sequencing depth and gene length of Gene ID (column 4) from a specific Library ID (column 2).

FPKM (column 15)

The FPKM column provides a normalized quantification measure for sequencing depth and gene length of Gene ID (column 4) from a specific Library ID (column 2).

Rank (column 16)

The Rank column provides the rank of a Gene ID (column 4) in a condition for a species. It is used to compute expression ranks and expression scores.

Detection flag (column 17)

The Detection flag column provides an informative classification of a Gene ID (column 4). The flag can be present, which means that the gene is actively expressed or empty classification (NULL). Note that in single cell RNA-Seq full-length data, we don't call absent genes. The genes are classified as present based on pValue (column 18) cutoff.

pValue (column 18)

The p-value is a quantitative metric to detect if Gene ID (column 4) is actively expressed in any standalone RNA-Seq Library ID (column 2).

For each individual Library ID (column 2) we map reads both to transcripts and to the reference intergenic regions, and compute TPM (column 14) per Gene ID (column 4) (summing over transcripts) and per intergenic region. Then for each Gene ID (column 4) in the Library ID (column 2), we compute a Z-score in terms of standard deviations from the mean of reference intergenic regions:

                              log2(TPM (column 14)_{Gene ID (column 4)}) - mean(log2(TPM_{RefIntergenic}))
ZScore_{Gene ID (column 4)} = ----------------------------------------------------------------------------
                                                     sd(log2(TPM_{RefIntergenic}))

Then for Gene ID (column 4) in the Library ID (column 2) we calculate a p-value based on a null hypothesis of expression at a similar level to reference intergenic, estimated as a Normal distribution.

The library-specific TPM limit to call genes expressed is the minimum value of TPM where p-value ≤ α. In the download files, we used α = 0.05.

State in Bgee (column 19)

The State in Bgee column provides information about the usage of Gene ID (column 4) to make expression calls. Three different labels can be retrieved in this column:

  1. Part of a call --> This means the information from the Gene ID (column 4) was used to make an expression informative call.
  2. Result excluded, reason: pre-filtering --> Pre-filtering of genes never observed as present in any Library ID (column 2). No calls will be generated for those Gene ID (column 4).
  3. Result excluded, reason: absent call not reliable --> protocol used to generate the Library ID (column 2) does not allow to consider Gene ID (column 4) for absent calls.

Example rows

Experiment IDLibrary IDLibrary typeGene IDAnatomical entity IDAnatomical entity nameStage IDStage nameCell type IDCell type nameSexStrainRead countTPMFPKMRankDetection flagpValueState in Bgee
SRP020490SRX259105singleENSMUSG00000000001UBERON:0000085morulaMmusDv:0000006Theiler stage 03 (mouse)CL:0000353blastoderm cellNACAST_EiJ(mother)_x_C57BL_6J(father)215454.53802655.332242465present3.70353E-06Part of a call
SRP020490SRX259105singleENSMUSG00000000003UBERON:0000085morulaMmusDv:0000006Theiler stage 03 (mouse)CL:0000353blastoderm cellNACAST_EiJ(mother)_x_C57BL_6J(father)000NANAResult excluded, reason: absent call not reliable
SRP020490SRX259105singleENSMUSG00000000028UBERON:0000085morulaMmusDv:0000006Theiler stage 03 (mouse)CL:0000353blastoderm cellNACAST_EiJ(mother)_x_C57BL_6J(father)341.999916.19195716.4277534671present8.89015E-05Part of a call