Full-Length scRNA-seq Processed Expression Values Files

This documentation describes the format of annotation download files (library and experiment files) as well as the processed expression values download files for single-cell RNA-Seq data. The files can be found on the Bgee download page.

Annotation (experiments/libraries) files
Processed expression (read counts, TPM, FPKM) files

Annotation (experiments/libraries) files

The annotation download files are divided into 2 main files:

library file: provides detailed information for each individual sample (where each sample is a unique cell), including anatomical entity, developmental stage, cell type, sex, strain, and quality scores used in quality control metrics.
experiment file: provides overall information about the experiment, including the number of libraries that belong to the experiment, and the number of conditions, organs, stages, cell types, and strains.

Library file

File format and column descriptions

Column	Content	Example
1	Experiment ID	ERP013381
2	Library ID	ERX1226594
3	Anatomical entity ID	UBERON:0000922
4	Anatomical entity name	embryo
5	Stage ID	MmusDv:0000014
6	Stage name	Theiler stage 09 (mouse)
7	Cell type ID	CL:0000352
8	Cell type name	epiblast cell
9	Sex	NA
10	Strain	CD-1
11	Expression mapped anatomical entity ID	UBERON:0000922
12	Expression mapped anatomical entity name	embryo
13	Expression mapped stage ID	MmusDv:0000014
14	Expression mapped stage name	Theiler stage 09 (mouse)
15	Expression mapped cell type ID	CL:0000352
16	Expression mapped cell type name	epiblast cell
17	Expression mapped sex	not annotated
18	Expression mapped strain	CD-1
19	Platform ID	Illumina HiSeq 2500
20	Library type	single
21	Library orientation	NA
22	TPM expression threshold	3.88442
23	Read count	3238518
24	Mapped read count	1467281
25	Min. read length	125
26	Max. read length	125
27	All genes percent present	13.79
28	Protein coding genes percent present	31.54
29	Intergenic regions percent present	1.55
30	Distinct rank count	10642
31	Max rank in the expression mapped condition	NA
32	Run IDs	NA
33	Data source	SRA
34	Data source URL	https://www.ncbi.nlm.nih.gov/sra/?term=ERX1226594
35	Bgee normalized data URL	https://bgee.org/ftp/current/download/processed_expr_values/sc_rnaseq/Mus_musculus/Mus_musculus_Full-Length_SC_RNA-Seq_read_counts_TPM_FPKM_ERP013381.tsv.gz
36	Raw file URL	https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=ERX1226594

Experiment ID (column 1)

The Experiment ID column provides the unique identifier per experiment.

Library ID (column 2)

The Library ID column provides the unique identifier per sample (where each sample is a unique cell) that belongs to an Experiment ID (column 1).

Anatomical entity ID (column 3)

The Anatomical entity ID column provides the unique identifier of the anatomical entity, from the Uberon ontology.

Anatomical entity name (column 4)

The anatomical entity name column provides the name of the anatomical entity defined by Anatomical entity ID (column 3).

Stage ID (column 5)

The Stage ID column provides the unique identifier of the developmental stage, from the Uberon ontology.

Stage name (column 6)

The Stage name column provides the name of the developmental stage defined by Stage ID (column 5).

Cell type ID (column 7)

The Cell type ID column provides the unique identifier of the cell type, from the Uberon ontology.

Cell type name (column 8)

The Cell type name column provides the name of the cell type defined by Cell type ID (column 7).

Sex (column 9)

The Sex column provides the sex information ('not annotated', 'NA', 'mixed', 'male', 'female', 'hermaphrodite').

Strain (column 10)

The Strain column provides information about the genetic variant or subtype of an organism.

Expression mapped anatomical entity ID (column 11)

The Expression mapped anatomical entity ID column is the annotation used in the Bgee expression calls. It can be different from the Anatomical entity ID (column 3) if it is too granular to be inserted in the database.

Expression mapped anatomical entity name (column 12)

The Expression mapped anatomical entity name column provides the name of the anatomical entity defined by Expression mapped anatomical entity ID (column 11).

Expression mapped stage ID (column 13)

The Expression mapped stage ID column is the annotation used in the Bgee expression calls. It can be different from the Stage ID (column 5) if it is too granular to be inserted in the database.

Expression mapped stage name (column 14)

The Expression mapped stage name column provides the name of the developmental stage defined by Expression mapped stage ID (column 13).

Expression mapped cell type ID (column 15)

The Expression mapped cell type ID column is the annotation used in the Bgee expression calls. It can be different from the Cell type ID (column 7) if it is too granular to be inserted in the database.

Expression mapped cell type name (column 16)

The Expression mapped cell type name column provides the name of the cell type defined by Expression mapped cell type ID (column 15).

Expression mapped sex (column 17)

The Expression mapped sex column provides the sex information used in the Bgee expression calls ('any', 'male', 'female', 'hermaphrodite').

Expression mapped strain (column 18)

The Expression mapped strain column provides the genetic variant or subtype of an organism used in the Bgee expression calls.

Platform ID (column 19)

The Platform ID column provides the sequencing platform identifier.

Library type (column 20)

The Library type column consists of the strandedness of the library. This can be single or paired-end.

Library orientation (column 21)

The Library orientation column provides the relative orientation of the reads.

TPM expression threshold (column 22)

The TPM expression threshold column provides the minimum TPM value to call expressed genes in the Library ID (column 2).

Read count (column 23)

The Read count column provides the total number of read counts that will be mapped to the transcriptome.

Mapped read count (column 24)

The Mapped read count column provides the number of read counts that overlap/map to the genomic position.

Min. read length (column 25)

The Min. read length column provides the minimum number of base pairs (bp) sequenced from a DNA fragment.

Max. read length (column 26)

The Max. read length column provides the maximum number of base pairs (bp) sequenced from a DNA fragment.

All genes percent present (column 27)

The All genes percent present column provides information about the proportion of genes called actively expressed in the Library ID (column 2).

Protein coding genes percent present (column 28)

The Protein coding genes percent present column provides information about the proportion of protein coding genes called actively expressed in the Library ID (column 2).

Intergenic regions percent present (column 29)

The Intergenic regions percent present column provides information about the proportion of intergenic regions called actively expressed in the Library ID (column 2).

Distinct rank count (column 30)

The Distinct rank count column provides information about unique rank counts in the Library ID (column 2). It is used to weigh the rank information coming from this library when computing expression ranks and expression scores.

Max rank in the expression mapped condition (column 31)

The Max rank in the expression mapped condition column provides the max rank over all libraries in this condition. It is used to normalize ranks between conditions when computing expression ranks and expression scores.

Run IDs (column 32)

The Run IDs column refers to a sequencing run associated with the library ID (column 2).

Data source (column 33)

Data repository from where the raw files were extracted. Collect all Run IDs (column 32) corresponding to a target library ID (column 2).

Data source URL (column 34)

URL pathway to the data repository where is located the library ID (column 2).

Bgee normalized data URL (column 35)

URL pathway where is located the processed data for the correspondent Experiment ID (column 1) in Bgee.

Raw file URL (column 36)

URL pathway to the SRA Run Selector. This allows access to the Run IDs (column 32) through the library ID (column 2).

Example rows

Experiment ID	Library ID	Anatomical entity ID	Anatomical entity name	Stage ID	Stage name	Cell type ID	Cell type name	Sex	Strain	Expression mapped anatomical entity ID	Expression mapped anatomical entity name	Expression mapped stage ID	Expression mapped stage name	Expression mapped cell type ID	Expression mapped cell type name	Expression mapped sex	Expression mapped strain	Platform ID	Library type	Library orientation	TPM expression threshold	Read count	Mapped read count	Min. read length	Max. read length	All genes percent present	Protein coding genes percent present	Intergenic regions percent present	Distinct rank count	Max rank in the expression mapped condition	Run IDs	Data source	Data source URL	Bgee normalized data URL	Raw file URL
ERP013381	ERX1226594	UBERON:0000922	embryo	MmusDv:0000014	Theiler stage 09 (mouse)	CL:0000352	epiblast cell	NA	CD-1	UBERON:0000922	embryo	MmusDv:0000014	Theiler stage 09 (mouse)	CL:0000352	epiblast cell	not annotated	CD-1	Illumina HiSeq 2500	single	NA	3.88442	3238518	1467281	125	125	13.79	31.54	1.55	10642	NA	NA	SRA	https://www.ncbi.nlm.nih.gov/sra/?term=ERX1226594	https://bgee.org/ftp/current/download/processed_expr_values/sc_rnaseq/Mus_musculus/Mus_musculus_Full-Length_SC_RNA-Seq_read_counts_TPM_FPKM_ERP013381.tsv.gz	https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=ERX1226594
ERP013381	ERX1226595	UBERON:0000922	embryo	MmusDv:0000014	Theiler stage 09 (mouse)	CL:0000352	epiblast cell	NA	CD-1	UBERON:0000922	embryo	MmusDv:0000014	Theiler stage 09 (mouse)	CL:0000352	epiblast cell	not annotated	CD-1	Illumina HiSeq 2500	single	NA	2.32718	3621774	2049490	125	125	15.34	34.81	1.28	11014	NA	NA	SRA	https://www.ncbi.nlm.nih.gov/sra/?term=ERX1226595	https://bgee.org/ftp/current/download/processed_expr_values/sc_rnaseq/Mus_musculus/Mus_musculus_Full-Length_SC_RNA-Seq_read_counts_TPM_FPKM_ERP013381.tsv.gz	https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=ERX1226595
ERP013381	ERX1226596	UBERON:0000922	embryo	MmusDv:0000014	Theiler stage 09 (mouse)	CL:0000352	epiblast cell	NA	CD-1	UBERON:0000922	embryo	MmusDv:0000014	Theiler stage 09 (mouse)	CL:0000352	epiblast cell	not annotated	CD-1	Illumina HiSeq 2500	single	NA	3.39165	3581718	1606871	125	125	13.11	29.75	1.17	9585	NA	NA	SRA	https://www.ncbi.nlm.nih.gov/sra/?term=ERX1226596	https://bgee.org/ftp/current/download/processed_expr_values/sc_rnaseq/Mus_musculus/Mus_musculus_Full-Length_SC_RNA-Seq_read_counts_TPM_FPKM_ERP013381.tsv.gz	https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=ERX1226596

Experiment file

File format and column descriptions

Column	Content	Example
1	Experiment ID	ERP013381
2	Experiment name	Mouse embryonic RNA-seq
3	Library count	1205
4	Condition count	4
5	Organ-stage count	4
6	Organ count	1
7	Stage count	4
8	Cell-Type count	2
9	Sex count	1
10	Strain count	1
11	Data source	SRA
12	Data source URL	https://www.ncbi.nlm.nih.gov/sra/ERP013381
13	Bgee normalized data URL	https://bgee.org/ftp/current/download/processed_expr_values/sc_rnaseq/Mus_musculus/Mus_musculus_Full-Length_SC_RNA-Seq_read_counts_TPM_FPKM_ERP013381.tsv.gz
14	Experiment description	The study was aimed at interrogating the early stages of blood cell development within the embryo...

Experiment ID (column 1)

The Experiment ID column provides the unique identifier per experiment.

Experiment name (column 2)

The Experiment name column provides the title referent to the Experiment ID (column 1).

Library count (column 3)

The Library count column provides the total number of the libraries associated with the Experiment ID (column 1).

Condition count (column 4)

The Condition count column provides all the combinations of unique parameters in Bgee. This means, combinations between unique anatomical entities, developmental stages, cell types, sex, and strains.

Organ-stage count (column 5)

The Organ-stage count column provides the total number of unique combinations between anatomical entities Ids Organ count (column 6) and developmental stages Stage count (column 7) in the target Experiment ID (column 1).

Organ count (column 6)

The Organ count column provides the total number of anatomical entity ids in the target Experiment ID (column 1).

Stage count (column 7)

The Stage count column provides the total number of developmental stages in the target Experiment ID (column 1).

Cell-Type count (column 8)

The Cell-Type count column provides the total number of cell types in the target Experiment ID (column 1).

Sex count (column 9)

The Sex count column provides the total number of sexes in the target Experiment ID (column 1).

Strain count (column 10)

The Strain count column provides the total number of genetic variants or subtypes in the target Experiment ID (column 1).

Data source (column 11)

Data repository from where the raw files that belong to the Experiment ID (column 1) were extracted.

Data source URL (column 12)

URL pathway to the data repository where is located the Experiment ID (column 1).

Bgee normalized data URL (column 13)

URL pathway where is located the processed data for the correspondent Experiment ID (column 1) in Bgee.

Experiment description (column 14)

Description provided by the authors of the Experiment ID (column 1).

Example rows

Experiment ID	Experiment name	Library count	Condition count	Organ-stage count	Organ count	Stage count	Cell-Type count	Sex count	Strain count	Data source	Data source URL	Bgee normalized data URL	Experiment description
ERP013381	Mouse embryonic RNA-seq	1205	4	4	1	4	2	1	1	SRA	https://www.ncbi.nlm.nih.gov/sra/ERP013381	https://bgee.org/ftp/current/download/processed_expr_values/sc_rnaseq/Mus_musculus/Mus_musculus_Full-Length_SC_RNA-Seq_read_counts_TPM_FPKM_ERP013381.tsv.gz	The study was aimed at interrogating the early stages of blood cell development within the embryo...
SRP020490	Single-cell RNA-Seq reveals dynamic, random monoallelic gene expression in mammalian cells	118	2	2	2	2	2	2	1	SRA	https://www.ncbi.nlm.nih.gov/sra/SRP020490	https://bgee.org/ftp/current/download/processed_expr_values/sc_rnaseq/Mus_musculus/Mus_musculus_Full-Length_SC_RNA-Seq_read_counts_TPM_FPKM_SRP020490.tsv.gz	In the diploid genome, genes come in two copies, which can have different DNA sequence and where one is maternal and one is paternal...

Processed expression (read counts, TPM, FPKM) files

The processed expression (read counts, TPM, FPKM) files can be retrieved per experiment for a specific species, accessed through FTP or through the download page by selecting the species of interest and then by clicking on the button Download read counts, TPM, and FPKMs. When using the web page, all processed data for the species are downloaded. The data for each experiment are contained in separate files named using the experiment identifier. Each experiment file includes all processed data of all samples from the experiment.

File format and column descriptions

Column	Content	Example
1	Experiment ID	SRP020490
2	Library ID	SRX259105
3	Library type	single
4	Gene ID	ENSMUSG00000000001
5	Anatomical entity ID	UBERON:0000085
6	Anatomical entity name	morula
7	Stage ID	MmusDv:0000006
8	Stage name	Theiler stage 03 (mouse)
9	Cell type ID	CL:0000353
10	Cell type name	blastoderm cell
11	Sex	NA
12	Strain	CAST_EiJ(mother)_x_C57BL_6J(father)
13	Read count	2154
14	TPM	54.538026
15	FPKM	55.33224
16	Rank	2465
17	Detection flag	present
18	pValue	3.70353E-06
19	State in Bgee	Part of a call

Experiment ID (column 1)

The Experiment ID column provides the unique identifier per experiment.

Library ID (column 2)

The Library ID column provides the unique identifier per sample (where each sample is a unique cell) that belongs to an Experiment ID (column 1).

Library type (column 3)

The Library type column consists of the strandedness of the library. This can be single or paired-end.

Gene ID (column 4)

The Gene ID column provides the unique identifier of genes from Ensembl.

Anatomical entity ID (column 5)

The Anatomical entity ID column provides the unique identifier of the anatomical entity, from the Uberon ontology.

Anatomical entity name (column 6)

The Anatomical entity name column provides the name of the anatomical entity defined by Anatomical entity ID (column 5).

Stage ID (column 7)

The Stage ID column provides the unique identifier of the developmental stage, from the Uberon ontology.

Stage name (column 8)

The Stage name column provides the name of the developmental stage defined by Stage ID (column 7).

Cell type ID (column 9)

The Cell type ID column provides the unique identifier of the cell type, from the Uberon ontology.

Cell type name (column 10)

The Cell type name column provides the name of the cell type defined by Cell type ID (column 9).

Sex (column 11)

The Sex column provides the sex information ('not annotated', 'NA', 'mixed', 'male', 'female', 'hermaphrodite').

Strain (column 12)

The Strain column provides information about the genetic variant or subtype of an organism.

Read count (column 13)

The Read count column provides the total number of reads of Gene ID (column 4) from a target Library ID (column 2) that will be mapped to the transcriptome.

TPM (column 14)

The TPM column provides a normalized quantification measure for sequencing depth and gene length of Gene ID (column 4) from a specific Library ID (column 2).

FPKM (column 15)

The FPKM column provides a normalized quantification measure for sequencing depth and gene length of Gene ID (column 4) from a specific Library ID (column 2).

Rank (column 16)

The Rank column provides the rank of a Gene ID (column 4) in a condition for a species. It is used to compute expression ranks and expression scores.

Detection flag (column 17)

The Detection flag column provides an informative classification of a Gene ID (column 4). The flag can be present, which means that the gene is actively expressed or empty classification (NULL). Note that in single cell RNA-Seq full-length data, we don't call absent genes. The genes are classified as present based on pValue (column 18) cutoff.

pValue (column 18)

The p-value is a quantitative metric to detect if Gene ID (column 4) is actively expressed in any standalone RNA-Seq Library ID (column 2).

For each individual Library ID (column 2) we map reads both to transcripts and to the reference intergenic regions, and compute TPM (column 14) per Gene ID (column 4) (summing over transcripts) and per intergenic region. Then for each Gene ID (column 4) in the Library ID (column 2), we compute a Z-score in terms of standard deviations from the mean of reference intergenic regions:

                              log2(TPM (column 14)_{Gene ID (column 4)}) - mean(log2(TPM_{RefIntergenic}))
ZScore_{Gene ID (column 4)} = ----------------------------------------------------------------------------
                                                     sd(log2(TPM_{RefIntergenic}))

Then for Gene ID (column 4) in the Library ID (column 2) we calculate a p-value based on a null hypothesis of expression at a similar level to reference intergenic, estimated as a Normal distribution.

The library-specific TPM limit to call genes expressed is the minimum value of TPM where p-value ≤ α. In the download files, we used α = 0.05.

State in Bgee (column 19)

The State in Bgee column provides information about the usage of Gene ID (column 4) to make expression calls. Three different labels can be retrieved in this column:

Part of a call --> This means the information from the Gene ID (column 4) was used to make an expression informative call.
Result excluded, reason: pre-filtering --> Pre-filtering of genes never observed as present in any Library ID (column 2). No calls will be generated for those Gene ID (column 4).
Result excluded, reason: absent call not reliable --> protocol used to generate the Library ID (column 2) does not allow to consider Gene ID (column 4) for absent calls.

Example rows

Experiment ID	Library ID	Library type	Gene ID	Anatomical entity ID	Anatomical entity name	Stage ID	Stage name	Cell type ID	Cell type name	Sex	Strain	Read count	TPM	FPKM	Rank	Detection flag	pValue	State in Bgee
SRP020490	SRX259105	single	ENSMUSG00000000001	UBERON:0000085	morula	MmusDv:0000006	Theiler stage 03 (mouse)	CL:0000353	blastoderm cell	NA	CAST_EiJ(mother)_x_C57BL_6J(father)	2154	54.538026	55.33224	2465	present	3.70353E-06	Part of a call
SRP020490	SRX259105	single	ENSMUSG00000000003	UBERON:0000085	morula	MmusDv:0000006	Theiler stage 03 (mouse)	CL:0000353	blastoderm cell	NA	CAST_EiJ(mother)_x_C57BL_6J(father)	0	0	0	NA		NA	Result excluded, reason: absent call not reliable
SRP020490	SRX259105	single	ENSMUSG00000000028	UBERON:0000085	morula	MmusDv:0000006	Theiler stage 03 (mouse)	CL:0000353	blastoderm cell	NA	CAST_EiJ(mother)_x_C57BL_6J(father)	341.9999	16.191957	16.427753	4671	present	8.89015E-05	Part of a call