RNA-seq Processed Expression Values Files

This documentation describes the format of annotation download files (library and experiment files) as well as the processed expression values download files for RNA-Seq data. The files can be found on the Bgee download page.

Annotation (experiments/libraries) files

The annotation download files are divided into 2 main files:

  1. library file: provides detailed information for each individual sample, including anatomical entity, developmental stages, sex, strain, and quality scores used in quality control metrics.

  2. experiment file: provides overall information about the experiment, including the number of libraries that belong to the experiment, and the number of conditions, organs, stages, and strains.

Library file

File format and column descriptions

ColumnContentExample
1Experiment IDGSE44612
2Library IDSRX091570
3Anatomical entity IDUBERON:0000079
4Anatomical entity namemale reproductive system
5Stage IDDsimDv:0000007
6Stage namedays 5-8 of fully formed stage (Drosophila simulans)
7Sexmale
8Strain(DSSC) 14021-0251.199
9Expression mapped anatomical entity IDUBERON:0000079
10Expression mapped anatomical entity namemale reproductive system
11Expression mapped stage IDDsimDv:0000007
12Expression mapped stage namedays 5-8 of fully formed stage (Drosophila simulans)
13Expression mapped sexmale
14Expression mapped strain(DSSC) 14021-0251.199
15Platform IDIllumina Genome Analyzer II
16ProtocolpolyA
17Library typepaired
18Library orientationNA
19TMM normalization factor0.831812
20TPM expression threshold0.410944
21Read count27021668
22Mapped read count11538462
23Min. read length101
24Max. read length101
25All genes percent present80.49
26Protein coding genes percent present82.26
27Intergenic regions percent present3.53
28Distinct rank count13871
29Max rank in the expression mapped conditionNA
30Run IDsSRR330571
31Data sourceSRA
32Data source URLhttps://www.ncbi.nlm.nih.gov/sra/?term=SRX091570
33Bgee normalized data URLhttps://bgee.org/ftp/current/download/processed_expr_values/rna_seq/Drosophila_simulans/Drosophila_simulans_RNA-Seq_read_counts_TPM_FPKM_GSE44612.tsv.gz
34Raw file URLhttps://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRX091570
Experiment ID (column 1)

The Experiment ID column provides the unique identifier per experiment.

Library ID (column 2)

The Library ID column provides the unique identifier per sample that belongs to an Experiment ID (column 1).

Anatomical entity ID (column 3)

The Anatomical entity ID column provides a unique identifier of the anatomical entity, from the Uberon ontology.

Anatomical entity name (column 4)

The anatomical entity name column provides the name of the anatomical entity defined by Anatomical entity ID (column 3).

Stage ID (column 5)

The Stage ID column provides the unique identifier of the developmental stage, from the Uberon ontology.

Stage name (column 6)

The Stage name column provides the name of the developmental stage defined by Stage ID (column 5).

Sex (column 7)

The Sex column provides the sex information ('not annotated', 'NA', 'mixed', 'male', 'female', 'hermaphrodite').

Strain (column 8)

The Strain column provides information about the genetic variant or subtype of an organism.

Expression mapped anatomical entity ID (column 9)

The Expression mapped anatomical entity ID column is the annotation used in the Bgee expression calls. It can be different from the Anatomical entity ID (column 3) if it is too granular to be inserted in the database.

Expression mapped anatomical entity name (column 10)

The Expression mapped anatomical entity name column provides the name of the anatomical entity defined by Expression mapped anatomical entity ID (column 9).

Expression mapped stage ID (column 11)

The Expression mapped stage ID column is the annotation used in the Bgee expression calls. It can be different from the Stage ID (column 5) if it is too granular to be inserted in the database.

Expression mapped stage name (column 12)

The Expression mapped stage name column provides the name of the developmental stage defined by Expression mapped stage ID (column 11).

Expression mapped sex (column 13)

The Expression mapped sex column provides the sex information used in the Bgee expression calls ('any', 'male', 'female', 'hermaphrodite').

Expression mapped strain (column 14)

The Expression mapped strain column provides the strain information (genetic variant or subtype of an organism) used in the Bgee expression calls.

Platform ID (column 15)

The Platform ID column provides the sequencing platform identifier.

Protocol (column 16)

The Protocol column provides information about the RNA-sequencing protocol used for library construction. For the moment four different types of protocols are annotated: circRNA, lncRNA, miRNA, and polyA.

Library type (column 17)

The Library type column consists of the strandedness of the library. This can be single or paired-end.

Library orientation (column 18)

The Library orientation column provides the relative orientation of the reads.

TMM normalization factor (column 19)

The TMM normalization factor column provides the estimated normalized factor of the relative RNA production levels from the RNA-seq data. Using the TMM method, we estimate the scale factors between samples, this means from libraries that belong to the same target Experiment ID (column 1).

TPM expression threshold (column 20)

The TPM expression threshold column provides the minimum TPM value to call expressed genes in the Library ID (column 2).

Read count (column 21)

The Read count column provides the total number of read counts that will be mapped to the transcriptome.

Mapped read count (column 22)

The Mapped read count column provides the number of read counts that overlap/map to the genomic position.

Min. read length (column 23)

The Min. read length column provides the minimum number of base pairs (bp) sequenced from a DNA fragment.

Max. read length (column 24)

The Max. read length column provides the maximum number of base pairs (bp) sequenced from a DNA fragment.

All genes percent present (column 25)

The All genes percent present column provides information about the proportion of genes called actively expressed in the Library ID (column 2).

Protein coding genes percent present (column 26)

The Protein coding genes percent present column provides information about the proportion of protein coding genes called actively expressed in the Library ID (column 2).

Intergenic regions percent present (column 27)

The Intergenic regions percent present column provides information about the proportion of intergenic regions called actively expressed in the Library ID (column 2).

Distinct rank count (column 28)

The Distinct rank count column provides information about unique rank counts in the Library ID (column 2). It is used to weigh the rank information coming from this library when computing expression ranks and expression scores.

Max rank in the expression mapped condition (column 29)

The Max rank in the expression mapped condition column provides the max rank over all libraries in this condition. It is used to normalize ranks between conditions when computing expression ranks and expression scores.

Run IDs (column 30)

The Run IDs column refers to a sequencing run associated with a library ID (column 2).

Data source (column 31)

Data repository from where the raw files were extracted. Collect all Run IDs (column 30) correspondent to a target library ID (column 2).

Data source URL (column 32)

URL pathway to the data repository where is located the library ID (column 2).

Bgee normalized data URL (column 33)

URL pathway where is located the processed data for the correspondent Experiment ID (column 1) in Bgee.

Raw file URL (column 34)

URL pathway to the SRA Run Selector. This allows access to the Run IDs (column 30) through the library ID (column 2).

Example rows

Experiment IDLibrary IDAnatomical entity IDAnatomical entity nameStage IDStage nameSexStrainExpression mapped anatomical entity IDExpression mapped anatomical entity nameExpression mapped stage IDExpression mapped stage nameExpression mapped sexExpression mapped strainPlatform IDProtocolLibrary typeLibrary orientationTMM normalization factorTPM expression thresholdRead countMapped read countMin. read lengthMax. read lengthAll genes percent presentProtein coding genes percent presentIntergenic regions percent presentDistinct rank countMax rank in the expression mapped conditionRun IDsData sourceData source URLBgee normalized data URLRaw file URL
GSE44612SRX091570UBERON:0000079male reproductive systemDsimDv:0000007days 5-8 of fully formed stage (Drosophila simulans)male(DSSC) 14021-0251.199UBERON:0000079male reproductive systemDsimDv:0000007days 5-8 of fully formed stage (Drosophila simulans)male(DSSC) 14021-0251.199Illumina Genome Analyzer IIpolyApairedNA0.8318120.410944270216681153846210110180.4982.263.5313871NASRR330571SRAhttps://www.ncbi.nlm.nih.gov/sra/?term=SRX091570https://bgee.org/ftp/current/download/processed_expr_values/rna_seq/Drosophila_simulans/Drosophila_simulans_RNA-Seq_read_counts_TPM_FPKM_GSE44612.tsv.gzhttps://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRX091570
GSE44612SRX091571UBERON:0000079male reproductive systemDsimDv:0000007days 5-8 of fully formed stage (Drosophila simulans)male(DSSC) 14021-0251.199UBERON:0000079male reproductive systemDsimDv:0000007days 5-8 of fully formed stage (Drosophila simulans)male(DSSC) 14021-0251.199Illumina Genome Analyzer IIpolyApairedNA0.9741930.228134251075781458559010110161.0963.552.4911546NASRR330572SRAhttps://www.ncbi.nlm.nih.gov/sra/?term=SRX091571https://bgee.org/ftp/current/download/processed_expr_values/rna_seq/Drosophila_simulans/Drosophila_simulans_RNA-Seq_read_counts_TPM_FPKM_GSE44612.tsv.gzhttps://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRX091571
GSE44612SRX091572UBERON:0000079male reproductive systemDsimDv:0000007days 5-8 of fully formed stage (Drosophila simulans)male(DSSC) 14021-0251.199UBERON:0000079male reproductive systemDsimDv:0000007days 5-8 of fully formed stage (Drosophila simulans)male(DSSC) 14021-0251.199Illumina Genome Analyzer IIpolyApairedNA0.8775870.414407202818801335721310110181.383.243.1213954NASRR330573SRAhttps://www.ncbi.nlm.nih.gov/sra/?term=SRX091572https://bgee.org/ftp/current/download/processed_expr_values/rna_seq/Drosophila_simulans/Drosophila_simulans_RNA-Seq_read_counts_TPM_FPKM_GSE44612.tsv.gzhttps://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRX091572

Experiment file

File format and column descriptions

ColumnContentExample
1Experiment IDGSE44612
2Experiment nameComparative Validation of the D. melanogaster Encyclopedia of DNA Elements Transcript Models
3Library count13
4Condition count6
5Organ-stage count3
6Organ count3
7Stage count1
8Sex count2
9Strain count3
10Data sourceGEO
11Data source URLhttps://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE44612
12Bgee normalized data URLhttps://bgee.org/ftp/current/download/processed_expr_values/rna_seq/Drosophila_simulans/Drosophila_simulans_RNA-Seq_read_counts_TPM_FPKM_GSE44612.tsv.gz
13Experiment description
Experiment ID (column 1)

The Experiment ID column provides the unique identifier per experiment.

Experiment name (column 2)

The Experiment name column provides the title referent to the Experiment ID (column 1).

Library count (column 3)

The Library count column provides the total number of the libraries associated with the Experiment ID (column 1).

Condition count (column 4)

The Condition count column provides all the combinations of unique parameters in Bgee. This means, combinations between unique anatomical entities, developmental stages, sex, and strains.

Organ-stage count (column 5)

The Organ-stage count column provides the total number of unique combinations between anatomical entities Ids Organ count (column 6) and developmental stages Stage count (column 7) in the target Experiment ID (column 1).

Organ count (column 6)

The Organ count column provides the total number of anatomical entity ids in the target Experiment ID (column 1).

Stage count (column 7)

The Stage count column provides the total number of developmental stages in the target Experiment ID (column 1).

Sex count (column 8)

The Sex count column provides the total number of sexes in the target Experiment ID (column 1).

Strain count (column 9)

The Strain count column provides the total number of genetic variants or sub-types in the target Experiment ID (column 1).

Data source (column 10)

Data repository from where the raw files that belong to the Experiment ID (column 1) were extracted.

Data source URL (column 11)

URL pathway to the data repository where is located the Experiment ID (column 1).

Bgee normalized data URL (column 12)

URL pathway where is located the processed data for the correspondent Experiment ID (column 1) in Bgee.

Experiment description (column 13)

Description provided by the authors of the Experiment ID (column 1).

Example rows

Experiment IDExperiment nameLibrary countCondition countOrgan-stage countOrgan countStage countSex countStrain countData sourceData source URLBgee normalized data URLExperiment description
GSE44612Comparative Validation of the D. melanogaster Encyclopedia of DNA Elements Transcript Models13633123GEOhttps://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE44612https://bgee.org/ftp/current/download/processed_expr_values/rna_seq/Drosophila_simulans/Drosophila_simulans_RNA-Seq_read_counts_TPM_FPKM_GSE44612.tsv.gz
SRP099257Pervasive epigenetic effects of Drosophila euchromatic transposable elements impact their evolution [RNA-seq]2111111SRAhttps://trace.ncbi.nlm.nih.gov/Traces/?view=study&acc=SRP099257https://bgee.org/ftp/current/download/processed_expr_values/rna_seq/Drosophila_simulans/Drosophila_simulans_RNA-Seq_read_counts_TPM_FPKM_SRP099257.tsv.gzWe study the relatively unexplored evolutionary consequences of the epigenetic effects of transposable elements (TEs)....

Processed expression (read counts, TPM, FPKM) files

The processed expression (read counts, TPM, FPKM) files can be retrieved per experiment for a specific species, accessed through FTP or through the download page by selecting the species of interest and then by clicking on the button Download read counts, TPM, and FPKMs. When using the web page, all processed data for the species are downloaded. The data for each experiment are contained in separate files named using the experiment identifier. Each experiment file includes all processed data of all samples from the experiment.

File format and column descriptions

ColumnContentExample
1Experiment IDSRP099257
2Library IDSRX2548614
3Library typepaired
4Gene IDFBgn0012823
5Anatomical entity IDUBERON:0000922
6Anatomical entity nameembryo
7Stage IDUBERON:0000068
8Stage nameembryo stage
9SexNA
10StrainW501
11Read count4
12TPM0.178537
13FPKM0.159188
14Rank10528
15Detection flagabsent
16pValue0.13514812
17State in BgeePart of a call
Experiment ID (column 1)

The Experiment ID column provides the unique identifier per experiment.

Library ID (column 2)

The Library ID column provides the unique identifier per sample that belongs to an Experiment ID (column 1).

Library type (column 3)

The Library type column consists of the strandedness of the library. This can be single or paired-end.

Gene ID (column 4)

The Gene ID column provides the unique identifier of genes from Ensembl.

Anatomical entity ID (column 5)

The Anatomical entity ID column provides the unique identifier of the anatomical entity, from the Uberon ontology.

Anatomical entity name (column 6)

The Anatomical entity name column provides the name of the anatomical entity defined by Anatomical entity ID (column 5).

Stage ID (column 7)

The Stage ID column provides the unique identifier of the developmental stage, from the Uberon ontology.

Stage name (column 8)

The Stage name column provides the name of the developmental stage defined by Stage ID (column 7).

Sex (column 9)

The Sex column provides the sex information ('not annotated', 'NA', 'mixed', 'male', 'female', 'hermaphrodite').

Strain (column 10)

The Strain column provides information about the genetic variant or subtype of an organism.

Read count (column 11)

The Read count column provides the total number of reads of Gene ID (column 4) from a target Library ID (column 2) that will be mapped to the transcriptome.

TPM (column 12)

The TPM column provides a normalized quantification measure for sequencing depth and gene length of Gene ID (column 4) from a specific Library ID (column 2).

FPKM (column 13)

The FPKM column provides a normalized quantification measure for sequencing depth and gene length of Gene ID (column 4) from a specific Library ID (column 2).

Rank (column 14)

The Rank column provides the rank of a Gene ID (column 4) in a condition for a species. It is used to compute expression ranks and expression scores.

Detection flag (column 15)

The Detection flag column provides an informative classification of a Gene ID (column 4) being classified as present or absent. The flag present means that the gene is actively expressed, and absent means that the gene is not actively expressed. The genes are classified as present or absent based on pValue (column 16) cutoff.

pValue (column 16)

The p-value is a quantitative metric to detect if Gene ID (column 4) is actively expressed in any standalone RNA-Seq Library ID (column 2).

For each individual Library ID (column 2) we map reads both to transcripts and to the reference intergenic regions, and compute TPM (column 12) per Gene ID (column 4) (summing over transcripts) and per intergenic region. Then for each Gene ID (column 4) in the Library ID (column 2), we compute a Z-score in terms of standard deviations from the mean of reference intergenic regions:

                              log2(TPM (column 12)_{Gene ID (column 4)}) - mean(log2(TPM_{RefIntergenic}))
ZScore_{Gene ID (column 4)} = ----------------------------------------------------------------------------
                                                     sd(log2(TPM_{RefIntergenic}))

Then for Gene ID (column 4) in the Library ID (column 2) we calculate a p-value based on a null hypothesis of expression at a similar level to reference intergenic, estimated as a Normal distribution.

The library-specific TPM limit to call genes expressed is the minimum value of TPM where p-value ≤ α. In the download files, we used α = 0.05.

State in Bgee (column 17)

The State in Bgee column provides the information about the usage of Gene ID (column 4) to make expression calls. Three different labels can be retrieved in this column:

  1. Part of a call --> This means the information from the Gene ID (column 4) was used to make an expression informative call.
  2. Result excluded, reason: pre-filtering --> Pre-filtering of genes never observed as present in any Library ID (column 2). No calls will be generated for those Gene ID (column 4).
  3. Result excluded, reason: absent call not reliable --> protocol used to generate the Library ID (column 2) does not allow to consider Gene ID (column 4) absent for this gene biotype.

Biotypes excluded from absent calls

RNASeqProtocolbiotypes_excluded_for_absent_calls
polyA3prime_overlapping_ncRNA,antisense,antisense_RNA,bidirectional_promoter_lncRNA,lincRNA,lncRNA,macro_lncRNA,miRNA,misc_RNA,Mt_tRNA,ncRNA,other,piRNA,pre_miRNA,processed_transcript,ribozyme,TEC,rRNA,rRNA_pseudogene,Mt_rRNA,snoRNA,snRNA,sRNA,sense_intronic,sense_overlapping,translated_processed_pseudogene,translated_unprocessed_pseudogene,tRNA,Y_RNA,scaRNA,scRNA,vault_RNA
lncRNAIG_C_gene,IG_C_pseudogene,IG_D_gene,IG_D_pseudogene,IG_J_gene,IG_J_pseudogene,IG_LV_gene,IG_pseudogene,IG_V_gene,IG_V_pseudogene,miRNA,misc_RNA,Mt_rRNA,Mt_tRNA,ncRNA,other,piRNA,polymorphic_pseudogene,pre_miRNA,processed_pseudogene,protein_coding,pseudogene,ribozyme,rRNA,rRNA_pseudogene,scaRNA,scRNA,snoRNA,snRNA,sRNA,TEC,transcribed_processed_pseudogene,transcribed_unitary_pseudogene,transcribed_unprocessed_pseudogene,translated_processed_pseudogene,translated_unprocessed_pseudogene,tRNA,TR_C_gene,TR_D_gene,TR_J_gene,TR_J_pseudogene,TR_V_gene,TR_V_pseudogene,unitary_pseudogene,unprocessed_pseudogene,vault_RNA,Y_RNA
miRNA3prime_overlapping_ncRNA,antisense,antisense_RNA,bidirectional_promoter_lncRNA,IG_C_gene,IG_C_pseudogene,IG_D_gene,IG_D_pseudogene,IG_J_gene,IG_J_pseudogene,IG_LV_gene,IG_pseudogene,IG_V_gene,IG_V_pseudogene,lincRNA,lncRNA,macro_lncRNA,misc_RNA,Mt_rRNA,Mt_tRNA,ncRNA,other,piRNA,polymorphic_pseudogene,processed_pseudogene,processed_transcript,protein_coding,pseudogene,ribozyme,rRNA,rRNA_pseudogene,scaRNA,scRNA,sense_intronic,sense_overlapping,snoRNA,snRNA,sRNA,TEC,transcribed_processed_pseudogene,transcribed_unitary_pseudogene,transcribed_unprocessed_pseudogene,translated_processed_pseudogene,translated_unprocessed_pseudogene,tRNA,TR_C_gene,TR_D_gene,TR_J_gene,TR_J_pseudogene,TR_V_gene,TR_V_pseudogene,unitary_pseudogene,unprocessed_pseudogene,vault_RNA,Y_RNA
circRNA3prime_overlapping_ncRNA,antisense,antisense_RNA,bidirectional_promoter_lncRNA,IG_C_gene,IG_C_pseudogene,IG_D_gene,IG_D_pseudogene,IG_J_gene,IG_J_pseudogene,IG_LV_gene,IG_pseudogene,IG_V_gene,IG_V_pseudogene,lincRNA,lncRNA,macro_lncRNA,miRNA,misc_RNA,Mt_rRNA,Mt_tRNA,ncRNA,other,piRNA,polymorphic_pseudogene,pre_miRNA,processed_pseudogene,processed_transcript,protein_coding,pseudogene,ribozyme,rRNA,rRNA_pseudogene,scaRNA,scRNA,sense_intronic,sense_overlapping,snoRNA,snRNA,sRNA,TEC,transcribed_processed_pseudogene,transcribed_unitary_pseudogene,transcribed_unprocessed_pseudogene,translated_processed_pseudogene,translated_unprocessed_pseudogene,tRNA,TR_C_gene,TR_D_gene,TR_J_gene,TR_J_pseudogene,TR_V_gene,TR_V_pseudogene,unitary_pseudogene,unprocessed_pseudogene,vault_RNA,Y_RNA
ribo-minusMt_rRNA,rRNA,rRNA_pseudogene

Example rows

Experiment IDLibrary IDLibrary typeGene IDAnatomical entity IDAnatomical entity nameStage IDStage nameSexStrainRead countTPMFPKMRankDetection flagpValueState in Bgee
SRP099257SRX2548614pairedFBgn0012820UBERON:0000922embryoUBERON:0000068embryo stageNAW50100013263absent1Part of a call
SRP099257SRX2548614pairedFBgn0012821UBERON:0000922embryoUBERON:0000068embryo stageNAW50100013263absent1Part of a call
SRP099257SRX2548614pairedFBgn0012823UBERON:0000922embryoUBERON:0000068embryo stageNAW50140.1785370.15918810528absent0.13514812Part of a call
SRP099257SRX2548614pairedFBgn0012824UBERON:0000922embryoUBERON:0000068embryo stageNAW501274.91116.04936514.3100035174present1.81226E-05Part of a call