This documentation describes the format of annotation download files (library and experiment files) as well as the processed expression values download files for RNA-Seq data. The files can be found on the Bgee download page.
The annotation download files are divided into 2 main files:
library file: provides detailed information for each individual sample, including anatomical entity, developmental stages, sex, strain, and quality scores used in quality control metrics.
experiment file: provides overall information about the experiment, including the number of libraries that belong to the experiment, and the number of conditions, organs, stages, and strains.
The Experiment ID column provides the unique identifier per experiment.
The Library ID column provides the unique identifier per sample that belongs to an Experiment ID (column 1).
The Anatomical entity ID column provides a unique identifier of the anatomical entity, from the Uberon ontology.
The anatomical entity name column provides the name of the anatomical entity defined by Anatomical entity ID (column 3).
The Stage ID column provides the unique identifier of the developmental stage, from the Uberon ontology.
The Stage name column provides the name of the developmental stage defined by Stage ID (column 5).
The Sex column provides the sex information ('not annotated', 'NA', 'mixed', 'male', 'female', 'hermaphrodite').
The Strain column provides information about the genetic variant or subtype of an organism.
The Expression mapped anatomical entity ID column is the annotation used in the Bgee expression calls. It can be different from the Anatomical entity ID (column 3) if it is too granular to be inserted in the database.
The Expression mapped anatomical entity name column provides the name of the anatomical entity defined by Expression mapped anatomical entity ID (column 9).
The Expression mapped stage ID column is the annotation used in the Bgee expression calls. It can be different from the Stage ID (column 5) if it is too granular to be inserted in the database.
The Expression mapped stage name column provides the name of the developmental stage defined by Expression mapped stage ID (column 11).
The Expression mapped sex column provides the sex information used in the Bgee expression calls ('any', 'male', 'female', 'hermaphrodite').
The Expression mapped strain column provides the strain information (genetic variant or subtype of an organism) used in the Bgee expression calls.
The Platform ID column provides the sequencing platform identifier.
The Protocol column provides information about the RNA-sequencing protocol used for library construction. For the moment four different types of protocols are annotated: circRNA, lncRNA, miRNA, and polyA.
The Library type column consists of the strandedness of the library. This can be single or paired-end.
The Library orientation column provides the relative orientation of the reads.
The TMM normalization factor column provides the estimated normalized factor of the relative RNA production levels from the RNA-seq data. Using the TMM method, we estimate the scale factors between samples, this means from libraries that belong to the same target Experiment ID (column 1).
The TPM expression threshold column provides the minimum TPM value to call expressed genes in the Library ID (column 2).
The Read count column provides the total number of read counts that will be mapped to the transcriptome.
The Mapped read count column provides the number of read counts that overlap/map to the genomic position.
The Min. read length column provides the minimum number of base pairs (bp) sequenced from a DNA fragment.
The Max. read length column provides the maximum number of base pairs (bp) sequenced from a DNA fragment.
The All genes percent present column provides information about the proportion of genes called actively expressed in the Library ID (column 2).
The Protein coding genes percent present column provides information about the proportion of protein coding genes called actively expressed in the Library ID (column 2).
The Intergenic regions percent present column provides information about the proportion of intergenic regions called actively expressed in the Library ID (column 2).
The Distinct rank count column provides information about unique rank counts in the Library ID (column 2). It is used to weigh the rank information coming from this library when computing expression ranks and expression scores.
The Max rank in the expression mapped condition column provides the max rank over all libraries in this condition. It is used to normalize ranks between conditions when computing expression ranks and expression scores.
The Run IDs column refers to a sequencing run associated with a library ID (column 2).
Data repository from where the raw files were extracted. Collect all Run IDs (column 30) correspondent to a target library ID (column 2).
URL pathway to the data repository where is located the library ID (column 2).
URL pathway where is located the processed data for the correspondent Experiment ID (column 1) in Bgee.
URL pathway to the SRA Run Selector. This allows access to the Run IDs (column 30) through the library ID (column 2).
| Experiment ID | Library ID | Anatomical entity ID | Anatomical entity name | Stage ID | Stage name | Sex | Strain | Expression mapped anatomical entity ID | Expression mapped anatomical entity name | Expression mapped stage ID | Expression mapped stage name | Expression mapped sex | Expression mapped strain | Platform ID | Protocol | Library type | Library orientation | TMM normalization factor | TPM expression threshold | Read count | Mapped read count | Min. read length | Max. read length | All genes percent present | Protein coding genes percent present | Intergenic regions percent present | Distinct rank count | Max rank in the expression mapped condition | Run IDs | Data source | Data source URL | Bgee normalized data URL | Raw file URL |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GSE44612 | SRX091570 | UBERON:0000079 | male reproductive system | DsimDv:0000007 | days 5-8 of fully formed stage (Drosophila simulans) | male | (DSSC) 14021-0251.199 | UBERON:0000079 | male reproductive system | DsimDv:0000007 | days 5-8 of fully formed stage (Drosophila simulans) | male | (DSSC) 14021-0251.199 | Illumina Genome Analyzer II | polyA | paired | NA | 0.831812 | 0.410944 | 27021668 | 11538462 | 101 | 101 | 80.49 | 82.26 | 3.53 | 13871 | NA | SRR330571 | SRA | https://www.ncbi.nlm.nih.gov/sra/?term=SRX091570 | https://bgee.org/ftp/current/download/processed_expr_values/rna_seq/Drosophila_simulans/Drosophila_simulans_RNA-Seq_read_counts_TPM_FPKM_GSE44612.tsv.gz | https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRX091570 |
| GSE44612 | SRX091571 | UBERON:0000079 | male reproductive system | DsimDv:0000007 | days 5-8 of fully formed stage (Drosophila simulans) | male | (DSSC) 14021-0251.199 | UBERON:0000079 | male reproductive system | DsimDv:0000007 | days 5-8 of fully formed stage (Drosophila simulans) | male | (DSSC) 14021-0251.199 | Illumina Genome Analyzer II | polyA | paired | NA | 0.974193 | 0.228134 | 25107578 | 14585590 | 101 | 101 | 61.09 | 63.55 | 2.49 | 11546 | NA | SRR330572 | SRA | https://www.ncbi.nlm.nih.gov/sra/?term=SRX091571 | https://bgee.org/ftp/current/download/processed_expr_values/rna_seq/Drosophila_simulans/Drosophila_simulans_RNA-Seq_read_counts_TPM_FPKM_GSE44612.tsv.gz | https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRX091571 |
| GSE44612 | SRX091572 | UBERON:0000079 | male reproductive system | DsimDv:0000007 | days 5-8 of fully formed stage (Drosophila simulans) | male | (DSSC) 14021-0251.199 | UBERON:0000079 | male reproductive system | DsimDv:0000007 | days 5-8 of fully formed stage (Drosophila simulans) | male | (DSSC) 14021-0251.199 | Illumina Genome Analyzer II | polyA | paired | NA | 0.877587 | 0.414407 | 20281880 | 13357213 | 101 | 101 | 81.3 | 83.24 | 3.12 | 13954 | NA | SRR330573 | SRA | https://www.ncbi.nlm.nih.gov/sra/?term=SRX091572 | https://bgee.org/ftp/current/download/processed_expr_values/rna_seq/Drosophila_simulans/Drosophila_simulans_RNA-Seq_read_counts_TPM_FPKM_GSE44612.tsv.gz | https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRX091572 |
| Column | Content | Example |
|---|---|---|
| 1 | Experiment ID | GSE44612 |
| 2 | Experiment name | Comparative Validation of the D. melanogaster Encyclopedia of DNA Elements Transcript Models |
| 3 | Library count | 13 |
| 4 | Condition count | 6 |
| 5 | Organ-stage count | 3 |
| 6 | Organ count | 3 |
| 7 | Stage count | 1 |
| 8 | Sex count | 2 |
| 9 | Strain count | 3 |
| 10 | Data source | GEO |
| 11 | Data source URL | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE44612 |
| 12 | Bgee normalized data URL | https://bgee.org/ftp/current/download/processed_expr_values/rna_seq/Drosophila_simulans/Drosophila_simulans_RNA-Seq_read_counts_TPM_FPKM_GSE44612.tsv.gz |
| 13 | Experiment description |
The Experiment ID column provides the unique identifier per experiment.
The Experiment name column provides the title referent to the Experiment ID (column 1).
The Library count column provides the total number of the libraries associated with the Experiment ID (column 1).
The Condition count column provides all the combinations of unique parameters in Bgee. This means, combinations between unique anatomical entities, developmental stages, sex, and strains.
The Organ-stage count column provides the total number of unique combinations between anatomical entities Ids Organ count (column 6) and developmental stages Stage count (column 7) in the target Experiment ID (column 1).
The Organ count column provides the total number of anatomical entity ids in the target Experiment ID (column 1).
The Stage count column provides the total number of developmental stages in the target Experiment ID (column 1).
The Sex count column provides the total number of sexes in the target Experiment ID (column 1).
The Strain count column provides the total number of genetic variants or sub-types in the target Experiment ID (column 1).
Data repository from where the raw files that belong to the Experiment ID (column 1) were extracted.
URL pathway to the data repository where is located the Experiment ID (column 1).
URL pathway where is located the processed data for the correspondent Experiment ID (column 1) in Bgee.
Description provided by the authors of the Experiment ID (column 1).
| Experiment ID | Experiment name | Library count | Condition count | Organ-stage count | Organ count | Stage count | Sex count | Strain count | Data source | Data source URL | Bgee normalized data URL | Experiment description |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GSE44612 | Comparative Validation of the D. melanogaster Encyclopedia of DNA Elements Transcript Models | 13 | 6 | 3 | 3 | 1 | 2 | 3 | GEO | https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE44612 | https://bgee.org/ftp/current/download/processed_expr_values/rna_seq/Drosophila_simulans/Drosophila_simulans_RNA-Seq_read_counts_TPM_FPKM_GSE44612.tsv.gz | |
| SRP099257 | Pervasive epigenetic effects of Drosophila euchromatic transposable elements impact their evolution [RNA-seq] | 2 | 1 | 1 | 1 | 1 | 1 | 1 | SRA | https://trace.ncbi.nlm.nih.gov/Traces/?view=study&acc=SRP099257 | https://bgee.org/ftp/current/download/processed_expr_values/rna_seq/Drosophila_simulans/Drosophila_simulans_RNA-Seq_read_counts_TPM_FPKM_SRP099257.tsv.gz | We study the relatively unexplored evolutionary consequences of the epigenetic effects of transposable elements (TEs).... |
The processed expression (read counts, TPM, FPKM) files can be retrieved per experiment for a specific species, accessed through FTP or through the download page by selecting the species of interest and then by clicking on the button Download read counts, TPM, and FPKMs. When using the web page, all processed data for the species are downloaded. The data for each experiment are contained in separate files named using the experiment identifier. Each experiment file includes all processed data of all samples from the experiment.
| Column | Content | Example |
|---|---|---|
| 1 | Experiment ID | SRP099257 |
| 2 | Library ID | SRX2548614 |
| 3 | Library type | paired |
| 4 | Gene ID | FBgn0012823 |
| 5 | Anatomical entity ID | UBERON:0000922 |
| 6 | Anatomical entity name | embryo |
| 7 | Stage ID | UBERON:0000068 |
| 8 | Stage name | embryo stage |
| 9 | Sex | NA |
| 10 | Strain | W501 |
| 11 | Read count | 4 |
| 12 | TPM | 0.178537 |
| 13 | FPKM | 0.159188 |
| 14 | Rank | 10528 |
| 15 | Detection flag | absent |
| 16 | pValue | 0.13514812 |
| 17 | State in Bgee | Part of a call |
The Experiment ID column provides the unique identifier per experiment.
The Library ID column provides the unique identifier per sample that belongs to an Experiment ID (column 1).
The Library type column consists of the strandedness of the library. This can be single or paired-end.
The Gene ID column provides the unique identifier of genes from Ensembl.
The Anatomical entity ID column provides the unique identifier of the anatomical entity, from the Uberon ontology.
The Anatomical entity name column provides the name of the anatomical entity defined by Anatomical entity ID (column 5).
The Stage ID column provides the unique identifier of the developmental stage, from the Uberon ontology.
The Stage name column provides the name of the developmental stage defined by Stage ID (column 7).
The Sex column provides the sex information ('not annotated', 'NA', 'mixed', 'male', 'female', 'hermaphrodite').
The Strain column provides information about the genetic variant or subtype of an organism.
The Read count column provides the total number of reads of Gene ID (column 4) from a target Library ID (column 2) that will be mapped to the transcriptome.
The TPM column provides a normalized quantification measure for sequencing depth and gene length of Gene ID (column 4) from a specific Library ID (column 2).
The FPKM column provides a normalized quantification measure for sequencing depth and gene length of Gene ID (column 4) from a specific Library ID (column 2).
The Rank column provides the rank of a Gene ID (column 4) in a condition for a species. It is used to compute expression ranks and expression scores.
The Detection flag column provides an informative classification of a Gene ID (column 4) being classified as present or absent.
The flag present means that the gene is actively expressed, and absent means that the gene is not actively expressed.
The genes are classified as present or absent based on pValue (column 16) cutoff.
The p-value is a quantitative metric to detect if Gene ID (column 4) is actively expressed in any standalone RNA-Seq Library ID (column 2).
For each individual Library ID (column 2) we map reads both to transcripts and to the reference intergenic regions, and compute TPM (column 12) per Gene ID (column 4) (summing over transcripts) and per intergenic region. Then for each Gene ID (column 4) in the Library ID (column 2), we compute a Z-score in terms of standard deviations from the mean of reference intergenic regions:
log2(TPM (column 12)_{Gene ID (column 4)}) - mean(log2(TPM_{RefIntergenic}))
ZScore_{Gene ID (column 4)} = ----------------------------------------------------------------------------
sd(log2(TPM_{RefIntergenic}))
Then for Gene ID (column 4) in the Library ID (column 2) we calculate a p-value based on a null hypothesis of expression at a similar level to reference intergenic, estimated as a Normal distribution.
The library-specific TPM limit to call genes expressed is the minimum value of TPM where p-value ≤ α. In the download files, we used α = 0.05.
The State in Bgee column provides the information about the usage of Gene ID (column 4) to make expression calls.
Three different labels can be retrieved in this column:
Gene ID (column 4) was used to make an expression informative call.Library ID (column 2). No calls will be generated for those Gene ID (column 4).Library ID (column 2) does not allow to consider Gene ID (column 4) absent for this gene biotype.| RNASeqProtocol | biotypes_excluded_for_absent_calls |
|---|---|
| polyA | 3prime_overlapping_ncRNA,antisense,antisense_RNA,bidirectional_promoter_lncRNA,lincRNA,lncRNA,macro_lncRNA,miRNA,misc_RNA,Mt_tRNA,ncRNA,other,piRNA,pre_miRNA,processed_transcript,ribozyme,TEC,rRNA,rRNA_pseudogene,Mt_rRNA,snoRNA,snRNA,sRNA,sense_intronic,sense_overlapping,translated_processed_pseudogene,translated_unprocessed_pseudogene,tRNA,Y_RNA,scaRNA,scRNA,vault_RNA |
| lncRNA | IG_C_gene,IG_C_pseudogene,IG_D_gene,IG_D_pseudogene,IG_J_gene,IG_J_pseudogene,IG_LV_gene,IG_pseudogene,IG_V_gene,IG_V_pseudogene,miRNA,misc_RNA,Mt_rRNA,Mt_tRNA,ncRNA,other,piRNA,polymorphic_pseudogene,pre_miRNA,processed_pseudogene,protein_coding,pseudogene,ribozyme,rRNA,rRNA_pseudogene,scaRNA,scRNA,snoRNA,snRNA,sRNA,TEC,transcribed_processed_pseudogene,transcribed_unitary_pseudogene,transcribed_unprocessed_pseudogene,translated_processed_pseudogene,translated_unprocessed_pseudogene,tRNA,TR_C_gene,TR_D_gene,TR_J_gene,TR_J_pseudogene,TR_V_gene,TR_V_pseudogene,unitary_pseudogene,unprocessed_pseudogene,vault_RNA,Y_RNA |
| miRNA | 3prime_overlapping_ncRNA,antisense,antisense_RNA,bidirectional_promoter_lncRNA,IG_C_gene,IG_C_pseudogene,IG_D_gene,IG_D_pseudogene,IG_J_gene,IG_J_pseudogene,IG_LV_gene,IG_pseudogene,IG_V_gene,IG_V_pseudogene,lincRNA,lncRNA,macro_lncRNA,misc_RNA,Mt_rRNA,Mt_tRNA,ncRNA,other,piRNA,polymorphic_pseudogene,processed_pseudogene,processed_transcript,protein_coding,pseudogene,ribozyme,rRNA,rRNA_pseudogene,scaRNA,scRNA,sense_intronic,sense_overlapping,snoRNA,snRNA,sRNA,TEC,transcribed_processed_pseudogene,transcribed_unitary_pseudogene,transcribed_unprocessed_pseudogene,translated_processed_pseudogene,translated_unprocessed_pseudogene,tRNA,TR_C_gene,TR_D_gene,TR_J_gene,TR_J_pseudogene,TR_V_gene,TR_V_pseudogene,unitary_pseudogene,unprocessed_pseudogene,vault_RNA,Y_RNA |
| circRNA | 3prime_overlapping_ncRNA,antisense,antisense_RNA,bidirectional_promoter_lncRNA,IG_C_gene,IG_C_pseudogene,IG_D_gene,IG_D_pseudogene,IG_J_gene,IG_J_pseudogene,IG_LV_gene,IG_pseudogene,IG_V_gene,IG_V_pseudogene,lincRNA,lncRNA,macro_lncRNA,miRNA,misc_RNA,Mt_rRNA,Mt_tRNA,ncRNA,other,piRNA,polymorphic_pseudogene,pre_miRNA,processed_pseudogene,processed_transcript,protein_coding,pseudogene,ribozyme,rRNA,rRNA_pseudogene,scaRNA,scRNA,sense_intronic,sense_overlapping,snoRNA,snRNA,sRNA,TEC,transcribed_processed_pseudogene,transcribed_unitary_pseudogene,transcribed_unprocessed_pseudogene,translated_processed_pseudogene,translated_unprocessed_pseudogene,tRNA,TR_C_gene,TR_D_gene,TR_J_gene,TR_J_pseudogene,TR_V_gene,TR_V_pseudogene,unitary_pseudogene,unprocessed_pseudogene,vault_RNA,Y_RNA |
| ribo-minus | Mt_rRNA,rRNA,rRNA_pseudogene |
| Experiment ID | Library ID | Library type | Gene ID | Anatomical entity ID | Anatomical entity name | Stage ID | Stage name | Sex | Strain | Read count | TPM | FPKM | Rank | Detection flag | pValue | State in Bgee |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SRP099257 | SRX2548614 | paired | FBgn0012820 | UBERON:0000922 | embryo | UBERON:0000068 | embryo stage | NA | W501 | 0 | 0 | 0 | 13263 | absent | 1 | Part of a call |
| SRP099257 | SRX2548614 | paired | FBgn0012821 | UBERON:0000922 | embryo | UBERON:0000068 | embryo stage | NA | W501 | 0 | 0 | 0 | 13263 | absent | 1 | Part of a call |
| SRP099257 | SRX2548614 | paired | FBgn0012823 | UBERON:0000922 | embryo | UBERON:0000068 | embryo stage | NA | W501 | 4 | 0.178537 | 0.159188 | 10528 | absent | 0.13514812 | Part of a call |
| SRP099257 | SRX2548614 | paired | FBgn0012824 | UBERON:0000922 | embryo | UBERON:0000068 | embryo stage | NA | W501 | 274.911 | 16.049365 | 14.310003 | 5174 | present | 1.81226E-05 | Part of a call |