Genome-wide Identification and Conservation of Unannotated Transcription Start Sites within a Bacterial Genus

Genome-wide Identification and Conservation of Unannotated Transcription Start Sites within a Bacterial Genus (2013)

W. Shao, M. N. Price, A. M. Deutschbauer, M. F. Romine, A. P. Arkin
mBio.01398-14

Abstract

Recent studies have revealed unexpected complexity in bacterial transcription. Transcription start sites (TSSs) lying inside annotated genes, on the same or opposite strand, have been observed in diverse bacteria. Here, we use the metal-reducing bacterium Shewanella oneidensis MR-1 and its relatives to study the prevalence and evolutionary conservation of unexpected TSSs. Using high-resolution "tiling" microarrays and 5’-end RNA sequencing, we identified 2,531 TSSs in MR-1, of which 18% were located inside coding sequences (CDSs). Comparative transcriptome analysis with seven additional Shewanella species revealed that 30% of the sense-orientated internal TSSs (iTSSs) were conserved. In addition, sequence analysis around these iTSSs showed conserved promoter motifs, suggesting that many iTSS are selected for and under direct regulatory control. Evolutionary conservation was also detected for "orphan" TSSs that were intergenic and far from annotated genes (nTSSs). Some of the nTSSs drove the expression of putative non-coding RNAs. To the contrary, most of the antisense TSSs that were located inside CDSs (aTSSs) had neither conserved transcription nor conserved promoter sequences. Overall, our findings demonstrate that most antisense transcription is not under selection within a bacterial genus. In contrast, other "unexpected" transcription, such as iTSSs and nTSSs, were conserved and thus likely functional.

Supplementary Tables

Table S1. Illumina sequencing raw data and analyzed result.
Table S2. List of identified transcription start sites for annotated protein coding genes in Shewanella oneidensis MR-1.
Table S3. List of identified internal transcription start sites on the sense strand (iTSSs).
Table S4. List of identified internal transcription start sites on the antisense strand (aTSSs).
Table S5. List of identified intergenic transcription start sites far from any annotated protein coding genes (nTSSs).
Table S6. Consistency between our identified TSSs and previous literatures.
Table S7. List of identified transcription start sites in Shewanella sp MR-4.
Table S8. List of identified transcription start sites in Shewanella sp MR-7.
Table S9. List of identified transcription start sites in Shewanella sp ANA-3.
Table S10. List of identified transcription start sites in Shewanella putrefaciens CN-32.
Table S11. List of identified transcription start sites in Shewanella sp W3-18-1.
Table S12. List of identified transcription start sites in Shewanella loihica PV-4.
Table S13. List of identified transcription start sites in Shewanella amazonensis SB2B.
Table S14. Effect of different log-odds cutoffs on TSSs indentification.
Table S15. List of identified novel non-coding RNAs (ncRNAs).

Viewing the data in Artemis

This zip file (122 megabytes) includes:

MR1_chromosome.txt -- GenBank annotation for MR-1 main chromosome (AE014299.2); can be used as reference in Artemis
Files that can be viewed with add user plot (only for the main chromosome):
- Tiling data (the genome-wide median is 0 and potentially cross-hybridizing probes have been removed;the intensity at each point is the value for the probe whose center is closest to that point;the 1st column is the plus strand and the 2nd column is the minus strand ):
  - richnorm.plus_minus.art -- normalized tiling data from rich media
  - minnorm.plus_minus.art -- normalized tiling data from lactate minimal media
  - dmsonorm.plus_minus.art -- normalized tiling data from anaerobic growth with dimethyl sulfoxide as the electron acceptor
  - fenorm.plus_minus.art -- normalized tiling data from anaerobic growth iron (III) citrate as the electron acceptor
  - hsnorm.plus_minus.art -- normalized tiling data from 10 minutes post heat shock at 42C
- 5'RNA-seq data for Shewanella oneidensis MR-1:
  - rich1_5RNAseq_chr.art -- 5' RNASeq data from rich media (experiment 1)
  - rich2_5RNAseq_chr.art -- 5' RNASeq data from rich media (experiment 2)
  - min1_5RNAseq_chr.art -- 5' RNASeq data from lactate minimal media (experiment 1)
  - min2_5RNAseq_chr.art -- 5' RNASeq data from lactate minimal media (experiment 2)
- 5'RNA-seq data for other Shewanella species (mapped to MR-1 main chromosome):
  - rich_5RNAseq_MR4_MR1_chr.art -- Shewanella sp MR-4 in rich media
  - rich_5RNAseq_MR7_MR1_chr.art -- Shewanella sp MR-7 in rich media
  - rich_5RNAseq_ANA_MR1_chr.art -- Shewanella sp ANA-3 in rich media
  - rich_5RNAseq_CN_MR1_chr.art -- Shewanella putrefaciens CN-32 in rich media
  - rich_5RNAseq_W3_MR1_chr.art -- Shewanella sp W3-18-1 in rich media
  - rich_5RNAseq_PV_MR1_chr.art -- Shewanella loihica PV-4 in rich media
  - rich_5RNAseq_SB_MR1_chr.art -- Shewanella amazonensis SB2B in rich media
  - min_5RNAseq_MR4_MR1_chr.art -- Shewanella sp MR-4 in lactate minimal media
  - min_5RNAseq_MR7_MR1_chr.art -- Shewanella sp MR-7 in lactate minimal media
  - min_5RNAseq_ANA_MR1_chr.art -- Shewanella sp ANA-3 in lactate minimal media
  - min_5RNAseq_CN_MR1_chr.art -- Shewanella putrefaciens CN-32 in lactate minimal media
  - min_5RNAseq_W3_MR1_chr.art -- Shewanella sp W3-18-1 in lactate minimal media
  - min_5RNAseq_PV_MR1_chr.art -- Shewanella loihica PV-4 in lactate minimal media
  - min_5RNAseq_SB_MR1_chr.art -- Shewanella amazonensis SB2B in lactate minimal media

Artemis is available here

Tables to download

All files are tab-delimited. *.gz files are compressed with gzip.

MR1 Tiling data
- PROBE_ID -- arbitrary identifier
- scaffoldId -- "MR1_Chromosome" for the main chromosome; "MR1_plasmid" for the megaplasmid
- strand -- the strand of the probe's sequence, the same as the strand of the RNA it assays because we hybridize to 1st-strand cDNA
- begin and end -- the extent of the probe
- match2 -- non-zero of this is a potentially cross-hybridizing probe
- nA, nC, nG, nT -- nucleotide composition of the probe
- genomic -- log2(raw intensity) in the genomic control
- code -- 1 if coding, 0 if intergenic, -1 if antisense (using the original annotation)
- rich.raw -- log2(raw intensity) for rich media
- min.raw -- log2(raw intensity) for lactate minimal media
- dmso.raw -- log2(raw intensity) for anaerobic growth with dimethyl sulfoxide as the electron acceptor
- fe.raw -- log2(raw intensity) for anaerobic growth iron (III) citrate as the electron acceptor
- hs.raw -- log2(raw intensity) for 10 minutes post heat shock at 42C
- rich.fit, rich.norm -- fitted value from the regression and normalized log-level for rich media
- min.fit, min.norm -- similarly for lactate minimal media
- dmso.fit, dmso.norm -- similarly for anaerobic growth with dimethyl sulfoxide as the electron acceptor
- fe.fit, fe.norm -- similarly for anaerobic growth iron (III) citrate as the electron acceptor
- hs.fit, hs.norm -- similarly for 10 minutes post heat shock at 42C
MR-1 5'-end RNA-seq data
- scaffoldId, strand, and start show the beginning of the reads (the putative 5' end of an RNA).
- n.rich1, n.rich2, n.min1, n.min2 show the number of reads at that location in the 1st experiment of rich media, the 2nd experiment of rich media, the 1st experiment of lactate minimal media, and the 2nd experiment of lactate minimal media.
- peak is TRUE if this position is a local peak
Classification of transcript start sites (TSSs) in MR-1
- scaffoldId, start, strand, n.rich1, n.rich2, n.min1, n.min2, and peak -- as in the MR-1 5' RNA-seq data
- bitScore -- the bit score of sigma factor 70 motif hit
- lo.rich, lo.min, lo.dmso, lo.fe, lo.hs, logoddsNrich1, logoddsNrich2, logoddsNmin1, logoddsNmin2, logoddsBits -- log odds values for each individual feature, including five tiling experiments, four 5'RNA-seq experiments, and motif hit bit score
- lo -- total log odds; lo >= 10 means high confidence TSS
- lo.noBit -- total log odds without considering bit score; lo.noBit >= 7 means high confidence "motif-free" TSSs
- code -- types of TSSs: 1 if internal sense (iTSS), 2 if internal antisense (aTSS), 3 if upstream protein coding genes (gTSS), 4 if intergenic (nTSS)

Analyzing the data in R

See R image for files used in the analysis
See major codes for calling high-confidence TSSs in MR1_TSS.R
See additional functions in util.R

Sources

Transcriptomics data are available from Gene Expression Omnibus (GEO) ((GSE58337)
- tiling data for S. oneidensis MR-1 (GSE45217 and GSE39468)
- 5’RNA-Seq data for S. oneidensis MR-1 (GSE45313 and GSE39474)
- 5’RNA-Seq data for Shewanella sp. MR-4 (GSE45314)
- 5’RNA-Seq data for Shewanella sp. MR-7 (GSE45315)
- 5’RNA-Seq data for Shewanella sp. ANA-3 (GSE45311)
- 5’RNA-Seq data for Shewanella putrefaciens CN-32 (GSE45312)
- 5’RNA-Seq data for Shewanella sp. W3-18-1 (GSE45318)
- 5’RNA-Seq data for Shewanella loihica PV-4 (GSE45316)
- 5’RNA-Seq data for Shewanella amazonensis SB2B (GSE45317)
- dRNA-seq data for S. oneidensis MR-1 (GSE58292)

This work conducted by ENIGMA -- Ecosystems and Networks Integrated with Genes and Molecular Assemblies -- was supported by the Office of Science, Office of Biological and Environmental Research, of the U. S. Department of Energy under Contract No. DE-AC02-05CH11231.