The transcripts and proteins of Desulfovibrio vulgaris Hildenborough

Evidence-based annotation of transcripts and proteins in the sulfate-reducing bacterium Desulfovibrio vulgaris Hildenborough (2011)

M. N. Price, A. M. Deutschbauer, J. V. Kuehl, H. Liu, H. E. Witkowska, A. P. Arkin
J. Bacteriology 193:5716-27

Abstract

We used high-resolution tiling microarrays and 5' RNA sequencing to identify transcripts in Desulfovibrio vulgaris Hildenborough, a model sulfate-reducing bacterium. We identified the first nucleotide position for 1,124 transcripts, including 54 proteins with leaderless transcripts and another 72 genes for which a major transcript initiates within the upstream protein-coding gene, which confounds measurements of the upstream gene's expression. Sequence analysis of these promoters showed that D. vulgaris sigma70 prefers a different -10 box and -35 box than Escherichia coli sigma does. 549 transcripts ended at intrinsic (rho-independent) terminators, but most other transcripts seemed to have variable ends. We found low-level antisense expression of most genes, and the 5' ends of these transcripts mapped to promoter-like sequences. Because antisense expression was reduced for highly-expressed genes, we suspect that elongation of non-specific antisense transcripts is suppressed by transcription of the sense strand. Finally, we combined the transcripts with comparative analysis and proteomics data to make 505 revisions to the original annotation of 3,531 proteins: we removed 255 (7.5%) proteins, changed 123 (3.6%) start codons, and added 127 (3.7%) proteins that were missed. Tiling data had higher coverage than shotgun proteomics and hence led to most of the corrections, but many errors probably remain. Our data are available at http://genomics.lbl.gov/supplemental/DvHtranscripts2011/.

Freely available at JB (or see older version)

Viewing the data in Artemis

This zip file (46 megabytes, updated) includes:

DvH_mo.genbank -- MicrobesOnline annotation for DvH at the start of this project; also includes terminator predictions from TransTermHP
Files that can be viewed with add user plot:
- richnorm3.plus_minus.art -- normalized tiling data from rich media (average of two biological replicates)
  - the genome-wide median is 0 and potentially cross-hybridizing probes have been removed
  - the intensity at each point is the value for the probe whose center is closest to that point
  - the 1st column is the plus strand and the 2nd column is the minus strand
- min1norm3.plus_minus.art -- normalized tiling data from minimal media (1 chip)
- RNASeq1 -- 5' RNASeq without exonuclease treatment
- RNASeq2 -- 5' RNASeq with exonuclease treatment
- (The RNASeq files were updated on April 29 2013. Due to a programming error, some reads were not counted in the original RNASeq artemis files.)
Feature files with potential ORFs:
- both.feature -- ORFs predicted by both CRITICA and RAST but not in the original annotation
- critica_only.feature -- ORFs predicted only by CRITICA
- rast_only.feature -- ORFs predicted only by RAST
- blastx.feature -- potential ORFs from blastx
And similar files for the megaplasmid, named *_mega.plus_minus.art. (There were no critica-only megaplasmid predictions so that file does not exist.)
dvmain_miyazaki_mauve.art -- a conservation track showing 1 for nucleotides that are conserved in D. vulgaris Miyazaki B (according to the Mauve alignment)

Artemis is available here

Loading all the plots will require a computer with several gigabytes of memory and will probably require you to change your Java or Artemis settings to allow it to use that much memory. Also, we recommend smoothing the tiling data with a window size of 10-15 and log transforming the 5' RNASeq data. By default Artemis may smooth 5' RNASeq data but this is not desirable, you can turn this off by lowering the window size to 1.

Tables to download

All files are tab-delimited. *.gz files are compressed with gzip.

Tiling data
- PROBE_ID -- arbitrary identifier
- scaffoldId -- 1944 for the main chromosome; 1945 for the megaplasmid
- strand -- the strand of the probe's sequence, the same as the strand of the RNA it assays because we hybridize to 1st-strand cDNA
- X, Y, and DMD -- the layout on the array
- begin and end -- the extent of the probe
- match2 -- non-zero of this is a potentially cross-hybridizing probe
- genomic -- log2(raw intensity) in the genomic control
- code -- 1 if coding, 0 if intergenic, -1 if antisense (using the original annotation)
- raw.min -- log2(raw intensity) for minimal media
- raw.rich -- log2(raw intensity) for rich media, this was an average of two chips
- nA, nC, nG, nT -- nucleotide composition of the probe
- fit.min, norm.min -- fitted value from the regression and normalized log-level for minimal media
- fit.rich, norm.rich -- similarly for rich media
5' RNASeq data
- scaffoldId, strand, and start show the beginning of the reads (the putative 5' end of an RNA).
- nf, nppp, and ntot show the number of reads at that location in the 1st library, the 2nd (exonuclease) library, and the total.
- peakNew is TRUE if this position is a local peak
- only positions with 10 or more reads (across the two libraries) are included
Intrinsic terminators
- scaffoldId, start, stop, strand -- the terminators predicted by TransTermHP. stop is the end of the stem-loop, a few nucleotides upstream of the expected termination location.
- confirm is TRUE is this terminator was confirmed by our tiling data
Classification of transcript starts
- scaffoldId, start, strand, nf, nppp, and ntot -- as in the raw 5' RNASeq data
- midrich -- the middle of the corresponding rise in the tiling data from rich media (if any)
- atrich -- the rise, corrected by 15 nt for the typical offset between the rise and the transcript start
- rich -- the local correlation of the rise
- midmin, atmin, min -- similarly for minimal media
- startm -- the start of the corresponding hit to a promoter motif, if any
- motif -- which motif (1 and 2 for sigma 70 with different spacings; 3 for rpoN; 4 for fliA)
- bits -- the bit score of the motif hit
- logodds values -- log odds values for each individual feature
- lo -- total log odds; lo >= 4 means high confidence
Gene annotations and our revisions and lengths of the 5' and 3' UTRs
- sysName -- also known as locus tag; DVU or DVUA are from the original annotation.
- locusId -- the MicrobesOnline or VIMSS id (if from the original annotation)
- scaffoldId, strand, start, stop -- the location of the gene
- type -- type=1 means protein-coding gene; type=7 means pseudogene derived from a protein-coding gene; types 9 and 10 are CRISPR repeats and spacers; other types are various kinds of non-coding RNAs
- desc -- description of the gene
- removed -- if present, the reason the gene was removed
- changed -- if present, the reason the gene was changed
- start.orig -- the original start
- start.critica -- the gene start from CRITICA, if this frame was selected by CRITICA
- start.rast -- the gene start from RAST, if this frame was selected by RAST
- critId and rastId -- ids for the CRITICA and RAST predictions; these are used in some of the Artemis feature files
- UTR5 -- length of the 5' UTR for this gene, if there is a confident transcript start upstream. 0 indicates a leaderless transcript.
- UTR3 -- length of the 3' UTR for this gene, if there is a confirmed terminator downstream.
Peptides detected in ENIGMA experiments with fractionation or pull-downs
- peptide -- the sequence of the peptide detected
- nFractions -- the number of different fractions or experiments that this peptide was detected in
- Also see MS spectra that were inspected by hand (in Excel format; 2.6 MB).
Peptides detected in ENIGMA complete-proteome experiments
- prot_acc -- the identifier of the protein
- prot_score -- the total score of the protein (this table includes low-confidence peptides if detection of the protein was high-confidence)
- pep_expect -- expectation value for the peptide
- pep_seq -- the peptide sequence
Gene annotations with our revisions and recent RefSeq revisions
Operon predictions
- scaffoldId, strand -- which scaffold and strand the potential operon pair is on
- upg and dng -- the locusIds for the upstream and downstream genes in the pair
- start.up, stop.up, start.dn, stop.dn -- start and stop for the upstream and downstream genes
- min5 and rich5 -- smoothed minimum expression between the genes from a tiling array (or missing if it cannot be computed because there is little space between the genes)
- ttConfirm -- non-zero if there is a confirmed terminator between the genes
- rich.c.up -- median expression of the coding strand of the upstream gene in rich media
- rich.n.up -- median expression of the non-coding (antisense) strand of the upstream gene in rich media
- min.*.up -- similarly for minimal media
- *.dn -- similarly for the downstream gene
- start and lo -- the location and log odds score of the most confident promoter between the two genes (if any)
- code -- classification of the operon pair
- ExprSim -- Pearson correlation of gene expression patterns of the two genes

Analyzing the data in R

See R image (282 MB; from 64-bit Linux and R 2.11)
See documentation and function definitions in DvHtranscripts.R
See additional function definitions in utils.R (some of these are used by DvHtranscripts.R)

Sources

Transcriptomics data: Adam M. Deutschbauer and Jennifer K. Kuehl in the Arkin group at LBL
- Also available from GEO as GSE29560
Proteomics data: Haichuan Liu, Simon Allen, Evelyn D. Szakal, H. Ewa Witkowska at UCSF; Alyssa M. Redding-Johanson and Aindrila Mukhopadhyay at JBEI; and John-Marc Chandonia at LBL; protein samples were from the ENIGMA project.
Transcriptomics analyses and reannotation: Morgan Price (Arkin group)

This work conducted by ENIGMA -- Ecosystems and Networks Integrated with Genes and Molecular Assemblies -- was supported by the Office of Science, Office of Biological and Environmental Research, of the U. S. Department of Energy under Contract No. DE-AC02-05CH11231.