Downloads for "Deep Annotation of Protein Function across Diverse Bacteria from Mutant Phenotypes"
by Morgan N. Price, Kelly M. Wetmore, R. Jordan Waters, Mark Callaghan, Jayashree Ray, Hualan Liu, Jennifer V. Kuehl, Ryan A. Melnyk, Jacob S. Lamson, Yumi Suh, Hans K. Carlson, Zuelma Esquivel, Harini Sadeeshkumar, Romy Chakraborty, Grant M. Zane, Benjamin E. Rubin, Judy D. Wall, Axel Visel, James Bristow, Matthew J. Blow, Adam P. Arkin, and Adam M. Deutschbauer
The function of nearly half of all protein-coding genes identified in bacterial genomes remains unknown. To systematically explore the functions of these proteins, we generated saturated transposon mutant libraries from 32 diverse bacteria and we assayed mutant phenotypes across hundreds of distinct conditions. From 4,870 genome-wide mutant fitness assays, we obtained 18.7 million gene phenotype measurements and we identified a mutant phenotype for 11,779 proteins with previously unknown functions. The majority of these hypothetical proteins (62%) had phenotypes that were either specific to a few conditions or were similar to that of another gene in the same bacterium, thus enabling us to make informed predictions of protein function. For 2,316 of these hypothetical proteins, the functional associations are conserved across related proteins from different bacteria, which confirms that these associations are genuine. Based on the functional associations, we identified 13 novel families of DNA repair proteins, we proposed functions for 19 other uncharacterized protein families, and we identified 444 transporters or catabolic enzymes that had been annotated incorrectly. Across all sequenced bacteria, 12% of proteins that lack detailed annotations have an ortholog with a functional association in our data set. Our study demonstrates the utility and scalability of high-throughput genetics for annotating the functions of bacterial proteins.
See bioRxiv preprint (with just 25 bacteria)
The easiest way to view the data is with the Fitness Browser. You can also download the data for each organism here:
or as a tarball for all genomes here (large! 84 GB)
You can get information about the organisms and their genomes here:
Alternatively, you can download all of the data in the Fitness Browser from doi: 10.6084/m9.figshare.5134840
Also note that for some organisms, the Fitness Browser contains additional experiments beyond those described here. For these organisms, the cofitness values will not match.
- R images with all data and analyses
- Final data set of 32 bacteria (June 30, 2017)
- This R image contains all of the per-gene
fitness values and comparative genomics analyses.
- The key data
structure is the orgs variable, which is a list with one entry per
organism. Most of the components of orgs were generated by comb.R in
the FEBA code base. Other
elements of the data frame were generated by the code in plotfeba.R.
(This code is also in the R image, although the plotting code in the R image is out of date.)
Each element of orgs includes:
- genes -- a data frame with metadata about genes
- exps -- a data frame of metadata about experiments (including Time 0s)
- g -- a vector of gene ids in the order that they appear in lrn and in t
- q -- a data frame of quality metrics for fitness experiments,
including success or not (u)
- lrn -- gene fitness values, with 1 row per g and 1 column per
- t -- similarly for t values (but includes all experiments, not just the successful ones)
- lrn_t0 -- gene fitness values for control experiments that compare one
Time0 sample to another
- t_t0 -- similarly the t values for the control experiments
- specsick -- specific phenotypes [but these fitness values can be positive, so
"sick" is a misnomer]
- cofit -- the most cofit hits for each gene
- pfam -- all of the PFam hits for the protein-coding genes in this
- tigrfam -- all of the TIGRFam hits for the protein-coding genes in
- parahits -- the similarities between proteins in this organism
- para -- for each gene, its closer paralog (BLAST score ratios under
0.25 are ignored)
- esstable -- for each protein-coding gene that is long-enough, some metrics
to help estimate whether the gene might be essential
- ess -- a vector of locusIds of the likely-essential genes in this organism
- Some other useful data structures are:
- allprot -- a data frame with metadata about proteins and what types of
phenotypes (if any) they have
- bbh -- bidirectional best hits between protein-coding genes in
different organisms (i.e., potential orthologs)
- sickcmp -- conservation of phenotypes for genes with specific
phenotypes. ("sick" is a misnomer: both signs are included.)
- Also see older images: preliminary data set of 25 bacteria (July 11, 2016) and an older image from May 13 2016, also with 25 bacteria, that was based on preliminary TnSeq data for Pseudomonas fluorescens FW300-N1B4.
- Strain usage (gzipped tarball)
- If you do your own fitness assays with these libraries and want to include the same set of strains and genes in your analyses as we did in ours, then put the strainusage.* files in the g/organism/ subdirectory before running the R analysis scripts.
- This is *not* recommended unless you recover your library from the freezer in exactly the same way as we did.
- If you use these files, you should verify that your Time0 samples are similar to ours.
Page by Morgan N. Price, Arkin group, June 2017