Mutant Phenotypes for Thousands of Bacterial Genes of Unknown Function

by Morgan N. Price, Kelly M. Wetmore, R. Jordan Waters, Mark Callaghan, Jayashree Ray, Hualan Liu, Jennifer V. Kuehl, Ryan A. Melnyk, Jacob S. Lamson, Yumi Suh, Hans K. Carlson, Zuelma Esquivel, Harini Sadeeshkumar, Romy Chakraborty, Grant M. Zane, Benjamin E. Rubin, Judy D. Wall, Axel Visel, James Bristow, Matthew J. Blow, Adam P. Arkin, and Adam M. Deutschbauer

Abstract

One third of all protein-coding genes from bacterial genomes cannot be annotated with a function. To investigate these genes' functions, here we collected genome-wide mutant fitness data from 32 diverse bacteria across dozens of growth conditions each. We identified mutant phenotypes for 11,779 protein-coding genes that had not been annotated with a specific function. Many genes could be associated with a specific condition because the gene affected fitness only in that condition, or with another gene in the same bacterium because they had similar mutant phenotypes. 2,316 of these poorly-annotated genes had associations that are of high confidence because they are conserved in other bacteria. By combining these conserved associations with comparative genomics, we identified putative DNA repair proteins and we proposed specific functions for poorly-annotated enzymes and transporters and for uncharacterized protein families. Our study demonstrates the scalability of microbial genetics and its utility for improving gene annotations.

See the article (paywalled), the final author version (free), or view the data in the Fitness Browser.

Data Downloads

Note added September 10, 2021. After publishing this data, Morgan Price and Adam Deutschbauer discovered that our stock solutions for sucrose and D-mannitol were problematic. In particular, Escherichia coli BW25113 is a K-12 strain (closely related to MG1655) and should not be able to grow on sucrose. In M9 media made with our original stock solution of sucrose, E. coli BW25113 grew, but in media made with a fresh stock solution, it did not. Similarly, growth of E. coli on mannitol should require the phosphotransferase uptake protein mtlA and the dehydrogenase mtlD. In our original fitness assays for E. coli, mtlA and mtlD were not important for growth on mannitol; instead, manX and manY, which encode the mannose phosphotransferase system, were important. When we repeated these experiments with a fresh stock solution for D-mannitol, we found that mtlA and mtlD were important for fitness, and manX and manY were not.

Please disregard any of the data from this publication regarding sucrose or D-mannitol. In the Fitness Browser, the problematic fitness experiments have been removed. As of September 2021, the data for these compounds in the Fitness Browser is from fresh stock solutions of sucrose and D-mannitol. We also checked that the data from these experiments is consistent with prior knowledge of the utilization of these compounds. Finally, we checked the gene re-annotations that were related to sucrose or mannitol utilization.

You can download the data for each organism here:

or as a tarball for all genomes here (large! 84 GB)

You can get information about the organisms and their genomes here:

Metadata about the 32 bacteria (tab-delimited)
Genome sequences and gene models
- This is a tarball of genome fasta files, protein fasta files, and annotation tables, with one subdirectory per organism.
- The subdirectory is g/name, where the name for each bacterium is given in the above metadata table.
Mapping from our org/locusId to the protein accession or locus_tag in NCBI's RefSeq (tab-delimited).
- This mapping was done by sequence, requiring 97% identity and 70% coverage, and covers 97% of the proteins.
- Proteins that do not match are missing from the RefSeq annotation or are redundant (another protein in the genome has exactly the same sequence)
- Also see mapping from orgId to NCBI assembly
Mapping from our org/locusId to uniprot identifiers (tab-delimited).
- This mapping was done by sequence, requiring 98% identity.
- Unfortunately proteins from four of the bacteria are not in UniProt (as of November 2017), even though they are in RefSeq. This can happen if an assembly is deposited into Genbank without any protein annotations and is annotated by RefSeq.

Alternatively, you can download all of the data in the Fitness Browser (as of June 2017) from doi: 10.6084/m9.figshare.5134840

Also note that for some organisms, the Fitness Browser contains additional experiments beyond those described here. For these organisms, the cofitness values will not match.

Other Downloads

Tables
- Supplementary tables (Excel format)
  - Or see Supplementary tables for the bioRxiv preprint
- Likely-essential proteins
- Conserved functional associations
- Reannotations
  - All reannotated proteins in the current Fitness Browser (this includes reannotations that are beyond the scope of this study and also a few corrections)
  - Reannotations of transporters and catabolic enzymes (for this study; as of summer 2017)
R images with all data and analyses
- Final data set of 32 bacteria (June 30, 2017)
  - This R image contains all of the per-gene fitness values and comparative genomics analyses.
  - The key data structure is the orgs variable, which is a list with one entry per organism. Most of the components of orgs were generated by comb.R in the FEBA code base. Other elements of the data frame were generated by the code in plotfeba.R. (This code is also in the R image, although the plotting code in the R image is out of date.) Each element of orgs includes:
    - genes -- a data frame with metadata about genes
    - exps -- a data frame of metadata about experiments (including Time 0s)
    - g -- a vector of gene ids in the order that they appear in lrn and in t
    - q -- a data frame of quality metrics for fitness experiments, including success or not (u)
    - lrn -- gene fitness values, with 1 row per g and 1 column per successful experiment
    - t -- similarly for t values (but includes all experiments, not just the successful ones)
    - lrn_t0 -- gene fitness values for control experiments that compare one Time0 sample to another
    - t_t0 -- similarly the t values for the control experiments
    - specsick -- specific phenotypes [but these fitness values can be positive, so "sick" is a misnomer]
    - cofit -- the most cofit hits for each gene
    - pfam -- all of the PFam hits for the protein-coding genes in this organism
    - tigrfam -- all of the TIGRFam hits for the protein-coding genes in this organism
    - parahits -- the similarities between proteins in this organism
    - para -- for each gene, its closer paralog (BLAST score ratios under 0.25 are ignored)
    - esstable -- for each protein-coding gene that is long-enough, some metrics to help estimate whether the gene might be essential
    - ess -- a vector of locusIds of the likely-essential genes in this organism
  - Some other useful data structures are:
    - allprot -- a data frame with metadata about proteins and what types of phenotypes (if any) they have
    - bbh -- bidirectional best hits between protein-coding genes in different organisms (i.e., potential orthologs)
    - sickcmp -- conservation of phenotypes for genes with specific phenotypes. ("sick" is a misnomer: both signs are included.)
- Also see older images: preliminary data set of 25 bacteria (July 11, 2016) and an older image from May 13 2016, also with 25 bacteria, that was based on preliminary TnSeq data for Pseudomonas fluorescens FW300-N1B4.
Strain usage (gzipped tarball)
- If you do your own fitness assays with these libraries and want to include the same set of strains and genes in your analyses as we did in ours, then put the strainusage.* files in the g/organism/ subdirectory before running the R analysis scripts.
- This is *not* recommended unless you recover your library from the freezer in exactly the same way as we did.
- If you use these files, you should verify that your Time0 samples are similar to ours.
Information about every mappable TnSeq read, as computed with MapTnSeq.pl
- acidovorax_3H11 ANA3 azobra BFirm Caulo Cola Cup4G11 Dyella79 Dino HerbieS Kang Keio Korea Koxy Marino Miya MR1 Phaeo Ponti PS pseudo1_N1B4 pseudo3_N2E3 pseudo5_N2C3_1 pseudo6_N2E2 pseudo13_GW456_L13 psRCH2 Pedo557 PV4 SB2B Smeli SynE WCS417 (large gzipped tarballs)
Strong positive fitness effects (tab-delimited)
- This table shows all strong positive fitness effects.
- It include experiments that did not meet our quality thresholds due to strong positive selection.
- Effects are included in this table if gene fitness > 2, t > 5, estimated standard error < 1, fit >= max(fit in experiment) - 8, and mean(reads per gene) of at least 10.
- orgId and locusId specify the gene. sysName (optional) and desc describe the gene.
- expName specifies the experiment, and short, Condition_1, Concentration_1, Units_1, and Media describe the experiment
- used is TRUE if the experiment met our quality thresholds
- gMean is the average number of reads per gene. Note that this is the average, not the median. (Experiments with strong positive selection often fail to meet the quality threshold that median reads per gene be at least 50.)
- maxFit is the maximum fitness value of any gene in this experiment
- fit is the gene's fitness value
- t is the t-like test statistic and se is the estimated standard error (fit/t)

Older

An earlier version (with just 25 bacteria and a different title) was posted on bioRxiv
reviews

Page by Morgan N. Price, Arkin group, June 2017