The reviewers' comments on "Deep Annotation of Protein Function across Diverse Bacteria from Mutant Phenotypes", which was later titled "Mutant Phenotypes for Thousands of Bacterial Genes of Unknown Function".

Round 1

The first submission was nearly identical to the bioRxiv pre-print. The reviewer's comments were:

Referee #1 (Remarks to the Author)

Summary

Price et al. conducted a large-scale chemical-gene screen consisting of thousands of fitness assays for transposon libraries of 25 different strains of bacteria. The importance of this study is that it is the first large-scale comparison across bacterial species, but its value is somewhat lessened because of a large focus on two genus (Pseudomonas and Shewanella) and on γ-proteobacteria. They find thousands of new phenotypes—including many phenotypes for genes of unknown function, and cite several examples of how these phenotypes may relate to gene function. They show that using conservation as a criterion for functional relationships provides robustness to their predictions. Overall, the work provides a substantial dataset with a significant number of novel phenotypes, and the web interface for accessing phenotype data is very useful and accessible. However, the manuscript lacks follow up analysis for any new putative gene function and the findings are consistently over-hyped.

Except in rare cases, the promise that this work provides “deep annotation of protein function” is largely unsupported.

General comments

1. The authors claim to utilize a diverse set of bacteria for their studies, but all of the strains except Synechococcus are proteobacteria and 14/25 are γ-proteobacteria. Moreover, unaccountably, the γ-proteobacterial data set is predominantly comprised of Shewanella (4) and pseudomonas ( 7) strains, even though pseudomonas genomes are relative conserved across species (as compared to those of E. coli). As indicated in methods, the fitness studies for some of the strains have been published previously (Wetmore et al. 2015, mBio; Rubin et al. 2015 PNAS), and those from the Arkin laboratory (10% of total) have been incorporated in the current dataset. The Arkin group also previously published the essential gene set of Synecococcus—is that dataset used here?.

2.The use of similar strains (e.g., the 5 Pseudomonas fluorescens strains) raises important questions about some of the data analysis choices and conclusions of the manuscript. The authors use conserved phenotypes across organisms as a metric for how much confidence they have in a particular phenotype—showing conservation across very similar strains may have different consequences than across those that are more distantly related. In the current presentation, I was unable to determine phenotypes that are conserved only across the same species. How many phenotypes are conserved across phylogenetic groups (e.g., between Alpha-proteobacteria and Beta-proteobacteria)?

3. Phenotype calculations in the manuscript do not consider the number of generations strains have grown, which may lead to significant false negative/positive results and makes it difficult to compare between strains and conditions. According to the manuscript, strains were inoculated at an OD of 0.02 in the selective media and then grown to saturation. The authors say this is 4-8 doublings, but it’s unclear if they actually measured the final OD values (which is key for correctly calculating fitness). Importantly, strains grown for 8 generations should show stronger quantitative phenotypes because the number of doublings in that condition is greater; some phenotypes may only be apparent at 8 generations. Therefore, comparing even the same strain at different generation numbers in the same would result in different significant phenotypes. The authors should clarify their experimental conditions and recalculate their phenotype results and fitness measurements by taking into account the generation time as in van Opijnen et al. 2013 Nat Rev Microbiol.

4. A potentially important aspect of the manuscript is the ability to determine the essential gene complement for a number of organisms, but it is unclear whether the density of transposon mutagenesis achieved in some organisms is high enough to clearly call genes essential. Especially concerning is the fact that the authors had a very high false discovery rate for their E. coli K-12 positive control, calling into question their ability to make this call for organisms with shallow libraries. The fact that the fraction of essential genes is as high as 23% of the genomic complement in some organisms, which is unlikely to be true, is further indication that their mutagenesis is not sufficiently saturated to provide reliable estimates of the essential gene complement. Furthermore, libraries are grown extensively prior to the start of each experiment, likely removing slow growing as well as essential genes prior to the initial measurement. The authors should call this gene set slow growing/essential.

5. The importance of this work is that it provides the first large dataset of phenotypes across organisms. Indeed the authors claim: “Our results provide high resolution genome-wide fitness maps for all 25 bacteria”. How well did they do at providing comprehensive fitness data? Variably well. First of all between 50-75% of the genes either have no phenotype or no data. Second, they do best with genes that already have detailed TIGR annotation (52%), and significantly less well with all other classes [e.g detailed annotation, vague (27%) or hypotheticals (19%)], meaning that a great deal of the data is validating TIGR. This is important, especially given the fact that predictions can be wrong, but is perhaps not the focus the authors wished for.

6. The authors believe that they have established metrics to predict protein function (see: “conserved phenotypes are accurate predictors of protein function” line 165). The authors use two metrics:

(1) specific phenotypes (e.g. one or very few) across conditions that are conserved among some groups of bacteria.

(2) co-variance of two or more genes across conditions that are conserved among some groups of bacteria.

These metrics work particularly well in a few restricted cases. The most successful application is identifying the substrate class of transporters. This is important information that will be useful to the community. They are also reasonably successful at finding genes involved in growth on C and N-sources (a large fraction of their dataset). This is useful when novel classes of enzymes perform a clear function. However, outside of these instances, the metrics seem to perform poorly. For example, in the case of DNA damage, the authors use sensitivity to cisplatin as a metric to identify 11 new genes involved in DNA repair, with secondary criteria of a “DNA-related domain” or presence in a LexA regulon. However, the secondary criteria does not exclude the possibility of the proteins encoding other functions, showing the essentiality of follow-up work. Finally, of the 3 present in E. coli, 2 had already been shown to be involved in resistance to radiation, in studies that go beyond those reported here, and the third, a well characterized endonuclease is reportedly present in the periplasm, suggesting that it does not directly participate in repair. Moreover, these phenotypes are not indicative of protein function.

To show how poorly this metric functions to provide predicted protein function, consider the 77 domains of unknown function that have conserved phenotypes. Of these the authors claim to have specific molecular functions for 8. Of the 4 examples they discuss, I think it is likely that the authors have guessed correctly for two (thallium + sulfite reductase), the glycine transporter hypothesis is not convincing, and the discussion of conserved co-association of three uncharacterized proteins is somewhat disingenuous as these three are highlighted as interactors in the EC ID data base (Andres Leon et al., Nucleic Acids Res. 2009), and moreover, from the data presented, have different phenotypes in different organisms.

The data that the authors have collected is very important for the bacterial community, but in general, it provides phenotypes for further dissection, rather than annotation of protein function.

Specific comments

1. The label “Mutants per protein” in Fig. 1B is very confusing. Is this supposed to indicate the “number of mutant strains used to estimate fitness for a given protein”—or insertions per gene? These numbers seem inconsistent with the “5 to 66 insertions for the median protein coding gene” mentioned in the text.

2. In Fig. 1D, most strains with detailed annotation do not have a phenotype. Please explain why that is the case. Are the conditions used simply not expected to pick up phenotypes for these genes?

3. In addition, the authors need to describe how they selected species to annotate more variable (or deeper) protein function in these groups of bacteria. i.e. Where are these species in tree of life?- Which method did they use to create phylogenetic tree? 16S based or using a set of universal marker proteins (Mende, D.R., Sunagawa, S., Zeller, G., and Bork, P. (2013). Accurate and universal delineation of prokaryotic species. Nat Methods 10, 881-884)?

Referee #2 (Remarks to the Author):

This manuscript describes the use of transposon mutagenesis sequencing (TnSeq) to identify genes involved in a large number of phenotypes from across a range of bacteria. They use the resulting dataset to identify conserved genes involved in specific phenotypes, and ascribe function to many previously uncharacterised genes. The manuscript is the first large-scale comparative use of this technology, and the results should be of general interest, as well as being valuable to a large number of microbiologists.

The experiments seem sound and well-described, and the analyses appropriate. Given the size of the data set, the number of experiments and analyses performed and the necessarily limited description of these in the format of this paper, it is difficult to be certain of the robustness of all of these. However, I didn't identify anything that would concern me.

I have a few concerns about the basis of the experiments, and their write up that would improve the manuscript if addressed:

Line 89. The authors describe this as "a phylogentically diverse set of bacteria". I'm not sure I agree. Of the set, all but one are proteobacteria, leaving the large majority of bacteria untouched. Even within that, the majority are gamma-proteobacteria, and of those eight are Psedomonads, and six are P. fluorescens. This is actually a fairly poor representation of bacterial diversity, and heavily weighted towards already well-studied organisms. The authors should make this clearer, discuss the limitations of this choice, and be more circumspect about the breadth of applicability of the data.

The isolates used in this study should be submitted to a public culture collection, and the accession numbers given in the manuscript.

Data availability: The authors describe a web server they have set up to provide this data to the scientific community. However, this is of limited utility in the long term. They do not describe any other attempts to propagate this data. They claim to have ascribed functions to many previously unknown proteins. Have they included this information in the annotation of these genomes submitted to the public databases, or did they submit the original RAST annotation? They have ascribed functions to many PFAM domains of unknown function. Have they tried to update PFAM, so that future users of that database benefit from this information?

In many parts of the manuscript they complain about the current mis-match between sequence and function. This will not be fixed by keeping newly described functions on a stand-alone webserver with a limited lifespan. To be of long-term use, this data needs to be in the public domain, and propagated through resources that all biologists use.

Referee #3 (Remarks to the Author):

Overview:

Price and colleagues present a large-scale effort to systematically explore bacterial protein function using a high-throughput genetic fitness assay. They created libraries of randomly inserted transposons within the whole genome of 25 bacterial species across the phylogenetic tree. The authors had to sequence the genomes of some of the previously uncharacterized bacteria in this study, adding to the data created. Mutant clones were identified through sequencing of flanking regions of the inserted transposons, and unique barcode sequences included within the transposon library allowed for simplification of mutant identification after one round of transposon-barcode assignment. The bacterial libraries were then grown in a large number of different conditions, including different carbon, sulfur and nitrogen sources, as well as several chemical inhibitors and other stresses such as low temperature and pH. Fitness of mutant clones was assessed using a technique previously published by members of the same laboratories, where relative abundance of mutant clones can be assessed by sequencing the barcode regions only, thus determining enrichment and de-enrichment of certain mutants over others.

The amount of data recorded in this study is astonishing, and will definitely impact the field going forward. Further, the rationale of finding novel ways to bridge the gap between sequence and protein function in a wide array of bacterial species in high-throughput systems should be of great interest to a wide audience, as the gap prevails despite several attempts over the years, using different approaches, to close it. The techniques used in this study are interesting and reasonably novel – the same group of authors has published smaller studies using the same RB-TnSeq technique only last year. In addition, by creating a publicly available database in form of an interactive analysis platform that is relatively straightforward in navigation, the authors have made an effort to allow future researchers of all trades to make use of the vast information that might be found within their data.

While all these factors speak for publication in Nature, I had serious trouble with the clarity and conciseness of the writing of the manuscript at hand, and do not think that the article in its current form would easily be understandable by a wide readership. This also leads to many inaccuracies (of varying degrees) that should be addressed before publication.

Major comments:

- Overall, the writing needs to be improved for clarity and conciseness. Not all results are comprehensibly described, and understanding without reading the Methods section in detail is hampered greatly (see minor comments for examples on a per line basis).

- Conserved cofitness, though a very interesting and seemingly useful idea, needs to be described more carefully. The introduction of this concept, and its usefulness in predicting functional annotations for genes that currently have no or vague annotations in databases presents the most novel concept in this study. However, the presentation of this concept, as well as the proof of its utility, is rather weak.

First of all, the authors need to add a clear definition of the method of how conserved cofitness is determined. I assume they simply use a similar correlation between fitness profiles of genes from two distinct organisms, as opposed to genes from the same organism when determining cofitness, but I did not find this information in the manuscript. To illustrate the concept better, the authors should include a plot showing conserved cofitness, instead of two cofitness plots of different bacterial species next to each other (as currently shown in Fig. 2D). In addition, it is never actually mentioned which statistic is used; I am assuming it is a simple Pearson’s correlation, but that should be included in the manuscript. In addition, a more careful analysis of the validity of the author’s claim that conserved cofitness predicts functional roles should be carried out. For example, the authors don’t discuss how they arrived at the respective cutoffs for both cofitness and conserved cofitness. To this end, I suggest showing a distribution of correlation scores for both cofitness and conserved cofitness between pairs of genes. Comparing orthologs (or paralogs for conserved cofitness) with random pairs of non-related genes should guide the decision for a cutoff. A statistical test, such as a simple permutation test, could then be used to prove the validity of the association of the correlation score with gene function.

- Diversity of library in small sample size – The authors need to show that by decreasing the sample size per fitness assay from the input sample (input sample: undefined volume taken from a culture of 25ml total; split into undefined ‘multiple’ fitness assays). A simple way of doing so would be to compare the diversity of the input library to a control fitness assay without any modifications to the media. This should be included in the supplementary material.

- Computation of fitness values – The authors describe that the fitness value for each gene is calculated as a weighted average of all strains that have insertions in this gene. This assumes a similar phenotype for insertions in all parts of a gene. While it has previously been shown that this method is valid (Opjinen et al., Nat. Met. 2009), I am very concerned about the difference in the number of insertions per gene between the citation and this study: While Opjinen report on average 73.8 insertions per gene, this study’s numbers range from 3-23 across the different bacteria, with an average of 9.64 insertions per gene. Because of these low numbers, the authors should show the validity of using the average per gene.

- Little functional follow-up – The manuscript loses some of its strength by not showing sufficient functional follow-up. One possible suggestion would be a simple Glycine uptake assay accompanying Fig. 4a, comparing radiolabelled Glycine uptake in wild-type vs UPF0126 mutants to prove the hypothesis presented in the text.

Minor comments:

- The authors use the term ‘bacteria’ very non-specifically. They should define the operational taxonomic units (OTU) they are working with, especially given their use of 6 pseudomonads of close phylogenetic proximity.

- The examples chosen to illustrate the usefulness of the data set seem like somewhat random choices. For example, the comparison of phenotypes chosen for Figure 1C (and Sup Fig. 2) does not necessarily convince me. I would prefer to see a higher level analysis at this point, for example a comparison of fitness across all carbon sources. Other possible questions to explore: Do phenotypes for genes with similar functional annotations cluster across operational taxonomic units? Is there an enrichment of expected classes of annotations across OTUs?

- Use of the term ‘significant’ is inconsistent throughout the paper; in some instances it refers to one of the fitness classifications, whereas in other instances it is unclear if it is used as a description of statistical significance or not. I would recommend to choose a different term for the fitness classification to avoid this confusion, or make clear when the term is used as statistical significance. (I understand that the classified fitness only includes phenotypes that are statistically significant, my comment still remains.)

- Use of the terms gene vs. protein is inconsistent throughout the manuscript and should be evaluated more carefully.

- Line 91: I would be interested in a rationale on why these 25 bacteria were chosen. This does not have to be extensive, just giving an insight to readers from diverse fields.

- Line 102: What percentage of genes could be annotated using the chosen databases?

- Line 105: Identify TIGRFAMs as a database for protein families (for the wider readership of Nature)

- Line 114: define ‘utilization or inhibition’

- Lines 120-128: make more concise

- Line 132-134: Here, the authors introduce their phenotype classifications. This needs to be described more carefully. For example, I oppose the use of the term ‘no phenotype’ and would strongly prefer ‘neutral phenotype’, as the phenotype of mutant strains with fitness values of 0 (or ‘near’, as the authors describe it here somewhat insufficiently) is similar to the wild-type. In addition, the remaining phenotype classifications (‘strong’ and ‘significant’) need to be defined more carefully when they first appear.

- Line 134-145: The authors should comment on why they chose these OTUs as examples here. Do the observations described hold across genera/divisions? Additionally, the whole paragraph was hard to follow, and a re-structuring would help clarity.

- Line 147: define ‘significant phenotype’

- Line 140: insert ‘concentration’ after ‘zinc’

- Line 161: what is the percentage of non-annotated genes that could be annotated in this study?

- Line 175: “significant and notable” is a vague description – or is it an additional classification for fitness data, distinct from the classification ‘significant’ used in Fig. 1 and below? The authors should clarify.

- Line177: In what percentage of bacterial species?

- Line 191: The sentence seems to end half way through.

- Line 197-199: tighten

- Line 210: Define cofitness in a more straight-forward way; The authors should also include the threshold that is considered significant, and how this cutoff was decided upon (see major comments).

- Line 216: Insert “, SO1332 and SO3965, “ after “PtsP

- Line 217: Insert a transition between the two paragraphs for easier reading.

- Line 224: What do the authors refer to when they use the term ‘nearby’? I assume distance in the genome, but it should be clarified. Also, please note why you exclude nearby genes in this prediction (I assume to exclude a bias through genes in operons, which are often functionally related?).

- Line 228: The authors should define the significance they are referring to here – how is it measured? Which statistic test was used? (see major comments)

- Line 231: should read “Fisher’s exact test”

- Lines 228-238: The whole paragraph is very hard to read. Please clarify, since I think this is a central point of this manuscript.

- Line 242: Give percentages for successful annotations.

- Line 252: clarify ‘important’

- Line 260-261: I couldn’t follow the author’s argument here. Please clarify.

- Lines 282-289: The authors should describe the results shown in Figure 3b more clearly.

- Line 283: “Pseudomonads” should not be italicized.

- Line 285: From my quick check of the citation, it seems that both publications only deal with P. inhibens; furthermore, I couldn’t find the connection between the D-xylose pathway and the lacI-like regulator. Please comment.

- Line 303: Define “statistically significant”

- Line 312: “Pseudomonads” should not be italicized.

- Line 319: Please clarify “but the association to a specific condition was misleading”

- Line 333: Why does Fig. 4a only show 4 of the 5 OTUs mentioned in the text?

- Line 339: Please clarify “adjacent” (in the genome? – see above)

- Lines 340-344: I found this paragraph to be very confusing, the authors should consider restructuring it. I would first mention that in all defined media, sulfate is the only source for sulfur, followed by the hypothesis that cysI’s function as a sulfite reductase could explain the phenotype.

- Lines 349: I would suggest first explaining the YeaG pathway and involvement in triggering correct response to nitrogen starvation. This would greatly enhance the understanding of the following paragraph.

- Line 356: insert ‘sole’ before “nitrogen source”

- Line 358: Why did the authors test a different pathway for S. oneidensis? The authors should also discuss the dual phenotype that is observed for this group of genes in S. oneidensis as compared to E. coli: positive fitness values in some nitrogen sources, yet negative in others, plus negative fitness across a variety of carbon sources. In comparison to this, E. coli as well as P. fluorescens FW300-N2E2 show strong phenotypes only for a single defined condition. This should be discussed.

- While I understand the point the authors are trying to make with Figure 5 and the last paragraph of the Results section (lines 364-390), I am not convinced this figure is the best possible way to achieve this.

- Lines 372 and 374: Fig. 5 has only 1 panel; remove ‘a’ from reference.

- Line 570: Did the different growth formats affect controls? (simple heatmap in sup. materials would be fine)

- Lines 586-599: I found the writing in this paragraph to be quite sloppy and would ask the authors to make their description more concise.

- Lines 597&598: For concentration of different drugs, methods detail “multiple mutant fitness assays with different concentrations”. This raises the question if the authors expected strong differences, and what that in turn would indicate. (It could well be that it is a problem of describing the method of determining a useful concentration that would give reliable fitness data across wt and all mutant clones; the description should then be changed accordingly. Please comment and adjust description accordingly.)

- Line 622: I didn’t find the extended stationary phase as a condition in the Fitnessbrowser when searching for ‘survival’ (only low temperature experiments came up).

- Lines 683-685: Please restructure for clarity and conciseness.

- Line 692: Are the final thresholds for each organism studied listed somewhere? I am wondering what the different thresholds tell us, and if there are patterns across the phylogenetic tree?

- Lines 693-698: I would suggest to dedicate a section of the supplementary material to this issue. Can the authors rule out an effect of the specific organism’s ‘stickiness’ on the correct determination of fitness values for genes?

- Line 735-737: The authors should clarify what they consider as an ‘important’ phenotype.

- Lines 744-745: clarify

- Lines 758-780: This information would better be presented in a supplementary table or figure. The complete paragraph is just a list of annotations, but the actual method is described pretty poorly.

- Line 783 & 785: insert ‘entry’ after Pfam and TIGRFAM.

- Line 787: Remove reference to Fig. 5a

Comments to figures:

- Figure formatting needs work. For example, the use of panel headlines is unorthodox and in some instances misleading (see below). Furthermore, it leads to crowded figures (esp. figure 2).

Fig. 1)

b) Phenotype classifications are not defined properly (cutoffs for phenotype classification ‘strong’ vs. ‘significant’ need to appear in figure legend). What does ‘no data’ refer to (and how is it different from ‘essential’)? x-axis label should include the notion that the protein-coding genes that are examined here are predicted to be such

d) See above for x-axis labeling

Fig. 2)

a) Show E. coli CrcB, as a comparison

b) Panel headline is confusing

c) and f) See Fig. 1b) for x-axis labeling

Fig. 3)

a) Define difference between ‘essential’ and ‘no data’

b) legend: Lines 974-977: The observed phenotype of xylD in S. korensis should be discussed in the main text of the manuscript. According to this phenotype, the description in the legend should also be adjusted: ‘In addition, putative … if they have not previously been implicated in D-xylose utilization.’ The next sentence should be removed from the legend.

Fig. 4)

a) While this panel lists the 4 bacteria Phaeobacter inhibens BS107, Pseudomonas fluorescens FW300-N2E2, Pseudomonas simiae WCS417 and Shewanella oneidensis MR-1, the corresponding Supplementary Table 11 lists a specific phenotype in relation to Glycine for a different set of bacteria (with two overlaps): Pseudomonas fluorescens FW300-N2E2, Pseudomonas fluorescens GW456-L13, Pseudomonas simiae WCS417, Pseudomonas stutzeri RCH2 and Shewanella amazonensis SB2B. Please explain this discrepancy.

Phaeobacter inhibens BS107 does not appear in the list of bacteria containing UPF0126, whereas Shewanella oneidensis MR-1 is listed with a specific phenotype in relation to L-Alanine.

c) and d) I would suggest clustering the 3 different organisms together – with this, similar conditions will be more comparable across species.

d) I see different patterns for the three organisms. It would thus be interesting to see a more detailed comparison of the different types of fitness vectors. The authors should also address if similar patterns exist across the remaining 3 organisms mentioned in the main text. Can we learn more?

Fig. 5)

Use same annotations as throughout the manuscript. Does ‘any phenotype’ only include significant phenotypes?

Term ‘hypotheticals’ in y-axis is unclear

Legend does not sufficiently explain what is shown.

Supplementary material:

Table S3: Clarify ‘somewhat conservative threshold for growth’

Table S4: list final concentrations during experiment

Table S5: add fitness scores for both conditions

Table S10: add ‘vague annotations’ (referred to in line 304) to table

Figure S3 legend: Is ‘strong phenotype’ only defined for negative fitness values <-2 or in both directions? In line 3 of the legend, remove ‘ between “ and paralogs.

Figure S5: A stronger phenotype might be visible if recovery growth after N starvation was measured.

Supplementary Note 2:

- ‘polar effect’ is used several times, the term should briefly be explained and a reference cited.

- There is some confusion between the Sup. Note description, Sup. Tab. 11, and Fig. 4a about the OTUs that are listed. Please review.

Round 2

For the second submission, the biggest changes were (1) the addition of fitness data for 7 additional bacteria; (2) growth curves for individual mutant strains (in the supplementary figures); (3) reannotation of hundreds of metabolic enzymes (in the supplementary material). The reviewer's comments on this version were:

Referee #1 (Remarks to the Author):

Deep Annotation of Protein Function across Diverse Bacteria from Mutant Phenotypes

This manuscript describes the collection of a large number of phenotypes measured using Tn-seq on a number of bacterial species, now collected in an online database. The major advantage of this work is the excellent online database that will be a resource for the community. The major flaws that will limit interest are: the quality of the data; and the lack of broader insights from a synthesis of the collected data.

I. Data quality limits interpretation

A. One of the most interesting analyses of parallel Tn-seq experiments in diverse bacteria would be of the essential gene set in each organism. Assignment of essentiality is not supported by the quality of data in this study, and no claims of essentiality that the authors make are valid. Comparing the presented E. coli Tn-seq dataset and published gold-standard single-gene deletion experiments (Baba et al. 2006, Yamamoto et al. 2009), the authors find the FDR of essential genes to be 36%, which we find unacceptable. Interpretation of the authors’ data for other species’ essential gene sets is not supported by data of this quality. Further, in their response to our review, the authors make an illogical argument, citing an early Tn-seq report with FDR over 50% (Gerdes et al. 2003) to support the claim that their data is sufficient for the analysis they present. There are currently many better Tn-seq datasets available [ e.g. Santiago et al., 2015 BMC Genomics, 16 (2015)].

B. This work presents two central analyses/insights related to uncharacterized genes: (1) many of them have condition-specific growth phenotypes suggesting their functions, and (2) highly correlated patterns of phenotypes (within and between genomes) suggest shared/related functions. Both the significant phenotypes and correlated fitness (co-fitness) measures are affected by the authors’ ability to accurately measure strain fitness; this is questionable for two reasons. First, the use of technical replicates for the growth experiments is not rigorously reported (for which experiments are there replicates?). Secondly, comparison of strain fitness between experimental conditions requires normalization for number of generations. For experiments in which the population grew more generations, fitness values will have greater magnitude. In the correlation (cofitness) analyses, these conditions will bias the results. Again, we refer to van Opijnen et al. 2013 (Nat Rev Microbiol) for the normalization calculation.

C. The authors did include 7 additional species for a total of 32 species to address our previous comments about the diversity of strains. However, we are still concerned about biases in both organisms utilized and phenotypes tested limits a their ability to make conclusions.The benefit of assaying phenotypes across a diverse set of bacteria is to identify (1) phenotype-gene relationships that are broadly conserved, and (2) patterns of correlated fitness (cofitness) that are broadly conserved. The former suggests conserved functions of genes in diverse species, and the latter suggests conserved functional modules of gene pairs.

a. Overall the strains used have poorly distributed diversity: majority gamma-proteobacteria, with 5 Pseudomonas fluorescens species.

Are the specific phenotypes the authors identify more likely to be conserved within closely related species?

Are the correlated fitness (cofit) gene pairs more likely to be conserved within closely related species? The authors respond to our question about this in the rebuttal text but the manuscript text is less clear.

b. How are the analyses of conserved phenotypes and/or cofitness affected by the discrepancies in numbers of conditions tested between species (26-129 conditions)? Were closely related species more likely to have been screened under identical conditions and/or similar numbers of conditions?

II. A conceptual issue that should be addressed

Throughout the manuscript text there is a persistent conceptual misunderstanding about phenotypes vs. function. What the authors do is to catalogue phenotypes for transposon mutants. The identification of a phenotype is not sufficient on its own for assigning function to the disrupted gene. Additionally, the correct usage is that genes have phenotypes, not proteins (Lines 50, 51, 76, 88-89, 173, 178, 180, 181, 184, 200, 227, 232, 233, 235, 372-373, 390, 392, 463, etc.).

Example: The authors highlight in their dataset is the “13 novel families of DNA repair proteins” (Lines 55-56). Importantly, the genes/families identified are only associated with DNA repair by their sensitivity phenotype in cisplatin: this is not sufficient to support the claim that they are involved in DNA repair.

Referee #2 (Remarks to the Author):

I am happy that the authors have addressed my comments. I hope they will continue to attempt to get this information into public databases once the paper is published.

Referee #4 (Remarks to the Author):

Price et al. present a chemical-genetic screen of unprecedented scale, in which genome-wide libraries of 32 different bacteria are profiled in an average of 76 environmental or chemical conditions. The authors do a concerted effort to harvest the power of the data for inferring protein function, and provide a roadmap on how this process can even lead to functional annotation using other available information and manual curation. This is a colossal amount of work, and authors have put significant effort in trying to improve the manuscript and address reviewers’ constructive criticism.

Already from its size (~10 rather than 19 million gene-phenotype measurements –comment 4) this work plays in a league of its own, no matter if comparing to published work in prokaryotes or eukaryotes. More importantly, it illustrates for the first time how broadly such approaches can be used to infer function in non-model organisms, spanning large segments of the evolutionary tree. Although there are many things that could be done better/differently- in screen design, data analysis (metrics used –see comment 4), quality control and benchmarking of screen (comparing to other datasets), data integration with other omics info, defining protein families, assessing essentials, none of these can hamper my enthusiasm for the broad utility of the information that this study contains and the messages it conveys. Since also my role as an extra reviewer is to assess whether authors complied with what was asked in the first review rather than to provide a de novo review, I split my comments in 2 levels. First part resonates criticisms raised in the first review, which I find not fully addressed – there are about presentation and easily fixable. Second part, has additional comments, which I find important for improving clarity, but I leave to the discretion of authors and editor if they want to incorporate them. All of them involve explaining better, clarifying points and avoiding overstatements.

A. Comments pertaining 1st review

1. There is a plethora of terms/words with a specific quantitative connotation, which is not intuitive (some terms are used interchangeably by others – e.g. important and notable; other terms have study-dependent definitions– e.g. orthologues and paralogues) and reader has to search throughout the manuscript to understand what the authors mean. I suggest that the authors include a glossary for all of them as part of the main manuscript (Table), so reader can easily assess the information. Here are some of the terms I gathered: significant, important, notable, neutral, strong, detrimental, specific, co-fit, conserved specific, conserved co-fit, paralogue, orthologue, no data.

2. Although authors have improved clarity and conciseness of the manuscript, there is still room for doing a better job in this front. Previous reviewers have spent a lot of effort towards this direction, so authors should ensure they at least address all previous comments raised (not only responding to reviewers but also in text). There are also a number of additional points that could improve clarity/readability – for most less numbers (when not needed or may convey wrong impressions). Without going into much detail, and since text editing is not my job, here are two examples to make the point (there are many more in text):

a) lines 409-415 (issues identified by reviewer 3, nothing done). Suggested rephrasing: “As a third example, we found that in three bacteria, DUF2849-containing proteins are cofit with an adjacent (adjacent to what? – replace with "the adjacently encoded") sulfite reductase, CysI (Fig. 4c). Sulfite reductases are required for sulfate assimilation, which was the only source of sulfur in our defined media conditions. Thus, DUF2849 is also presumably involved in the same process. The three bacteria containing DUF2849 lack CysJ, which provides usually the electron source for CysI. Since other bacterial genomes that contain DUF2849 also contain cysI but not cysJ, we propose that DUF2849 is encodes for an alternate electron source for CysI”

b) the manuscript is full of numbers. Although many are required/important, there is still an overuse. I would urge the authors to recheck if they really need all of them- e.g. do we need ranges for the total number of insertions per genome (line 110), per gene (line 111), and per gene with sufficient data (line 134)? Wouldn’t the last one suffice for the main text and avoid misunderstandings that were raised in 1st review? Does the total number of “likely-essential proteins” (line 114) add something? Especially when we don’t know if same protein across 32 species or species-specific… In other cases, numbers stated may be misleading (see comments 4 and 6).

3. Authors provide reasons on why they could not provide strains to public culture collections or update protein family information. Although I understand the limitations, I would like to point out that there are other big national strain collections besides ATCC (numerous in Europe) and that it would be important for many fields (and for this work to reach its full potential) to follow up the Pfam/Interpro past publication of this paper.

B. Additional main comments

4. Data analysis has issues. If authors do not want to fix them at this point (points will remain similar, numbers will change), I think is important that caveats are spelled out so reader/user appreciates them – especially for people who will use the data:

a) The fitness score metric does not correct for fitness defects in the unperturbed condition. This means that mutants with fitness defects in rich media will (most likely) carry over this defect in all inhibitors tested in rich media. For such genes, conditional phenotypes are misleading (having a strong negative score in condition X is not because gene is required specifically for this condition, but more because it is required in general). Judging from ED Fig 2b, there may be more phenotypes coming from such generally-required genes rather than from specific conditional phenotypes. Although authors do not use such genes in their analysis, they are part of the hits in each condition and they may lead to misleading cofitness associations. Therefore it would be if reader knows is warned about this.

b) Biological replicates are never merged (apart from when calculating genes with significant phenotypes), but reported separately across data and manuscript. This has from harmless effects (over-reporting of fitness assays or numbers of gene-phenotype measurements in abstract – nearly half are from replicates) to more serious ones, that for each cofitness measure conditions that have more replicates weigh more than conditions with less replicates.

I am also surprised that authors discuss little about replicates in the main manuscript and one has to dig in methods to find more about them. At least some graph (distribution?) for reproducibility should be there; from numbers stated in methods reproducibility sounds excellent.

c) Although I mostly agree that normalizing fitness scores by the number of generations should not change drastically the report of statistically significant fitness scores (though keep in mind that variance may not scale linearly, and resolution for low reads will be different), not normalizing the variance of fitness scores between conditions has stonger repercussions in the cofitness calculation. Conditions grown for longer will have larger variance (up to 2x) and thus will contribute much more in the Pearson correlation measure.

5. I find the emphasis on the specific phenotypes for assigning functions somewhat dangerous. First, the current analysis is sensitive to the conditions screened and to a rather coarse definition. If there are similar conditions screened (e.g. there are several aminoglycosides, tetracyclines, beta-lactams or detergents), it is almost impossible for a mutant to be defined as specific even if it has phenotypes only in these conditions. Current definition (no strong phenotype in >95% experiments) is sensitive not only to conditions screened but also to replicates done in each condition! To make it simple, if more DNA-damaging agents had been included in the screen, this would have precluded the authors of defining the phenotypes as specific and doing all the downstream analysis. Maybe this is why this approach is used only for Cisplatin and DNA-repair related genes. Second, and more broadly speaking, I think is an overstatement saying that one can assign function based on the phenotype. Phenotypes provide an inference, but for understanding/assigning biological function more is required or at the minimum there is need for function prediction information (e.g. that protein encodes for transporter) - please see also comment 20 for this.

6. Abstract

• first sentence needs reference, it’s not common/textbook knowledge

• instead of giving number of experiments, it would be more informative to give range or average of conditions tested in 32 species to give reader an impression of real dimensions

• provide unique number of gene-phenotype scores (after merging replicates)

• line 52-53: sentence can be moved one sentence down, when talking about protein families

• are these protein families as defined by Pfam or eggNOG? Maybe better to remove families – “13 novel putative DNA repair proteins”

C. Minor comments – pertaining 1st review & additional

7. line 67-69; provide reference

8. line 89: provide average or range (instead of dozens of conditions).

9. lines 133-136: what is sufficient abundance? How does this end up to 3-26 insertions per mutant from 5-66 insertions per mutant?

10. Fig 1b: I would find the right bar plot more informative if X axis (protein-coding genes) was presented as %, and n (number of genes) were given for each organism. This would facilitate comparisons between organisms.

Also change “Mutants per protein” to “Average quantifiable transposon insertions per protein”

11. line 200: define “very few.”

12. Fig 2b: are there 61 proteins or “67% of 101” as stated in text? Is this number high enough to be robust/representative? How is confidence calculated? Define better Y axis.

13. lines 243-243: the authors are not for sure the first to use correlation (co-fitness) to functionally associate genes, so they should (also) cite other bacterial or yeast papers that have does this first.

14. Fig 2e: explain better in legend; especially what Y axis means.

15. line 262: “not nearby in the genome” is still vague. Is there a Kb, # of genes cutoff or is it operon structure taken into account?

16. lines 265-269/Fig 2e: slower decay for conserved cofitness – on what do the authors think this is based on? More data increases significance? Or because by looking into conserved phenotypes, they enrich for conserved genes that have more phenotypes (so better correlations) and are more likely to have TIGR roles? It may be useful to explore this in manuscript.

17. line 295: “67 protein families”. Obviously family definition suffering if there are 4 families of RecO proteins.

18. line 300; is it relevant if previous studies did not test cisplatin, but identified proteins as important in DNA repair?

19. line 306: it is curious that such central cell cycle/cell division genes (minCD, mrcB) have phenotypes only with cisplatin, and not with stresses much closer to these processes (e.g. A22 or beta-lactams)

20. line 309: I would be careful with predictions. Many of the 13 protein families have homology to cell-division and chromosome partitioning rather than DNA repair proteins (Table S9).

21. Fig. 3a: XerD is not a cell division gene, but a subunit of the Xer recombinase.

22. Fig. 3b: is the term “specific” sensitive to genes that are nearly essential in most organisms, which makes it difficult in getting data in most conditions, and then in few that is possible, they are significant?

23. line 346-47; better state the independent transporters rather than subunits. Same ABC can have 1,2, or more membrane subunits.

24. lines 417-433: YeaH-G-YcgB are very strongly associated by STRING based on genomic context, text-mining and co-expression data. Current info provided is not much more than existing – just makes it more likely that they constitute a functional unit.

25. lines 740-46: what are size of scaffolds, and what do authors mean by copy number? Do they refer to chromosomally versus plasmid encoded genes? Or do they need different normalization across the genome?

Round 3

The third submission had many minor changes, including a more detailed classification of the cisplatin resistance genes; comparisons to alternative ways of computing cofitness; and a more careful analysis of essential genes in E. coli (in a supplementary note). It also had a revised title. The reviewers' comments on this version were:

Referee #1 (Remarks to the Author):

We thank the authors for their thorough and thoughtful revision of this manuscript, and feel that it is now much more accessible and a major contribution to using high throughput approaches to decipher bacterial phenotypes. Our one further request is that we feel that adding the false positive rate to the discussion of essentials in the text would be extremely helpful for downstream use of these data.

Referee #4 (Remarks to the Author):

As mentioned before, the study by Price et al. is impressive, harvesting reverse genetics at an unprecedented scale, constructing and profiling genome-wide libraries for 32 different bacteria in tens of environmental or chemical conditions. The authors have put significant effort in revising the manuscript, addressing many of the reviewers’ comments. Here are some of the points that in my opinion need further resolving.

1.Authors mention in response to comment that their fitness score metric does not correct for fitness defects in the unperturbed condition and the at the least the reader should be warned about this and its possible repercussions (this text is almost identical to what added in manuscript”:

“We manually examined the phenotypes of the 26 Shewanella loihica PV-4 genes with fitness defects in both experiments from Extended Data Figure 2b. We found that half (13) of these genes are actually detrimental to fitness (fitness > 0 and statistically significant) in at least one other condition and that none of the remaining 13 genes has strong cofitness (r > 0.8) with another Shewanella loihica PV-4 gene.”

I have trouble understanding both the logic and the specifics of the argumentation.

First, this goes beyond this particular organism; I suspect that there are mutants that have growth defects already in basic media (LB) for all 32 strains tested. If you don’t want to correct for this, this is absolutely fine with me, but please warn the reader that in these cases phenotypes are not due to particular stress, but due to growth defects in basic media.

Second, I don’t see how this particular argument (and this weird splitting in 13 + 13 genes) proves that these genes do not have cofitness associations that are unrelated to their function. Do you want to say that genes that have low fitness in LB, and thus in nearly all stresses tested in LB, do not have strong co-fitness correlation at least among them (those that grow better in other background should for sure)? And if they do, would that then mean that these genes are functionally associated, or just required for growth in LB?

2.Authors justify keeping replicates separate in a convoluted manner– which basically boils down to dissimilar genes remaining dissimilar, and top similar genes remaining top similar no matter if they average replicates or not. It would be much easier to show scatter plots of correlations before and after merging (as they do in EF Fig. 10).

In my mind, we do replicates to quality assess an experiment, and to increase the statistical significance of our claims. First part is now covered. Second, is brushed off by saying gene-gene associations are more or less the same no matter if we do it or not. Replicates give the authors the possibility to calculate fitness scores and do their t-test across replicates, providing reader with one number per gene and avoiding confusion for example when fitness scores disagree between replicates, but p-value is good in both replicates. As far as I understand, authors do this anyways for reporting significant phenotypes, so I don’t really understand why not doing for all data and reporting one fitness score and t-statistic per gene-condition.

About coftiness regulation: I understand that authors want to keep calculation as is, and it is up to them what to do if they show that correlations stay the same (scatter plots) after merging. Nevertheless, conceptually, I still find keeping replicates separate for calculating similarity of responses (cofitness) wrong. Indeed, there is no perfect representative set of conditions to test, but keeping obviously redundant information (exact same condition) when calculating similarity metrics is plain wrong: it can bias correlations and increases their significance artificially. Although one can get similarly high and significant correlations if one screens 20 times cisplatin and 20 different DNA damaging agents, the amount of false positive biological correlations in 2 datasets would be obviously very different. In many other phenotyping datasets (e.g. imaging), researchers spend a lot of time looking for such redundant information (features) to remove and reduce dimensionality of data, although this is not obvious from the start.

3.Normalizing variance of fitness scores: there was obviously a misunderstanding here. I did not ask to normalize with number of generations, as I agree that this has difficulties and is not applicable to all conditions. I suggested normalizing the variance of the fitness scores in all conditions, which is straightforward and is standard for comparing between conditions and calculating similarity metrics. Though again up to the authors. From ED Fig.10 this seems that may have little role in at least the 6 organisms shown; variance may be very similar from the start in these cases. In our experience this improves data, and lowers the threshold for significant correlations.

4.Line 45 (abstract), line 66 (intro): is it half or one third of the bacterial genes that cannot be annotated?

5.Fig 3a/lines 306-358: this part has become longer but more confusing.

First, numbers don’t add up, and often what’s in text does not agree with Fig – e.g. there are 13 families that are probably not involved in DNA repair shown in Fig. 3a, but 15 stated in text (line 353). Families that are reasoned not to be involved in DNA repair in text (DUF3584 and yghF – lines 335 and 351) are dubbed as “possible DNA repair” in Fig. 3A. More confusion arises because authors sometimes talk about one family in text (e.g. DUF1654 – line 339), but this counts as “multiple families” in Fig. 3A (3 DUF1654 families). Part of the problem is the family definition and that bidirectional BLAST is a crude way of defining protein families (that’s why I said definition suffered in previous round). For example, Uniprot has one RecO family instead of 3 (http://www.uniprot.org/uniprot/?query=family:%22RecO+family%22), and has proteins from organisms in all 3 groups defined here. The authors do not have to change their family definition (I understand that perfecting that does not add to any message they make), but please stay consistent in text and figures.

Second, there are some simplifications or misstatements in this part. LexA regulation is not enough to predict that something is involved in DNA repair – just that is required during DNA damage (e.g. one of the strongest LexA regulated proteins is SulA, which inhibits cell division). DNA damage leads to filamenation because it directly leads to cell division inhibition, not because it stops DNA replication (lines 323-24). Last, regarding the 3 protein families, which have been previously shown to be involved in DNA repair and found as part of ciplatin resistance here. I would still treat the new info as validation, as all 3 studies cited have more conclusive evidence on their DNA repair function, and actually UV and mitomycin C cause very similar DNA damage as cisplatin, and require largely same DNA repair pathways for fixing it.

6.YeaGH-YcgB: I still stand by my comment that big part of what the authors conclude about these proteins (act as a signaling functional unit) could be made without their data, from STRING – based on both genomic context and experimental data (co-expression and textmining). The new info is that function is possibly different across bacteria. This is simply addressable by authors acknowledging previous info. Actually, I don’t see this as minus, and I would take it as further validation that their screen and analysis is valid.

Round 4

The fourth submission had a additional information on alternative ways to compute cofitness (in the supplementary figures); more detailed information on the availability of the data and the mutant libraries; and a shorter title and abstract. It was provisionally accepted, subject to a reduction in length from ~5,400 words to under 3,000 words.

Please address any comments to Morgan Price