Analysis Procedure


Data Analysis
Data analysis for the sorbitol, pH 8, 1M salt, Nystatin, minimal media and galactose conditions was performed as follows: A sample size of 10-15 hybridizations were collected for the control condition (YPD at 30oC) for each generation time (5 and 15). The data was centered using the mean intensity across the chip after censoring poor readings as described in the preprocessing section. For each generation time, a Gaussian distribution was fit to the base 10 logarithm of the signal intensity for each tag across all hybridizations. In any particular experiment (defined as a condition measured at 5 or 15 generations), the likelihood of observing a tag's intensity under the control distribution was calculated. The fitness of a strain is then found by averaging the likelihood of the 4 tags associated with that ORF (see below). For computational and data presentation purposes, the negative natural log of this value is taken and dubbed the 'fitness defect score'. Therefore, the larger the fitness defect score the greater the probability a strain has a significant growth defect in the condition tested. In a statistical sense, this is a hypothesis test of observing the intensity in the condition experiment in normal YPD at 30oC conditions. For complete data sets see: http://genomics.lbl.gov/YeastFitnessData/websitefiles/cel_index.html.(Note: This opens a new window to a different location) Based on comparisons with known and observed biological significance, our cutoff for significance included strains with fitness defects of greater than 20 in both 5 generation experiments and greater than 100 in both 15 generation experiments. This cutoff is stringent enough so that we are confident that these strains exhibited a reproducible decrease in fitness.

Because the genes required for growth in the lys-, trp- and thr- media are confounded by those required for growth in minimal medium, a filter was applied to these data. A likelihood ratio test between the amino acid condition and the minimal medium condition is successful in accomplishing this filtering. If the likelihood ratio was greater than 15, we considered the strain to have a significant fitness defect in the dropout media only.

To determine slow growing strains in the control condition (YPD at 30oC), a ratio of the average intensity of a given strain taken directly from 'post -80oC thaw' over the average intensity of the strain after ~10 generations of growth was calculated. The data for these experiments was generated from 10 independent experiments. The strains with the highest ratios at 10 generations are those that grow most slowly in YPD. Based on the comparison of the intensity ratios to individual grow curves we can make an estimate of the relative growth rate with respect to wild-type by comparing these ratios to individual strain growth rates (see Figure S1 [here])

Tag and microarray Preprocessing
Each deletion strain is associated with 4 hybridization signals on the high-density oligonucleotide array: UPTAG sense, UPTAG antisense, DNTAG sense and DNTAG antisense. To classify the tags that do not hybridize to the array well enough to make valuable predictions, a distribution describing background tag behavior was generated using ~8,000 tags on the array that are not associated with any strain. These oligonucleotides represent the background intensity of the array. A small fraction of these background oligonucleotides were found to cross hybridize to tags in the strains. These tags were eliminated from further analysis. For each tag distribution (generated from 10 time zero hybridizations) a Kolmogorov-Smirnov test of distributional similarity was applied1. The null hypothesis is that the two samples (background and tag) come from the same underlying distribution. Tags with a p-value of greater than 0.05 by this test (1663) were considered too similar to background to yield predictive results and were eliminated from the analysis. 77 additional tags were discarded from the analysis because they hybridized with signal levels below that of the background distribution (signal intensity less then 400). Overall, 95.3% of the tags were included in the analysis.

Comparison of expression data to fitness data
A Spearman's rank test was performed comparing the expression data of the genes that exhibited a significant increase in expression (as defined by the authors) with the fitness data of the genes that exhibited a significant sensitive fitness defect and no correlation was found. However, because of the somewhat arbitrary cutoff points for significance in both datasets, it is difficult to make strong conclusions based on statistical tests alone and it is possible that a subset of the data may exhibit such correlation. For these reasons, we opted not to include any further statistical tests of association as the results could lead to a false conclusion. Instead, we report the percent of the genes significantly up regulated that also have a significant fitness defect score (see Expression Comparison Table [here]).

1. Chakravarti, L., Roy. Handbook of Methods of Applied Statistics (John Wiley and Sons, 1967).