WCS417

Fitness Data for WCS417

188 condition samples (148 successful), Thu Apr 7 13:22:24 2016, statistics version 1.0.3

Plots

Quality of experiments
- Experiment quality
  - Grey: Time0, i.e. negative control; red: failed; green: OK; blue: OK, but maxFit > 5
  - After the main page there are separate pages for subsets of experiments by date and lane.
- Rho12 for each experiment (genes with |t| > 3 are in red)
Clustering of experiments (rotate clockwise to view)
- Clustering of log-ratios
- Clustering of log-counts
Distribution of cofitness
Chromosomal bias for each experiment

Tables

Experiments
- Experiments and quality scores
- Detailed metadata for experiments
Genes
- Gene annotations
- Gene fitness (normalized log2 ratios for successful experiments )
  - or for all experiments
  - (To see why experiments were not successful or were "ignored" and not analyzed, check the log.)
  - Unnormalized logratios
  - Naive unnormalized logratios
- t-like test statistic (based on consistency of measurements for the gene)
  - Or estimated standard error (based on variation across strains)
  - Or naive standard error (based on total counts)
- Top cofitness hits
- Specific phenotypes
- Counts per gene per sample (#reads, summed over strains)
Strains
- BarSeq count for each strain by each sample (large)
- Strain fitness (log2 ratios, normalized, large)

R image (large)
Log file from processing in R
- At the top, "Ignoring" lists which samples were not analyzed at all. Either Drop=TRUE in the metadata or there were too few reads for the sample. These samples are not included in any of the tables.
- At the bottom, the log lists which experiments were considered not successful and why:
  - "low_count" -- there were not enough reads for the median gene (gMed < 50). There were too few cells or something went wrong during DNA extraction or PCR. Or, there might be strong positive selection, i.e. gMean > 50, gMean >> gMed and maxFit >> 1. If there was strong positive selection then there might be useful biological information in the genes with the strong positive fitness values. Check the quality metrics table and then the table of gene fitness for all experiments (not just successful ones).
  - "high_mad12" -- the fitness estimates from the two halves of a gene were not very consistent (mad12 > 0.5). This could result from variable lag or if only a small subset of cells were able to grow.
  - "low_cor12" -- the rank correlation of the fitness values for the two halves of a gene was too low (cor12 < 0.1). In other words, there wasn't much biological signal. This could be a sign that the cells did not grow enough or that the condition is too similar to the condition that the pool was made under (so that all sick strains are gone). If you have replicates then you might still be able to get high quality results by averaging them.
  - "high_adj_gc_cor" -- Suspicious correlations indicate a problem with GC bias or with normalization (|gccor| > 0.2 or |adjcor| > 0.25).
Source code: see bin/RunFEBA.R and lib/FEBA.R in the dev branch of the FEBA BitBucket repository
Input files for RunFEBA.R: genes, exps, all.poolcount (large!)
Also see the pool

Documentation

Gene Fitness

Gene fitness is a log₂ ratio. It is normalized so that genes with no phenotype should have values near zero. Ideally, genes that are very sick (incapable of growth in the condition) should have values around -6 if the experiment ran for 6 generations. In practice, values below -2 or -3 indicate that mutants in the gene are very sick, and values around -1 indicate a mildly deleterious phenotype for mutants in that gene. On the other hand, if a gene's activity is deleterious, then the fitness values will be positive.

Gene fitness is calculated from strain fitness:

Strain fitness = log₂ ( C₁ + #Reads in Sample ) - log₂( C₀ + #Reads in Time0 ), where C₁ and C₀ are small constants to avoid taking the logarithm of 0.
Unnormalized gene fitness = C + weighted average of strain fitness, where C is chosen so that the median gene's unnormalized fitness is zero.

The method for averaging gene fitness across strains may change in the future.

Normalization for Chromosomal Bias

Depending on the growth phase of the sample, the copy number of the chromosome may be higher near the origin than near the terminus. If the treated and Time0 samples were growing at different rates, then the there will be variable recovery of barcodes near the origin which does not relate to the fitness of the genes. This is plotted for each experiment in the chromosome bias plots (note that the y axis is the unnormalized fitness). To remove this effect, we subtract the running median of fitness (the line in those plots). Then, a constant is added so that the mode (the peak of the distribution of gene fitness values) is zero.

Also, in different preparations of genomic DNA, the efficiency of recovering plasmids can vary. So, for each plasmid (if this organism has any), the median fitness of the genes on the plasmid is set to zero. Plasmids with very few genes cannot be normalized and so their genes are excluded.

If the genome sequence is not complete and is in many fragments, then it is not easy to tell if there is an effect of proximity to the terminus. Assuming that each scaffold is small, there will not be a significant variation in copy number across the scaffold. Since each scaffold is normalized separately, this should correct for the varying copy numbers of the scaffolds (but this has not been verified). Also, since it is difficult to a distinguish a small scaffold from a plasmid, genes from small scaffolds are excluded.

The quality scores in the table of experiments

name -- set1H1 means index H1 in set1 for that organism. These experiment names are the column names in most of the other tables.
short -- shortened description. Samples with the same value are replicates.
t0set -- which set of Time0 samples this was compared to. If this is a Time0 sample, then the self-counts were subtracted out of the other side of the comparison as a negative control.
num -- another unique identifier, the row number in the input exps file. Used in the plots for compactness and shown in the cluster diagrams with "#", i.e. "set1H1 #3".
nMapped -- #reads for that sample that corresponded to a known strain
nPastEnd -- #reads that corresponded to a strain that has an insertion within the suicide vector instead of within the genome.
nGenic -- #reads that correspond to insertions within genes
nUsed -- #reads that lie within the central 10-90% of a gene. Only these reads are used to estimate gene fitness.
gMed -- median reads per gene in the sample. Values under 50 suggest that the experiment failed to generate useful information about most genes.
gMedt0 -- median reads per gene in the corresponding t0 sample(s). Values under 50 suggest that the experiment might have failed.
gMean -- average number of reads per gene in the sample. This can be far higher than gMed because of a strong skew in the sampling of gene mutants. This usually indicates strong positive selection for mutants in a few genes.
cor12 -- (called rho12 in the plots) -- A measure of how consistent the fitness data is for each gene. For each gene, we estimate the (normalized) fitness using only insertions within the first half of each gene (10-50%) or only within the second half of each gene (50-90%). cor12 is the Spearman rank correlation of those two sets of values. Values below 0.2 suggest that the experiment might have failed.
mad12 -- Another measure of the consistency. For each experiment, the median absolute difference (m.a.d.) between the fitness according to the first half of the gene and the fitness according to the secnd half of the gene. Values above 0.5 suggest that the experiment might have failed.
mad12c, mad12c_t0 -- The m.a.d. of the log2 counts for the 1st and 2nd half of the gene, in the treated sample or the Time0 sample.
opcor -- A measure of how consistent the fitness data is for each operon. For each pair of adjacent genes that are predicted to be in the same operon, we take the fitness values for the upstream and downstream gene. This is the Spearman rank correlation of those two sets of values. Genes in the same operon often have related functions, so opcor < 0.2 suggest that the experiment might have failed.
adjcor -- A measure of how consistent the fitness data is for nearby genes that probably do *not* have related functions. This is like opcor but is computed using adjacent genes that are not on the same strand. Values above 0.1 may be an indication of problems with the BarSeq PCR (i.e. GC bias) or with the normalization.
gccor -- A test of GC bias, this reports the Pearson correlation between a gene's GC content and its fitness. Values above 0.1 or below -0.1 may be an indication of problems with the BarSeq PCR.
maxFit -- The highest fitness value of any gene. Values above 5 indicate strong positive selection in the sample. This often explains why an experiment failed, either because most genes have few reads (low gMed) or because only a few strains grow.
u -- Whether the experiment was successful and is used in global analyses.

As of April 9, 2014, the requirements for a successful experiment are:

not a Time0 (Time0 "fitness" values are computed as a negative control)
gMed ≥50
mad12 ≤ 0.5
cor12 ≥ 0.1
|gccor| ≤ 0.2 and |adjcor| ≤ 0.25

(The rules are implemented in FEBA_Exp_Status() in FEBA.R)

The R image

The R image includes:

genes -- a table of genes, including
- locusId, an arbitrary identifier used in data tables
- sysName, the systematic name or locus tag
- type, the type of gene, including 1 for protein-coding genes, 2:4 for various kinds of ribosomal RNAs, 5 for tRNAs, and 7 for pseudogenes of protein-coding genes.
- scaffold, begin, end, strand -- the gene's location. begin < end regardless of strand.
- GC -- the GC content of the gene's sequence
- nTA -- the number of TA dinucleotides in the gene's sequence (the Mariner transposase usually inserts at TA).
fit includes many different tables:
- q -- quality scores for each experiment and some metadata and quality metrics (described above)
- per-gene information:
  - g -- the corresponding locusId for each row in the per-gene tables. Genes that are not in this list might lack mapped insertions (e.g. because they are short, duplicated, or essential). Or, they might have insertions but not enough reads at Time0 to compute a fitness value (perhaps because the mutants are very sick).
  - lrn -- normalized gene fitness (in the same order as in g)
  - lr -- unnormalized gene fitness
  - lrn1, lrn2 -- normalized gene fitness using only strains within the 1st or 2nd half of each gene. These were used to compute the cor12 (or rho12) quality metric. These should also be useful for checking whether a specific gene's fitness measurement is reliable. Also see lr1 and lr2 for the corresponding unnormalized values.
  - t -- a t-like test statistic that gives an estimate of how significant this gene's measurement is
  - se -- an estimate of how noisy this gene's measurement is (se is short for standard error)
  - sdNaive -- the best-case of how noisy this gene's measurement would be, based on the total counts
  - n -- number of usable strains for each gene
  - nEff -- effective number of strains that were used to compute each gene's fitness value. The effective number is less than the actual number because of uneven weighting of the strains. nEff = max(weight) / sum(weight).
  - tot, tot0 -- total counts for the central part of this gene in the treatment and Time0 samples.
  - tot1, tot2, tot1_0, tot2_0 -- similarly for only the 1st and 2nd half of each gene.
- per-strain information:
  - strains -- a table of all mapped strains. This describes the barcode (as seen in TnSeq) and its reverse complement (as seen in BarSeq), where the insertion is, and what orientation it is in. If the insertion mapped to the delivery vector then scaffold = "pastEnd". If the strain is within the central 10-90% of a gene, then the locusId is indicated. If the strain was within 10-90% of a gene and had enough reads to be included in the gene fitness calcuation, then used=TRUE. "f" is the fractional position in the gene (in the gene's orientation), i.e. 0.5 for an insertion in the middle of the gene and near 0 for an insertion at the very 5' end.
  - strain_lrn -- fitness values for each strain (in the same order as in strains). These will often be noisy due to low counts. Also see non-normalized fitness values in strain_lr.
  - strain_se -- a rough estimate of how noisy the per-strain fitness values are
- simple gene-wise analyses:
  - cofit -- top cofitness hits for each gene
  - specphe -- specific phenotypes. A genes' phenotype is considered specific for a condition if abs(fit) > 1 and abs(fit) > percentileFit + 0.5 and abs(t) > 5, where percentileFit is the 95th percentile of abs(fit) for that gene. A gene may have more than one specific phenotypes.
expsUsed -- a table of experiments

The pool of mutants

The pool file is used to assign barcodes ot strains while assembling all.poolcount. It includes a separate row for each strain:

barcode -- The barcode identified by TnSeq
rcbarcode -- The reverse complement of barcode, which is what is seen during BarSeq
nTot -- The total number of TnSeq reads that mapped to either the genome or the delivery vector and contained this barcode.
n -- #mapped TnSeq reads for this barcode at its primary location. n will often be a bit less than nTot because of chimeric reads that arise during the PCR step of the TnSeq protocol.
scaffold, strand, pos -- The primary location that this barcode maps to. If it maps to the delivery vector, then scaffold="pastEnd" and strand and pos are blank. Otherwise, pos is 1-based.
n2 -- #mapped TnSeq reads for this barcode at its second-most-popular location. n2 << n because other barcodes are filtered out when "designing" the pool.
scaffold2, strand2, pos2 -- the secondary location
pastEnd -- #TnSeq reads for this barcode that map to the delivery vector.

(Mutant libraries made with transpososomes instead of a suicide plasmid do not have a delivery vector and will not have any "pastEnd" reads.)

Written by Morgan Price, Arkin group, Lawrence Berkeley Lab, January 2014