Fitness Data for Korea
155 condition samples (148 successful), Thu Feb 18 20:48:20 2016, statistics version 1.0.3
- R image (large)
- Log file from processing in R
- At the top, "Ignoring" lists which samples were not analyzed at all. Either Drop=TRUE in the metadata or there were too few reads for the sample. These samples are not included in any of the tables.
- At the bottom, the log lists which experiments were considered not successful and why:
- "low_count" -- there were not enough reads for the median gene (gMed < 50). There were too few cells or something went wrong during DNA extraction or PCR. Or, there might be strong positive selection, i.e. gMean > 50, gMean >> gMed and maxFit >> 1. If there was strong positive selection then there might be useful biological information in the genes with the strong positive fitness values. Check the quality metrics table and then the table of gene fitness for all experiments (not just successful ones).
- "high_mad12" -- the fitness estimates from the two halves of a gene were not very consistent (mad12 > 0.5). This could result from variable lag or if only a small subset of cells were able to grow.
- "low_cor12" -- the rank correlation of the fitness values for the two halves of a gene was too low (cor12 < 0.1). In other words, there wasn't much biological signal. This could be a sign that the cells did not grow enough or that the condition is too similar to the condition that the pool was made under (so that all sick strains are gone). If you have replicates then you might still be able to get high quality results by averaging them.
- "high_adj_gc_cor" -- Suspicious correlations indicate a problem with GC bias or with normalization (|gccor| > 0.2 or |adjcor| > 0.25).
- Source code: see bin/RunFEBA.R and lib/FEBA.R in the dev branch of the FEBA BitBucket repository
- Input files for RunFEBA.R: genes, exps, all.poolcount (large!)
- Also see the pool
Gene fitness is a log2 ratio. It is normalized so that
genes with no phenotype should have values near zero. Ideally, genes
that are very sick (incapable of growth in the condition) should have
values around -6 if the experiment ran for 6 generations. In practice,
values below -2 or -3 indicate that mutants in the gene are very sick,
and values around -1 indicate a mildly deleterious phenotype for
mutants in that gene. On the other hand, if a gene's activity is
deleterious, then the fitness values will be positive.
Gene fitness is calculated from strain fitness:
The method for averaging gene fitness across strains may change in the future.
- Strain fitness = log2 ( C1 + #Reads in Sample ) - log2( C0 + #Reads in Time0 ), where C1 and C0 are small constants to avoid taking the logarithm of 0.
- Unnormalized gene fitness = C + weighted average of strain fitness, where C is chosen so that the median gene's unnormalized fitness is zero.
Normalization for Chromosomal Bias
Depending on the growth phase of the sample, the copy number of the
chromosome may be higher near the origin than near the terminus.
If the treated and Time0 samples were growing at different rates,
then the there will be variable recovery of barcodes near the origin which does
not relate to the fitness of the genes. This is plotted for each
experiment in the chromosome bias plots (note that the y axis is the
unnormalized fitness). To remove this effect, we subtract the running median
of fitness (the line in those plots). Then, a constant is added so
that the mode (the peak of the distribution of gene fitness values) is
Also, in different preparations of genomic DNA, the efficiency of
recovering plasmids can vary. So, for each plasmid (if this organism
has any), the median fitness of the genes on the plasmid is set to
zero. Plasmids with very few genes cannot be normalized and so their genes are excluded.
If the genome sequence is not complete and is in many fragments, then it is not easy to tell if there is an effect of proximity to the terminus. Assuming that each scaffold is small, there will not be a significant variation in copy number across the scaffold. Since each scaffold is normalized separately, this should correct for the varying copy numbers of the scaffolds (but this has not been verified). Also, since it is difficult to a distinguish a small scaffold from a plasmid, genes from small scaffolds are excluded.
As of April 9, 2014, the requirements for a successful experiment are:
- name -- set1H1 means index H1 in set1 for that organism. These experiment names are the column names in most of the other tables.
- short -- shortened description. Samples with the same value are replicates.
- t0set -- which set of Time0 samples this was compared to. If this is a Time0 sample, then the self-counts were subtracted out of the other side of the comparison as a negative control.
- num -- another unique identifier, the row number in the input exps file. Used in the plots for compactness and shown in the cluster diagrams with "#", i.e. "set1H1 #3".
- nMapped -- #reads for that sample that corresponded to a known strain
- nPastEnd -- #reads that corresponded to a strain that has an insertion within the suicide vector instead of within the genome.
- nGenic -- #reads that correspond to insertions within genes
- nUsed -- #reads that lie within the central 10-90% of a gene. Only
these reads are used to estimate gene fitness.
- gMed -- median reads per gene in the sample. Values under 50
suggest that the experiment failed to generate useful information
about most genes.
- gMedt0 -- median reads per gene in the corresponding t0
sample(s). Values under 50 suggest that the experiment might have
- gMean -- average number of reads per gene in the sample. This can
be far higher than gMed because of a strong skew in the sampling of
gene mutants. This usually indicates strong positive selection for
mutants in a few genes.
- cor12 -- (called rho12 in the plots) -- A measure of how consistent the fitness data is for each
gene. For each gene, we estimate the (normalized) fitness using only insertions
within the first half of each gene (10-50%) or only within the second
half of each gene (50-90%). cor12 is the Spearman rank correlation of
those two sets of values. Values below 0.2 suggest that the experiment
might have failed.
- mad12 -- Another measure of the consistency. For each experiment,
the median absolute difference (m.a.d.) between the fitness according
to the first half of the gene and the fitness according to the secnd
half of the gene. Values above 0.5 suggest that the experiment might have failed.
- mad12c, mad12c_t0 -- The m.a.d. of the log2 counts
for the 1st and 2nd half of the gene, in the treated sample or the
- opcor -- A measure of how consistent the fitness data is
for each operon. For each pair of adjacent genes that are predicted to
be in the same operon, we take the fitness values for the upstream and
downstream gene. This is the Spearman rank correlation of those two
sets of values. Genes in the same operon often have related functions,
so opcor < 0.2 suggest that the experiment might have failed.
- adjcor -- A measure of how consistent the fitness data is
for nearby genes that probably do *not* have related functions. This
is like opcor but is computed using adjacent genes that are not on the
same strand. Values above 0.1 may be an indication of problems with
the BarSeq PCR (i.e. GC bias) or with the normalization.
- gccor -- A test of GC bias, this reports the Pearson correlation between a gene's GC content and its fitness. Values above 0.1 or below -0.1 may be an indication of problems with the BarSeq PCR.
- maxFit -- The highest fitness value of any gene. Values above 5
indicate strong positive selection in the sample. This often explains why
an experiment failed, either because most genes have few reads (low
gMed) or because only a few strains grow.
- u -- Whether the experiment was successful and is used in global analyses.
(The rules are implemented in FEBA_Exp_Status() in FEBA.R)
- not a Time0 (Time0 "fitness" values are computed as a negative control)
- gMed ≥50
- mad12 ≤ 0.5
- cor12 ≥ 0.1
- |gccor| ≤ 0.2 and |adjcor| ≤ 0.25
The R image
The R image includes:
- genes -- a table of genes, including
- locusId, an arbitrary identifier used in data tables
- sysName, the systematic name or locus tag
- type, the type of gene, including 1 for protein-coding genes, 2:4 for various kinds of ribosomal RNAs, 5 for tRNAs, and 7 for pseudogenes of protein-coding genes.
- scaffold, begin, end, strand -- the gene's location. begin < end regardless of strand.
- GC -- the GC content of the gene's sequence
- nTA -- the number of TA dinucleotides in the gene's sequence (the Mariner transposase usually inserts at TA).
- fit includes many different tables:
- q -- quality scores for each experiment and some metadata and quality metrics (described above)
- per-gene information:
- g -- the corresponding locusId for each row in the per-gene tables. Genes that are not in this list might lack mapped insertions (e.g. because they are short, duplicated, or essential). Or, they might have insertions but not enough reads at Time0 to compute a fitness value (perhaps because the mutants are very sick).
- lrn -- normalized gene fitness (in the same order as in g)
- lr -- unnormalized gene fitness
- lrn1, lrn2 -- normalized gene fitness using only strains within the 1st or 2nd half of each gene. These were used to compute the cor12 (or rho12) quality metric. These should also be useful for checking whether a specific gene's fitness measurement is reliable. Also see lr1 and lr2 for the corresponding unnormalized values.
- t -- a t-like test statistic that gives an estimate of how significant this gene's measurement is
- se -- an estimate of how noisy this gene's measurement is (se is short for standard error)
- sdNaive -- the best-case of how noisy this gene's measurement would be, based on the total counts
- n -- number of usable strains for each gene
- nEff -- effective number of strains that were used to compute each gene's fitness value. The effective number is less than the actual number because of uneven weighting of the strains. nEff = max(weight) / sum(weight).
- tot, tot0 -- total counts for the central part of this gene in the treatment and Time0 samples.
- tot1, tot2, tot1_0, tot2_0 -- similarly for only the 1st and 2nd half of each gene.
- per-strain information:
- strains -- a table of all mapped strains. This describes the barcode (as seen in TnSeq) and its reverse complement (as seen in BarSeq), where the insertion is, and what orientation it is in. If the insertion mapped to the delivery vector then scaffold = "pastEnd". If the strain is within the central 10-90% of a gene, then the locusId is indicated. If the strain was within 10-90% of a gene and had enough reads to be included in the gene fitness calcuation, then used=TRUE. "f" is the fractional position in the gene (in the gene's orientation), i.e. 0.5 for an insertion in the middle of the gene and near 0 for an insertion at the very 5' end.
- strain_lrn -- fitness values for each strain (in the same order as in strains). These will often be noisy due to low counts. Also see non-normalized fitness values in strain_lr.
- strain_se -- a rough estimate of how noisy the per-strain fitness values are
- simple gene-wise analyses:
- cofit -- top cofitness hits for each gene
- specphe -- specific phenotypes. A genes' phenotype is considered specific for a condition if abs(fit) > 1 and abs(fit) > percentileFit + 0.5 and abs(t) > 5, where percentileFit is the 95th percentile of abs(fit) for that gene. A gene may have more than one specific phenotypes.
- expsUsed -- a table of experiments
The pool of mutants
The pool file is used to assign barcodes ot strains while assembling all.poolcount. It includes a separate row for each strain:
(Mutant libraries made with transpososomes instead of a suicide plasmid do not have a delivery vector and will not have any "pastEnd" reads.)
- barcode -- The barcode identified by TnSeq
- rcbarcode -- The reverse complement of barcode, which is what is seen during BarSeq
- nTot -- The total number of TnSeq reads that mapped to either the genome or the delivery vector and contained this barcode.
- n -- #mapped TnSeq reads for this barcode at its primary location. n will often be a bit less than nTot because of chimeric reads that arise during the PCR step of the TnSeq protocol.
- scaffold, strand, pos -- The primary location that this barcode maps to. If it maps to the delivery vector, then scaffold="pastEnd" and strand and pos are blank. Otherwise, pos is 1-based.
- n2 -- #mapped TnSeq reads for this barcode at its second-most-popular location. n2 << n because other barcodes are filtered out when "designing" the pool.
- scaffold2, strand2, pos2 -- the secondary location
- pastEnd -- #TnSeq reads for this barcode that map to the delivery vector.
Written by Morgan Price, Arkin group, Lawrence Berkeley Lab, January 2014