MultiNome Version Beta 0.0.0000001: Multiple Genome Compiler

A project submitted for BioEngineering 190C, Professor Adam Arkin, Spring 2002: by Gene Tyson, Hoenie Luk, and David Quan (University of Berkeley at California)

Welcome! Mostly likely you're someone we know. (Hi!) If this happens to be not the case: WOW! Let me re-welcome you: Welcome! In either case, we hope your visit will be at least marginally enjoyable. So here's to hoping!

Despite the prior goofiness, please be assured, this is a potentially useful program, with the capacity to take multiple genomes and display multiple global genomes. While this may sound uber-cool, please restrain your enthusiasm. This program is limited to didactic proportions, and therefore is not robust, such that it may not work for trickier sequencing problems (high global homologies). In theory though, this program is useful for obtaining global sequences for things like bacteria, which may not be easily isolated through conventional techniques. Interested? Yes? Well, read on! No? Well, you will be. Read on!

A general review of multinome
The parts of the program
Feeding in your file, file format
Into the concatenation
To the display. The GUI!
A relatively simple "how to"
Dowloading Multinome
If you've come this far...
References and other information
Registration, in case you haven't yet 


A general review of multinome

Inspired by the vast multitude of bacteria that are unable to be conventionally cultured, and specifically to identify the bacteria at an acid mine drainage site, Multinome takes DNA fragments as sequenced from a community of bacteria, and (in theory) returns global sequences. The application of this program, of course, is not necessarily restricted to bacteria, but this is its most obvious application.

 

The parts of the program

The program is split up into three parts. A modified smith-waterman (David), a concatenation algorithm (Hoenie), and a GUI (Gene). Note: name in parenthesis indicates who did what.

 

Feeding in your file AND file format (!)

Code: the core of the algorithm is adapted from http://lectures.molgen.mpg.de/PracticalSection/DynProgInJava/, which is for amino acids. Using the class dataFileIn and the specific method "getData()", a string is returned, which contains both the sequence tag and the actual sequence concatenated together. dataFileIn in = new dataFileIn(); String s; s = in.getData(); This information is then parsed into the fields of a sequence data type. Note: The tag must be 7 characters long in order for the parsing to be correct. sequence seq2 = new sequence(); unparsed = data.getData(i); seq2.id = unparsed.substring(0,8); String subby3 = unparsed.substring(8); seq2.letters = subby3.toCharArray();//etc. This data is then iteratively processed by a smith-waterman algorithm with every other sequence by the class alignment, spitting out a number is passed through a threshold tester to an int array, which essentially a boolean, where 1 represents true and 0 represents false. This process may be termed a first pass. alignment = new Alignment(seqAbbrev, seq2); alignment.fillEditMatrix (openGapPenalty, extensionGapPenalty, seqAbbrev, seq2); maybe = alignment.maxToArray(); if (maybe >= 87.0) { maxArray[i-1] = 1; } The indicies of the int array that are equal to 1 are then iteratively passed back through the smith-waterman algorithm to the backtracking step, spitting out data,which is then passed along to the output file. This process may be termed a second pass. if (firstPass[p] ==1) { int[] corresponds; alignment = new Alignment(seq1, seq2); alignment.fillEditMatrix (openGapPenalty, extensionGapPenalty, seq1, seq2); corresponds = alignment.backTracking(); String toHash = new String((corresponds[2]+1) + "-" + (corresponds[0]+1) + "::" + (corresponds[3]+1) + "-" + corresponds[1]+1)); . . . String salsa = " "; out.writeChars(salsa.valueOf(count)); out.writeChar('\n'); for (int s = 0; s <=(count-1); s++) { out.writeChars(seq1.id); out.writeChar(':'); out.writeChar(':'); out.writeChars(hashNames[s]); out.writeChar('='); out.writeChars(((String)sink.get(hashNames[s]))); out.writeChar('\n'); The outputfile is a txt file "pass.swr". Now some pertinent facts on operation: File format: (i) (txt.) File must have a number on the first line, indicating half the number of sequences in the file. (ii) File must then have nine lines of non-relevant txt. (iii)File must then have the sequence tag/name 7 spaces in, a space, and then the sequence itself. Example: "Name = seq001A CTAATATCT....." (iv) There must be three lines of non-relevant txt. (v) (iii) and (iv) should be repeated for every sequence. To run this section of the program, write in a command line: jre -cp app.jar TestIT or optionally, jre -cp app.jar Work TestIT provides it's own file, whereas using Work requires you to provide your own file. Additional note: You must have a directory c:\jdk1.3.1_02\bin\ . This is where the output will be read to.

 

Into the concatenation

Documentation for Concatenation Module (part 2 of Multinome) by Hoenie Luk (5/22/02) (airedale@theearth.com) PURPOSE OF MODULE This module reads in the Smith-Waterman output from David's Overlap Search Module and reassemble the actual global sequences by concatenating the DNA fragments in the correct order. The program is completely written in Perl and in one file (globmap2.pl). INPUT *.swr (Smith-Waterman report of fragment overlaps) *.fra (DNA fragment list; both files have to have same name) OUTPUT *.con (Concatenation report) and a bunch of other intermediate reports (*.250, *.aln) REQUIREMENT (use files in concat_i.zip) 1. Perl 5 must be installed. 2. The following files msut be in one directory: globmap2.pl (Global mapping version 2, written in Perl) *.swr (Smith_Waterman report of fragment overlaps) *.fra (DNA fragment list) 3. t_coffee.exe and t_coflal.exe (T-coffee multiple-sequence alignment; in path or in same directory) T-Coffee is a multiple-sequence alignment program written by Cedric Notredame and his team. It is an improved algorithm based on the older clustalw algorithm (which, unfortunately, is not accurate enough for our Concatenation Module. The t_cof*.exe included in this zip package is a binary executable for Win32 system. For versions compiled for other operating system, visit: http://igs-server.cnrs-mrs.fr/~cnotred/Projects_home_page/t_coffee_home_page.html USAGE type "perl globmap2.pl" (No argument will be accepted) When prompted for the name of the .fra and .swr name, type the name without the extension. SAMPLE INPUT AND OUTPUT (files in concat_i.zip and concat_o.zip) There are two sets of sample input files: (1) randfr10.fra/.swr and (2) pass10.fra/.swr. (randfr10.fra and pass10.fra are identical) randfr10.fra and pass10.fra contains about 60 random fragments generated from a 3000bp-chunk of E. coli genome and about 60 random fragements generated from a 3000bp-chunk of Clostridium perfringens genome. Each fragment is about 500bp in size and contains 10 random mutations (missense, deletions or insertions of <3 bases) to simulatesequencing errors. randfr10.fra and pass10.fra are exactly identical but they both exist in the package so that they will match the names of the two .swr files. randfr10.swr is a 100% accurate Overlap Search report generated at the time when the .fra file was generated. This represents the result of the Smith-Waterman Overlap Search Module when it works perfectly. pass10.swr is an actual report generated by actually running randfr10.fra (=pass10.fra) through the Overlap Search Module. You will notice that it misses many actual overlaps while also including a few false overlaps. To test our Concatenation Module, we fed randfr10.fra/.swr to globmap2.pl. The generated output is under the directory named \randfr10. We also fed pass10.fra/.swr to globmap2.pl and placed the output under the \pass10 directory. The Concatenation module took about one hour to run. Most of the time is spent on t-coffee multiple-sequence alignment. The *.fa files are collections of fragments (in Fasta format) that the Concatenation Module wants to sent to T-coffee for alignment. In return, T-coffee spits out a lot of .aln files and .dnd files. fa*.aln are alignment of DNA fragments within a 250bp region. pa*.aln are alignment of concesus sequences in adjacent 250bp regions. The .dnd files are dendrograms produced during the alignment process You can ignore these files, since they are mainly for debugging purpose. There is also a *.250 file (which reports the computed concensus sequence within each 250 base-pair region) and a *.cat file (which reports just the computed concatenated sequences). These are intermediate report files used by the Concatenation Module. You can also ignore them. The real output are randfr10.con and pass10.con (the concatenation report). Inside, you can find the final global sequences (listed as mn_global*; mn stands for multinome) and the position of the global sequences which each DNA fragment belongs to. You'll notice that in randfr10.con, there are two global sequences (2686 and 2610 bases long). We did a t-coffee alignment of these two multinome-computed sequences against the two original global sequences (see mng1_g2.aln and mng2_g1.aln). You will find that the multinome-computed sequences almost perfectly match the original sequences except at the two ends where the fold coverage is low. This indicates that our Concatenation Module works in the ideal situation where the Overlap Search report is 100% accurate. When an actual Overlap Search report is used (pass10.swr), however, the result requires extra interpretation. In pass10.con, you'll find 11 multinome-computed global sequences. Some are longer and most are very short. The short ones are the result of too many unreported overlaps, such that certain fragments cannot be mapped back to the larger global sequences. But if we take the two largest multinome-computed global sequences (mn_global001 and mn_global004) and match them against the original global sequences, we find that they indeed match each other but with a relatively high rate of error. The errors are the result of low fold coverage when too few fragments could be mapped to a given region of the global sequence. The results from pass10.con indicates that the Concatenation Module is working at a level that can reassemble the global sequences from the DNA fragements to an acceptable shape. However, further improvement will be dependent on the improvement of the accuracy of the Overlap Search report. The *.con file is the one file handed to the third module (GUI Display Module) for graphical display of the results. OVERVIEW OF ALGORITHM First, the Concatenation Module loads the .swr Overlap Search report into a matrix. The column and row headers contain all the fragments mentioned in the report. Where bases 1 to 100 of segment X overlap with bases N to N+100 of segment Y, the number N is entered into the box (X,Y). Second, the Concatenation Module checks for those sequences which has empty columns. These are sequences that map to the extreme 5'-end of the final global sequences since their 5'-end do not overlap with any other fragments. Each of these sequences represent the beginning of a separate global species. Third, from one particular sequence (call it A) identified from above, the program "crawls" through the row of that sequence to find the fragment (call it B) that begins just 3' to the starting point of sequence A. Then the next sequence following B is identified by similarly crawling the row of sequence B. This continues until the program encounter an empty row is encountered, indicating that we have crawled to the extreme 3'-end of the global sequence. Fourth, this fragment crawling is repeated for all the separate species. Fifth, with all the fragments lined up in sequential order, the program starts at the boundary line of base 1, collects all the fragments that begin or end within bases 1 to 100. These fragments are sent to T-coffee for multiple-sequence alignment. Sixth, the program examine the T-coffee alignment result column-by-column and decide what is the concensus sequence for thesection starting from base 1. Seventh, then the program jumps 250 bases to base 250 and repeat the same process (steps five and six) until sectional concensus sequence starting from bases 1, 251, 501, 751 ....... 2751 are all determined. These sectional concensus sequences are recorded in the *.250 file. Eighth, the program takes two adjacent sectional concensus sequences at a time (e.g. the one for base 1 (C) and the one for base 251 (D)),and send them to t-coffee alignment. The alignment report is examined to find a 25-bp region near the beginning of sequence D where both sequences C and D are identical. This is used as a joint point where the two adjacent sequences, C and D, are join together. Ninth, step eight is repeated until all the fragments belonging to one species are joined together. This is reported to the *.cat file. Tenth, finally, the *.con output file is generated for the third module (Gene's GUI Display Module). PLEASE SEND QUESTIONS AND SUGGESTIONS TO THE AUTHOR Hoenie Luk, airedale@theearth.com

 

To the display. The GUI!

GUI documentation by Gene Tyson (20/5/02). Usage and Purpose of Module: ---------------------------- java -jar ContigViewer.jar will give you the usage info for the program. Following is a brief explanation of the major classes in the program. For each class we will discuss its primary function and how it interacts with the other classes. This module reads in the *.con (Concatenation report) output from Hoenie's module and creates a visual display of the results for each of the global sequences. The program is completely written in Java. CLASSES: GlobalSeq: ----------- Wrapper around the (name,size,seq) triple that is read from the output file. ConcatFileParser: ---------------- Responsible for parsing and providing access to the information in the output file. Fragment: ---------- Wrapper for selected information (name,hitID,start,end,sequence) parsed from the output file. ContigGlyph: --------- Wrapper around the (sx,sy,ex,ey,name) that is used for constructing graphics for contigs. ContigVeiwer: ------------ The Main class, this is the entry point for the program. The remaining classes are Swing Classes that were subclassed, adapted, or otherwise manipulated to serve our purposes. Following is an enumeration and brief description of our UI Classes and their utility helpers. MainFrame (extends javax.swing.JFrame) -------------------------------------------- Constructs and lays out all the major user interface components for the program. The essential strategy is a JTabbedPane with tabs representing the global sequence (species) images respectively. We construct separate JPanels that use the BorderLayout manager to layout the Components on the tabbedPanes On each panel we place a JLabel at BorderLayout.CENTER that holds an ImageIcon displaying the appropriate ContigGlyphs, and a JTable at BorderLayout.SOUTH that will display the appropriate annotation information when a user mouses over a spot. We instantiate two anonymous MouseMotionListeners and add them to the JLables displaying the images. These classes manage the display of sequence information by sending a message to the appropriate Jtable which takes care of updating itself via its AbstractTableModel. JTextAreas are also created when contigs are clicked on and displaya FASTA format of the sequence. ContigTable (extends javax.swing.JTable): --------------------------------------------------------- Display data corresponding to the users mouse position on the displayed ContigGlyph. ContigInfoTableModel (extends javax.swing.table.AbstractTableModel): ------------------------------------- The underlying datamodels that are queried by the tables for appropriate data to display. See the source code, and the Sun Java Swing tutorial for more information. SplashScreen (extends javax.swing.JWindow): ------------------------------------------- This program parses a file and creates a large images at startup. This startup SplashScreen simply lets the user know that something is actually going on while he/she waits. The progress bar is completely unecessary but is a nice finishing touch to the program.

 

A relatively simple "How To"

Download the zip file. Create a directory: C:\jdk1.3.1_02\bin\. Follow the instructions contained in the preceding sections.

 

Dowloading Multinome and the Code

If you would like to download multinome and/or its code, please click HERE. If you have not yet registrered, please scroll to the bottom of the page. One must register in order to obtain the username and password.

Please note: this program is oped source, as protected under the GPL.

 

If you've come this far...

You're still here? In this case, you must be either: (i) someone from class, most likely professor Arkin, (ii) someone we know personally, (iii) some Berkeley student wondering if you should take this class, or (iv) a fairly odd type, with an obsessive-compulsive need to finish what you've started: please seek the proper mental health authorities. Whichever category applies to you, be grateful. You are near the end. Congratulations.

At this point, we'd like to give thanks: This project has been dedicated to dinosaurs in trees that look like monkeys, the world over.

 

References and other information

A Review of the Smith-Waterman algorithm   Gene Tyson Hoenie Luk David Quan

 

Registration, in case you haven't yet

Multinome Registration 
 
For a free Multinome registration please fill out the form below:
Please fill out the following fields for free registration. Thank you.
Age: 
Name: 
E-mail: 
Address:
Country: 
Company: 
Other comments:

 

Created by David Nathan Quan

Back to http://genomics.lbl.gov

To the Berkeley website

 

May 2002