Posted by on Jul 2, 2011 | 0 comments

Hello!

This is my first post about my work on my DNAnalyser!

It is nothing fancy, so don’t get your hopes up (just yet). Here is a quick description of what it involves and the objectives:

  • I have some 6,670 real Human DNA sequences.
  • They have some 100 odd binding sites with their occurrence count on that specific sequence.
  • I have to find the relative similarities between any given two sequences based on their binding sites.
  • The data is huge. About 17 million records was computed by my algorithm. It took about 7 hours to finish (on my MBP). I know, I know. Why the MBP?!
  • Now I have to find a few different clustering algorithms to cluster this data and prep it for analysis.
  • Once the clustering part is done, I have to feed this data into this convertor (that I’ll have to write), to prep it to be fed to this visualizer (I have to figure that out too).
  • Here is the best part: I only have about 2 months to do all this. Plus a 80 pages report!

Up to now (just to point out, has been about 4 days since I actually started the implementation), done the following:

  • Written a CSV convertor to MySQL. I know that there are CSV convertors out there, but due to the amount of data, I couldn’t use them (e.g. OpenCSV). I had to write the convertor myself.
  • I have written the SimilarityMatrix calculator which uses the Jensen–Shannon Divergence (JSD). This part took 7 hours on my MBP. This is a one-off event, and won’t need to be done, unless the DNA data had to be refreshed.

Oh, I almost forgot. Here is the link to the development images. Be aware. Scary. Trust me. Scary.

Now let’s go partying!

Mo.

Leave a Comment

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" highlight="">