Hello!
This is my first post about my work on my DNAnalyser!
It is nothing fancy, so don’t get your hopes up (just yet). Here is a quick description of what it involves and the objectives:
- I have some 6,670 real Human DNA sequences.
- They have some 100 odd binding sites with their occurrence count on that specific sequence.
- I have to find the relative similarities between any given two sequences based on their binding sites.
- The data is huge. About 17 million records was computed by my algorithm. It took about 7 hours to finish (on my MBP). I know, I know. Why the MBP?!
- Now I have to find a few different clustering algorithms to cluster this data and prep it for analysis.
- Once the clustering part is done, I have to feed this data into this convertor (that I’ll have to write), to prep it to be fed to this visualizer (I have to figure that out too).
- Here is the best part: I only have about 2 months to do all this. Plus a 80 pages report!
Up to now (just to point out, has been about 4 days since I actually started the implementation), done the following:
- Written a CSV convertor to MySQL. I know that there are CSV convertors out there, but due to the amount of data, I couldn’t use them (e.g. OpenCSV). I had to write the convertor myself.
- I have written the SimilarityMatrix calculator which uses the Jensen–Shannon Divergence (JSD). This part took 7 hours on my MBP. This is a one-off event, and won’t need to be done, unless the DNA data had to be refreshed.
Oh, I almost forgot. Here is the link to the development images. Be aware. Scary. Trust me. Scary.
Now let’s go partying!
Mo.
