DNA Sequence Pattern Match/ String Search Problem

Question: The DNA double helix consists of a long strand of four bases: adenine (abbreviated A), cytosine (C), guanine (G) and thymine (T). Thus, it can be represented as a long string containing the characters A, C, G, and T. The field of bio-informatics involves storing and searching of DNA sequence data. What is a good data structure to facilitate storage and string match/search operation on this kind of data? Write a class that stores the DNA sequence and implement a method which takes another DNA sub-sequence and returns the position of this sub-sequence in the first DNA sequence.

Solution: There are several linear time string matching algorithms. See http://en.wikipedia.org/wiki/String_searching_algorithm for a list of such algorithms.
Often a data structure called Suffix Tree or PAT tree is used in DNA sequencing applications in the field of Bio-informatics. A Suffix tree data structure facilitates string match/search in O(m) complexity, where m is the length of the sub-string. It takes an initial O(n) time required to build the suffix tree.
One concern with suffix tree is the high amount of space/memory needed. But, with a small alphabet ( only 4 characters in the DNA search problem ) , its not too bad. More over, there are some advanced techniques that can be used to further reduce the storage requirements.
A thorough treatment of suffix trees can be found here.
I still have some bugs in my class implementation, so I will post the code in the next couple of days.

No comments:

Post a Comment