Showing posts with label Data Structures. Show all posts
Showing posts with label Data Structures. Show all posts

DNA Sequence Pattern Match/ String Search Problem

Question: The DNA double helix consists of a long strand of four bases: adenine (abbreviated A), cytosine (C), guanine (G) and thymine (T). Thus, it can be represented as a long string containing the characters A, C, G, and T. The field of bio-informatics involves storing and searching of DNA sequence data. What is a good data structure to facilitate storage and string match/search operation on this kind of data? Write a class that stores the DNA sequence and implement a method which takes another DNA sub-sequence and returns the position of this sub-sequence in the first DNA sequence.

Solution: There are several linear time string matching algorithms. See http://en.wikipedia.org/wiki/String_searching_algorithm for a list of such algorithms.
Often a data structure called Suffix Tree or PAT tree is used in DNA sequencing applications in the field of Bio-informatics. A Suffix tree data structure facilitates string match/search in O(m) complexity, where m is the length of the sub-string. It takes an initial O(n) time required to build the suffix tree.
One concern with suffix tree is the high amount of space/memory needed. But, with a small alphabet ( only 4 characters in the DNA search problem ) , its not too bad. More over, there are some advanced techniques that can be used to further reduce the storage requirements.
A thorough treatment of suffix trees can be found here.
I still have some bugs in my class implementation, so I will post the code in the next couple of days.

Store a stream of numbers along with the counts

Problem: Given a stream of numbers, suggest a data structure to store the numbers in the stream along with their number of occurrences.

Solution: Before solving any problem make sure you lay down all the assumptions you are making and validate them with the interviewer.

Assumptions here:
  • The numbers are 32 bit integers.
  • The size of the set (cardinality) is very small compared to the possible number of integers (lets say 100 numbers)
  • Design for faster lookup
Since the size of the set is small, a data structure like an array of integers 2^32 in size will be a waste of space, even though the insert and lookup is going to be O(1). A hashtable would have a similar issue. A linked list would solve the memory issue by storing only the numbers that were present in the set but the look up suffers ( O(n) ). We would like the look up to be better than O(n).

So we've looked at an array, linked list, hashtable and none seem to solve it. Linked list looked promising but look up was O(n). Can we improve the lookup in a linked list? First of all, whats better than O(n)? O(log n), right? What data structure does that remind you of? A Binary Search Tree!!

Lets explore if we can utilize a Binary Search Tree here. So, if we started inserting numbers in a Binary Search Tree, each insertion would cost us O(log n). There are n number of insertions, so inserting all the numbers is going to be O(n log n). When we insert each number we also store the count at the node. So if a number is inserted for the first time we create a node in the tree and store the number itself and also the count (1 for the 1st occurrence). Any subsequent occurrence will not create a new node. Instead, we'll find the corresponding node and just update the count. So by the time we are done processing all the numbers we'll have a nice binary search tree with all the numbers and their occurrences.

Now the comes the look up part. Look up in a Binary Search Tree is O(log n). Any time we are presented a number and asked for the number of occurrences we can look up (or not ) the number and get its count.

If we were to design for faster insert we'd be better off using an unsorted linked list. Insert would have been O(1) (just insert at the end) and look up would be O(n). So for a small n this wouldn't be terribly bad.

Note: There is another data structure called the skip list which is a layered linked list type data structure which has O(log n) look up and insert. The linked list in the skip list is a sorted list and that would make our insert O(n log n).

Implement a Queue using two Stacks

Problem: Implement a queue using 2 stacks

Solution: The solution involves using one of the stacks as inbox stack. Incoming items are pushed onto this stack. The other stack is used as an outbox. When items need to be dequeued from the Queue and the outbox stack is empty, all the items from the inbox stack are popped and pushed on to the outbox stack. From there they can be popped until the outbox stack is empty. If the outbox is not empty then Dequeue operation is just a simple Pop() on the outbox stack.

The Enqueue and Dequeue methods for the Queue are as follows:
void Enqueue(int item)
{ 
  // all incoming items go on to the inboxStack
  inboxStack.push(item);
}

int Dequeue()
{
  //if the outbox stack has items in it just pop it from there and return
  if(outboxStack.Count > 0)
  {
     return outboxStack.Pop();
  }
  else
  {
     // move all items from the inbox stack to the outbox stack
     while(inboxStack.Count > 0)
     {
        outboxStack.Push(inboxStack.Pop());
     }
     if(outboxStack.Count > 0)
     {
        return outboxStack.Pop();
     }
  }
}

Tree: Breadth First Search

Problem: Write code for doing a breadth first search in a Tree data structure.

Solution: Breadth first search inspects items one level at a time starting with the root node. This is unlike the Depth First Search where the nodes in the left most branch are visited until there are no more nodes and then backtracks.

The problem with implementing a breadth first search is that we have to remember the list of nodes at the same level before moving to the next level. This can be achieved by using a queue data structure. See this post on details of how to implement a Queue using an array. Every time we visit a node, we inspect the data contained in the node before Enqueueing both children to the Queue. If the node contains the data we were searching for we return the pointer to that node.
//Breadth First Search (BFS) method searches the tree one level at a time

Node * Breadth-First-Search(Node *root, int searchValue)
{
    Queue queue;
    queue.Enqueue(root);
    Node * currentNode;
    while(currentNode = queue.Dequeue())
    {
       if(currentNode->data == searchVal)
           return currentNode;

       queue.Enqueue(currentNode->left);
       queue.Enqueue(currentNode->right);
    }
}

Implement a Queue using an Array

Problem: Implement a queue using an Array. Make efficient use of the space in the array.

Solution: Queue is a data structure that supports Enqueue and Dequeue methods. The Enqueue method adds an item to the end of the list and Dequeue method removes an item from the beginning of the list. This means that the Queue is a FIFO (First In First Out) data structure. Theoretically speaking a Queue doesn't have a fixed size. It grows as items are added to the end of the list. The implementation below has a fixed sized determined at the time of instantiating the Queue object. Ideally, the array should be reallocated to accommodate more items. The implementations makes use of the space efficiently by reusing space that is released when items are dequeued.

/************ Queue.h *********************/
// Queue class implements the Queue ADT(abstract data type) or Abstract Data Structure
// using an array as the underlying data structure
class Queue
{
    private:
        int *itemsArray;    // array used for storing items in the queue
        int startIndex;     // index in the array where the queue starts
        int size;           // size of the array holding the queue
        int count;          // number of items in queue

    public:
        Queue(int size);
        ~Queue();
        void Enqueue(int i);
        int Dequeue();
        int GetCount();
        void Print();
};
/************ Queue.cpp *********************/
#include "stdafx.h"
#include "malloc.h"
#include "Queue.h"

Queue::Queue(int size)
{
    this->size = size;
    itemsArray = (int *)malloc(sizeof(int) * size);
    startIndex = 0;
    count = 0;
}

Queue::~Queue()
{
    free(itemsArray);
}

// Enqueue method adds an item to the end of the queue
void Queue::Enqueue(int i)
{
    if(count < size) // queue is not full
    {
        int destinationIndex =  (startIndex + count) % size;
        itemsArray[destinationIndex] = i;
        count++;
    }
    else
    {
        printf("Queue is full...\n");
    }
}

// Dequeue method removes an item from the front of the queue
int Queue::Dequeue()
{
    if(count == 0)
    {
        printf("queue is empty!\n");
        throw "queue is empty";
    }
    int ret = itemsArray[startIndex];
    count--;
    startIndex = (startIndex + 1)%size;
    return ret;
}

int Queue::GetCount()
{
    return count;
}

// Print the current contents of the queue
void Queue::Print()
{
    printf("Queue contents: ");
    
   for(int i = 0; i<count;i++)
   {
        printf("%d, ", itemsArray[(startIndex + i)%size]);
   }
   printf("\n");
}