Determine and display all anagrams in a string array

Question: Given an array of strings (string[] inputStrings). Devise a way and write the code to determine and display all the anagrams in the array.

Solution: With any interview question one should get the specifics of the problem and the input data before designing the solution. In this particular case, we should ask about the approximate number of strings in the array, average length of each string, memory constraints, CPU usage constraints, etc. Each factor can influence the design and selection of algorithm and the data structures used.

The solution I am presenting here optimizes for speed at the cost of extra memory. The basic idea is to come up with simple hash or signature for each string in the input string array. This signature is nothing but the characters in the string sorted alphabetically. So, the signature for "cat" will be "act" and for "tac" it will be "act" too. As you can see the signature is the same for 2 strings that are anagrams. We can use this property to identify and collect the anagrams in the inputString array together. We can use a hash table for fast insert and retrieve these anagrams, but will cost us extra memory.

Note: There are other solutions that don't use a hashtable, but involves CPU cost is higher and involves sorting of the signature. I will post this alternative solution soon. If you have a better method, please write a comment.

Pseudo-Code:

Loop for each string (S) in the array:
  1. Get signature (K) for S (sort the characters of S to get its signature (K))
  2. Insert S in hash table with signature K as the key and append S to the array list contained in the value of the hash table key.
  3. If K already exists, don't add a new key (since it will cause collision), but just append S to the end of the list in that hash table slot.
  4. Goto step 1 with next available string.
  5. In the end just print all the keys in the hash table along with the list of strings in each slot.


Code (C#): I am avoiding use of generics here for convenience of users of other languages.

using System;
using System.Text;
using System.Collections;

namespace anagram_problem
{
   class Program
   {
       static void Main(string[] args)
       {
           string[] inputStrings = new string[] { "cat", "tac", "army", "pam", "act", "map", "mary", "self" };
           PrintAnagrams(inputStrings);
       }

       //
       // given a set of strings group all the strings that are anagrams
       // "cat", "tac", "army", "pam", "act", "map", "mary", "self"
       // "act": "act", "cat", "tac"
       // "army": "army", "mary"
       // "map": "map", "pam"
       // "self": "self"
       
       static void PrintAnagrams(string[] inputStrings)
       {
           Hashtable hashTable = new Hashtable();
           foreach (string str in inputStrings)
           {
               string signature = GetAnagramSignature(str);
               if (!hashTable.ContainsKey(signature))
               {
                   // this will be used to store our anagrams for this particular signature
                   ArrayList newStringList = new ArrayList();
                   newStringList.Add(str);

                   hashTable.Add(signature, newStringList);
               }
               else
               {
                   // we have already encountered this signature so we just
                   // need to append the current str to the list
                   ArrayList list = (ArrayList)hashTable[signature];
                   list.Add(str);
               }
           }

           // now we print all the anagrams for each hash key
           foreach (string key in hashTable.Keys)
           {
               Console.Write("{0}: ", key);
             
               ArrayList list = (ArrayList)hashTable[key];
               foreach (string str in list)
               {
                   Console.Write("{0},", str);
               }
               Console.Write(Environment.NewLine);
           }
       }

       //
       // this method will sort the characters in the input string alphabetically
       // this sorted string will be out hash key
       //
       static string GetAnagramSignature(string input)
       {
           char[] charArray = input.ToLower().ToCharArray();
           Array.Sort<char>(charArray, new Comparison<char>(CompareCharacters));
           return new string(charArray);
       }

       static int CompareCharacters(char a, char b)
       {
           if (a > b)
               return 1;
           else if (b > a)
               return -1;
           else
               return 0;
       }
   }
}


Print 2D Array Matrix Spiral Order

Question: Given a 2D array / matrix of integers. Write a program to print the elements in spiral order. Consider a matrix as show in the diagram to the right. The desired output of the program should be as: 1,2,3,4,8,12,16,20,19,18,17,13,9,5,6, 7,11,15,14,10.

matrix-spiral-print

Solution: There are several ways to solve this problem, but I am mentioning a method that is intuitive to understand and easy to implement. The idea is to consider the matrix similar to onion which can be peeled layer after layer. We can use the same approach to print the outer layer of the matrix and keep doing it recursively on a smaller matrix (with 1 less row and 1 less column).

Refer to the image below for a visual explanation. We start by printing the top-right layer of the matrix by calling the print_layer_top_right. It will print 1,2,3,4,8,12,16,20. The print_layer_top_right method then calls the print_layer_bottom_left method which will print 19,18,17,13,9,5. If you observe the size of the target matrix is reducing after each call. Each method basically calls the other method and passes the matrix indexes for the reduced matrix. Both methods take 4 index parameters which represent the target matrix. When the target matrix size is such that there is only one layer left the recursion terminates and by this time the code has printed all the numbers in the full matrix.

matrix-spiral-print-after-1st-iteration

Code (C language):
#include <stdio.h>
void print_layer_top_right(int a[][4], int x1, int y1, int x2, int y2);
void print_layer_bottom_left(int a[][4], int x1, int y1, int x2, int y2);

int main(void)
{
       int a[5][4] = {
                       {1,2,3,4},
                       {5,6,7,8},
                       {9,10,11,12},
                       {13,14,15,16},
                       {17,18,19,20}
                   };

       print_layer_top_right(a,0,0,3,4);
}

//
// prints the top and right shells of the matrix
//
void print_layer_top_right(int a[][4], int x1, int y1, int x2, int y2)
{
      int i = 0, j = 0;

      // print the row
      for(i = x1; i<=x2; i++)
      {
         printf("%d,", a[y1][i]);
      }
 
      //print the column         
      for(j = y1 + 1; j <= y2; j++)         
      {
         printf("%d,", a[j][x2]);
      }

      // see if  we have more cells left         
      if(x2-x1 > 0)
      {
          // 'recursively' call the function to print the bottom-left layer
          print_layer_bottom_left(a, x1, y1 + 1, x2-1, y2);
      }
}

//
// prints the bottom and left shells of the matrix
//
void print_layer_bottom_left(int a[][4], int x1, int y1, int x2, int y2)
{
       int i = 0, j = 0;

       //print the row of the matrix in reverse
       for(i = x2; i>=x1; i--)
       {
               printf("%d,", a[y2][i]);
       }

       // print the last column of the matrix in reverse
       for(j = y2 - 1; j >= y1; j--)
       {
               printf("%d,", a[j][x1]);
       }

       if(x2-x1 > 0)
       {
               // 'recursively' call the function to print the top-right layer
               print_layer_top_right(a, x1+1, y1, x2, y2-1);
       }
}

DNA Sequence Pattern Match/ String Search Problem

Question: The DNA double helix consists of a long strand of four bases: adenine (abbreviated A), cytosine (C), guanine (G) and thymine (T). Thus, it can be represented as a long string containing the characters A, C, G, and T. The field of bio-informatics involves storing and searching of DNA sequence data. What is a good data structure to facilitate storage and string match/search operation on this kind of data? Write a class that stores the DNA sequence and implement a method which takes another DNA sub-sequence and returns the position of this sub-sequence in the first DNA sequence.

Solution: There are several linear time string matching algorithms. See http://en.wikipedia.org/wiki/String_searching_algorithm for a list of such algorithms.
Often a data structure called Suffix Tree or PAT tree is used in DNA sequencing applications in the field of Bio-informatics. A Suffix tree data structure facilitates string match/search in O(m) complexity, where m is the length of the sub-string. It takes an initial O(n) time required to build the suffix tree.
One concern with suffix tree is the high amount of space/memory needed. But, with a small alphabet ( only 4 characters in the DNA search problem ) , its not too bad. More over, there are some advanced techniques that can be used to further reduce the storage requirements.
A thorough treatment of suffix trees can be found here.
I still have some bugs in my class implementation, so I will post the code in the next couple of days.