Determine and display all anagrams in a string array

Question: Given an array of strings (string[] inputStrings). Devise a way and write the code to determine and display all the anagrams in the array.

Solution: With any interview question one should get the specifics of the problem and the input data before designing the solution. In this particular case, we should ask about the approximate number of strings in the array, average length of each string, memory constraints, CPU usage constraints, etc. Each factor can influence the design and selection of algorithm and the data structures used.

The solution I am presenting here optimizes for speed at the cost of extra memory. The basic idea is to come up with simple hash or signature for each string in the input string array. This signature is nothing but the characters in the string sorted alphabetically. So, the signature for "cat" will be "act" and for "tac" it will be "act" too. As you can see the signature is the same for 2 strings that are anagrams. We can use this property to identify and collect the anagrams in the inputString array together. We can use a hash table for fast insert and retrieve these anagrams, but will cost us extra memory.

Note: There are other solutions that don't use a hashtable, but involves CPU cost is higher and involves sorting of the signature. I will post this alternative solution soon. If you have a better method, please write a comment.

Pseudo-Code:

Loop for each string (S) in the array:
  1. Get signature (K) for S (sort the characters of S to get its signature (K))
  2. Insert S in hash table with signature K as the key and append S to the array list contained in the value of the hash table key.
  3. If K already exists, don't add a new key (since it will cause collision), but just append S to the end of the list in that hash table slot.
  4. Goto step 1 with next available string.
  5. In the end just print all the keys in the hash table along with the list of strings in each slot.


Code (C#): I am avoiding use of generics here for convenience of users of other languages.

using System;
using System.Text;
using System.Collections;

namespace anagram_problem
{
   class Program
   {
       static void Main(string[] args)
       {
           string[] inputStrings = new string[] { "cat", "tac", "army", "pam", "act", "map", "mary", "self" };
           PrintAnagrams(inputStrings);
       }

       //
       // given a set of strings group all the strings that are anagrams
       // "cat", "tac", "army", "pam", "act", "map", "mary", "self"
       // "act": "act", "cat", "tac"
       // "army": "army", "mary"
       // "map": "map", "pam"
       // "self": "self"
       
       static void PrintAnagrams(string[] inputStrings)
       {
           Hashtable hashTable = new Hashtable();
           foreach (string str in inputStrings)
           {
               string signature = GetAnagramSignature(str);
               if (!hashTable.ContainsKey(signature))
               {
                   // this will be used to store our anagrams for this particular signature
                   ArrayList newStringList = new ArrayList();
                   newStringList.Add(str);

                   hashTable.Add(signature, newStringList);
               }
               else
               {
                   // we have already encountered this signature so we just
                   // need to append the current str to the list
                   ArrayList list = (ArrayList)hashTable[signature];
                   list.Add(str);
               }
           }

           // now we print all the anagrams for each hash key
           foreach (string key in hashTable.Keys)
           {
               Console.Write("{0}: ", key);
             
               ArrayList list = (ArrayList)hashTable[key];
               foreach (string str in list)
               {
                   Console.Write("{0},", str);
               }
               Console.Write(Environment.NewLine);
           }
       }

       //
       // this method will sort the characters in the input string alphabetically
       // this sorted string will be out hash key
       //
       static string GetAnagramSignature(string input)
       {
           char[] charArray = input.ToLower().ToCharArray();
           Array.Sort<char>(charArray, new Comparison<char>(CompareCharacters));
           return new string(charArray);
       }

       static int CompareCharacters(char a, char b)
       {
           if (a > b)
               return 1;
           else if (b > a)
               return -1;
           else
               return 0;
       }
   }
}


5 comments:

  1. I think for signature, we can use the XOR value of string i.e. for cat, XOR value will be (c^a^t). It will O(L) time where L is length of string.

    ReplyDelete
  2. I dont think we can use XOR for key.
    char c1='a'^'a';
    char c2='c'^'c';
    if(c1==c2)
    System.out.println("both same");
    else
    System.out.println("both not same");
    Here the o/p is "both same"

    ReplyDelete
  3. here's a rather funny and humors way to explain anagrams check out what's donald knuth, alan tourin'g anagrams are on FULLBC webcomic
    http://fullbc.in/?comicid=145

    ReplyDelete
  4. C++ implementation for grouping strings by anagrams
    http://yourbitsandbytes.com/viewtopic.php?f=10&t=42&p=3219

    ReplyDelete
  5. why not use prime number approach?
    Assign a unique prime number to each letter in the alphabet and obtain the product of numbers that represent the letters in the given words. If they are the same then they are anagrams. The order is O(n) to parse through the length of the word. Perhaps the only weakness is the multiplication of large numbers but i don't think that could be a big problem.

    ReplyDelete