Anagram Search program | BASeCamp Programming Blog

And now, for something completely different- a post that actually is about programming. A short overview of anagrams and a breakdown of a common algorithm that is used for discovering them.

Breakdown

An anagram is one word that is another word when you rearrange the letters; for example, “deposit” and “posited” are anagrams of one another; as are “dare” and “dear”. The idea behind the algorithm is that, given a list of words, to find out which words are anagrams of one another. This sounds fairly simple- one could conceivably create a function That determines if two words are anagrams of one another (by looping through the characters, comparing length, etc) and then use that function on every single possible pairing of the words in a dictionary.

This approach is simple-minded, and will take an intractably long time based on the number of words. For example, Linux systems generally have a word dictionary located at /user/share/dict/words not unlike the one you can download here; in this case, the dictionary contains over 235,923 different words. Simply iterating through every word and comparing it to every other word will take 235,923 squared calls to the aforementioned “isAnagram” function; that’s over 55,659,661,929 separate comparisons! Even if the IsAnagram Function only took a millisecond to execute, that’s still over 55,656,661 seconds, or over two years. Clearly this is not a good solution.

One method that problems like these can be solved is to realize that the best way of going about it is to change the problem into something that will “sort itself out” so to speak. For example, rather then searching through every word and comparing it to every other word to find a match, one would be better off finding out what the words that are anagrams have in common, and how we can use that to our advantage. In the case of anagrams, what they have in common is rather simple; they have the same letters. The question is, how can we leverage this in a way that makes it easier to find groups of words that are anagrams of one another?

If we were able to create a unique value that all anagrams share that can be used to correlate them, we can create a data structure and only iterate through the list of words a single time. This value is the sorted letters of the word.

For example, both posited and deposit, when their letters are sorted create deiopst; in fact, it seems quite logical to conclude that all words that are anagrams of each other will create the same sorted sequence. This makes the entire thing almost crystal clear.

Instead of comparing every word to every other word in a brute-force approach, we can instead use a List indexed by the sorted strings, where each item contains a list of the words found that “sort” to that set of letters. For each word, we sort the letters, and then we look in the dictionary; if there is a existing item for that sorted set, we add it to the list that is already there, otherwise, we add a new one. Rather straightforward.

This is in fact a very old algorithm, conceived as early as the 60’s. The original concept was merely to bring together those words that were anagrams; the list was sorted based on the sorted letter set of each word, meaning that words whose component letters sorted the same would be contiguous in the list.

Here is the implementation of the Program in C#:

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;

namespace anagrammer
{
    class Program
    {
        //anagrammer: takes a word list on the command line and finds all words that are anagrams of one another.
        static void Main(string[] args)
        {
            Console.WriteLine(@"""Anagrammer"" program for loading and finding anagrams in a list of words.
Nov. 13 2010 Written by Michael Burgwin (http://bc-programming.com)");
            foreach (String loopargument in args)
            {
                EmitAnagrams(loopargument);
            }
        }
        public static void EmitAnagrams(String wordfile)
        {
            StreamReader wordreader = File.OpenText(wordfile);
            Console.WriteLine("Processing file..." + wordfile);
            String alltext = wordreader.ReadToEnd();
            String[] words = alltext.Split(new string[] { "\n" }, StringSplitOptions.RemoveEmptyEntries);

            Console.WriteLine("Total words read:" + words.Length +"\n");

            EmitAnagrams(words);
        }
        /// <summary>
        /// sorts the characters in the string, and returns the resulting "sorted" string.
        /// </summary>
        /// <param name="sumstring">String to sort.</param>
        /// <returns>Sorted string of the characters in SumString.</returns>
        static String sortchars(String sumstring)
        {
            List<char> sortme = sumstring.ToList();
            sortme.Sort((a, b) => a.CompareTo(b));
            return new string(sortme.ToArray());
            //return sumstring.ToList().Sort((a, b) => a.CompareTo(b)));
        }
       public static void EmitAnagrams(String[] words)
        {
            //algorithm: create a Dictionary; the keys are ints which are the sum of the character values
            //of the strings within, and the element is a list of strings.
            Dictionary<string , List<String>> anagramdict = new Dictionary<string , List<string>>();
            Console.WriteLine("creating Dictionary lookup tables...");
            Console.WriteLine("  Only considering words > 4 characters.");
            foreach (String loopword in words)
            {
                if (loopword.Length > 4)
                {
                    String hashcode = sortchars(loopword.ToLower());
                    List<string> listuse = null;
                    if (!(anagramdict.TryGetValue(hashcode, out listuse)))
                    {
                        listuse = new List<string>();
                       
                            listuse.Add(loopword.ToLower());
                        anagramdict.Add(hashcode, listuse);
                    }
                    else
                    {
                        if (!listuse.Exists((w) => (w.Equals(loopword.ToLower()))))
                            listuse.Add(loopword);
                    }
                }
            }
            Console.WriteLine("Finished! results:\n\n");
            List<keyvaluepair <String, List<String>>> sortedanagrams = anagramdict.ToList();
            //sort the list based on the size of each List therein...
            sortedanagrams.Sort((a, b) => a.Value.Count.CompareTo(b.Value.Count));
            foreach (KeyValuePair<string , List<String>> looplists in sortedanagrams)
            {
                if (looplists.Value.Count > 1)
                {
                    foreach (String loopstring in looplists.Value)
                    {
                        Console.Write(loopstring + ",");
                    }
                    Console.WriteLine();
                }
            }
        }
    }
}</keyvaluepair></string></char>

using System;

using System.Collections.Generic;

using System.IO;

using System.Linq;

using System.Text;

namespace anagrammer

{

class Program

{

//anagrammer: takes a word list on the command line and finds all words that are anagrams of one another.

static void Main(string[] args)

{

Console.WriteLine(@"""Anagrammer"" program for loading and finding anagrams in a list of words.

Nov. 13 2010 Written by Michael Burgwin (http://bc-programming.com)");

foreach (String loopargument in args)

{

EmitAnagrams(loopargument);

}

public static void EmitAnagrams(String wordfile)

{

StreamReader wordreader = File.OpenText(wordfile);

Console.WriteLine("Processing file..." + wordfile);

String alltext = wordreader.ReadToEnd();

String[] words = alltext.Split(new string[] { "\n" }, StringSplitOptions.RemoveEmptyEntries);

Console.WriteLine("Total words read:" + words.Length +"\n");

EmitAnagrams(words);

}

/// <summary>

/// sorts the characters in the string, and returns the resulting "sorted" string.

/// </summary>

/// <param name="sumstring">String to sort.</param>

/// <returns>Sorted string of the characters in SumString.</returns>

static String sortchars(String sumstring)

{

List<char> sortme = sumstring.ToList();

sortme.Sort((a, b) => a.CompareTo(b));

return new string(sortme.ToArray());

//return sumstring.ToList().Sort((a, b) => a.CompareTo(b)));

}

public static void EmitAnagrams(String[] words)

{

//algorithm: create a Dictionary; the keys are ints which are the sum of the character values

//of the strings within, and the element is a list of strings.

Dictionary<string , List<String>> anagramdict = new Dictionary<string , List<string>>();

Console.WriteLine("creating Dictionary lookup tables...");

Console.WriteLine(" Only considering words > 4 characters.");

foreach (String loopword in words)

{

if (loopword.Length > 4)

{

String hashcode = sortchars(loopword.ToLower());

List<string> listuse = null;

if (!(anagramdict.TryGetValue(hashcode, out listuse)))

{

listuse = new List<string>();

listuse.Add(loopword.ToLower());

anagramdict.Add(hashcode, listuse);

}

else

{

if (!listuse.Exists((w) => (w.Equals(loopword.ToLower()))))

listuse.Add(loopword);

}

Console.WriteLine("Finished! results:\n\n");

List<keyvaluepair <String, List<String>>> sortedanagrams = anagramdict.ToList();

//sort the list based on the size of each List therein...

sortedanagrams.Sort((a, b) => a.Value.Count.CompareTo(b.Value.Count));

foreach (KeyValuePair<string , List<String>> looplists in sortedanagrams)

{

if (looplists.Value.Count > 1)

{

foreach (String loopstring in looplists.Value)

{

Console.Write(loopstring + ",");

}

Console.WriteLine();

}

}</keyvaluepair></string></char>

Here, I leverage Generics, and also nest Generic Classes within one another; note the Dictionary> type, which creates a Dictionary that indexes lists of Strings via a string; in this instance, it indexes the lists of anagramatic words by the sorted letters of that word (which is merely a string, since a string is by definition a list of letters). On the aforementioned dictionary file it took about half a second to process the entire file and create the output. The results can be found here. The dictionary appears to have a few esoteric terms; also, as can be seen in the source, it ignores all words less then 5 characters in length.

Have something to say about this post? Comment!

Tags: .NET, Algorithms, C#, Functional Programming