Menu

Anagram Search program

November 15, 2010 - Programming

And now, for something  completely different- a post that actually is about programming. A short overview of anagrams and a breakdown of a common algorithm that is used for discovering them.

Breakdown

An anagram is one word that is another word when you rearrange the letters; for example, “deposit” and “posited” are anagrams of one another; as are “dare” and “dear”. The idea behind the algorithm is that, given a list of words, to find out which words are anagrams of one another. This sounds fairly simple- one could conceivably create a function That determines if two words are anagrams of one another (by looping through the characters, comparing length, etc) and then use that function on every single possible pairing of the words in a dictionary.

This approach is simple-minded, and will take an intractably long time based on the number of words. For example, Linux systems generally have a word dictionary located at /user/share/dict/words not unlike the one you can download here; in this case, the dictionary contains over 235,923 different words. Simply iterating through every word and comparing it to every other word will take 235,923 squared calls to the aforementioned “isAnagram” function; that’s over 55,659,661,929 separate comparisons! Even if the IsAnagram Function only took a millisecond to execute, that’s still over 55,656,661 seconds, or over two years. Clearly this is not a good solution.

One method that problems like these can be solved is to realize that the best way of going about it is to change the problem into something that will “sort itself out” so to speak. For example, rather then searching through every word and comparing it to every other word to find a match, one would be better off finding out what the words that are anagrams have in common, and how we can use that to our advantage. In the case of anagrams, what they have in common is rather simple; they have the same letters. The question is, how can we leverage this in a way that makes it easier to find groups of words that are anagrams of one another?

If we were able to create a unique value that all anagrams share that can be used to correlate them, we can create a data structure and only iterate through the list of words a single time. This value is the sorted letters of the word.

For example, both posited and deposit, when their letters are sorted create deiopst; in fact, it seems quite logical to conclude that all words that are anagrams of each other will create the same sorted sequence. This makes the entire thing almost crystal clear.

Instead of comparing every word to every other word in a brute-force approach, we can instead use a List indexed by the sorted strings, where each item contains a list of the words found that “sort” to that set of letters. For each word, we sort the letters, and then we look in the dictionary; if there is a existing item for that sorted set, we add it to the list that is already there, otherwise, we add a new one. Rather straightforward.

This is in fact a very old algorithm, conceived as early as the 60’s. The original concept was merely to bring together those words that were anagrams; the list was sorted based on the sorted letter set of each word, meaning that words whose component letters sorted the same would be contiguous in the list.

Here is the implementation of the Program in C#:

Here, I leverage Generics, and also nest Generic Classes within one another; note the Dictionary> type, which creates a Dictionary that indexes lists of Strings via a string; in this instance, it indexes the lists of anagramatic words by the sorted letters of that word (which is merely a string, since a string is by definition a list of letters). On the aforementioned dictionary file it took about half a second to process the entire file and create the output. The results can be found here. The dictionary appears to have a few esoteric terms; also, as can be seen in the source, it ignores all words less then 5 characters in length.

Have something to say about this post? Comment!