Menu

Anagram Search Program Part XVI: QuickBASIC

June 9, 2017 - Programming

This is part of a occasionally updated series on various programming languages. It should not be interpreted as a benchmark, but rather as a casual look at various programming languages and how they might be used by somebody for a practical purpose.
Currently, there are Articles written regarding Python,C#, Java and VB6 (merged for some reason),Scala,F# & Ruby,Perl, Delphi, PHP,C++,Haskell,D, VB.NET, QuickBASIC, and Rust

QuickBASIC is an out of place choice when compared to most other languages that I’ve written in this series. Why would I jump so far backwards to QuickBASIC?

There are actually an umber of reasons. The first is that QuickBASIC actually imposes a number of limitations. Aside from the more limited programming language compared to, say C#, it also means any solution needs to appropriately contend with issues such as Memory usage and Open File Handles on MS-DOS. At the same time, a lot of the development task is actually more simple; one doesn’t need to fiddle with designers, or property pages or configuration tabs, or anything of that sort. You open a text file and start writing the program.

The first task is to determine an algorithm. Of course, we know the Algorithm- it’s been described previously- However, in this instance, we don’t have hashmaps available; furthermore, even if we want to implement that ourself, we cannot even keep all the information in memory. As a result, one compromise is to instead keep an array of index information in memory; that array can contain the sorted word as well as a record index into another random-access file, so, to start, we have these two TYPE structures:

By writing and reading directly from a scratch file when we need to add a new file to the “hash” we can avoid having any of the SORTRECORD structures in memory except the one we are working with. This drastically reduces our memory usage. As did determining that the longest word in SORTINDEX is 28 characters/bytes. The algorithm thus becomes similar- basically, with a word, we sort the words letters, and then we consult the array of SORTINDEX types. If we find one with the sorted word, we take the OFFSET and we read in the SORTRECORD at that offset, increment wordcount, and add the word to the SORTWORDS array, then PUT it back into the scratch file. And if it isn’t found in the SORTINDEX, we create a new entry- saving a new record with the word to the scratch file and recording the offset and sorted text in the index for that record.

Of course this does have several inefficiencies that I won’t address; The first is that the search for the matching sorted word is effectively a sequential search. Ideally, the in-memory index would be kept sorted and searches could use a Binary Search. I guess if somebody is interested I “left it as an exercise for the reader”.

Otherwise all seems well. But not so fast- the dict.txt file has 45402 words. Our type definition is 32 bytes, which means for all words to be stored in the index, we would need 1,452,864 bytes, which is far beyond the conventional memory limits that we are under. So we need to drastically reduce the memory usage of our algorithm. And we had something so promising! Seems like it’s back to the drawing board.

Or is it? instead of trying to reduce how much our algorithm uses, we could reduce how much data it is working with. At a time. We can split the original dictionary file into chunks, and as it happens since words of different lengths cannot be anagrams of each other, we can merely split the file into separate file organized by length. Then we perform the earlier algorithm on each of those files and output the resulting anagram list of each to one file. That would give us one file listing all anagrams without exceeding memory limitations!

Before we get too excited, let’s make sure that the largest “chunk” would be small enough. using another QuickBASIC program (because, what the hell, right?) I checked over the count of files of particular lengths. In this case, the chunk with the most is words of 7 letters in length, of which there are 7371 in the test dictionary file. This would require 235,872 Bytes of storage, which is well within our 640K conventional memory limit.

Of course, there is a minor caveat; we do need to start QuickBASIC with certain command line arguments, as, by default, the dynamic array maximum is actually 64K. We do this by launching it with the /Ah command line parameter. Otherwise, we might find that it encounters Subscript out of range errors once we get beyond around the 2000 mark for our 32-byte index record type.

Another consideration I encountered was open files. I had it opening all the dictionary output files at once, but it maxed out at 16 files, so I had to refactor it to be much slower by reading a line, determining the file to open, writing the line, and them closing the file. Again, there may be a better technique here to increase performance. For reference, I wasn’t able to find how to increase the limit, either (adjusting config.sys didn’t help).

After that, it worked a treat- the primary algorithm runs on each length subset, and writes the results to an output file.

Without further Adieu- The Full source of this “solution”:

And there you have it. an Anagram search program written in QuickBASIC. Of course, it is arather basic and is a bit picky about preconditions (hard-coded for a specific file, for example) but it was largely written against my test VM.

Have something to say about this post? Comment!