This is part of a occasionally updated series on various programming languages. It should not be interpreted as a benchmark, but rather as a casual look at various programming languages and how they might be used by somebody for a practical purpose.
Currently, there are Articles written regarding Python,C#, Java and VB6 (merged for some reason),Scala,F# & Ruby,Perl, Delphi, PHP,C++,Haskell,D, VB.NET, QuickBASIC, and Rust
QuickBASIC is an out of place choice when compared to most other languages that I’ve written in this series. Why would I jump so far backwards to QuickBASIC?
There are actually an umber of reasons. The first is that QuickBASIC actually imposes a number of limitations. Aside from the more limited programming language compared to, say C#, it also means any solution needs to appropriately contend with issues such as Memory usage and Open File Handles on MS-DOS. At the same time, a lot of the development task is actually more simple; one doesn’t need to fiddle with designers, or property pages or configuration tabs, or anything of that sort. You open a text file and start writing the program.
The first task is to determine an algorithm. Of course, we know the Algorithm- it’s been described previously- However, in this instance, we don’t have hashmaps available; furthermore, even if we want to implement that ourself, we cannot even keep all the information in memory. As a result, one compromise is to instead keep an array of index information in memory; that array can contain the sorted word as well as a record index into another random-access file, so, to start, we have these two TYPE structures:
1 2 3 4 5 6 7 8 9 |
TYPE SORTINDEX SORTED AS STRING*28 OFFSET AS LONG END TYPE TYPE SORTRECORD WORDCOUNT AS INTEGER SORTWORDS(1 TO 20) AS STRING * 28 END TYPE |
By writing and reading directly from a scratch file when we need to add a new file to the “hash” we can avoid having any of the SORTRECORD structures in memory except the one we are working with. This drastically reduces our memory usage. As did determining that the longest word in SORTINDEX is 28 characters/bytes. The algorithm thus becomes similar- basically, with a word, we sort the words letters, and then we consult the array of SORTINDEX types. If we find one with the sorted word, we take the OFFSET and we read in the SORTRECORD at that offset, increment wordcount, and add the word to the SORTWORDS array, then PUT it back into the scratch file. And if it isn’t found in the SORTINDEX, we create a new entry- saving a new record with the word to the scratch file and recording the offset and sorted text in the index for that record.
Of course this does have several inefficiencies that I won’t address; The first is that the search for the matching sorted word is effectively a sequential search. Ideally, the in-memory index would be kept sorted and searches could use a Binary Search. I guess if somebody is interested I “left it as an exercise for the reader”.
Otherwise all seems well. But not so fast- the dict.txt file has 45402 words. Our type definition is 32 bytes, which means for all words to be stored in the index, we would need 1,452,864 bytes, which is far beyond the conventional memory limits that we are under. So we need to drastically reduce the memory usage of our algorithm. And we had something so promising! Seems like it’s back to the drawing board.
Or is it? instead of trying to reduce how much our algorithm uses, we could reduce how much data it is working with. At a time. We can split the original dictionary file into chunks, and as it happens since words of different lengths cannot be anagrams of each other, we can merely split the file into separate file organized by length. Then we perform the earlier algorithm on each of those files and output the resulting anagram list of each to one file. That would give us one file listing all anagrams without exceeding memory limitations!
Before we get too excited, let’s make sure that the largest “chunk” would be small enough. using another QuickBASIC program (because, what the hell, right?) I checked over the count of files of particular lengths. In this case, the chunk with the most is words of 7 letters in length, of which there are 7371 in the test dictionary file. This would require 235,872 Bytes of storage, which is well within our 640K conventional memory limit.
Of course, there is a minor caveat; we do need to start QuickBASIC with certain command line arguments, as, by default, the dynamic array maximum is actually 64K. We do this by launching it with the /Ah command line parameter. Otherwise, we might find that it encounters Subscript out of range errors once we get beyond around the 2000 mark for our 32-byte index record type.
Another consideration I encountered was open files. I had it opening all the dictionary output files at once, but it maxed out at 16 files, so I had to refactor it to be much slower by reading a line, determining the file to open, writing the line, and them closing the file. Again, there may be a better technique here to increase performance. For reference, I wasn’t able to find how to increase the limit, either (adjusting config.sys didn’t help).
After that, it worked a treat- the primary algorithm runs on each length subset, and writes the results to an output file.
Without further Adieu- The Full source of this “solution”:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 |
DECLARE SUB SPLITDICT () DECLARE SUB ADDWORD (WORDADD AS STRING) DECLARE FUNCTION GETOFFSET% (SORTEDWORD AS STRING) DECLARE FUNCTION GETBYTEOFFSET% (SORTEDWORD AS STRING) DECLARE FUNCTION SORTCHARS$ (ST AS STRING) TYPE SORTINDEX SORTED AS STRING * 28 OFFSET AS LONG END TYPE TYPE SORTRECORD WORDCOUNT AS INTEGER SORTWORDS(1 TO 20) AS STRING * 28 END TYPE DIM x AS INTEGER COMMON SHARED INDEXDATA() AS SORTINDEX COMMON SHARED INDEXFILE AS STRING COMMON SHARED INDEXCOUNT AS INTEGER COMMON SHARED INDEXHANDLE AS INTEGER DIM USEFILE AS INTEGER DIM READWORD AS STRING DIM READINDEX AS LONG DIM ANAGRAMOUT AS INTEGER SPLITDICT ANAGRAMOUT = FREEFILE OPEN "C:\ANAGRAMS.TXT" FOR OUTPUT AS ANAGRAMOUT DIM LETTERCOUNT AS INTEGER FOR LETTERCOUNT = 2 TO 32 INDEXFILE = "C:\DICT.IDX" INDEXHANDLE = FREEFILE INDEXCOUNT = 0 KILL INDEXFILE OPEN INDEXFILE FOR RANDOM AS INDEXHANDLE LEN = 562 ERASE INDEXDATA DIM DICTFILE AS STRING DICTFILE = "C:\DICT\DICT" + RTRIM$(LTRIM$(STR$(LETTERCOUNT))) + ".TXT" USEFILE = FREEFILE IF DIR$(DICTFILE) <> "" THEN OPEN DICTFILE FOR INPUT AS USEFILE PRINT USEFILE PRINT INDEXHANDLE 'each line is a word. However, as we have limited memory 'compared to, C# and other languages we need to come up with a clever workaround. 'that workaround? we use the file system. Instead of using a hashmap, we instead create a IF DIR$("C:\DSCRATCH\NUL") = "" THEN MKDIR "C:\DSCRATCH" END IF DO WHILE NOT EOF(USEFILE) LINE INPUT #USEFILE, READWORD READWORD = LCASE$(READWORD) DIM SORTEDWORD AS STRING DIM RECORDNUM AS INTEGER DIM BUILDRECORD AS SORTRECORD SORTEDWORD = SORTCHARS(READWORD) RECORDNUM = GETOFFSET(SORTEDWORD) IF RECORDNUM = -1 THEN 'if no record num, then assume it isn't present. ' seek to the end, and write a new record. 'increment INDEXCOUNT, redim INDEXDATA to the new size and set. INDEXCOUNT = INDEXCOUNT + 1 REDIM PRESERVE SHARED INDEXDATA(1 TO INDEXCOUNT) AS SORTINDEX INDEXDATA(INDEXCOUNT).SORTED = SORTEDWORD INDEXDATA(INDEXCOUNT).OFFSET = INDEXCOUNT BUILDRECORD.WORDCOUNT = 1 BUILDRECORD.SORTWORDS(1) = READWORD PUT #INDEXHANDLE, INDEXCOUNT, BUILDRECORD ELSE GET #INDEXHANDLE, RECORDNUM, BUILDRECORD BUILDRECORD.WORDCOUNT = BUILDRECORD.WORDCOUNT + 1 BUILDRECORD.SORTWORDS(BUILDRECORD.WORDCOUNT) = READWORD PUT #INDEXHANDLE, RECORDNUM, BUILDRECORD END IF PRINT "ADDED " + READWORD + " To Index. " + STR$(INDEXCOUNT) LOOP CLOSE USEFILE DIM TAKEWORD AS INTEGER 'READ DICT.IDX FROM START TO FINISH. FOR READINDEX = 1 TO INDEXCOUNT 'Write to ANAGRAMOUT FILE. READ FROM INDEXHANDLE. GET #INDEXHANDLE, READINDEX, BUILDRECORD IF BUILDRECORD.WORDCOUNT > 1 THEN FOR TAKEWORD = 1 TO BUILDRECORD.WORDCOUNT PRINT #ANAGRAMOUT, LTRIM$(RTRIM$(BUILDRECORD.SORTWORDS(TAKEWORD))) + " "; NEXT TAKEWORD PRINT #ANAGRAMOUT, "" END IF NEXT READINDEX CLOSE INDEXHANDLE END IF NEXT LETTERCOUNT CLOSE #ANAGRAMOUT FUNCTION GETOFFSET% (SORTEDWORD AS STRING) DIM I AS INTEGER ON ERROR RESUME NEXT FOR I = 1 TO UBOUND(INDEXDATA) IF ERR <> 0 THEN ON ERROR GOTO 0 EXIT FOR END IF IF MID$(INDEXDATA(I).SORTED, 1, LEN(SORTEDWORD)) = SORTEDWORD THEN GETOFFSET = INDEXDATA(I).OFFSET ON ERROR GOTO 0 EXIT FUNCTION END IF NEXT I 34534 GETOFFSET = -1 END FUNCTION FUNCTION SORTCHARS$ (ST AS STRING) DIM I AS INTEGER DIM J AS INTEGER DIM ICHAR AS STRING DIM JCHAR AS STRING DIM TEMP AS STRING DIM DS AS STRING DS = ST FOR I = 1 TO LEN(DS) ICHAR = MID$(DS, I, 1) FOR J = 1 TO LEN(DS) ICHAR = MID$(DS, I, 1) JCHAR = MID$(DS, J, 1) IF ICHAR < JCHAR THEN MID$(DS, I, 1) = JCHAR MID$(DS, J, 1) = ICHAR END IF NEXT J NEXT I SORTCHARS = DS END FUNCTION SUB SPLITDICT 'task: open C:\DICT\DICT.TXT and split the contents into separate files 'C:\DICT\DICT7.TXT would contain the words that have 7 letters, for example. DIM FILENAMES(2 TO 28) AS STRING DIM TOTALCOUNT(2 TO 28) AS LONG DIM LCOUNT AS INTEGER DIM USEOUTFILE AS STRING DIM DICTREADER AS INTEGER DIM LINEREAD AS STRING DIM CHECKLEN AS INTEGER DIM FSORT AS INTEGER FOR LCOUNT = 2 TO 28 USEOUTFILE = "C:\DICT\DICT" + RTRIM$(LTRIM$(STR$(LCOUNT))) + ".TXT" FILENAMES(LCOUNT) = USEOUTFILE FSORT = FREEFILE OPEN USEOUTFILE FOR OUTPUT AS #FSORT CLOSE #FSORT NEXT LCOUNT DICTREADER = FREEFILE OPEN "C:\DICT\DICT.TXT" FOR INPUT AS #DICTREADER PRINT "Reading Dictionary File Data..." DO WHILE NOT EOF(DICTREADER) LINE INPUT #DICTREADER, LINEREAD CHECKLEN = LEN(LINEREAD) TOTALCOUNT(CHECKLEN) = TOTALCOUNT(CHECKLEN) + 1 PRINT LINEREAD FSORT = FREEFILE OPEN FILENAMES(CHECKLEN) FOR APPEND AS FSORT PRINT #FSORT, LINEREAD CLOSE #FSORT LOOP END SUB |
And there you have it. an Anagram search program written in QuickBASIC. Of course, it is arather basic and is a bit picky about preconditions (hard-coded for a specific file, for example) but it was largely written against my test VM.
Have something to say about this post? Comment!