Anagram Search Program Part XVI: QuickBASIC

This is part of a occasionally updated series on various programming languages. It should not be interpreted as a benchmark, but rather as a casual look at various programming languages and how they might be used by somebody for a practical purpose.
Currently, there are Articles written regarding Python,C#, Java and VB6 (merged for some reason),Scala,F# & Ruby,Perl, Delphi, PHP,C++,Haskell,D, VB.NET, QuickBASIC, and Rust

QuickBASIC is an out of place choice when compared to most other languages that I’ve written in this series. Why would I jump so far backwards to QuickBASIC?

There are actually an umber of reasons. The first is that QuickBASIC actually imposes a number of limitations. Aside from the more limited programming language compared to, say C#, it also means any solution needs to appropriately contend with issues such as Memory usage and Open File Handles on MS-DOS. At the same time, a lot of the development task is actually more simple; one doesn’t need to fiddle with designers, or property pages or configuration tabs, or anything of that sort. You open a text file and start writing the program.

The first task is to determine an algorithm. Of course, we know the Algorithm- it’s been described previously- However, in this instance, we don’t have hashmaps available; furthermore, even if we want to implement that ourself, we cannot even keep all the information in memory. As a result, one compromise is to instead keep an array of index information in memory; that array can contain the sorted word as well as a record index into another random-access file, so, to start, we have these two TYPE structures:

TYPE SORTINDEX
    SORTED AS STRING*28
    OFFSET AS LONG
END TYPE

TYPE SORTRECORD
    WORDCOUNT AS INTEGER
    SORTWORDS(1 TO 20) AS STRING * 28
END TYPE

TYPE SORTINDEX

SORTED AS STRING*28

OFFSET AS LONG

END TYPE

TYPE SORTRECORD

WORDCOUNT AS INTEGER

SORTWORDS(1 TO 20) AS STRING * 28

END TYPE

By writing and reading directly from a scratch file when we need to add a new file to the “hash” we can avoid having any of the SORTRECORD structures in memory except the one we are working with. This drastically reduces our memory usage. As did determining that the longest word in SORTINDEX is 28 characters/bytes. The algorithm thus becomes similar- basically, with a word, we sort the words letters, and then we consult the array of SORTINDEX types. If we find one with the sorted word, we take the OFFSET and we read in the SORTRECORD at that offset, increment wordcount, and add the word to the SORTWORDS array, then PUT it back into the scratch file. And if it isn’t found in the SORTINDEX, we create a new entry- saving a new record with the word to the scratch file and recording the offset and sorted text in the index for that record.

Of course this does have several inefficiencies that I won’t address; The first is that the search for the matching sorted word is effectively a sequential search. Ideally, the in-memory index would be kept sorted and searches could use a Binary Search. I guess if somebody is interested I “left it as an exercise for the reader”.

Otherwise all seems well. But not so fast- the dict.txt file has 45402 words. Our type definition is 32 bytes, which means for all words to be stored in the index, we would need 1,452,864 bytes, which is far beyond the conventional memory limits that we are under. So we need to drastically reduce the memory usage of our algorithm. And we had something so promising! Seems like it’s back to the drawing board.

Or is it? instead of trying to reduce how much our algorithm uses, we could reduce how much data it is working with. At a time. We can split the original dictionary file into chunks, and as it happens since words of different lengths cannot be anagrams of each other, we can merely split the file into separate file organized by length. Then we perform the earlier algorithm on each of those files and output the resulting anagram list of each to one file. That would give us one file listing all anagrams without exceeding memory limitations!

Before we get too excited, let’s make sure that the largest “chunk” would be small enough. using another QuickBASIC program (because, what the hell, right?) I checked over the count of files of particular lengths. In this case, the chunk with the most is words of 7 letters in length, of which there are 7371 in the test dictionary file. This would require 235,872 Bytes of storage, which is well within our 640K conventional memory limit.

Of course, there is a minor caveat; we do need to start QuickBASIC with certain command line arguments, as, by default, the dynamic array maximum is actually 64K. We do this by launching it with the /Ah command line parameter. Otherwise, we might find that it encounters Subscript out of range errors once we get beyond around the 2000 mark for our 32-byte index record type.

Another consideration I encountered was open files. I had it opening all the dictionary output files at once, but it maxed out at 16 files, so I had to refactor it to be much slower by reading a line, determining the file to open, writing the line, and them closing the file. Again, there may be a better technique here to increase performance. For reference, I wasn’t able to find how to increase the limit, either (adjusting config.sys didn’t help).

After that, it worked a treat- the primary algorithm runs on each length subset, and writes the results to an output file.

Without further Adieu- The Full source of this “solution”:

DECLARE SUB SPLITDICT ()
DECLARE SUB ADDWORD (WORDADD AS STRING)
DECLARE FUNCTION GETOFFSET% (SORTEDWORD AS STRING)
DECLARE FUNCTION GETBYTEOFFSET% (SORTEDWORD AS STRING)
DECLARE FUNCTION SORTCHARS$ (ST AS STRING)
TYPE SORTINDEX
    SORTED AS STRING * 28
    OFFSET AS LONG
END TYPE
TYPE SORTRECORD
    WORDCOUNT AS INTEGER
    SORTWORDS(1 TO 20) AS STRING * 28
END TYPE


DIM x AS INTEGER
COMMON SHARED INDEXDATA() AS SORTINDEX
COMMON SHARED INDEXFILE AS STRING
COMMON SHARED INDEXCOUNT AS INTEGER
COMMON SHARED INDEXHANDLE AS INTEGER
DIM USEFILE AS INTEGER
DIM READWORD AS STRING
DIM READINDEX AS LONG
DIM ANAGRAMOUT AS INTEGER


SPLITDICT
ANAGRAMOUT = FREEFILE
OPEN "C:\ANAGRAMS.TXT" FOR OUTPUT AS ANAGRAMOUT

DIM LETTERCOUNT AS INTEGER
FOR LETTERCOUNT = 2 TO 32
	
	INDEXFILE = "C:\DICT.IDX"
	INDEXHANDLE = FREEFILE
	INDEXCOUNT = 0
	KILL INDEXFILE
	OPEN INDEXFILE FOR RANDOM AS INDEXHANDLE LEN = 562
	
	
	ERASE INDEXDATA
	DIM DICTFILE AS STRING
	DICTFILE = "C:\DICT\DICT" + RTRIM$(LTRIM$(STR$(LETTERCOUNT))) + ".TXT"
	USEFILE = FREEFILE
	IF DIR$(DICTFILE) <> "" THEN
		OPEN DICTFILE FOR INPUT AS USEFILE
		PRINT USEFILE
		PRINT INDEXHANDLE
		'each line is a word. However, as we have limited memory
		'compared to, C# and other languages we need to come up with a clever workaround.
		'that workaround? we use the file system. Instead of using a hashmap, we instead create a
		
		
		    IF DIR$("C:\DSCRATCH\NUL") = "" THEN
		        MKDIR "C:\DSCRATCH"
		    END IF
		    DO WHILE NOT EOF(USEFILE)
		        LINE INPUT #USEFILE, READWORD
		
			     READWORD = LCASE$(READWORD)
			
			
			     DIM SORTEDWORD AS STRING
			     DIM RECORDNUM AS INTEGER
			     DIM BUILDRECORD AS SORTRECORD
			     SORTEDWORD = SORTCHARS(READWORD)
			     RECORDNUM = GETOFFSET(SORTEDWORD)
			     IF RECORDNUM = -1 THEN 'if no record num, then assume it isn't present.
				     '  seek to the end, and write a new record.
				     'increment INDEXCOUNT, redim INDEXDATA to the new size and set.
				     INDEXCOUNT = INDEXCOUNT + 1
				     REDIM PRESERVE SHARED INDEXDATA(1 TO INDEXCOUNT) AS SORTINDEX
				     INDEXDATA(INDEXCOUNT).SORTED = SORTEDWORD
				     INDEXDATA(INDEXCOUNT).OFFSET = INDEXCOUNT
				     BUILDRECORD.WORDCOUNT = 1
				     BUILDRECORD.SORTWORDS(1) = READWORD
				     PUT #INDEXHANDLE, INDEXCOUNT, BUILDRECORD
			     ELSE
			
			     GET #INDEXHANDLE, RECORDNUM, BUILDRECORD
				     BUILDRECORD.WORDCOUNT = BUILDRECORD.WORDCOUNT + 1
				     BUILDRECORD.SORTWORDS(BUILDRECORD.WORDCOUNT) = READWORD
				     PUT #INDEXHANDLE, RECORDNUM, BUILDRECORD
			     END IF
		        
		
		
		
		        PRINT "ADDED " + READWORD + " To Index. " + STR$(INDEXCOUNT)
		    LOOP
		
		CLOSE USEFILE
		
		DIM TAKEWORD AS INTEGER
		'READ DICT.IDX FROM START TO FINISH.
		FOR READINDEX = 1 TO INDEXCOUNT
		    'Write to ANAGRAMOUT FILE. READ FROM INDEXHANDLE.
		    GET #INDEXHANDLE, READINDEX, BUILDRECORD
		    IF BUILDRECORD.WORDCOUNT > 1 THEN
		        FOR TAKEWORD = 1 TO BUILDRECORD.WORDCOUNT
		        PRINT #ANAGRAMOUT, LTRIM$(RTRIM$(BUILDRECORD.SORTWORDS(TAKEWORD))) + " ";
		        NEXT TAKEWORD
		    
		        PRINT #ANAGRAMOUT, ""
		    END IF
		NEXT READINDEX
		CLOSE INDEXHANDLE
	END IF
NEXT LETTERCOUNT

CLOSE #ANAGRAMOUT


FUNCTION GETOFFSET% (SORTEDWORD AS STRING)
	DIM I AS INTEGER
	ON ERROR RESUME NEXT
	FOR I = 1 TO UBOUND(INDEXDATA)
	  IF ERR <> 0 THEN
	  ON ERROR GOTO 0
	  EXIT FOR
	  END IF
	  IF MID$(INDEXDATA(I).SORTED, 1, LEN(SORTEDWORD)) = SORTEDWORD THEN
	      GETOFFSET = INDEXDATA(I).OFFSET
	      ON ERROR GOTO 0
	      EXIT FUNCTION
	  END IF
	
	NEXT I
34534  GETOFFSET = -1

END FUNCTION

FUNCTION SORTCHARS$ (ST AS STRING)

	DIM I AS INTEGER
	DIM J AS INTEGER
	DIM ICHAR AS STRING
	DIM JCHAR AS STRING
	DIM TEMP AS STRING
	DIM DS AS STRING
	DS = ST
	FOR I = 1 TO LEN(DS)
	    ICHAR = MID$(DS, I, 1)
	    FOR J = 1 TO LEN(DS)
	        ICHAR = MID$(DS, I, 1)
	        JCHAR = MID$(DS, J, 1)
	        IF ICHAR < JCHAR THEN
	            MID$(DS, I, 1) = JCHAR
	            MID$(DS, J, 1) = ICHAR
	        END IF
	    NEXT J
	NEXT I
	
	
	SORTCHARS = DS


END FUNCTION

SUB SPLITDICT
	'task: open C:\DICT\DICT.TXT and split the contents into separate files
	'C:\DICT\DICT7.TXT would contain the words that have 7 letters, for example.
	DIM FILENAMES(2 TO 28) AS STRING
	DIM TOTALCOUNT(2 TO 28) AS LONG
	DIM LCOUNT AS INTEGER
	DIM USEOUTFILE AS STRING
	DIM DICTREADER AS INTEGER
	DIM LINEREAD AS STRING
	DIM CHECKLEN AS INTEGER
	DIM FSORT AS INTEGER
	FOR LCOUNT = 2 TO 28
	    USEOUTFILE = "C:\DICT\DICT" + RTRIM$(LTRIM$(STR$(LCOUNT))) + ".TXT"
	    FILENAMES(LCOUNT) = USEOUTFILE
	    FSORT = FREEFILE
	    OPEN USEOUTFILE FOR OUTPUT AS #FSORT
	    CLOSE #FSORT
	NEXT LCOUNT
	
	DICTREADER = FREEFILE
	OPEN "C:\DICT\DICT.TXT" FOR INPUT AS #DICTREADER
	PRINT "Reading Dictionary File Data..."
	DO WHILE NOT EOF(DICTREADER)
	     LINE INPUT #DICTREADER, LINEREAD
	     CHECKLEN = LEN(LINEREAD)
	     TOTALCOUNT(CHECKLEN) = TOTALCOUNT(CHECKLEN) + 1
	     PRINT LINEREAD
	     FSORT = FREEFILE
	     OPEN FILENAMES(CHECKLEN) FOR APPEND AS FSORT
	     PRINT #FSORT, LINEREAD
	     CLOSE #FSORT
	LOOP


END SUB

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

DECLARE SUB SPLITDICT ()

DECLARE SUB ADDWORD (WORDADD AS STRING)

DECLARE FUNCTION GETOFFSET% (SORTEDWORD AS STRING)

DECLARE FUNCTION GETBYTEOFFSET% (SORTEDWORD AS STRING)

DECLARE FUNCTION SORTCHARS$ (ST AS STRING)

TYPE SORTINDEX

SORTED AS STRING * 28

OFFSET AS LONG

END TYPE

TYPE SORTRECORD

WORDCOUNT AS INTEGER

SORTWORDS(1 TO 20) AS STRING * 28

END TYPE

DIM x AS INTEGER

COMMON SHARED INDEXDATA() AS SORTINDEX

COMMON SHARED INDEXFILE AS STRING

COMMON SHARED INDEXCOUNT AS INTEGER

COMMON SHARED INDEXHANDLE AS INTEGER

DIM USEFILE AS INTEGER

DIM READWORD AS STRING

DIM READINDEX AS LONG

DIM ANAGRAMOUT AS INTEGER

SPLITDICT

ANAGRAMOUT = FREEFILE

OPEN "C:\ANAGRAMS.TXT" FOR OUTPUT AS ANAGRAMOUT

DIM LETTERCOUNT AS INTEGER

FOR LETTERCOUNT = 2 TO 32

INDEXFILE = "C:\DICT.IDX"

INDEXHANDLE = FREEFILE

INDEXCOUNT = 0

KILL INDEXFILE

OPEN INDEXFILE FOR RANDOM AS INDEXHANDLE LEN = 562

ERASE INDEXDATA

DIM DICTFILE AS STRING

DICTFILE = "C:\DICT\DICT" + RTRIM$(LTRIM$(STR$(LETTERCOUNT))) + ".TXT"

USEFILE = FREEFILE

IF DIR$(DICTFILE) <> "" THEN

OPEN DICTFILE FOR INPUT AS USEFILE

PRINT USEFILE

PRINT INDEXHANDLE

'each line is a word. However, as we have limited memory

'compared to, C# and other languages we need to come up with a clever workaround.

'that workaround? we use the file system. Instead of using a hashmap, we instead create a

IF DIR$("C:\DSCRATCH\NUL") = "" THEN

MKDIR "C:\DSCRATCH"

END IF

DO WHILE NOT EOF(USEFILE)

LINE INPUT #USEFILE, READWORD

READWORD = LCASE$(READWORD)

DIM SORTEDWORD AS STRING

DIM RECORDNUM AS INTEGER

DIM BUILDRECORD AS SORTRECORD

SORTEDWORD = SORTCHARS(READWORD)

RECORDNUM = GETOFFSET(SORTEDWORD)

IF RECORDNUM = -1 THEN 'if no record num, then assume it isn't present.

' seek to the end, and write a new record.

'increment INDEXCOUNT, redim INDEXDATA to the new size and set.

INDEXCOUNT = INDEXCOUNT + 1

REDIM PRESERVE SHARED INDEXDATA(1 TO INDEXCOUNT) AS SORTINDEX

INDEXDATA(INDEXCOUNT).SORTED = SORTEDWORD

INDEXDATA(INDEXCOUNT).OFFSET = INDEXCOUNT

BUILDRECORD.WORDCOUNT = 1

BUILDRECORD.SORTWORDS(1) = READWORD

PUT #INDEXHANDLE, INDEXCOUNT, BUILDRECORD

ELSE

GET #INDEXHANDLE, RECORDNUM, BUILDRECORD

BUILDRECORD.WORDCOUNT = BUILDRECORD.WORDCOUNT + 1

BUILDRECORD.SORTWORDS(BUILDRECORD.WORDCOUNT) = READWORD

PUT #INDEXHANDLE, RECORDNUM, BUILDRECORD

END IF

PRINT "ADDED " + READWORD + " To Index. " + STR$(INDEXCOUNT)

LOOP

CLOSE USEFILE

DIM TAKEWORD AS INTEGER

'READ DICT.IDX FROM START TO FINISH.

FOR READINDEX = 1 TO INDEXCOUNT

'Write to ANAGRAMOUT FILE. READ FROM INDEXHANDLE.

GET #INDEXHANDLE, READINDEX, BUILDRECORD

IF BUILDRECORD.WORDCOUNT > 1 THEN

FOR TAKEWORD = 1 TO BUILDRECORD.WORDCOUNT

PRINT #ANAGRAMOUT, LTRIM$(RTRIM$(BUILDRECORD.SORTWORDS(TAKEWORD))) + " ";

NEXT TAKEWORD

PRINT #ANAGRAMOUT, ""

END IF

NEXT READINDEX

CLOSE INDEXHANDLE

END IF

NEXT LETTERCOUNT

CLOSE #ANAGRAMOUT

FUNCTION GETOFFSET% (SORTEDWORD AS STRING)

DIM I AS INTEGER

ON ERROR RESUME NEXT

FOR I = 1 TO UBOUND(INDEXDATA)

IF ERR <> 0 THEN

ON ERROR GOTO 0

EXIT FOR

END IF

IF MID$(INDEXDATA(I).SORTED, 1, LEN(SORTEDWORD)) = SORTEDWORD THEN

GETOFFSET = INDEXDATA(I).OFFSET

ON ERROR GOTO 0

EXIT FUNCTION

END IF

NEXT I

34534 GETOFFSET = -1

END FUNCTION

FUNCTION SORTCHARS$ (ST AS STRING)

DIM I AS INTEGER

DIM J AS INTEGER

DIM ICHAR AS STRING

DIM JCHAR AS STRING

DIM TEMP AS STRING

DIM DS AS STRING

DS = ST

FOR I = 1 TO LEN(DS)

ICHAR = MID$(DS, I, 1)

FOR J = 1 TO LEN(DS)

ICHAR = MID$(DS, I, 1)

JCHAR = MID$(DS, J, 1)

IF ICHAR < JCHAR THEN

MID$(DS, I, 1) = JCHAR

MID$(DS, J, 1) = ICHAR

END IF

NEXT J

NEXT I

SORTCHARS = DS

END FUNCTION

SUB SPLITDICT

'task: open C:\DICT\DICT.TXT and split the contents into separate files

'C:\DICT\DICT7.TXT would contain the words that have 7 letters, for example.

DIM FILENAMES(2 TO 28) AS STRING

DIM TOTALCOUNT(2 TO 28) AS LONG

DIM LCOUNT AS INTEGER

DIM USEOUTFILE AS STRING

DIM DICTREADER AS INTEGER

DIM LINEREAD AS STRING

DIM CHECKLEN AS INTEGER

DIM FSORT AS INTEGER

FOR LCOUNT = 2 TO 28

USEOUTFILE = "C:\DICT\DICT" + RTRIM$(LTRIM$(STR$(LCOUNT))) + ".TXT"

FILENAMES(LCOUNT) = USEOUTFILE

FSORT = FREEFILE

OPEN USEOUTFILE FOR OUTPUT AS #FSORT

CLOSE #FSORT

NEXT LCOUNT

DICTREADER = FREEFILE

OPEN "C:\DICT\DICT.TXT" FOR INPUT AS #DICTREADER

PRINT "Reading Dictionary File Data..."

DO WHILE NOT EOF(DICTREADER)

LINE INPUT #DICTREADER, LINEREAD

CHECKLEN = LEN(LINEREAD)

TOTALCOUNT(CHECKLEN) = TOTALCOUNT(CHECKLEN) + 1

PRINT LINEREAD

FSORT = FREEFILE

OPEN FILENAMES(CHECKLEN) FOR APPEND AS FSORT

PRINT #FSORT, LINEREAD

CLOSE #FSORT

LOOP

END SUB

And there you have it. an Anagram search program written in QuickBASIC. Of course, it is arather basic and is a bit picky about preconditions (hard-coded for a specific file, for example) but it was largely written against my test VM.

Have something to say about this post? Comment!

BASeCamp Programming Blog

Multi-lingual when he wants to be.

Anagram Search Program Part XVI: QuickBASIC