This is part of a occasionally updated series on various programming languages. It should not be interpreted as a benchmark, but rather as a casual look at various programming languages and how they might be used by somebody for a practical purpose.
Currently, there are Articles written regarding Python,C#, Java and VB6 (merged for some reason),Scala,F# & Ruby,Perl, Delphi, PHP,C++,Haskell,D, VB.NET, QuickBASIC, and Rust
Somehow while creating the various anagram implementations for the previous entries in this series, I neglected to create one for PHP. This was primarily because I hadn’t really even thought about it, in much the same way as I didn’t create a Javascript version of the script. Nonetheless, it seems like a worthy endeavour; PHP is a very popular and well-used language, and there is no reason not to see how it stacks up against other languages in this sort of situation.
One of the main things when creating a PHP implementation is to get an environment set up. If I was to implement the script and run it on my webhost, the results would no longer correspond properly; what has to happen is the PHP has to be interpreted on my machine for the results to be something we can compare. PHP is designed more as a web-based language- as opposed to a local scripting environment. By this I mean it’s prevalent- and almost universally used on the server side; whereas other languages like Python, Perl, etc. find a lot of use client-side as well. Because I hadn’t ever actually run PHP client-side I had to do a little searching. But the search was relatively short. Since I plan to run these tests on my Windows machine, I needed the PHP interpreter for Windows. What I didn’t need, however, was Apache; and while I have IIS installed and could probably run PHP that way, I am concerned that the web server components might add some overhead to the running of the script, and generally make the timing “off” in comparison to the other languages I used. As I write this I am installing the file from here, which ought to give me the tools to write a PHP equivalent of the Anagrams Algorithm. This entry (evidently) describes the PHP entrant into that discussion. A naive anagram search program will typically have dismal performance, because the “obvious” brute force approach of looking at every word and comparing it to every other word is a dismal one. The more accepted algorithm is to use a Dictionary or Hashmap, and map lists of words to their sorted versions, so ‘mead’ and ‘made’ both sort to ‘adem’ and will be present in a list in the Hashmap with that key. A psuedocode representation:
- Loop through all words. For each word:
- create a new word by sorting the letters of the word.
- use the sorted word as a key into a dictionary/hashmap structure, which indexes Lists. Add this word (the normal unsorted one) to that list.
- When complete, anagrams are stored in the hashmap structure indexed by their sorted version. Each value is a list, and those that contain more than one element are word anagrams of one another.
The basic idea is to have a program that isn’t too simple, but also not something too complicated. I’ll also admit that I chose a problem that could easily leverage the functional constructs of many languages. Each entry in this series will cover an implementation in a different language and an analysis of such.
PHP is one of those languages that have a lot of fiobles and historical issues that are maintained to attempt to make sure things are compatible. In some ways, it reminds me of the story of Visual Basic 6. after several language iterations, it still holds to it’s roots. So on the one hand, you have very old code that still works well in the latest version as it used to, and on the other you have some tiny incompatibilities that might require changes to certain kinds of code. One of the most popular PHP-detracting on the web today is PHP: A Fractal of Bad Design Which makes some good points, but also ignores the fact that PHP was really never designed- it ‘evolved’. It was never really designed as a programming language, but rather as a way to get what needed to be done done in the fastest possible way and with the fewest headaches. The inevitable result is what we have now, which has some crusty deposits, but those crusty deposits are what makes PHP so useful.
As far as the anagrams concept goes, I figure the best way to accomplish the goal is to use associative arrays. PHP has these built in, and this is arguably an extremely awesome feature that is frequently overlooked. This is the resulting implementation:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
<?php error_reporting(-1); //timing start... $mtime = microtime(); $mtime = explode(" ",$mtime); $mtime = $mtime[1] + $mtime[0]; $starttime = $mtime; $anagrams = array(); $processedcount=0; //open dictionary file... $file_handle = fopen("D:\\dict.txt", "r"); //iterate through each line. while (!feof($file_handle)) { $processcount++; $word = fgets($file_handle); //split the word into each character. $letters = str_split($word); //sort the letters... sort($letters); //implode to a new string, and use that as a index into the associative array. $sorted = implode('',$letters); //determine if this sorted item already exists. if($anagrams[$sorted]==null){ $anagrams[$sorted] = array($sorted=>$word); }else{ //array_push($anagrams[$sorted],$sorted=>$word); $anagrams[$sorted] = array_merge($anagrams[$sorted],array($sorted=>$word)); } } $mtime = microtime(); $mtime = explode(" ",$mtime); $mtime = $mtime[1] + $mtime[0]; $endtime = $mtime; $totaltime = ($endtime - $starttime); echo $processedcount + " processed in ". $totaltime . " seconds\n"; ?> |
A run of this took 116 seconds. Now, PHP being slower than many other languages, directly, is not amazing. However this is quite a bit slower than any other implementation by a factor of ten. Therefore I think it is fair to PHP to see if perhaps I did something wrong in the above implementation.
I am fairly certain that the issue is with my use of array_merge, since it rather re-creates the array every time, which can be costly given the dictionary has over 200K words. I changed it to use array_push:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
<?php error_reporting(-1); //timing start... $mtime = microtime(); $mtime = explode(" ",$mtime); $mtime = $mtime[1] + $mtime[0]; $starttime = $mtime; $anagrams = array(); $processedcount=0; //open dictionary file... $file_handle = fopen("D:\\dict.txt", "r"); //iterate through each line. while (!feof($file_handle)) { $processcount++; $word = fgets($file_handle); //split the word into each character. $letters = str_split($word); //sort the letters... sort($letters); //implode to a new string, and use that as a index into the associative array. $sorted = implode('',$letters); //determine if this sorted item already exists. if($anagrams[$sorted]==null){ $anagrams[$sorted] = array($word); }else{ array_push($anagrams[$sorted],$word); } } $mtime = microtime(); $mtime = explode(" ",$mtime); $mtime = $mtime[1] + $mtime[0]; $endtime = $mtime; $totaltime = ($endtime - $starttime); echo $processedcount + " processed in ". $totaltime . "seconds\n"; echo $anagrams[50]; ?> |
I also “fixed” a minor issue from before. We are basically working with a top level associative array which uses the sorted words as their key, each one of which stores another array of the words that sort to that sorted word. Each of which would be anagrams of one another, naturally.
I also added a check at the bottom to make sure that I was getting some sort of result. This one clocked in at 114 seconds, which is still rather dismal. At this point I’m still convinced I’m just not using it “properly” but it is also possible that PHP really has this issue regarding speed. That said, I could probably find a Dictionary/HashTable class implementation on-line that would work better than a nested associative array. Another thing might be that I am simply using a slower PHP implementation. I’m using the installation from “php-5.3.20-nts-Win32-VC9-x86.msi” which is accessible from here. I went with the “Non thread-safe” version since that is usually faster. It’s also possible that running this on a Linux machine will yield faster results, since that is PHP’s ‘native’ environment. I considered the stumbling blocks with other languages, particularly Java and C#, with which I only got maximum performance when disabling debug mode. With that in mind, I sought to make tweaks to the php.ini file that came with the PHP interpreter, in the interest of increasing speed. I wasn’t able to make any significant changes to increase speed, though I did get it down to 109 seconds.
I think a fair assessment here has to do with the File I/O being done only a tiny bit at a time. This is one of the issues I had with some of my other implementations as well. In this case it creates and allocates more memory (possibly, depending on the specific PHP implementation of adding elements). So it makes sense to try to do it in one fell swoop, and then iterate over those elements. With that in mind, I made some changes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
<?php error_reporting(-1); //timing start... $mtime = microtime(); $mtime = explode(" ",$mtime); $mtime = $mtime[1] + $mtime[0]; $starttime = $mtime; $anagrams = array(); $processedcount=0; //open dictionary file... $lines = file("D:\\dict.txt", FILE_IGNORE_NEW_LINES); //iterate through each line. foreach($lines as $line){ $processcount++; $word = strtolower($line); //split the word into each character. $letters = str_split($word); //sort the letters... sort($letters); //implode to a new string, and use that as a index into the associative array. $sorted = implode('',$letters); //determine if this sorted item already exists. if($anagrams[$sorted]==null){ $anagrams[$sorted] = array($word); }else{ array_push($anagrams[$sorted],$word); } } $mtime = microtime(); $mtime = explode(" ",$mtime); $mtime = $mtime[1] + $mtime[0]; $endtime = $mtime; $totaltime = ($endtime - $starttime); echo $processedcount + " processed in ". $totaltime . "seconds\n"; echo $anagrams[50]; ?> |
Which still didn’t fix the speed problem (113 seconds this time). I did notice that when I had error display on, I got a lot of notices- as the loop ran- about undefined indices when setting $anagrams[$sorted]. So if I had to hazard a guess I’d say that the speed issue is a combination of a very long loop (200K words) combined with it being entirely interpreted that results in the rather slow run-time. The language itself is a bit messy, though it’s certainly not as bad as Perl as far as special punctuation marks go. Of course with with the loss of that special punctuation ($ for scalars, @ for arrays and % for hashmaps in Perl, but PHP only keeps $ and lacks the multiple contexts) we get a simplified language that is a bit easier to work with cognitively but lacks some power and capability. Nonetheless, it is still a very workable language, and most of it’s capability can be leverages by using existing frameworks and code; PEAR libraries can be found for pretty much every purpose. I hazard to wonder how it can be considered scalable, given the speed issues I’ve encountered, but I’m still quite certain I’m simply using it incorrectly. I have to say I was surprised that I remembered the foreach syntax off the top of my head- (foreach($array as $value)). Not because it’s complex or esoteric but because I have been messing with C#, java, and other languages a lot in the meantime, many of which use very slight differences- Java uses it’s for loop but with a special syntax for(value:array) C# uses a foreach loop foreach(value in array), Scala uses for(value < - array), and Python doesn't technically have a standard for loop and everything is really a foreach for value in array:. F# uses a similiar syntax, replacing the colon with it’s own equivalent keyword for value in array do. I was convinced when I used as that I was remembering some other language implementation and would have to look it up, but to my surprise I remembered it correctly. probably just luck though, to be fair. Even though C# is my “main” language for desktop development I do in fact find myself using the Java for syntax for foreach loops, too. (or even trying to use for(value in array) and wondering why the IDE yells at me on that line until I facepalm that I used the wrong keyword).
One might thing this particular example proves PHP is “slow”, but I disagree. I think it just shows that PHP is not well suited to iterating over hundreds of thousands of elements. But given it’s particular domain- web pages- you shouldn’t be presenting that much information to the user anyway, so that sort of falls out of the languages intended use cases’ scope.
In the works: Part XI, Haskell. I had a nice long one almost finished but I appear to have either forgotten to save it when I restarted my PC or something along those lines, so I’ve had to rewrite it. Thankfully I still have the actual Haskell code, which is arguably the most intensive part of each entry in the series.
Have something to say about this post? Comment!