Programming Language Performance Part I: Overview

This is part of a occasionally updated series on various programming languages. It should not be interpreted as a benchmark, but rather as a casual look at various programming languages and how they might be used by somebody for a practical purpose.
Currently, there are Articles written regarding Python,C#, Java and VB6 (merged for some reason),Scala,F# & Ruby,Perl, Delphi, PHP,C++,Haskell,D, VB.NET, QuickBASIC, and Rust

Much like my previous posts on the subject of programming language polarization, this one comes as a response to what I feel are flawed perspectives on programming languages. In particular, this comes in response to some claims that C# is a “beginners” language; this poster also made the claim that they tested older languages as well. This is doubtful, however, because that claim was backed up with no reproducibles such as source code and timings. Not to mention the notation that “Even OLD Basic loops and functions are faster than C#” which seems hard to prove since “OLD” Basic (BASICA and GW-BASIC) didn’t have loops or functions. (with the exception of DEF FN statements, which aren’t quite the same idea and are more similar to macros). Additionally, empty loops and functions aren’t typical constituents of a program. That is, a test seeing that some specific structure is faster on Language X than Language Y is focussing on a small area of a much larger picture.

The purpose of this short series of posts is to provide a relatively simple, but not entirely trivial, algorithm; I will implement it into several languages, and then compare performance results as well as of course the resulting code itself. The purpose of this experiment is to provide actual results. Many claim that “Language X is faster than Language Y” but as described above they typically focus far too much; for example, I have seen claims that loops are faster in C++ than, say, C#; The actual test code was not provided, of course, so it’s impossible to show what they were really measuring. (Hint: It wasn’t loops). More importantly, languages are designed for solving problems, not looping or simple if instructions done repeatedly.

I will not, of course, claim this to be unbiassed. I feel that the performance of a language is not nearly as important as that languages succinctness and expressiveness, because those two qualities contribute more to code quality and code quality is directly proportional to how maintainable a codebase is, and the maintainability of a codebase should be a more important concern than how many milliseconds a contrived example takes to execute. Many of the “swings” being taken are by C/C++ users; usually claiming that garbage collected languages aren’t “real programming” or that without determinate memory management all programs will be really slow and unusable. Perhaps ironically, while C/C++ users are making these claims, creating their derived tests and contrived scenarios where Garbage Collection fails, those using those “not real” programming languages are writing programs and getting work done. It’s arguable whether the C/C++ userbase that make these attacks are making them because they feel that newer garbage collected languages are a true detriment to programming or whether they- either conciously or not- realize that these languages relieve a burden of responsibility from the programmer and that in making programming more accessible, it becomes far less of an elite skillset. Therefore, they need to attempt to trivialize the abilities of those using those new technologies in order to restore their own egos. This is highly speculative, of course, but from what I’ve seen a lot of those who push C/C++ and criticize various managed frameworks and Virtual Machines really don’t have the first clue about that which they complain. Many of their arguments are so easily refuted as to be silly.

This series of posts is not attempting to discredit C/C++; in fact I’m doubtful either one of those languages will show up in the tests at all; rather, it is designed as a comparison of the same thing implemented in different languages. I have tried to do so with as little bias as possible but I must admit not being familiar with many of the languages I have used for these tests, so my lack of ability may impair the performance of some implementations. Corrections or improvements to these implementations are both welcome and encouraged.

I hold the opinion that overall, language performance only matters in some very specific cases. For example, determinate memory management is useful most when you have a small, finite amount of memory; for example, if you are writing a program to run on a small embedded system with a limited feature set and only a few hundred K of memory, it makes sense to use a language like C. In fact, the 640K limitation of DOS could almost single-handedly be pointed to as the reason that most earlier Garbage Collection implementations did not get critical acclaim. THe languages architecture and design were usually very powerful and they did have garbage collectors that tried to be as conservative as possible (or used XMS for their heaps), but their was still the issue that DOS itself only had 640K to deal with. Additionally, programmers had become used to allocating and freeing all the memory they used.

This resulted in a powerful backlash to the implementation and use of Garbage collected languages- or even any language that claimed to be “easy to use”. This is because programmers are, to be quite honest, egotistical douchebags when it comes to programming. They like to think of themselves as “better” than the unwashed masses that cannot program or have no desire to learn to do so; they feel smarter. When C was getting popularity, Assembly programmers cried out that it wasn’t “real” programming. Eventually, however, even the last Assembly die-hards acknowledged that C had a very important role to play in programming. With the introduction of languages like Pascal and Modula, C Programmers cried out that using Pascal wasn’t “real” programming. Of course, without actually defining what “real programming” was, they were just being, for lack of a better word, whiny bitches. What they didn’t like wasn’t that the language wasn’t similar to C, but rather that the language was easier to use. This trivialized the effort that they put into learning C. In order to prove to themselves that they made the correct choice in learning C, they research and find myriad flaws in the language in question, and declare it to be “beyond repair” then fall back on C again.

As time progressed, C++ started to come to the fore. C++ got exactly the same criticisms as C; mostly ethereal claims based on little actual evidence, and usually saying something that indicated that using C++ wasn’t “real programming” or various other detrimental factors.

C++ is popular now, and now other languages- such as Java, C#, and various other languages, have born a lot of criticism from people that learned C++. The issue is not whether C++ is the better language at all; what it boils down to is that the C++ programmers are trying to convince themselves that they made the right choice learning C++, or that they made the right choice not learning Java, or not learning C#. I find this sort of logic somewhat silly, since there is nothing mutually exclusive about programming languages.

I suppose an important term to define is what it means for a language to be “powerful”. In my definition, the power of a programming language comes from it’s expressiveness; how easy it is to represent abstract concepts in said language. For example; even take a relatively basic concept such as an array. in C, there truly is no such thing as an array; what you are able to do is use the subscript operator for pointer arithmetic as well as for declarations and allocations, but truly there is no actual “array” objects or item that you access. You cannot determine the length of an array, for example, because, fundamentally, there is no array- just a block of bytes that you happen to be able to access using subscripts. Compare this to higher level languages, where Array’s are first-class citizens with a large variety of capabilities and methods. the C implementation is of course a lot faster, because you are really just doing pointer math, and there is no extra data to keep track of. However, the latter is more powerful because it takes the burden of tracking that data off the hands of the programmer. a C program might declare an array, add values into the array, sort the array, and display the array as separate operations. A Higher level language such as ruby however can do this in a single statement. The ruby solution, while many tens of lines shorter, will often be slower, because of ruby’s interpreted nature. However, while a C++ implementation will be faster and longer, a hand-tuned assembly variant will be even faster and longer. I guess the question is why C/C++ purists seem to think that the performance difference between C++ and higher level languages is so important but why the same gains accessible via the use of hand-tuned assembly is not. And I think it boils down to a simple fact- many C/C++ purists do not know assembly. Therefore advocating it’s use would be like a bricklayer encouraging the use of wood siding.

given the above, the power of languages differs, and the fact is, C# is a more powerful language than C++. A lot of C#-like behaviour- things like iterators, lambda expressions, etcetera can either be mocked up using C++ templates or macros or are part of the C++11 specification, but C# has had many of these features for a while now, and they are first-class citizens. Many C++ purists point to the fact that C# compiles to IL code- as opposed to native code- as if this was some sort of Achilles heel. Much the same is said for Java bytecode. What such arguments fail to take into consideration is the wealth of information and metadata that IL code has. No C++ program can dynamically construct class instances from some other C++ program, because, fundamentally, a C++ program, once compiled, is just flat code; there are no classes or objects or instances, or methods. with C# and Java, the class and function information remains intact. a C# program could run another C# program directly in it’s own process; or construct and use objects from some other C# program. This is very much the same for Java. Introspection and RTTI (Run-Time Type Information) are very powerful language features which can only, at best, be emulated in C++ code using strict coding practices and templates.

Whether C++ is faster than C# is a point of debate; but whether it is faster to develop in C# or C++ is a no-brainer question; C++ takes a significant investment of programmer time at each step; design takes longer because C++’s implementation of OOP is imcomplete and fractured; differences in the implementation of templates between compilers can make code compile and run fine on one compiler, compile and crash on another, and not even compile on yet another. This can be traced back to the complex standard of the language, which can make parsers nearly impossible to use and complicates static analysis. This brings us, yet again, back to introspection; if all types and data of a structure are preserved, static analysis becomes a lot easier, if still far from trivial.

Essentially, an argument that language A is faster than language B only matters if language A and language B are equally easy to use by the programmer. Arguing that C++ is faster than C# seems reasonable to people, but those same people jump to defend C++ in the equally solid argument that C is better than C++; or that Assembly is better than C. It used to be impossible to write and publish a game with reasonable performance without learning assembly. Over time, it became possible to write those programs in C; and later, in higher level languages. As computers have gotten more and more memory, faster and more processors and cores, and faster hard drives, micromanaging those resources within your program like a Nanny becomes a game of diminishing returns. Even today many AAA games are no longer entirely written in C++ or C, instead opting for Python or another script language for some of the scripting. Not every piece of code ever written needs to run as fast as humanly possible; in fact, some accepted “fastest” algorithms have worst-cases that can be slower than the accepted O(n²) slow algorithm; for example, with certain array configurations and sizes, a QuickSort can run slower than a bubblesort. It all depends on the data and what is being done to it. Benchmarking how fast empty loops run in different languages tells you nothing about which language is faster, and as argued above the end-result performance is no more important than the time required to make that end product.

That said, one could argue that I was making this argument to skirt the “obvious” fact that C# is slower than, say, Python, as evidenced by the above. I decided to give this a try with a nontrivial example. In this case, a program to find all anagrams in a given list of words. An anagram, for those not in the know, is a word formable by rearranging the letters of another word.

To those who don’t look before they leap, they may find performance dismal; the brute force approach of looking at every word and comparing it to every other word is dismal at best, performance-wise. The accepted algorithm is to use a Dictionary structure, and use the sorted letters of a word as a key into that dictionary, and essentially add each word to a list. A psuedocode representation:

Loop through all words. For each word:
1. create a new word by sorting the letters of the word.
2. use the sorted word as a key into a dictionary/hashmap structure, which indexes Lists. Add this word (the normal unsorted one) to that list.
When complete, anagrams are stored in the hashmap structure indexed by their sorted version. Each value is a list, and those that contain more than one element are word anagrams of one another.

The basic idea is to have a program that isn’t too simple, but also not something too complicated. I’ll also admit that I chose a problem that could easily leverage the functional constructs of many languages. Each entry in this series will cover an implementation in a different language and an analysis of such.

Have something to say about this post? Comment!

Tags: Anagrams, C#, Dictionary, HashMap, Java, Perl, Programming, Python, Scala

BASeCamp Programming Blog

Multi-lingual when he wants to be.

Programming Language Performance Part I: Overview