11 Mar 2016 @ 2:24 AM 

Previously, I wrote about Counting Lines of Code in a given directory. Recently, I had a similar curiousity- “how long are all the video files in this folder?” This seemed to be a good fit for a Python script.

The main question, then, is how do we determine the length of a video file in Python? Once we have that problem solved, we can effectively use the same code as the CodeCount and “sum” all the lengths together in much the same way that the Code Count script sums all the line counts.

The most straightforward approach I discovered was using ffmpeg to retrieve video information, redirecting the output to a temp file, and then finding the listed Duration in that temp file. We then take that determined duration and parse it, and create a Python TimeDelta. We can use that TimeDelta, and sum all the TimeDelta’s of the files in question to determine the total run time of all the videos in the specified folders.

A bit of checking code to allow for the use of *.* to total all ffmpeg compatible video files rounds this out- the syntax is the same as the Code Counter- VidLength.py <FileMask> <directory< and it will display all the files it analyzes and then a total duration.

Posted By: BC_Programming
Last Edit: 11 Mar 2016 @ 02:24 AM

EmailPermalinkComments (0)
Tags
Tags: , ,
Categories: Programming, Python, Scripting
 23 Apr 2013 @ 1:53 PM 

So lately I’ve been doing quite a bit of Java related stuff. I’ve learned a lot. Mostly about how much better C# is, to be honest. But this isn’t about that.

Most Java work get’s packaged into a Jar file. a Java .jar file is really just a .zip file with a .jar extension- so points to Sun Microsystems for coming up with a package format that isn’t something completely new. I found myself having a bit of an annoying time of it, however- from what I can tell there is no “package this stuff into a jar” option in Eclipse. Since the zip format is so widely workable, I decided to create a quick script to package all the appropriate files into a .jar. This saves me selecting everything in the appropriate .bin folder,unselecting the hidden history folder created by my text editor and anything else I don’t want included, etc.

I’m aiming for simplicity here. A script in the bin folder is good enough for me. I encountered quite a few problems creating the script and ended up giving up on it for a while, but I got back to it today and decided to get it working. Here is the script:

Python is quite a powerful language. I particularly like the use of tuples directly in the language; the os.walk() function is a “generator”; each iteration the root variable is set to the root of the search, dirs is filled with all the directories in the current iterating folder, and files is filled with the files. I change the dirs value here and filter out those that start with “.” (which could be, say, a git repo) as well as filtering out those starting with an underscore. The cool thing is that since I specified topdown=True in the call to os.Walk, I can remove directories and os.walk will not go into those directories, so I don’t need additional checks to make sure we aren’t in one of the excluded folders. Then the function calls itself for each directory, and finally it adds all the files in the currently walked folder to the zipfile. It performs similar checks to avoid adding undesired files; it doesn’t add files that start with a period, underscore, or are the same name as the scriptfile itself (which could cause issues, arguably, but with Java I’m unlikely to have a file called “prepare.py” anywhere in the tree that I want to be included). Then it breaks. Works quite well; the zip is created with only the files I want and everybody is happy. I was considering adding extra functionality to it, but decided that it’s possible to over-think it; I just want something simple, and I don’t need- for example- special handling of .gitignore folders to actually exclude files that are specified from the package. Not yet, anyway.

Naturally this script can really be run anywhere; just paste it in a folder and run it, and you’ve got a zip with everything in that folder compressed. could be useful for compressing source trees. I thought I’d share it here because I found it useful.

Posted By: BC_Programming
Last Edit: 23 Apr 2013 @ 01:55 PM

EmailPermalinkComments (1)
Tags
Tags: , , ,
Categories: Python, Scripting
 21 Mar 2013 @ 9:08 AM 

Most application frameworks/languages provide access to the Command Line parameters passed to your application. Generally it is passed to your application as either a string or an array of strings. What you do not get automatically is the functionality to parse out switches.

Command-line parameters used to be the only way to communicate with a program. Fundamentally, the command line was your UI into the program. Different platforms took different approaches. Unix-like systems typically take the “invasive” route; they replace wildcards and then pass the resulting command line to the application. This means that you don’t have to do any shell expansion of wildcards (as it is known) but you have to account for the fact that your command line could include a lot of files. It’s a trade-off, really. Either way, I figured for the purposes of this library, we could stick to the platform- if the program is run with a wildcard, you’ll see the wildcard on windows, but it will have been expanded if you run the same program on Linux. It might be worth adding an option to “auto-expand” wildcards- just for consistencies sake, but that seems like a post for another day.

Either way, most applications also include flags and switches. This is more a De Facto standard that has cropped up- there is no hard and fast rulebook about what flags and switches are or how you are supposed to pass arguments, which can cause no end of confusion when it comes to reading application documentation. the .NET language just gives you the string, and leaves it up to you to decide how to interpret it. Some language libraries provide functionality to parse the Command Line appropriately, such as Python. C# doesn’t come with such a class…. So let’s make one!

First we need to determine what exactly can exist in a command line. My method allows for two things: Switches, and arguments. A Switch can include an argument, separated from the switch with a colon. For example:

In this case, we have three switches- switch, sw, and doall. The first two include an argument. My “syntax” allows for quotes in the arguments of switches as well as the “loose” arguments. We will evidently need classes to represent and parse Arguments, and another one for Switches. The parsing can be done sequentially. Although it’s not a recommended best practice, I chose to use by reference parameters in the class constructors. In order to keep things generic and accessible, both Switches and Arguments will derive from a CommandLineElement abstract class, which will force each base class to implement toString(). the ArgumentItem class will be used for parsing both “loose” arguments, as well as arguments found after a switch.

Arguments

Arguments are simple- if the first letter of the position is a quote, we look for the next quote that isn’t doubled up. Otherwise, we look for either the next whitespace or the end of the string. Each argument only needs the actual argument value.

The constructor is where the important stuff happens. the by reference parameter is used to define the starting position, and we update it when the constructor returns to point at the character after the argument. The class also defines some statics for implicit conversions to and from a string.

Now that we have the Argument class, we can define the Switch class. The actual syntax of switches often depends on the application but also seems to depend on the platform. for example, Linux tools favour the hyphen for single letter flags, and double hyphens for multi-character flags. Switches are also called flags. forward slash is not generally used as a switch or flag indicator. Windows platforms prefer the forward slash but generally allow for single hyphens as well. We aim to support all three syntaxes, and make the client application not have to worry about which it is. We also add support for arguments- a switch can be specific as such:

The element after the colon will be parsed as an argument and attached to the switch itself. But enough waffling- on to the Switch:

With the basic parsing logic completed, we need to consider how we want this to be used. Best way is to think of how we would like to use them:

Some basic take-aways from this. First, the Core Parser Object needs to provide an Indexer. In the above example, we see it is accessing Switches by passing in the Switch name. Other possibilities include using direct numeric indexes to refer to any argument- much like you would access elements in the framework provided args[] String array. Another possibility is to have the Argument of a switch auto-populate, rather than be null, when accessed:

Posted By: BC_Programming
Last Edit: 21 Mar 2013 @ 09:10 AM

EmailPermalinkComments (0)
Tags
 01 Feb 2013 @ 10:23 AM 

Randomization is something that is pretty much a staple in games. But the tricky part is randomizing in an expected way.

Some kinds of randomization, you don’t want to be truly random- just mostly random- and you do in fact want it to go through all the possibilities more frequently, though in a random way.

The best example I can think up is to imagine you have a Enemy that can shoot forwards, backwards, up, and down. Now, targeting the player notwithstanding, let’s assume you want them to shoot randomly. If you go with the approach of just choosing a direction completely at random, you will find that the code will often choose the same direction many times in a row. This is more or less the problem with trying to randomize small sets of data. Each selected element is more or less chosen without regard to those chosen before. This fits probability, of course, but can be a bit weird gameplay wise.

Enter ItemBucket

The solution is fundamentally to make sure that items are chosen at random, but also that every item get’s used. The best solution I could think of was one where you choose elements at random, but you make sure you don’t choose the same elements twice during any one run. The class I created for this purpose I called “CItemBucket”. It’s functionality is fairly simple: it is constructed with the items it will contain. This initializes two lists with those elements. The ItemBucket can “dispense” items; when an item is dispensed, it is removed from the list. if the list is empty, it is reset from the Original List. this makes sure that in any number of elements of size N, any element in that list will occur exactly once during N dispenses. My original idea was to return a randomly indexed element each call, but then I got to thinking that, in many ways, we are simply dealing with a “shuffled deck” of elements, and returning items from that “deck”. So, at the cost of making Adding a bit more expensive, I think I was able to make the retrieval method O(n) for the number of present elements (based on the Queue implementation, that is).

Here, the, is the C# Implementation for this class:

Here is some code that consumes this implementation:

Works quite well, as far as I can tell.

But wait, there’s more!

Perhaps because I like to think of myself as something of a polyglot, let’s go ahead and come up with an implementation of this for Java. Now, doing this in Java would be a bit more tricky- that is, directly. Unlike the C# implementation, we don’t have the convenience of lambda’s to make a ‘pure’ shuffle. Instead, we would have to write our own implementation. Instead of that, I went with the original concept- that is, to choose,remove, and return a random index, and re-create the Bucket contents when we’ve emptied it. Here is the Java version of this class:

Test code for the above Java implementation:

As far as I’m concerned, the most interesting thing in the Java implementation is the “RepeatArray” method. Java’s Generic limitations prevent us from creating Arrays of the Generic Type Parameter directly, so instead we need to use some reflection magic. unfortunately, as far as I can tell there is no way to get the class of the actual type parameter in Java, so I had to resort to accepting that class as a parameter. Also, because of the lack of Generators/Enumerable methods, I ended up simply having that method return a List of the generic type, rather than an Enumerable.

It’s not over yet!

You might think, “OK, he’s covered C# and Java implementations- can I please go home. Well, yes, you can. But then you’ll miss my Python implementation! I went with the same approach I chose for Java- that is, retrieving items randomly and removing them. It’s fully possible to use the same methodology I used for the C# approach- using Queues and sorting on randomized keys- but I found this approach a bit less verbose, and more straightforward. And, it is my opinion that ‘straightforward’ is sort of the “python way”, so I went with that.

And, of course, the Python code that uses it:

The first thing that stands out is just how much shorter this one is to either the Java or C# implementations. This is probably mostly due to the dynamic typing of Python, as well as a few syntactic shortcuts. The only “gotcha” I found was that Python didn’t really support overloaded methods. Technically, I could create one method and then verify what parameters were actually passed, but in this case I wanted one to be a Generator Method, and the other to simply return a single element. Only way I could figure out how to do that was to have separate DispenseSingle() and DispenseMulti() methods. It also doesn’t expose the innards like the other implementations, in that you cannot add Elements to an already constructed instance. In retrospect, I’m not really sure if it’s a good idea to make the design Mutable at all. The Most typical use-case would be to construct it with all the elements it will use, and not change it afterwards- that’s the use case I have in BASeBlock, and the use-case I’m considering using my Java implementation for. So having extra methods to add elements, and keep fields “fresh” and consistent adds implementation complexity for little gain.

Posted By: BC_Programming
Last Edit: 01 Feb 2013 @ 11:46 AM

EmailPermalinkComments (1)
Tags
 03 Jan 2013 @ 10:06 PM 

This is part of a occasionally updated series on various programming languages. It should not be interpreted as a benchmark, but rather as a casual look at various programming languages and how they might be used by somebody for a practical purpose.
Currently, there are Articles written regarding Python,C#, Java and VB6 (merged for some reason),Scala,F# & Ruby,Perl, Delphi, PHP,C++,, Haskell,D, and VB.NET.

Somehow while creating the various anagram implementations for the previous entries in this series, I neglected to create one for PHP. This was primarily because I hadn’t really even thought about it, in much the same way as I didn’t create a Javascript version of the script. Nonetheless, it seems like a worthy endeavour; PHP is a very popular and well-used language, and there is no reason not to see how it stacks up against other languages in this sort of situation.

One of the main things when creating a PHP implementation is to get an environment set up. If I was to implement the script and run it on my webhost, the results would no longer correspond properly; what has to happen is the PHP has to be interpreted on my machine for the results to be something we can compare. PHP is designed more as a web-based language- as opposed to a local scripting environment. By this I mean it’s prevalent- and almost universally used on the server side; whereas other languages like Python, Perl, etc. find a lot of use client-side as well. Because I hadn’t ever actually run PHP client-side I had to do a little searching. But the search was relatively short. Since I plan to run these tests on my Windows machine, I needed the PHP interpreter for Windows. What I didn’t need, however, was Apache; and while I have IIS installed and could probably run PHP that way, I am concerned that the web server components might add some overhead to the running of the script, and generally make the timing “off” in comparison to the other languages I used. As I write this I am installing the file from here, which ought to give me the tools to write a PHP equivalent of the Anagrams Algorithm. This entry (evidently) describes the PHP entrant into that discussion. A naive anagram search program will typically have dismal performance, because the “obvious” brute force approach of looking at every word and comparing it to every other word is a dismal one. The more accepted algorithm is to use a Dictionary or Hashmap, and map lists of words to their sorted versions, so ‘mead’ and ‘made’ both sort to ‘adem’ and will be present in a list in the Hashmap with that key. A psuedocode representation:

  1. Loop through all words. For each word:
    1. create a new word by sorting the letters of the word.
    2. use the sorted word as a key into a dictionary/hashmap structure, which indexes Lists. Add this word (the normal unsorted one) to that list.
  2. When complete, anagrams are stored in the hashmap structure indexed by their sorted version. Each value is a list, and those that contain more than one element are word anagrams of one another.

The basic idea is to have a program that isn’t too simple, but also not something too complicated. I’ll also admit that I chose a problem that could easily leverage the functional constructs of many languages. Each entry in this series will cover an implementation in a different language and an analysis of such.

PHP is one of those languages that have a lot of fiobles and historical issues that are maintained to attempt to make sure things are compatible. In some ways, it reminds me of the story of Visual Basic 6. after several language iterations, it still holds to it’s roots. So on the one hand, you have very old code that still works well in the latest version as it used to, and on the other you have some tiny incompatibilities that might require changes to certain kinds of code. One of the most popular PHP-detracting on the web today is PHP: A Fractal of Bad Design Which makes some good points, but also ignores the fact that PHP was really never designed- it ‘evolved’. It was never really designed as a programming language, but rather as a way to get what needed to be done done in the fastest possible way and with the fewest headaches. The inevitable result is what we have now, which has some crusty deposits, but those crusty deposits are what makes PHP so useful.

As far as the anagrams concept goes, I figure the best way to accomplish the goal is to use associative arrays. PHP has these built in, and this is arguably an extremely awesome feature that is frequently overlooked. This is the resulting implementation:

A run of this took 116 seconds. Now, PHP being slower than many other languages, directly, is not amazing. However this is quite a bit slower than any other implementation by a factor of ten. Therefore I think it is fair to PHP to see if perhaps I did something wrong in the above implementation.

I am fairly certain that the issue is with my use of array_merge, since it rather re-creates the array every time, which can be costly given the dictionary has over 200K words. I changed it to use array_push:

I also “fixed” a minor issue from before. We are basically working with a top level associative array which uses the sorted words as their key, each one of which stores another array of the words that sort to that sorted word. Each of which would be anagrams of one another, naturally.
I also added a check at the bottom to make sure that I was getting some sort of result. This one clocked in at 114 seconds, which is still rather dismal. At this point I’m still convinced I’m just not using it “properly” but it is also possible that PHP really has this issue regarding speed. That said, I could probably find a Dictionary/HashTable class implementation on-line that would work better than a nested associative array. Another thing might be that I am simply using a slower PHP implementation. I’m using the installation from “php-5.3.20-nts-Win32-VC9-x86.msi” which is accessible from here. I went with the “Non thread-safe” version since that is usually faster. It’s also possible that running this on a Linux machine will yield faster results, since that is PHP’s ‘native’ environment. I considered the stumbling blocks with other languages, particularly Java and C#, with which I only got maximum performance when disabling debug mode. With that in mind, I sought to make tweaks to the php.ini file that came with the PHP interpreter, in the interest of increasing speed. I wasn’t able to make any significant changes to increase speed, though I did get it down to 109 seconds.

I think a fair assessment here has to do with the File I/O being done only a tiny bit at a time. This is one of the issues I had with some of my other implementations as well. In this case it creates and allocates more memory (possibly, depending on the specific PHP implementation of adding elements). So it makes sense to try to do it in one fell swoop, and then iterate over those elements. With that in mind, I made some changes:

Which still didn’t fix the speed problem (113 seconds this time). I did notice that when I had error display on, I got a lot of notices- as the loop ran- about undefined indices when setting $anagrams[$sorted]. So if I had to hazard a guess I’d say that the speed issue is a combination of a very long loop (200K words) combined with it being entirely interpreted that results in the rather slow run-time. The language itself is a bit messy, though it’s certainly not as bad as Perl as far as special punctuation marks go. Of course with with the loss of that special punctuation ($ for scalars, @ for arrays and % for hashmaps in Perl, but PHP only keeps $ and lacks the multiple contexts) we get a simplified language that is a bit easier to work with cognitively but lacks some power and capability. Nonetheless, it is still a very workable language, and most of it’s capability can be leverages by using existing frameworks and code; PEAR libraries can be found for pretty much every purpose. I hazard to wonder how it can be considered scalable, given the speed issues I’ve encountered, but I’m still quite certain I’m simply using it incorrectly. I have to say I was surprised that I remembered the foreach syntax off the top of my head- (foreach($array as $value)). Not because it’s complex or esoteric but because I have been messing with C#, java, and other languages a lot in the meantime, many of which use very slight differences- Java uses it’s for loop but with a special syntax for(value:array) C# uses a foreach loop foreach(value in array), Scala uses for(value < - array), and Python doesn't technically have a standard for loop and everything is really a foreach for value in array:. F# uses a similiar syntax, replacing the colon with it’s own equivalent keyword for value in array do. I was convinced when I used as that I was remembering some other language implementation and would have to look it up, but to my surprise I remembered it correctly. probably just luck though, to be fair. Even though C# is my “main” language for desktop development I do in fact find myself using the Java for syntax for foreach loops, too. (or even trying to use for(value in array) and wondering why the IDE yells at me on that line until I facepalm that I used the wrong keyword).

One might thing this particular example proves PHP is “slow”, but I disagree. I think it just shows that PHP is not well suited to iterating over hundreds of thousands of elements. But given it’s particular domain- web pages- you shouldn’t be presenting that much information to the user anyway, so that sort of falls out of the languages intended use cases’ scope.

In the works: Part XI, Haskell. I had a nice long one almost finished but I appear to have either forgotten to save it when I restarted my PC or something along those lines, so I’ve had to rewrite it. Thankfully I still have the actual Haskell code, which is arguably the most intensive part of each entry in the series.

Posted By: BC_Programming
Last Edit: 27 Jan 2013 @ 09:31 AM

EmailPermalinkComments (0)
Tags
 08 Oct 2012 @ 2:10 AM 

This is part of a occasionally updated series on various programming languages. It should not be interpreted as a benchmark, but rather as a casual look at various programming languages and how they might be used by somebody for a practical purpose.
Currently, there are Articles written regarding Python,C#, Java and VB6 (merged for some reason),Scala,F# & Ruby,Perl, Delphi, PHP,C++,, Haskell,D, and VB.NET.


In order to compare various languages, I will be implementing a anagram search program in a variety of languages and testing and analyzing the resulting code. This entry provides some conclusions on what I discovered.

Returning to my previous attempts, I looked to my C# implementation. Second fastest so far, but can it be made faster? Probably. My attempt to do so was to upgrade it to C# 5.0 and use the Parallel tasks library to try to get some measure of asynchronous execution with the core loop. I had to essentially put a lock around the Dictionary modification code, meaning that, fundamentally, the only thing that happened asynchronously was the sorting of the words, which is still better than nothing. I managed to get it under half a second using this approach. This is the ‘new’ for loop used for this:

I only changed the foreach loop- here commented out- and replaced it with a Parallel.ForEach call. Most of this code is just adding the items to a Dictionary, which isn’t easily parallelizable due to race conditions. The best I could think was to try using the ConcurrentDictionary and ConcurrentBag in place of Dictionary and List, respectively, but this made performance drop significantly, taking several dozen seconds.

Performance results

Here is a listing of the languages; with their execution time and a speed index that I created based on the length of the solution and the time it took.

language Execution Time(ms) Length(characters)
Python 1,200 399
C# 700 1,333
Java 1,050 1,471
VB6 9000 2,202
Scala 1,400 1,055
F# 500 899
Ruby 10,000 320
Perl 3,100 394
Delphi 1,000 1,848
PHP 114,000 1,071
C++ 400 984
D 7520 173
VB.NET 610 1436

Edit: After researching some information on the Java VM, I reran this test using the Server profile (Java -server anagram). This sped it up from 1.8s to just a smidgen over 1 second.

What can we conclude from this table? Well, nothing, really. Like most statistical analyses of Programming language performance, it’s almost meaningless. A high level view, however, tells us there are two kinds of languages; those that require the programmer to do a lot of work, and those that don’t. Because many high-level constructs are implemented on the former languages, they get used frequently, and as a result it’s difficult for the implementation to be optimized for all cases. For other languages, however, you often implement such structures yourself (and therefore can decide on the best design to use based on how you intend to use the data structure), or, if you are lucky, you get to choose from a set of genericized implementations. For example, the .NET framework has several classes that implement the basic List Interface (LinkedList, Queue, Stack, etc) and a number of other implementations that use various algorithms.

An Omission one could note in the .NET framework is the fact that it lacks a built-in, generic Tree structure. However, this is not really an Omission, but more a Design choice. While a List Interface and assorted implementations can have generic uses, a lot of the purpose of a similarly purposed generic tree is mostly academic; when creating a data structure, sometimes you don’t even fully realize that the structure you are creating has some tree properties, and even when you do, the various parts of that tree require extra information.

Some make the argument that programming languages that do things for the programmer are “for beginners”; for example, C# and by relation the .NET framework provide a myriad of data structures, such as LinkedLists, Lists, Dictionaries (Hashmaps), Queues, etc. Some argue that these are basic concepts that a programmer should know how to write. I agree 100%. However, what I don’t think we should need is for every programmer to have their own personal implementation of various data structures; while usage over time will iron out problems, those are the sorts of things that ought to be directly in the language library. Most popular programming languages have those or similar constructs available via their build in libraries. Of course this does not prevent you from writing your own implementations, and could be seen to help, since you can compare your own implementation to the well-tested library code.

I suppose what it sort of boils down to is an argument of speed of development versus speed of execution; languages like Ruby offer very quick development, easy reading, and quick, concise syntax to do what you need. However, one pays for this in a higher execution time compared to other languages. Personally, I air on the side of programming languages being geared towards speed of development; it doesn’t really matter how fast your program runs if it never exists; additionally, computers can always be made faster, as can language implementations. For example, the concise ruby version I provided might run faster in the future because of enhancements to the interpreter as well as being run on a faster computer. However, adding those language features that I exploit to other languages won’t magically make my implementations in those languages faster, because it will require the code be revised, or even rethought.

Add to this that most people who cite evidence against their least favourite language could very well simply not be very good at them; for all I know my python implementation is slower than my C# version simply because I suck at Python. However, the fact is that most people that state “I tested several languages…” and then state a conclusion that coincidentally matches their original stance are oddly tight-lipped when it comes to actually sharing the code they used to test.

This write-up is not designed to point out the relative performance of different languages, but rather point out something that should have been obvious from the beginning; all programming languages are different, and they differ in expressive power as well as execution speed, but the latter is a less important concern than the former, all things considered.

One thing that comes up often when you have programmers coming from different language and platform backgrounds is suggestions that come from those backgrounds. This is a mixed bag- a person that is familiar with functional programming languages may give you new insight into the problem you are working on. Similarly, somebody may bring with them, from those other platforms and systems, a way of doing things that you can apply to your language/technology. On the other hand, another thing you sometimes see is suggestions or criticisms that are based on how the person understands their language, rather than the one being dealt with. A good example of this I’ve witnessed first-hand (as the recipient of the critique) was when I wrote a small Python program that used Dictionaries to make a rather short and concise Rock/Paper/Scissors game. The criticism was that I was using “Rock”, “Paper” and “Scissors” as string indices into the Python dictionary, and they said that if I used numbered constants, it would be faster. However, this was not the case- as I demonstrated shortly afterwards, the changed code was in fact the same speed but much harder to read. This was because they were assuming that strings would be slower- but Python Dictionary’s are actually heavily optimized for string indexing, and additionally due to string interning, the comparison to the key is just a pointer comparison, so they were functionally identical.

I guess my overall point is that a Language itself is never going to make a big enough performance difference over another to make it worth either switching or even using a language purely on that basis. Different language paradigms and idioms Programming languages ought to be judged on their merits as a translatory language between humans and the machine. Additionally, if the speed of a language- such as the most common example of C or C++- is really fast enough to matter compared to myriad other languages, than surely hand-tuned Assembly code does as well; compared to Python, Ruby, Java, or C#, C and C++ typically require a lot more design, planning, experience, and development and maintenance time; the counter-argument is always that C/C++ require that because they are low-level (they aren’t, they are high-level languages) and are faster than the other languages. However that same trade-off can be found between C/C++ and hand-tuned assembly, making that argument more or less silly. The counter to this is that Assembly is a lot harder than C/C++. But that just reinforces my point; writing proper C/C++ code for relatively simple tasks can be exponentially harder than in Python, Ruby, Java, or C#. Additionally, it completely ignores the additional development and maintenance time that using it requires.

Posted By: BC_Programming
Last Edit: 15 Apr 2013 @ 10:17 PM

EmailPermalinkComments (0)
Tags
 22 Sep 2012 @ 12:13 AM 

This is part of a occasionally updated series on various programming languages. It should not be interpreted as a benchmark, but rather as a casual look at various programming languages and how they might be used by somebody for a practical purpose.
Currently, there are Articles written regarding Python,C#, Java and VB6 (merged for some reason),Scala,F# & Ruby,Perl, Delphi, PHP,C++,, Haskell,D, and VB.NET.


In order to compare various languages, I will be implementing a anagram search program in a variety of languages and testing and analyzing the resulting code. This entry describes the C# entrant into that discussion. A naive anagram search program will typically have dismal performance, because the “obvious” brute force approach of looking at every word and comparing it to every other word is a dismal one. The more accepted algorithm is to use a Dictionary or Hashmap, and map lists of words to their sorted versions, so ‘mead’ and ‘made’ both sort to ‘adem’ and will be present in a list in the Hashmap with that key. A psuedocode representation:

  1. Loop through all words. For each word:
    1. create a new word by sorting the letters of the word.
    2. use the sorted word as a key into a dictionary/hashmap structure, which indexes Lists. Add this word (the normal unsorted one) to that list.
  2. When complete, anagrams are stored in the hashmap structure indexed by their sorted version. Each value is a list, and those that contain more than one element are word anagrams of one another.

The basic idea is to have a program that isn’t too simple, but also not something too complicated. I’ll also admit that I chose a problem that could easily leverage the functional constructs of many languages. Each entry in this series will cover an implementation in a different language and an analysis of such.

Implemented in C#, the described algorithm might look like this:

One of the contributing factors to the amount of code here is the creation of a method to sort characters in a string as well as availing the use of foreach() with a stream contributes to this. Much of the size (compared to Python) can be attributed to the different typing of the language, too.

In Visual Studio’s default debug mode, all optimizations are disabled and the IDE hooks a lot of the activity of the program itself. My initial run took over 20 seconds to complete. I disabled the Visual Studio hosting process and activated the Release build, as well as changing the target platform from x86 to “Any CPU”. This reduced the time from 20 seconds to around 10 seconds; changing the implementation of the iterator method to read the entire file at once rather than one line at a time reduced it to ~0.8 seconds.

This might actually provide some insight into why so many detractors seem to have such an easy time thinking C# is slow. They will usually start up Visual Studio, make a quick test program, run it, and find it very slow, and then write off the language as slow, almost dismissively. This, without realizing that the language itself is not slow, but debugging a program tends to put a damper on things; what with the lack of optimizations in order to allow for better RTTI for the debugger.

Posted By: BC_Programming
Last Edit: 15 Apr 2013 @ 10:09 PM

EmailPermalinkComments (0)
Tags
 22 Sep 2012 @ 12:12 AM 

This is part of a occasionally updated series on various programming languages. It should not be interpreted as a benchmark, but rather as a casual look at various programming languages and how they might be used by somebody for a practical purpose.
Currently, there are Articles written regarding Python,C#, Java and VB6 (merged for some reason),Scala,F# & Ruby,Perl, Delphi, PHP,C++,, Haskell,D, and VB.NET.

In order to compare various languages, I will be implementing a anagram search program in a variety of languages and testing and analyzing the resulting code. This entry describes the python entrant into that discussion. A naive anagram search program will typically have dismal performance, because the “obvious” brute force approach of looking at every word and comparing it to every other word is a dismal one. The more accepted algorithm is to use a Dictionary or Hashmap, and map lists of words to their sorted versions, so ‘mead’ and ‘made’ both sort to ‘adem’ and will be present in a list in the Hashmap with that key. A psuedocode representation:

  1. Loop through all words. For each word:
    1. create a new word by sorting the letters of the word.
    2. use the sorted word as a key into a dictionary/hashmap structure, which indexes Lists. Add this word (the normal unsorted one) to that list.
  2. When complete, anagrams are stored in the hashmap structure indexed by their sorted version. Each value is a list, and those that contain more than one element are word anagrams of one another.

The basic idea is to have a program that isn’t too simple, but also not something too complicated. I’ll also admit that I chose a problem that could easily leverage the functional constructs of many languages. Each entry in this series will cover an implementation in a different language and an analysis of such.

Implemented in python, the solution is surprisingly short:

this solution is nice and short (as I noted) because of the first-class dictionary and list support in Python, as well as it’s dynamic typing. The resulting program was run without the shown comments, to prevent parsing overhead (if applicable). over the course of several dozen runs, the speed of this solution averages around 1.2 seconds on my machine, running it via Python 3.2.3. The Python language provides a number of powerful constructs that often employ functional programming elements into traditional imperative language design concepts. I often hear the claim that Python is “Object Oriented” but have never really figured out why that is said. For example, the “join” function used here is not an object method, and the program itself does not belong to an object. Instead I typify Python as one language among many that provides the tools for a programmer to use any number of programming styles, without being forced to use a specific one. This is something that should be noted and even lauded, not the fact that it happens to support Object Oriented programming.

Posted By: BC_Programming
Last Edit: 27 Jan 2013 @ 09:25 AM

EmailPermalinkComments (0)
Tags
 22 Sep 2012 @ 12:10 AM 

This is part of a occasionally updated series on various programming languages. It should not be interpreted as a benchmark, but rather as a casual look at various programming languages and how they might be used by somebody for a practical purpose.
Currently, there are Articles written regarding Python,C#, Java and VB6 (merged for some reason),Scala,F# & Ruby,Perl, Delphi, PHP,C++,, Haskell,D, and VB.NET.

Much like my previous posts on the subject of programming language polarization, this one comes as a response to what I feel are flawed perspectives on programming languages. In particular, this comes in response to some claims that C# is a “beginners” language; this poster also made the claim that they tested older languages as well. This is doubtful, however, because that claim was backed up with no reproducibles such as source code and timings. Not to mention the notation that “Even OLD Basic loops and functions are faster than C#” which seems hard to prove since “OLD” Basic (BASICA and GW-BASIC) didn’t have loops or functions. (with the exception of DEF FN statements, which aren’t quite the same idea and are more similar to macros). Additionally, empty loops and functions aren’t typical constituents of a program. That is, a test seeing that some specific structure is faster on Language X than Language Y is focussing on a small area of a much larger picture.

The purpose of this short series of posts is to provide a relatively simple, but not entirely trivial, algorithm; I will implement it into several languages, and then compare performance results as well as of course the resulting code itself. The purpose of this experiment is to provide actual results. Many claim that “Language X is faster than Language Y” but as described above they typically focus far too much; for example, I have seen claims that loops are faster in C++ than, say, C#; The actual test code was not provided, of course, so it’s impossible to show what they were really measuring. (Hint: It wasn’t loops). More importantly, languages are designed for solving problems, not looping or simple if instructions done repeatedly.

I will not, of course, claim this to be unbiassed. I feel that the performance of a language is not nearly as important as that languages succinctness and expressiveness, because those two qualities contribute more to code quality and code quality is directly proportional to how maintainable a codebase is, and the maintainability of a codebase should be a more important concern than how many milliseconds a contrived example takes to execute. Many of the “swings” being taken are by C/C++ users; usually claiming that garbage collected languages aren’t “real programming” or that without determinate memory management all programs will be really slow and unusable. Perhaps ironically, while C/C++ users are making these claims, creating their derived tests and contrived scenarios where Garbage Collection fails, those using those “not real” programming languages are writing programs and getting work done. It’s arguable whether the C/C++ userbase that make these attacks are making them because they feel that newer garbage collected languages are a true detriment to programming or whether they- either conciously or not- realize that these languages relieve a burden of responsibility from the programmer and that in making programming more accessible, it becomes far less of an elite skillset. Therefore, they need to attempt to trivialize the abilities of those using those new technologies in order to restore their own egos. This is highly speculative, of course, but from what I’ve seen a lot of those who push C/C++ and criticize various managed frameworks and Virtual Machines really don’t have the first clue about that which they complain. Many of their arguments are so easily refuted as to be silly.

This series of posts is not attempting to discredit C/C++; in fact I’m doubtful either one of those languages will show up in the tests at all; rather, it is designed as a comparison of the same thing implemented in different languages. I have tried to do so with as little bias as possible but I must admit not being familiar with many of the languages I have used for these tests, so my lack of ability may impair the performance of some implementations. Corrections or improvements to these implementations are both welcome and encouraged.

I hold the opinion that overall, language performance only matters in some very specific cases. For example, determinate memory management is useful most when you have a small, finite amount of memory; for example, if you are writing a program to run on a small embedded system with a limited feature set and only a few hundred K of memory, it makes sense to use a language like C. In fact, the 640K limitation of DOS could almost single-handedly be pointed to as the reason that most earlier Garbage Collection implementations did not get critical acclaim. THe languages architecture and design were usually very powerful and they did have garbage collectors that tried to be as conservative as possible (or used XMS for their heaps), but their was still the issue that DOS itself only had 640K to deal with. Additionally, programmers had become used to allocating and freeing all the memory they used.

This resulted in a powerful backlash to the implementation and use of Garbage collected languages- or even any language that claimed to be “easy to use”. This is because programmers are, to be quite honest, egotistical douchebags when it comes to programming. They like to think of themselves as “better” than the unwashed masses that cannot program or have no desire to learn to do so; they feel smarter. When C was getting popularity, Assembly programmers cried out that it wasn’t “real” programming. Eventually, however, even the last Assembly die-hards acknowledged that C had a very important role to play in programming. With the introduction of languages like Pascal and Modula, C Programmers cried out that using Pascal wasn’t “real” programming. Of course, without actually defining what “real programming” was, they were just being, for lack of a better word, whiny bitches. What they didn’t like wasn’t that the language wasn’t similar to C, but rather that the language was easier to use. This trivialized the effort that they put into learning C. In order to prove to themselves that they made the correct choice in learning C, they research and find myriad flaws in the language in question, and declare it to be “beyond repair” then fall back on C again.

As time progressed, C++ started to come to the fore. C++ got exactly the same criticisms as C; mostly ethereal claims based on little actual evidence, and usually saying something that indicated that using C++ wasn’t “real programming” or various other detrimental factors.

C++ is popular now, and now other languages- such as Java, C#, and various other languages, have born a lot of criticism from people that learned C++. The issue is not whether C++ is the better language at all; what it boils down to is that the C++ programmers are trying to convince themselves that they made the right choice learning C++, or that they made the right choice not learning Java, or not learning C#. I find this sort of logic somewhat silly, since there is nothing mutually exclusive about programming languages.

I suppose an important term to define is what it means for a language to be “powerful”. In my definition, the power of a programming language comes from it’s expressiveness; how easy it is to represent abstract concepts in said language. For example; even take a relatively basic concept such as an array. in C, there truly is no such thing as an array; what you are able to do is use the subscript operator for pointer arithmetic as well as for declarations and allocations, but truly there is no actual “array” objects or item that you access. You cannot determine the length of an array, for example, because, fundamentally, there is no array- just a block of bytes that you happen to be able to access using subscripts. Compare this to higher level languages, where Array’s are first-class citizens with a large variety of capabilities and methods. the C implementation is of course a lot faster, because you are really just doing pointer math, and there is no extra data to keep track of. However, the latter is more powerful because it takes the burden of tracking that data off the hands of the programmer. a C program might declare an array, add values into the array, sort the array, and display the array as separate operations. A Higher level language such as ruby however can do this in a single statement. The ruby solution, while many tens of lines shorter, will often be slower, because of ruby’s interpreted nature. However, while a C++ implementation will be faster and longer, a hand-tuned assembly variant will be even faster and longer. I guess the question is why C/C++ purists seem to think that the performance difference between C++ and higher level languages is so important but why the same gains accessible via the use of hand-tuned assembly is not. And I think it boils down to a simple fact- many C/C++ purists do not know assembly. Therefore advocating it’s use would be like a bricklayer encouraging the use of wood siding.

given the above, the power of languages differs, and the fact is, C# is a more powerful language than C++. A lot of C#-like behaviour- things like iterators, lambda expressions, etcetera can either be mocked up using C++ templates or macros or are part of the C++11 specification, but C# has had many of these features for a while now, and they are first-class citizens. Many C++ purists point to the fact that C# compiles to IL code- as opposed to native code- as if this was some sort of Achilles heel. Much the same is said for Java bytecode. What such arguments fail to take into consideration is the wealth of information and metadata that IL code has. No C++ program can dynamically construct class instances from some other C++ program, because, fundamentally, a C++ program, once compiled, is just flat code; there are no classes or objects or instances, or methods. with C# and Java, the class and function information remains intact. a C# program could run another C# program directly in it’s own process; or construct and use objects from some other C# program. This is very much the same for Java. Introspection and RTTI (Run-Time Type Information) are very powerful language features which can only, at best, be emulated in C++ code using strict coding practices and templates.

Whether C++ is faster than C# is a point of debate; but whether it is faster to develop in C# or C++ is a no-brainer question; C++ takes a significant investment of programmer time at each step; design takes longer because C++’s implementation of OOP is imcomplete and fractured; differences in the implementation of templates between compilers can make code compile and run fine on one compiler, compile and crash on another, and not even compile on yet another. This can be traced back to the complex standard of the language, which can make parsers nearly impossible to use and complicates static analysis. This brings us, yet again, back to introspection; if all types and data of a structure are preserved, static analysis becomes a lot easier, if still far from trivial.

Essentially, an argument that language A is faster than language B only matters if language A and language B are equally easy to use by the programmer. Arguing that C++ is faster than C# seems reasonable to people, but those same people jump to defend C++ in the equally solid argument that C is better than C++; or that Assembly is better than C. It used to be impossible to write and publish a game with reasonable performance without learning assembly. Over time, it became possible to write those programs in C; and later, in higher level languages. As computers have gotten more and more memory, faster and more processors and cores, and faster hard drives, micromanaging those resources within your program like a Nanny becomes a game of diminishing returns. Even today many AAA games are no longer entirely written in C++ or C, instead opting for Python or another script language for some of the scripting. Not every piece of code ever written needs to run as fast as humanly possible; in fact, some accepted “fastest” algorithms have worst-cases that can be slower than the accepted O(n2) slow algorithm; for example, with certain array configurations and sizes, a QuickSort can run slower than a bubblesort. It all depends on the data and what is being done to it. Benchmarking how fast empty loops run in different languages tells you nothing about which language is faster, and as argued above the end-result performance is no more important than the time required to make that end product.

That said, one could argue that I was making this argument to skirt the “obvious” fact that C# is slower than, say, Python, as evidenced by the above. I decided to give this a try with a nontrivial example. In this case, a program to find all anagrams in a given list of words. An anagram, for those not in the know, is a word formable by rearranging the letters of another word.

To those who don’t look before they leap, they may find performance dismal; the brute force approach of looking at every word and comparing it to every other word is dismal at best, performance-wise. The accepted algorithm is to use a Dictionary structure, and use the sorted letters of a word as a key into that dictionary, and essentially add each word to a list. A psuedocode representation:

  1. Loop through all words. For each word:
    1. create a new word by sorting the letters of the word.
    2. use the sorted word as a key into a dictionary/hashmap structure, which indexes Lists. Add this word (the normal unsorted one) to that list.
  2. When complete, anagrams are stored in the hashmap structure indexed by their sorted version. Each value is a list, and those that contain more than one element are word anagrams of one another.

The basic idea is to have a program that isn’t too simple, but also not something too complicated. I’ll also admit that I chose a problem that could easily leverage the functional constructs of many languages. Each entry in this series will cover an implementation in a different language and an analysis of such.

Posted By: BC_Programming
Last Edit: 15 Apr 2013 @ 04:26 AM

EmailPermalinkComments (0)
Tags
 18 Jun 2012 @ 5:05 AM 

It is a frequent point of debate in many web communities that contain programmers- or back-seat programmers, as it were, to argue that such and such language is better than another, or “if it was written in this language it would be faster”. A lot of the time this is coming from people calling themselves “real programmers” or at least putting on the air that “real” programmers don’t use managed languages like Java or C#; instead they seem to think that “real” programmers use C. And arguably, any programmer should at least be able to read C. However, a good programmer is defined not by what language they use, but their product. a Program that works is a program that works, regardless of the language it was written in.

Anyway, the idea that C- or C++ is superior to Java- or, more generally, that any language is better than another for speed- comes with the implicit assumption that there is some limit. Because you know what can probably create faster code than C++? C. And faster than that? Hand-tuned assembly. Saying C++ is superior is basically saying that it’s worth it to dump the language-based advantages that Java has over C++ for the speed improvement, but somehow it isn’t worth it to make nearly the same amount of gains by switching to C, or by switching from C to Assembly.

Thing is, unless you choose Assembly language, there will always be a language that can make your program faster. But the thing is that we use Programming languages to try to abstract away the details. Instead of having a series of direct instructions to the Floating Point Unit, placing values on the FPU and CPU stacks to perform operations in Assembly, we simply use C or a higher-level language and give them a in-fix expression as we are used to it. Can you sometimes make such code run faster in Assembly? probably; if you take advantage of U and V pipe-lining, make sure to reduce wait states and memory accesses, and so forth.

The bigger question is, “Is it worth it”. And largely, no, it isn’t worth it. In fact, it very seldom is worth it.

another point is that the primary thing you are doing with a programming language now is interfacing with libraries. C++, C, and Assembly don’t make the Libraries run any faster. On windows if you allocate memory- whether by way of creating objects in Java or C#, using new in C++, using malloc, alloc, or whatever in C, it all boils down to OS-specific functions; in Windows case, all of those eventually become calls to LocalAlloc. (Or GlobalAlloc). But whether you make that call from Java or C doesn’t make that function execute faster.

Sure, you can argue that Java or C# probably has more overhead; from new Object() to the actual Allocation there is probably some garbage collection housekeeping and allocating the various fields. But the fact is that in C you will usually be doing that housekeeping yourself anyway; depending on the nature of the memory allocation and what it is for, you will probably be making a lot of calls to malloc() and free(); and every single time is a tango with the evil poltergeist known as memory allocation problems. accidentally forget to free a block of memory and then reassign the value being used to store the pointer- leak. Accidentally call free twice on the same block of memory? double-free. All that extra code adds up, and while I don’t think it quite equals the time penalties associated with Java, which might accure to about a tenth of a second over a constant year of execution, it certainly takes a toll on anybody trying to work on the code. Having to remember “I have to allocate that structure on 16-byte boundaries, and I need to make sure all the char arrays are packed before I call this function,” etc.

And even then, you could easily eke out a similar performance gain over a C or C++ implementation by completely retooling the entire program in Assembly.

For any number of projects written in Java or C#; particularly Games- you can usually find a number of posts on forusm calling for or at least implying that the game should be re-written in C or C++ “for speed”. But why stop at C or C++? Why not call for the entire thing to be re-written in Assembly? Because it’s not worth the effort. But the thing is, it’s not worth the effort to rewrite it in C or C++, either; by the time any such rewrite is completed, computers will have gained enough speed to make the speed improvement moot. The reason that Assembly language isn’t used is because it is no longer necessary.

Programs used to pretty much have to be written in Assembly to be reasonably fast. QDOS/MS-DOS was originally coded in Assembly language. Same with every early Operating System. But those Operating Systems were dead simple by comparison. Now, C is the main language used for Operating Systems. Did C code get faster? Not really, at least not in comparison to hand-tuned Assembly. but the fact is that if writing twice as much code could make your code 10% faster, it was only worth it if that 10% speed difference actually mattered. With the 386, it did, and could often mean your program showing a chart a full second before your competition. Now, your program running 10% faster can often mean that it shows a chart imperceptibly faster, which is hardly justification for tripling the amount of code, making it difficult to maintain with low-level code, and discarding numerous useful abstractions.

That last word sort of touches on what a Programming Language truly is- just a set of abstractions. Let’s take a simple language construct that exists in nearly every language- the simple if. In assembly, the equivalent is to use a compare instruction, and then use a Jump instruction (for example, JNE (Jump if not equal) to Jump to a specific address. Most Assemblers also add features that don’t directly translate to Machine code, such as Macros, making some of this a little easier. Your typical C if statement might take quite a few lines to perform all the needed compares. But it certainly takes more work. Can you make that if run faster? Well, probably, if you are a skilled Assembler programmer. But most people aren’t, and even in the case that one is they would only do this in a time critical inner loop.

Nowadays, inner loop speed is not as critical as it used to be. While a Assembly rewrite to a critical section of code might make that code 200% faster on a 386, that doesn’t mean it would have the same effect on a modern machine, because modern machines have multiple processors, some of which need to access the cache concurrently; there are also numerous pipelining issues to consider. For the most part optimizing code with Assembly on a modern processor is not a feasible solution.

It used to be that Assembly had to be used for everything. If you used C, you were a loser and your code was too slow for general use. Then it was that C was used for everything. If you used C++, you were a loser and your code was too slow for general use. Now, it feels like C++ is used “for everything”. If you use anything else, your code is too slow for general use.

But the fact is that people sticking to these older- if arguably faster- languages are sticking also to the same languages that have made possible almost every single security vulnerability in every modern Operating System. C code just begs to be buffer overflowed, and the simplest oversight can topple a companies entire network, pass along trade secrets, or otherwise be a gigantic pain in the ass. In today’s networked world, Speed is not something that is required of programming languages, both because if it’s myriad trade-offs, such as costing more programmer time and anguish, as well as the fact that you can always buy a faster computer later. Obviously, as long as the algorithms used are sound, you aren’t going to be getting gigantic gains on your implementations just by switching to another language. Sometimes it can even make things slower. Managed languages are a good idea for Application code- and games- simply because they are for Applications. Games don’t need to be close to the hardware because, like a wife after a long marriage, the hardware hasn’t even been touched in years. All “hardware” interaction is done today either through OpenGL or DirectX, which themselves delegate through software-based drivers anyway.

Computer time is cheap. Programmer time is not.

Posted By: BC_Programming
Last Edit: 19 Jun 2012 @ 02:51 AM

EmailPermalinkComments (0)
Tags

 Last 50 Posts
 Back
Change Theme...
  • Users » 41480
  • Posts/Pages » 347
  • Comments » 104
Change Theme...
  • VoidVoid « Default
  • LifeLife
  • EarthEarth
  • WindWind
  • WaterWater
  • FireFire
  • LightLight

PP



    No Child Pages.

Windows optimization tips



    No Child Pages.

BC’s Todo List



    No Child Pages.