GPU Edge-MTTV - Fast edge energy detector

-Rengarajan (Wed Nov 18 11:05:10 CST 2015)

Efficient file parsing

-Rengarajan (Mon Jul 6 18:17:32 CDT 2015)
One of the most basic parts of programming yet, in my opinion, one of the most ignored parts is file parsing. Efficient techniques for file parsing have been figured out by more talented programmers for more than three decades and these techniques have been greatly improved over that period of time. Using these tools, however, is something that most casual programmers do not do, thinking that there is no need for such complexity for something that they are going to do for one special case. Unfortunately, I/O is fairly complex (not to mention, time consuming) and if you have written code with the above mindset then it is bound to break at some point. The main reason for this is error handling and efficient tokenization. The standard way of parsing each line and recognizing tokens with the help of delimiters is prevalent and is basically how efficient parsers tackle this problem too. However, there are significant differences in how many extraneous cases the code is handling and the process of error recovery. Grammar/Rule based parsers belong to the class of some of the most efficient ways to parse a file. Essentially, the process of tokenization remains the same as the front end of any compiler. In short, the main job of a tokenizer reads tokens one at a time from a stream and passes the said tokens to the parser. One can define exactly what token type is acceptable and also define the action to be taken otherwise. I will leave some of the background reading for those that are not familiar with theory of compilers and optimizations by attaching a link. Flex and GNU Bison are standard tools that can be leveraged to write (very easily) some of the very safe and most optimized parsers. The idea here is very simple and the end solution is elegant and flexible. Flex is the lexical analyzer and Bison is the actual parser. One can write the rules required to accept tokens in a flex file and the overall structure or grammar for the file in a corresponding bison file. Bison typically asks flex to send it tokens so that the grammar can be matched according to the definitions. If the rule fits then you can further write code to accept the input. Usually programmers need to write a layer on top of the bison grammar in order to separate the functionality in a clear and concise manner or even for compilation purposes (I will explain this in a detailed blog next week).

So, the question is why should casual programmers be interested in this? I will need to get into details at this point and I am going to explain this with an actual relevant example that we use in the lab all the time. We parse KW18 files all the time and most of the biomedical imaging part in our lab uses this file format. It is essentially an unstructured file format with simply 20 columns of information which is usually just floats. Now, I personally wouldn't use unstructured text files for presenting this kind of data but that is an argument about inefficiencies in the research pipeline which can be a later blog post. For this test, I pulled the parser from Kolam and compared it against an infinitely more flexible flex-bison based parser that was written from scratch. Both parsers were using the same text files as input and to be fair the test was conducted with a large and a small file each. Large files are typically on the order of several megabytes of plain text while the smaller files are about a few hundred kilobytes. The flex-bison parser always significantly outperformed the custom parser by being about 3 times faster! This is in addition to all the elegant error handling and the power of safe typecasting that comes with writing such parsers! Here are the actual numbers:

Filename Parsing with Flex-Bison parser (ms) Parsing with the custom parser (ms)
track_BigGroupmore (8.5 MB) 305.102 ms (avg. over 10x10 runs) 933.907 ms (avg. over 10x10 runs)
MU_ARGUS_GT_V2Registered (514 KB) 19.789 ms (avg. over 10x10 runs) 73.499 ms (avg. over 10x10 runs)

To explain in a bit more detail, I had to change some of the code within the Kolam parser in order for the test to be as fair as possible. The string tokens were being converted to double in the original code which is significantly slower than converting to integers and hence I made the change such that both the parsers (mine with the flex-bison & Kolam's custom parser) read in the inputs as integers and push them to vectors. I would advise people to take this with a grain of salt as this test is nowhere close to being through enough but I believe is sufficient to encourage people to search for better solutions even while doing the most mundane of tasks such as parsing. What it boils down to is the fact that if you are doing something interactive then a 900 ms delay is noticeable while a 300 ms delay is not and I think that is a simple enough conclusion for today's post. I have uploaded the whole code with a full CMake toolchain to my Github and I would ask anyone interested to start monitoring the site for another push in the next few days.

Motivated by Static Code Analysis

Pursing research in an academic setting can definitely impact the time that can be allotted towards programming. However, there are great sources of motivation out there that can help focus on programming. I personally follow many great programmers through various blogs (and/or reddit) and social media (twitter) and try to learn what tools they are using to make life easier. Standard tools such as CMake for building, Git for version control are a no-brainer (check New Members page) but are there tools that play an assistive role in helping with efficiency and/or particular programming styles? The answer is yes of-course there are and following John Carmack's thoughts on static code analysis the article inspired me to perform static code analysis of some of our own tools such as LOFT and Kolam. Both projects have been worked on by many different people from various backgrounds (more so in the case of Kolam than LOFT) and it was a wonderful experience analyzing both code bases for any obvious programming slips that can be corrected. Here are the links for Kolam and LOFT-Lite's detailed code analysis piped output. I used Cppcheck a truly easy tool to get this output with all options enabled and with the force check flags on. If anyone is interested on why this is required for C/C++ projects however large or small I would advise trying a search for static code analysis on twitter and you will see over a million tweets just today after John Carmack's interview all highlighting the importance of why this is something programmers have been dying to point out to the community as a whole. Happy coding!