Over the past week I’ve been working on DTLocalizableStringScanner or in short genstrings2. The original genstrings dates back to the NeXtStep days. You know how it is “never change a winning team” BUT “the good is the enemy of the great”, because if something kind of works, why change it?
Besides of the other problems I’ve alluded to in my previous article genstrings is very slow. Internally it is written in Objective-C as you can tell from the occasional stack trace when it crashes again. But it was created in a time when Macs did only have single CPU cores and when we did not have the awesome LLVM with ARC, GCD and multi-threading.
At the time of this writing we implemented genstrings2 twice: once based extensively on NSScanner which I became intimately acquainted with when working on DTCoreText. The other is a rewrite of the rewrite totally ditching NSScanner and doing all scanning directly on the individual characters.
My feeling is that the lower-level scanning will be orders of magnitude faster and at the same time it is way more robust because for non-literal macro parameters it knows how to deal with pairs of parenthesis and commas contained in strings. But this just begs to be benchmarked.
In order to have a sufficient amount of code to scan through I downloaded the latest source bundle for Adium. Getting a time for the individual versions is easy with the time command:
time genstrings -q -s AILocalizedString ~/Desktop/adium-1.4.4/Source/*.m real 0m7.872s user 0m6.848s sys 0m0.053s
When benchmarking you also need to mention the type of bench you are marking. These numbers come from my 2010 MacBook Air which is somewhat skewed because the disk IO times are possibly faster than usual. But we are mostly interested in the parsing efficiency. Here the benchmark machine has 2 CPU cores (1.6 GHz Intel Core 2 Duo).
You can see in activity monitor that original genstrings is single-threaded. It uses up to 100% of a single core and shows to have only 1 thread.
The version of genstrings2 using NSScanner clocked in much better. 1.7 seconds down from 7.9 seconds, 5 times as fast.
real 0m1.663s user 0m2.950s sys 0m0.070s
Now for the turbo version that does all the scanning without NSScanner. Instead it is working with the individual unichar characters that make up the strings. On iOS these are 16 bit because internally all NSStrings are using UTF16.
real 0m2.167s user 0m3.973s sys 0m0.074s
Whoa! Hold your horses, why is that slower than my version? My first thought was that maybe NSScanner really is very efficient. But once I fired up Instruments I found this, lots of CFRelease calls were bogging down the turbo scanner.
Looking at the sample distribution in _scanMacro revealed the culprit.
From this it was quite obvious that the creation of the mutable array to hold the scanned macro parameters was taking most of the CPU time in this area of the code. Turns out this was allocating and releasing this parameter array way more than necessary. We only need to created that once we are sure that this is actually one of the macros we are looking for, not some random word that happens to be made up of macro name characters.
So moving the array creation a few lines down and changing it to an alloc/initWithCapacity (we know that we probably will have less than 10 parameters) yielded this result in the benchmark:
real 0m1.385s user 0m2.045s sys 0m0.081s
Looking at Instruments again we see even more potential for optimization. If a line takes a quarter of the CPU time spent in a method then you have to ask yourself if there is a way to get rid of this work. Now this validMacroCharacters is already supposed to be a lazy initialization, so how to optimize this?
While it is lazy there is still quite some overhead to do the method call. On the first time around the character set is initialized and on subsequent calls the method returns the initialized IVAR. We can easily get around this by moving the initialization of the IVAR to the class init and reference it directly.
Normally this method call plus if plus returning the IVAR would not be a problem, but _scanMacro is called thousands of times and so – as proven by Instruments – this is an easy optimization. The benchmark improvement is very impressive:
real 0m1.065s user 0m1.839s sys 0m0.059s
At this stage the turbo scanner (with two changes) is already almost twice as fast as the NSScanner-version and 7 times as fast as original genstrings. BUT three is a charm and it so happens that I had a stroke of genius.
My train of thought went: if I can defer CPU-intensive work to the latest moment possible and if I can reduce the amount of times when it is necessary, what other method could I come up with to skip out of _scanMacro early? After the previous optimizations most CPU time was distributed between creating an NSString object and the containsObject of the macro dictionary. Two items that cannot be easily optimized further.
Upon closer inspection I found that far too many times words were found that were made up of characters in the macro words. This resulted in these two statements being called way more often than necessary, even for words that are not plausible macro names because they are too short. On the NSScanner version I had had the optimization to ignore short commands like “if” and “do” because these would look like macro characters but are in reality reserved words. But this did not yield and advantage.
The stroke of genius was to find the lengths of the shortest and longest macro names (also in the init) and then have a simple if comparison to return from _scanMacro if we character sequence is far too short for one of our macro names. The results blew me away.
real 0m0.442s user 0m0.691s sys 0m0.044s
Now we have truly reached the point of impossible further optimization. Looking in Instruments the CPU times are now much wider spread, only characterIsMember cumulates more CPU times then the other lines, but this method cannot be optimized further since that would mean writing a replacement for NSCharacterSet.
Let’s summarize the lessons learned:
- Good C-code easily outperforms NSScanner
- Defer creating objects to the latest possible moment
- Avoid lazy instantiations if they are called in tight loops that cause them to be called too often
- Try to find simple mathematical or logical abort conditions (quickly processed) to avoid CPU-intense work where possible
Granted I could spend more effort on working out the performance problems the NSScanner version has, but I doubt that I could get anywhere near the double turbo version. The final results thus are:
- genstrings: 7.9 sec
- genstrings (NSScanner): 1.7 sec
- genstrings (optimized turbo): 0.4 sec
The optimized turbo version is 20 times as fast as the original, let’s be content with that. 😀