Scotty, We Need More Power!

Mar 02, 2013

I remember a long time ago when CPUs still had only a single core and the chip manufacturers where racing for getting the highest Gigahertz. Then some technical limit was reached they shifted their philosophy to increasing the number of CPU cores. The unfortunate side-effect of this approach is that we developers need to adapt our code to make use of these multiple cores or else our CPU-intensive code will find a bottle neck in being only able to max out one core.

Scotty, we need more power!

In a previous article I described how optimizations in genstrings2 (our Open Source replacement for genstrings) led to the re-implementation being upwards of 30 times faster than the original.

The reason of this was mostly grounded in some smart using of multiple GCD queues. In this blog post I shall explore how we can get DTCoreText to also make use of all available CPU cores.

Since its inception DTCoreText has seen two major shifts in architecture. The first in version 1.1 was to switch from using NSScanner to libxml2. The second was to simplify the synthesis of the mutable attributed string by having a node tree and only append whenever nodes are being closed, as opposed to all the time. This was version 1.2.

Consider this Instruments CPU time graph. The big purple plateau is DTCoreText parsing War and Peace HTML and creating an NSAttributedString for it. This is about the longest ePub document there is, if not the longest. So this is a perfect candidate to test the effectiveness of optimizations.

There are some things quite apparent in this chart. This is obviously a dual-core device (iPhone 5) and it maxes out one core fully. Then there are two spikes where a second core kicks in for a very short time.

A Tale of 2 GCD Queues

The reason for this comes from DTCoreText already using two queues for its work, one for running the parser and one for appending to the mutable attributed string.

But these queues are only responsible for the left half of the CPU graph until the second spike. This right half is actually the building of the Core Text layout. The first 1 second spike is the actual parsing of the HTML data, after that the parsing queue is basically idle. Then for the remaining 6 seconds of the first phase the string assembly queue continues to execute the individual asynchronous blocks that where added in response to the DTHTMLParser delegate callbacks.

The second part, which we will not look at today, 8 seconds in duration, could also be halved if two cores would be used instead of one. Which is somewhat harder because as it is working presently we are manually typesetting the entire attributed string from beginning to end. Let’s make a note and revisit that some day as well.

Scotty, We Need More Queues!

At this point it becomes clear that we need to split off part of the work currently filling up the string assembly queue into a new queue. Here’s how the “Heaviest Stack Trace” looks like for this range

When I first saw this I thought to myself, “GCD is 50% overhead”. From this view in Instruments we can clearly see that _dispatch_call_block_and_release is doing quite a bit of work here. No clue how to optimize that, if you have any ideas let me know!

Next in the ranking of heaviest functions is our end-of-element callback from DTHTMLParser. Then there is the attributedString method of DTHTMLElement of which about half is DTHTMLElementText attributedString. So one idea is to split out the creating of the attributed substrings. In the following flow chart you can see the new design.

Even though it was a bit of a crap shot to determine where to best separate the functionality we now are getting a dramatically different profile:

The left portion is visibly taller and shorter in time.

The usual question to ask in this scenario is if the existing functionality is still intact. Initially it looked like it, but the unit tests started to fail. I needed to add a bit of synchronization around several properties of DTHTMLElement and rearranged some code. It took a bit of tweaking but I eventually got the unit tests to pass again.

Now how much does that gain us?

Version 1.2:
DTCoreText created string from 3.2 MB in 5.68 sec

with tree queue:
DTCoreText created string from 3.2 MB in 4.14 sec

with function for start:
DTCoreText created string from 3.2 MB in 4.26 sec

and function for end:
DTCoreText created string from 3.2 MB in 4.33 sec

with fixed unit test:
2DTCoreText created string from 3.2 MB in 4.11 sec

release build: 3.8 sec versus 4.09 sec

As you can see I did some other experiments as well.

One idea was to replace the dispatched blocks with functions. But as the tests clearly demonstrated this was not worth the effort. You have to do some extra magic to reproduce the variable capturing functionality that blocks to automatically for us. But all this work makes it ever so slightly slower. Slower performance coupled with less readability of the code is a clear No Go.

Updated: What About Single Core

One good question that came up in the comments was from Rob67:

Have you thought how this type of change will impact a single core device?

I assume there is a slight overhead with the extra dispatching, so will this actually run slower on a single core or will there be no noticeable difference? (or will it run faster there too?)

That I didn’t think of that … well, I tested it on my iPad 1 and got these results.

New Version
DTCoreText created string from 3.2 MB HTML in 20.20 sec

Version 1.2 (reverted)
DTCoreText created string from 3.2 MB HTML in 20.11 sec

New Version
DTCoreText created string from 3.2 MB HTML in 19.88 sec

So there was one measurement that was 1% slower and one measurement that was 1.2% faster. This we can round to “about the same”, call it “margin of error” and proclaim that more GCD queues don’t negatively effect single core performance.

Then I also did the same test on Mac that definitely has more than 2 cores, my 27″ i7 iMac.

The interesting bit here is that you can see a short while where all three queues are busy maxing at 312% (so about 3 cores) which then drops to 198.4% when only the tree and string assembly queues are still busy. Finally it goes to 99.2% when only one queue has work left to do.

War&Peace takes about one second to process on my Mac, obviously not using all the CPU cores available there. As a rule of thumb with serial queues you can only ever get to 100% times the number of GCD serial queues.

The Verdict

Having multiple dispatch queues allows for a greater deal of concurrency while still working of individual tasks in a serial fashion. Obviously you have to parse through the HTML data from top to bottom, build the tree the same way and always append to the attributed string at the end. But that does not mean that the interleaved work units have to be forced to wait for each other.

The final result from our benchmarks have shown a 28% improvement in duration for creating an attributed string from our War&Peace test document. Granted this is not the halving of the duration that we where hoping for, but nevertheless it is a great start.

By having one serial queue per stream of tasks I have shown how you can improve usage of multiple CPU cores where available. As more and more iOS devices sport multiple CPU cores it is up to us developers to code our CPU-intensive code such that it can finish those tasks in the least amount of time.

Categories: Recipes

7 Comments »

Anonymous
March 3, 2013 | 3:16 pm

Hm according to the profiler, _dispatch_call_block_and_release is only 3% overhead (self % is what is interesting here). You probably have some other calls on the “same level” further down that is responsible for the remaining ~50%!
rob67
March 3, 2013 | 11:09 pm

Have you thought how this type of change will impact a single core device?

I assume there is a slight overhead with the extra dispatching, so will this actually run slower on a single core or will there be no noticeable difference? (or will it run faster there too?)

Ad