DTCoreText – New Formula!

Jan 25, 2012

I chose this article’s title to try and grab your attention. Well, the product is still the same and does the same. The only difference is one that is under the hood. And as such it is your job – should you choose to accept it – to marvel at the benefits that the new old parsing engine brings us.

Ever since my friends at scribd showed me a first prototype for HTML parsing based on libxml2 I was – admittedly – jealous. This prototype basically worked by having libxml2 parse out the individual paragraphs of an article and then display each paragraph in its own cell of a UITableView. Back then I brushed off the suggestion and went with something I understood: NSScanner.

All the C-code necessary to deal with libxml2 seemed overly daunting, so I went with a simple basic structure for the parsing, something like this (pseudocode)

while (there_is_more)
{
   if (we_scanned_a_tag)
   {
      if (is_tag_a)
      {
         // a specific code
      }
      else if (is_tag_b)
      {
         //  b specific code
      }
      else if ...
   }
   else
   {
      // we must be inside a tag
 
      // deal with skipping over a tag that is incomplete, i.e. crap
 
      if (we_scanned_some_text_before_the_next_tag_open)
      {
         // append the text with correct formatting to the output
      }
   }
}

This structure had grown organically as I added support for additional tags and it went into the initWithHTML category method because that seemed to be the logical place. Little did I think ahead that this approach would preclude the possibility of having an event-based parser do callbacks into my code, because that would mean adding these event-handling methods to NSAttributedString.

Even the above pseudocode is long, you can imagine how much of a Spaghetti the origin code became. Much of the problem didn’t actually come from all these tags, those where simply a big if-statement. Complexity came from having to deal with all these special cases where HTML might not be well-formed.

In my Open Source genstrings2 I saw how much faster pure C-code performs than NSScanner which also served to rekindle my wish to switch to libxml2 because this is also written in low-level highly optimized C.

I approached libxml2 in several steps:

Jealousy – “Boy wouldn’t it be great, but I’m afraid that this is out of my league”
Announce Intention – I wrote an issue on GitHub hoping for somebody to step forward
Do an Experiment and Document – I googled a bit and put together Part 1 of my libxml HTML tutorial.
Write a Wrapper – For Part 2 of my libxml HTML tutorial I wrote an Objective-C wrapper for libxml2.
Astonishment – The feeling you get when you find that you begin to understand the C-code needed
Benchmark – I removed all string building code and compared the raw parsing performance of both approaches
Transform the Pasta – Moved the code for building the attributed string into an aptly named class and have this driven from events generated by the new HTML parser.

The final step I called like this because if you break up Spaghetti code into several logical pieces and then layer these into several layers that’s a different kind of Pasta, that’s called Lasagne.

For step 6, the comparison I moved the NSScanner parsing loop into its own class and directly compared the running time on my iPhone 4S resulting in this tweet:

War&Peace HTML (3.4 MB), NSScanner: 4.264s, libxml2: 1.398s = 3x as fast on single thread, plus latter fixes HTML structure

At this stage it was clear to me that I need no extra self-convincing. So I went to work in a branch of the project. Most of the work was simply copy/pasting the attributed string building code into the right place, tag start, tag end or the characters found event method. This also allowed for omitting some workarounds that where needed to deal with non-well-formed HTML.

The second big BIG advantage of libxml2 is that its HTML parser fixes up the structure for you and also adds a missing html and body tag so that you end up with a perfect structure. It even adds a </br> right after a <br>. Even though this is completely unnecessary it still makes the HTML look like perfect XML. Much nicer to work with.

While I was doing the migrating new issues and pull requests with fixes stated to come in putting me a bit under stress because I needed to include the fixes in both branches. Which is why I decided to merge the branch back into master at the earliest possible time.

The initWithHTML method has shrunken to a much more manageable size:

- (id)initWithHTML:(NSData *)data options:(NSDictionary *)options documentAttributes:(NSDictionary **)dict
{
	// only with valid data
	if (![data length])
	{
 
		return nil;
	}
 
	DTHTMLAttributedStringBuilder	*stringBuilder = [[DTHTMLAttributedStringBuilder alloc] initWithHTML:data options:options documentAttributes:dict];
 
	[stringBuilder buildString];
 
	return [stringBuilder generatedAttributedString];
}

Benchmark after the Merge

And of course I did some further benchmarking on my iPhone 4S to compare the resulting speed increase. All tests where done by simply adding two NSLogs at the beginning and end of the initWithHTML method and subtracting the timestamps. Note that this is just the time for building the attributed string and does not include the layouting or drawing.

Demo HTML Snippet

NSScanner: 50 ms

libxml2: 43 ms

War&Peace ePub HTML

NSScanner: 10.968 sec

libxml2: 8.796 sec

This is an overall parsing and string building speed increase of between 14% and 22%.

You might now ask what happened to the 60% increase we saw from just comparing the parsers. The string building itself is still all happening on the same thread as the parsing and by itself has many opportunities for optimization.

For one thing the next thing I’d like to do is to put the string building operations onto its own GCD background queue. This way the events coming in from libxml would hand off to a queue running on a different thread and would immediately return to parsing. This could easily double the overall scanning performance because you can now make use of the two CPU cores present in the more modern iOS devices.

And there are obviously many more optimizations that now are feasible to do because you no longer face a daunting monster spaghetti but instead stand a chance of understanding what is happing in DTHTMLAttributedStringBuilder.

Conclusion

The main three advantages of libxml2 over NSScanner are: Performance, HTML non-wellformedness-resilience, simpler code through event-based handling of HTML.

This merge now makes it possible again for interested developers to contribute optimizations and new features because the code has become so much simpler to read and understand.

Categories: Projects

Cancel Reply

Ad

DTCoreText – New Formula!

Benchmark after the Merge

Conclusion

Like this:

Related

Leave a Comment

CC

Ad

Ad

DTCoreText – New Formula!

Benchmark after the Merge

Conclusion

Sharing:

Like this:

Related

Leave a Comment

CC

Ad