Ad

Our DNA is written in Swift
Jump

DTCoreText 1.1

Other people take off the Christmas holidays to have fun. I delight in using the time away from normal programming work to work on DTCoreText.

There are many wide-reaching changes to warrant an increase in version number on the second digit. I need to sum them up in this location because you might have projects that rely on DTCoreText for displaying attributed strings.

Version 1.1. marks the second major improvement related to parsing HTML. The first step was to replace NSScanner with libxml2. The approach there was to keep track of a sort of DOM fragment comprised of DTHTMLElements. The actual text would be tracked separately and at certain locations the current text would be put into the current element and appended to the building string.

The problem with the previous approach was that it caused the code to be extremely convoluted because I needed to keep track of a multitude of parameters: whether the previous string had whitespace at the end, whether the next string fragment would require a newline before being flushed because it belonged to a block-level element and much more.

DOM-inating

The new parsing approach greatly simplifies the code inside DTHTMLAttributedStringBuilder, both in the DTHTMLParser delegate callbacks as well as the blocks to be executed for beginning and ending of tags. DTHTMLElement is now itself a subclass of DTHTMLParserNode. This represents a node in the parse tree which a name, attributes and child nodes.

Because this approach constructs an actual Document Object Model (DOM) of the HTML document you can output a HTML-like structure similar to how Safari shows the structure of an inspected document.

DOM structure

 

This output works because DTHTMLParserNode provides a debugDescription method which constructs this properly indented display.

The default mode of DTHTMLAttributedStringBuilder is to discard child nodes that it does not need any more, but you can have the structure be preserved and output it at any stage during or after the build process. The _rootNode IVAR points to the root node of the document, the _bodyElement points to the tag representing the <body> tag.

libxml2 always adds a body tag even if the parsed HTML did not have one. It does an exceptionally good job in making the HTML well-formed. Must be from its XML parsing heritage.

Class-Cluster Polymorphism

This new approach to a DOM tree allowed me to remove all these crufty state IVARs and a few specialized methods from the begin and end tag handlers. Those are blocks that get called after a new node is opened or closed by libxml2.

Instead there are now multiple subclasses of DTHTMLElement which provide specialized behavior for several HTML tags. One such HTML element primarily provides an attributedString method that returns a representation of said element as NSAttributedString. The base class provides an extensive implementation that also walks through its children. But for an <img> tag we don’t care about these anyway, so the overwritten version of attributedString does not need to iterate over the child nodes.

This method is what they call a “class cluster” and I got the inspiration for it from UIButton. You might remember that UIButton does not have an alloc/init but rather you instantiate buttonWithType:. The reason for this is that you never actually get a UIButton, but each of these types returns a specialized UIButton subclass.

Same with DTHTMLElement, there the tag name is used to look up the class to be used for the initializiation.

+ (void)initialize
{
	// lookup table so that we quickly get the correct class to instantiate for special tags
	NSMutableDictionary *tmpDict = [[NSMutableDictionary alloc] init];
 
	[tmpDict setObject:[DTHTMLElementBR class] forKey:@"br"];
	[tmpDict setObject:[DTHTMLElementHR class] forKey:@"hr"];
	[tmpDict setObject:[DTHTMLElementLI class] forKey:@"li"];
	[tmpDict setObject:[DTHTMLElementStylesheet class] forKey:@"style"];
	[tmpDict setObject:[DTHTMLElementAttachment class] forKey:@"img"];
	[tmpDict setObject:[DTHTMLElementAttachment class] forKey:@"object"];
	[tmpDict setObject:[DTHTMLElementAttachment class] forKey:@"video"];
	[tmpDict setObject:[DTHTMLElementAttachment class] forKey:@"iframe"];
 
	_classesForNames = [tmpDict copy];
}
 
+ (DTHTMLElement *)elementWithName:(NSString *)name attributes:(NSDictionary *)attributes options:(NSDictionary *)options
{
	// look for specialized class
	Class class = [_classesForNames objectForKey:name];
 
	// use generic of none found
	if (!class)
	{
		class = [DTHTMLElement class];
	}
 
	DTHTMLElement *element = [[class alloc] initWithName:name attributes:attributes options:options];
 
	return element;
}

Since all these subclasses derive from DTHTMLElement I can treat them as such whenever I am iterating over them. Each provides an attributedString method, however specialized it might be. This polymorphism should give a further performance improvement because there is never a need for giant “if trees”.

The only remaining items in the tag start/end handling blocks are related to special CSS style cases that are difficult to model with the current brute for CSS selector methodology. For example setting the headerLevel property to 1 for a <h1>. Of course I could make a special H1 DTHTMLElement subclass or maybe even one for all headers.

One exception to this rule is the block that deals with a <style> block. Version 1.0 of DTCoreText has a bug where only the first part of a lengthly style sheet is actually being parsed. Simply for the reason that I never thought that libxml2 would send multiple CData callbacks for longer blocks.

In the previous version HTML created from attributed strings would attach the style information to the individual tags. Inspired by Apple’s NSHTMLWriter I created DTHTMLWriter and I’m also aggregating styles into classes per tag. Naturally this increases the size of such a style block enough for the above mentioned problem to manifest.

This is a case where a state in the string builder class needs to be modified because of the contents of the stylesheet and so the logical place is in the handler block.

Oh, and some bug fixes, too!

First there was the problem with the long style blocks. That’s working now.

Actually the motivation for this weeks worth of changes came from an open issue. The problem was that if you had two underline words then the space between them would also get underlined. This was quite hard to fix in the old parse style because of the detachment of text and formatting information. This is fixed now.

There was a bug in DTLinkButton that would manifest if you were using highlighting on tapping on links. The assumption was that there could only ever be one glyph run and so only the first one got displayed there. The bug resulting from this was that hyperlinks consisting of multiple Chinese characters would consist of multiple glyph runs of which all but the first would not be displayed. Fixed.

There was also a second problem with DTLinkButton that would only show for very small fonts. This button has a feature to make itself grow in size if the hit area became to small for normal fingers. This caused small text to shift to the left side. Fixed.

Then there was a problem with dealing with pt versus px as length measurement. Previously I had treated them as equivalent but it turns out that apparently 1 point should be 1.3333 px. Because of this HTML that specified font sizes as pt would make the text appear too small. Fixed.

Conclusion

I was too lazy to to a benchmark comparison between both versions, but my guess is that the new method should be faster. Maybe somebody can verify this by timing it.

I’m hoping that the source code in this new version is so much more simple to understand that more people can contribute to the project. For example now that the entire tree of a <table> tag exists one could implement a simple form of table support.

Just two days ago it didn’t look to me like I could bring this development branch to 100% coverage of the unit tests, but I endured and finally I got all test cases to pass, including a few new ones. This should mean that the results should be identical or better than what version 1.0 was generating.

Your feedback cordially invited.


Tagged as:

Categories: Updates

9 Comments »

  1. DTCoreText support Chinese?

  2. help!
    I use CoreText from gitub ,but demo can’t compile.
    I setup a new project step by step,always show
    ‘DTHTMLParser.h’ file not found
    why

  3. You also need to clone the submodules, which include DTHTMLParser in DTFoundation. See the read me.