BuySellAds.com

Our DNA is written in Objective-C
Jump

DTCoreText 1.1

Other people take off the Christmas holidays to have fun. I delight in using the time away from normal programming work to work on DTCoreText.

There are many wide-reaching changes to warrant an increase in version number on the second digit. I need to sum them up in this location because you might have projects that rely on DTCoreText for displaying attributed strings.

Version 1.1. marks the second major improvement related to parsing HTML. The first step was to replace NSScanner with libxml2. The approach there was to keep track of a sort of DOM fragment comprised of DTHTMLElements. The actual text would be tracked separately and at certain locations the current text would be put into the current element and appended to the building string.

The problem with the previous approach was that it caused the code to be extremely convoluted because I needed to keep track of a multitude of parameters: whether the previous string had whitespace at the end, whether the next string fragment would require a newline before being flushed because it belonged to a block-level element and much more.

DOM-inating

The new parsing approach greatly simplifies the code inside DTHTMLAttributedStringBuilder, both in the DTHTMLParser delegate callbacks as well as the blocks to be executed for beginning and ending of tags. DTHTMLElement is now itself a subclass of DTHTMLParserNode. This represents a node in the parse tree which a name, attributes and child nodes.

Because this approach constructs an actual Document Object Model (DOM) of the HTML document you can output a HTML-like structure similar to how Safari shows the structure of an inspected document.

DOM structure

This output works because DTHTMLParserNode provides a debugDescription method which constructs this properly indented display.

The default mode of DTHTMLAttributedStringBuilder is to discard child nodes that it does not need any more, but you can have the structure be preserved and output it at any stage during or after the build process. The _rootNode IVAR points to the root node of the document, the _bodyElement points to the tag representing the tag.

libxml2 always adds a body tag even if the parsed HTML did not have one. It does an exceptionally good job in making the HTML well-formed. Must be from its XML parsing heritage.

Class-Cluster Polymorphism

This new approach to a DOM tree allowed me to remove all these crufty state IVARs and a few specialized methods from the begin and end tag handlers. Those are blocks that get called after a new node is opened or closed by libxml2.

Instead there are now multiple subclasses of DTHTMLElement which provide specialized behavior for several HTML tags. One such HTML element primarily provides an attributedString method that returns a representation of said element as NSAttributedString. The base class provides an extensive implementation that also walks through its children. But for an tag we don’t care about these anyway, so the overwritten version of attributedString does not need to iterate over the child nodes.

This method is what they call a “class cluster” and I got the inspiration for it from UIButton. You might remember that UIButton does not have an alloc/init but rather you instantiate buttonWithType:. The reason for this is that you never actually get a UIButton, but each of these types returns a specialized UIButton subclass.

Same with DTHTMLElement, there the tag name is used to look up the class to be used for the initializiation.

+ (void)initialize
{
	// lookup table so that we quickly get the correct class to instantiate for special tags
	NSMutableDictionary *tmpDict = [[NSMutableDictionary alloc] init];
 
	[tmpDict setObject:[DTHTMLElementBR class] forKey:@"br"];
	[tmpDict setObject:[DTHTMLElementHR class] forKey:@"hr"];
	[tmpDict setObject:[DTHTMLElementLI class] forKey:@"li"];
	[tmpDict setObject:[DTHTMLElementStylesheet class] forKey:@"style"];
	[tmpDict setObject:[DTHTMLElementAttachment class] forKey:@"img"];
	[tmpDict setObject:[DTHTMLElementAttachment class] forKey:@"object"];
	[tmpDict setObject:[DTHTMLElementAttachment class] forKey:@"video"];
	[tmpDict setObject:[DTHTMLElementAttachment class] forKey:@"iframe"];
 
	_classesForNames = [tmpDict copy];
}
 
+ (DTHTMLElement *)elementWithName:(NSString *)name attributes:(NSDictionary *)attributes options:(NSDictionary *)options
{
	// look for specialized class
	Class class = [_classesForNames objectForKey:name];
 
	// use generic of none found
	if (!class)
	{
		class = [DTHTMLElement class];
	}
 
	DTHTMLElement *element = [[class alloc] initWithName:name attributes:attributes options:options];
 
	return element;
}

Since all these subclasses derive from DTHTMLElement I can treat them as such whenever I am iterating over them. Each provides an attributedString method, however specialized it might be.¬†This polymorphism should give a further performance improvement because there is never a need for giant “if trees”.

The only remaining items in the tag start/end handling blocks are related to special CSS style cases that are difficult to model with the current brute for CSS selector methodology. For example setting the headerLevel property to 1 for a

. Of course I could make a special H1 DTHTMLElement subclass or maybe even one for all headers.

One exception to this rule is the block that deals with a