If you have ever looked at UIPasteboard you might have seen that there are a variety of public types, like colors, images, plain text or URLs. But applications are also free to implement their own types to be put on the pasteboard when the existing types don’t do justice to your content.
One such custom type is being used between Apple’s iOS apps, like mobile Safari, whenever you copy HTML snippets: “Apple Web Archive pasteboard type”. At first glance this looks really secretive because all you can see there is a long string of numbers representing the NSData for it.
Today I am unveiling an Open Source solution to consuming this pasteboard type as well: DTWebArchive. Using this you can let your users copy something from Safari or Mail and pasted it into your app while preserving the rich text. I put this code into a new GitHub repository because even if you don’t dabble with CoreText and NSAttributedStrings+HTML then this project can be very useful to you.
When I started implementing cut/copy/paste for my DTRichTextEditor I found that text copied from Safari is both available on the pasteboard as plain text and as the above mentioned type. At that point I thought “Boy wouldn’t it be great if we could get at the HTML that was pasted”. Well, now you can.
The reverse-engineering began by inspecting the NSData with a text editor. All I needed was to see the “BPLIST” tag at the beginning to know that this is in fact a binary plist. I deserialized it and was astonished to find that this is simply an NSDictionary.
There are several elements to this archive, the most important one being the main contents in a WebArchive class which has mime type “test/html” and is the pure HTML that we are seeking. In addition to that – if you have copied images as well – you see an array of WebRessource elements where each encapsulates a file, typically an image, sometimes a CSS file. Those are basically cached copies. The original HTML still has the web URLs in it, but you could search for the WebResource with the same URL to find the local version.
Finally if you have IFRAMEs in the main HTML then each IFRAME will also have a corresponding WebArchive in an separate array. For example a HTML5 YouTube video would be showing in an IFRAME.
Because WebKit is Open Source could can see for yourself why my cleanroom re-implementation is way more portable. Heres the source for WebArchive.mm and here for WebResource.mm which relies heavily on LegacyWebArchive.cpp.
Here’s an example where I copied text with one image.
I decoded this pasteboard item with DTWebArchive output the HTML.
This is not entirely correct, because it hard-codes NSUTF8StringEncoding but should take the actual encoding of the WebArchive contents. But you get the idea.
I copied this into a snippet to look at it with the NSAttributedString+HTML demo project and this is what you see:
(Actually I had to cheat a bit because I found that I first had to fix something in NSAS+HTML that would cause the image to shift upwards.)
The next step for me is now to add this project as a submodule to NSAS+HTML and implement convenience methods to convert between attributed strings and web archives. Also maybe some convenience methods might be nice to get the UIImage for a specific URL.
For you this means that you should add copy and pasting of rich HTML to your applications as well.