Over the course of the last year, I’ve had quite a few side projects that required some way to get text from a variety of sources, with code and frameworks found in a number of private repos. A while ago, I felt an inkling to start pulling those together into an open source project. So this will be my Christmas gift for you this year.
SwiftText collects various ways of getting text — or, if possible, Markdown — from a variety of sources and places.
Update: … now Images, PDFs, Word DOCX and also HTML pages or URLs.
One such use case was to get pure text from bank statements for my investment portfolio, so that I could parse the text and construct a CSV file to upload my holdings to Yahoo Finance.
Reading PDFs
For the most part these statements were normal PDFs that had been programmatically created. The advantage of those is that you can get the actual text from selection ranges, just like when you select the text and then copy it to the pasteboard. This is the one sort of PDFs you might find with vector data. Essentially those files are just a record of drawing information into a vector context.
But there was a problem, because some of those statements were scanned from paper. This is the other — less useful — sort of PDFs: those are essentially collections of bitmap images, one per page. But thankfully we do have quite capable OCR capabilities on Mac and iOS in the form of the Vision framework.
With both PDF selection ranges as well as text fragments from Vision you get rectangles with text. So I made it such that you only have to ask a PDFPage for its textLines(). It will first attempt to get the text from the selection ranges and if it fails it will render the page into a 300 DPI bitmap and then OCR it, to still give you more or less the same result. Those text lines are comprised of those fragments that are likely forming a line, even though there might be tabs or whitespace between them.
This was the state of this private framework for the longest time. It saw a lot more usage in a receipt scanner I am building for myself and also when I was asked by a friend to translate several PDFs, it was extremely lucky that I had a quick way to get the raw text from those PDFs to feed into ChatGPT. This opened my mind for the possibility that this might be quite useful in agentic scenarios where agents need to get to the text of things.
So the idea for SwiftText was born: it should be an open source project that collects various forms of getting text — or, if possible, Markdown — from a variety of sources and places.
Reading DOCX
For PDFs I had already covered both types of PDF files, extracting the OCR for bitmap images was a simple exercise. There was a case where I had to get the pure text from a Word document (DOCX) instead of PDF. Granted, I could just copy the text out of that, but my goal is to have that in a form — a tool — that I could use to automate such work in the future.
I had a look at how DOCX files are constructed: they are just a ZIP archive of a couple of XML files. At the heart there is a document.xml which contains the actual document text. So I gave this task to Codex and with nearly no extra input from me it was able to create a utility that would output the pure text from such a Word doc. Behind the scenes it uses XMLParser, so the only external dependency for that is ZIPFoundation, because to my knowledge there is no first-party ZIP reading capability that fits this use case across Apple’s platforms.
Markdown has a slight edge over pure text because it marks emphasis on specific terms, tells us about headlines of different levels, and also clearly structures lists — numbered or bulleted. But my Codex agent also had no problem pulling out this style information from the DOCX contents.
SwiftText comes with a demo CLI app that lets you perform OCR. This gives you Markdown for a Word file:
swift run swifttext docx file.docx --markdown
For PDF or bitmaps you do:
swift run swifttext ocr file
For the latter I do have experimental Markdown support, but it’s been very challenging to get semantic information from those kinds of sources. I have the beginnings of a semantic parser — again from Vision — which promises proper paragraphs, tables, and lists. But unfortunately at this time it seems that I couldn’t get it to work reliably. The problem with tables is that Vision seems to be very easily thrown off by some layouts, detects superfluous columns and what not. The best approach here would probably be to look at lines that have text always at the same x positions and then infer the table structure from that. This is clear future work.
Of course the easiest would be to just hand your files to ChatGPT — or some local Vision-enabled LLM — and ask for it to just give you the text. But with this decision you leave the area of perfect determinism and structure. And also you start to have costs of those tokens. There is still something to be said for a purely local solution that leverages functionality available natively on Apple platforms. The existence of the Vision framework in particular will make it impossible for this to ever be available on other platforms. But alas, I can live with only being able to support iOS and Mac with SwiftText.
Warning: Traits
This package has another first for me: package traits.
With those — if you use Swift tools 6.1 or higher — you can import SwiftText as an umbrella module which itself contains SwiftTextOCR, SwiftTextPDF, and SwiftTextDOCX.
If I understand that correctly, at some point in the future SwiftPM will be able to omit external dependencies if they are not needed. Right now they are still being resolved and downloaded, although not compiled if not referenced by code. The one immediate nicety is that you can simply import SwiftText in your code, and the specified traits decide what gets packaged into that for you.
This is an improvement over the previous method of having separate imports for all targets/products you want: import SwiftTextPDF and import SwiftTextDOCX (and perhaps future traits like — dare I say — HTML).
Quo Vadis?
I have a few more private things that I would like to see move into SwiftText. I do have a functioning tool that gets Markdown from HTML, which requires libXML. This is handy for getting an LLM-friendly version of web pages.
Some web pages build their content with JavaScript — like e.g. OpenAI API documentation. I’ve got a solution for that as well, leveraging WebKit which works by loading the web page with WebKit and waiting for the DOM to be complete. Then it extracts the DOM’s HTML and parses that.
So these will be some of the next additions to this project. Then there’s of course more document semantics. It would be great to get proper Markdown tables from anywhere. We’ll see about that. That might come more quickly from Word than from PDFs because XML is orders of magnitude more structured than PDFs.
Conclusion
I am excited to share SwiftText with the OSS community because it has proven its worth to me on many occasions. I could have waited until it is even more polished but I was eager to make my work here public. I have some ideas for the future direction of SwiftText and I invite you to get in touch with specific use cases where enhancements might fit with the spirit of SwiftText.
Update, later the same day….
Because Codex is really amazing copying code between projects while integrating it, I was able to add my libXML-based HTMLParser as well as the code to convert HTML to markdown. Enjoy!
Categories: Projects