← Back to topics
3 research LiteParse single-source 1 article

LiteParse brings browser-based PDF text extraction with OCR

LiteParse, an open-source project, now has a web-based version that extracts PDF text in the browser using PDF.js and optional OCR (Tesseract.js) with spatial parsing to maintain readable layouts, and includes a UI for text and JSON output.

LiteParse brings browser-based PDF text extraction with OCR
via Simon Willison

🔍 Let's dive in

LlamaIndex’s LiteParse, traditionally a Node.js CLI tool for PDF text extraction, has been implemented to run entirely in the browser. The browser version supports OCR as an option and uses PDF.js and Tesseract.js, employing spatial text parsing to preserve logical reading order, including multi-column layouts. The project homepage demo and a plan-driven development process, including a noted plan.md artifact and GitHub Actions deployment to GitHub Pages, are described in the article.

Lead coverage: Simon Willison — Extract PDF text in your browser with LiteParse for the web ↗

🕰 The timeline · 1 source

Simon Willison analyst · 4h ago · 3/5

Extract PDF text in your browser with LiteParse for the web ↗

LlamaIndex’s LiteParse, traditionally a Node.js CLI tool for PDF text extraction, has been implemented to run entirely in the browser. The browser version supports OCR as an option and uses PDF.js and Tesseract.js, employing spatial text parsing to preserve logical reading order, including multi-column layouts. The project homepage demo and a plan-driven development process, including a noted plan.md artifact and GitHub Actions deployment to GitHub Pages, are described in the article.

🏷 Tags

software-developmententerpriseconsumer ClaudeClaude CodeLlama

🔧 Debug

Cluster ID
c2ae8022eb
Importance (max)
3
Members
1
Sources
Simon Willison
Earliest
2026-04-23T21:54:24.000Z
Latest
2026-04-23T21:54:24.000Z
Lead URL
https://simonwillison.net/2026/Apr/23/liteparse-for-the-web