platform-pain-points

PDF Text Not Searchable? Run OCR to Fix It

2026-05-17 8 min read

Why Your PDF Refuses to Let You Search It

You hit Ctrl+F, type a word you know is on page 4, and... nothing. The text is right there, clear as day, but your PDF acts like it's a photograph. That’s because, for all practical purposes, it *is* a photograph. This maddening situation usually happens for two reasons. Someone might have scanned a physical document—a signed contract, an old invoice, a medical record—and saved it as a PDF without any text recognition. The scanner just captured a picture of the page, not the letters and words on it. Alternatively, some software applications create PDFs by flattening everything into a single image layer, discarding the underlying text data even if the original file had perfectly selectable text. The result is a PDF that looks completely normal but contains zero machine-readable characters. You can't search it. You can't copy-paste from it. Screen readers are useless. And if you try converting it to Word or Excel, you'll get a blank document or a file full of empty boxes. The solution is Optical Character Recognition, or OCR. OCR software analyzes the pixels in an image, identifies the shapes of letters, and reconstructs the actual text. After running OCR, your PDF gains a hidden text layer that sits invisibly beneath the visual image. It still looks identical, but now Ctrl+F works, copy-paste works, and your conversions to editable formats will actually contain content.

What OCR Actually Does (and Where It Can Go Wrong)

At its core, an OCR engine breaks an image into regions, isolates individual character shapes, and plays a high-stakes matching game against its trained models. Modern engines, like the Tesseract-based pipeline CocoConvert uses, are trained on millions of real-world documents. They handle standard fonts, mixed-case text, and common layouts with accuracy rates that often exceed 98% on clean scans. But don't let that 98% lull you into a false sense of security. A 10-page document with 500 words per page has roughly 30,000 characters. With 98% accuracy, you're still looking at 600 errors. That's more than enough to make a legal document unreliable or a financial report dangerously misleading. Accuracy plummets with poor source material. Low-resolution scans (anything under 200 DPI), pages with heavy background textures, funky decorative fonts, irregularly spaced columns, and documents in less common languages all present challenges. A faded thermal receipt scanned at 96 DPI will produce pure gibberish, no matter how smart the OCR engine is. Even the page orientation matters. A document scanned just 3–4 degrees crooked can throw off the character segmentation process. Good OCR pipelines, including CocoConvert's, run a 'deskew' step to automatically detect and correct this rotation. But if your scan is badly angled—think of a quick phone photo—the results will be imperfect. Handwriting is the final boss. Standard OCR is built for printed text. Cursive, in particular, will produce wildly unreliable results from any general-purpose tool. While specialized handwriting recognition exists, it's a completely different technology, and CocoConvert does not currently offer it. If your document is handwritten, OCR will try its best, but you must expect significant errors and plan on a full manual review.

How to Run OCR on a Scanned PDF Using CocoConvert

Getting this done is simple. Head over to CocoConvert and find the PDF to Searchable PDF converter. You can find it under the PDF Tools section or just type 'OCR' into the main search bar. Now, upload your file. CocoConvert takes PDFs up to 200 MB on the free tier, and that limit jumps to 2 GB for paid plans. If you're tackling a massive scanned archive that's bigger than your plan allows, you'll need to split it first with the PDF Split tool before running OCR. After the upload, you'll see an OCR settings panel. Pay attention here. The most important choice is language. While the default is English, the engine supports over 100 languages. If your document is in French, German, Spanish, or something else, you must select it. Choosing the wrong language won't break the conversion, but your error rate will spike, especially with accented characters. The other key choice is output format. You can get a searchable PDF (where the original image is preserved with a text layer added underneath) or a text-only PDF (which reconstructs the look of the document from the recognized text). For almost any common use case—contracts, invoices, reports—you want the searchable PDF. The text-only option can be useful for pulling raw text to edit elsewhere, but it will discard the original layout and any embedded images. Hit 'Convert,' give it a minute (a 20-page scan usually takes 30–90 seconds), and download your file. Open it, press Ctrl+F, and try searching for a word. It's a little bit of magic.

Checking OCR Quality Before You Rely on the Output

Never trust OCR output blindly. Just because the conversion finished doesn't mean it's perfect. It just means the engine processed every page. Now you need to verify the quality. The quickest way is the copy-paste test. Seriously, do this every time. Open your new PDF, select a full paragraph of text, copy it, and paste it into a simple text editor. Now read it. Look for the classic OCR mistakes: garbled words, spaces disappearing between words, numbers mistaken for letters (the digit '0' becoming the letter 'O' is an old favorite), and mangled punctuation. For any document where accuracy is non-negotiable—legal contracts, medical records, financial statements—you need to be more thorough. Open the original scan and the new searchable version side-by-side. Spot-check at least 10% of the pages, paying special attention to dense text, small fonts, or any areas where the original scan looked blurry. If you're finding error rates over 1-2%, the problem is almost certainly your source file. Rescanning at 300 DPI instead of 150 DPI can work wonders. Most modern scanners default to 200 or 300 DPI; check your settings for 'Scan Resolution' or 'Output Quality.' If you're using phone photos, dedicated scanner apps like Microsoft Lens or Adobe Scan are vastly superior to your default camera app, as they correct perspective and boost contrast. One thing to know: CocoConvert doesn't provide a confidence score or highlight questionable words in the output. This is a real limitation for certain high-stakes workflows. Enterprise platforms like ABBYY FineReader offer this, and for compliance-sensitive work, that extra verification layer can justify the higher cost.

Converting a Scanned PDF to an Editable Word Document

A searchable PDF is great, but what if you need to actually *edit* the content? Maybe you need to fix typos, update numbers, or completely reformat a section. For that, you'll want to convert the scanned PDF directly into a Word document. CocoConvert can do this in one shot. Just use the PDF to Word converter and make sure you enable the OCR option in the settings—look for a toggle labeled 'Enable OCR for scanned documents.' When this is on, the engine first recognizes the text and then does its best to reconstruct the original layout in Word, complete with matching fonts and paragraph styles. The key phrase here is 'does its best.' The quality of this reconstruction can vary wildly depending on how complex your document is. A simple, single-column document like a letter or memo will probably convert very cleanly. A multi-column magazine layout, a dense table, or anything with text wrapped around images will absolutely require manual cleanup. Tables are a notorious challenge; the OCR might recognize the text in the cells perfectly, but rebuilding the table structure depends entirely on how clear the borders are in the scan. You must budget time to clean up the Word output. For a 10-page report with standard formatting, plan on at least 20–30 minutes of tidying up fonts, page numbers, and headers. For a 50-page beast with tables and mixed layouts, it will be significantly more. Think of OCR-to-Word conversion as giving you a powerful head start, not a finished product.

When OCR Is the Wrong Tool for the Problem

OCR is a powerful fix, but only for the right problem. Before you run a file through an OCR engine, it's smart to diagnose what's actually wrong with your PDF, because not all unsearchable PDFs are simple image scans. Sometimes, a PDF has real text, but it's encoded with a custom font that doesn't map to standard characters. You'll know this is the case if you can select text, but copying and pasting it results in gibberish—random symbols, empty boxes, or jumbled letters. This is a font encoding issue, not an image issue. Running OCR on it is like putting a band-aid on a broken leg; it won't fix the underlying problem and just adds another layer of potential errors. The real solution is to re-export the PDF from its source with standard font embedding. Another culprit is password protection. Some PDFs are set to restrict text copying, which can make them seem unsearchable. OCR is useless here because the text data is present, just locked. You need the password to remove the restriction first. And of course, sometimes a PDF is just corrupted. If the file structure is damaged, it might not even render correctly. While CocoConvert can repair minor corruption, a severely damaged file might just fail to process at all. Finally, don't mistake OCR for a full accessibility solution. If your goal is to make a PDF fully usable by screen readers for visually impaired users, OCR is only the first step. True accessibility requires a tagged structure (defining headings, lists, reading order, and alt text for images), which is a separate, more involved process that automated tools don't handle well yet.

Practical Tips for Better OCR Results Every Time

The quality of your source file is the single biggest factor in OCR accuracy. Garbage in, garbage out. The good news is, this part is entirely within your control. First, scan at 300 DPI. I can't stress this enough. This is the universal standard recommended by archivists and legal offices for a reason. At 300 DPI, characters are sharp and clear. At 150 DPI, small fonts (anything below 10pt) start to get fuzzy and ambiguous. Going up to 600 DPI gives you only marginal gains for much larger files, so 300 is the sweet spot for most documents. For text-only documents, use grayscale or black-and-white mode. Color scans are bigger and can introduce compression artifacts that blur text. Unless you need to preserve color charts or photos, stick to grayscale. And please, clean your scanner glass. That tiny smudge or dust speck will show up as a black mark on every single page of your scan, and the OCR engine will waste time trying to figure out what letter it is. Anyone who has fought a misbehaving PDF export knows small details matter. If you're scanning a book, press the spine down flat and scan one page at a time. Trying to scan two pages at once introduces a shadow and curve near the spine that will wreck OCR accuracy in that area. Finally, for large projects, remember that CocoConvert's paid plans support batch processing. If you have a folder of 50 scanned PDFs to process, you can ZIP them up and upload them in one go. It's a huge time-saver for anyone digitizing an old archive.

← Browse all articles