platform-pain-points

PDF Text Not Searchable? Run OCR to Fix It

2026-05-17 8 min read

Why Your PDF Refuses to Let You Search It

You hit Ctrl+F, type a word you know is on page 4, and the search bar returns zero results. The text is sitting right there on screen — you can read it with your eyes — but the PDF treats it like a photograph. That's because it is a photograph, at least as far as the software is concerned. This happens in two main scenarios. First, someone scanned a physical document — a signed contract, an old invoice, a medical record — and saved the output as a PDF without running any text recognition. The scanner captured an image of the page, not the characters on it. Second, some software exports PDFs by flattening everything into a raster layer, stripping out any underlying text data even when the source document had perfectly good selectable text. The result is a PDF that looks completely normal but contains no machine-readable characters. You can't search it, you can't copy text from it, screen readers can't parse it, and if you try to convert it to Word or Excel, you'll get a blank document or a file full of empty boxes. The fix is Optical Character Recognition — OCR. OCR software analyzes the pixel patterns in an image, identifies letter shapes, and reconstructs the underlying text. Once that process runs, your PDF gets a hidden text layer that sits beneath the visual image. The page still looks identical, but now Ctrl+F works, copy-paste works, and conversion to editable formats works.

What OCR Actually Does (and Where It Can Go Wrong)

OCR engines work by breaking an image into regions, isolating individual character shapes, and matching them against trained models. Modern engines — including the Tesseract-based pipeline CocoConvert uses — are trained on millions of document samples and handle standard fonts, mixed-case text, and common layouts with accuracy rates above 98% on clean scans. That 98% sounds excellent until you do the math. A 10-page document with 500 words per page contains roughly 30,000 characters. At 98% accuracy, that's 600 errors — enough to make a legal document unreliable or a financial report misleading. Accuracy drops sharply in several conditions: low-resolution scans (below 200 DPI), pages with heavy background texture, handwritten text, unusual or decorative fonts, columns with irregular spacing, and documents in less common languages. A faded thermal receipt scanned at 96 DPI will produce garbage output regardless of which OCR engine you use. Rotation matters too. A page scanned even 3–4 degrees off-straight can confuse character segmentation. Good OCR pipelines include a deskew step that detects and corrects rotation before recognition runs. CocoConvert applies automatic deskew, but if your scan is badly skewed — say, a phone photo taken at an angle — results will be imperfect. Handwriting is the hardest case. Standard OCR is trained on printed text. Cursive handwriting, in particular, produces unreliable results with any general-purpose OCR tool. Specialized handwriting recognition exists but it's a different technology entirely, and CocoConvert does not currently offer it. If your document is handwritten, OCR will give you something, but you should expect significant errors and plan to review the output manually.

How to Run OCR on a Scanned PDF Using CocoConvert

The process is straightforward. Go to CocoConvert and select the PDF to Searchable PDF converter — you'll find it under the PDF Tools section of the main menu, or you can search for 'OCR' in the converter search bar at the top of the page. Upload your file. CocoConvert accepts PDFs up to 200 MB on the free tier and up to 2 GB on paid plans. If you're working with a scanned document that's larger than that — say, a 300-page archive scan — you'll need to split it first using the PDF Split tool before running OCR. Once uploaded, you'll see an OCR settings panel before you confirm the conversion. The most important setting is language. The default is English, but the engine supports over 100 languages. If your document contains French, German, Spanish, or any other supported language, select it here. Choosing the wrong language won't crash the conversion, but it will increase error rates noticeably, especially for accented characters. The output format option lets you choose between a searchable PDF (the image is preserved, a text layer is added underneath) and a text-only PDF (the visual appearance is reconstructed from recognized text). For most use cases — contracts, invoices, reports — you want the searchable PDF option. The text-only option can produce cleaner results for further editing but will lose the original layout and any images embedded in the document. Hit Convert, wait for processing (a 20-page scan typically takes 30–90 seconds depending on server load), then download your file. Open it in any PDF reader, press Ctrl+F, and test a word you know appears in the document.

Checking OCR Quality Before You Rely on the Output

Don't assume the OCR output is perfect just because the conversion completed without errors. The conversion succeeded — that means the engine processed every page — but it doesn't mean every word was recognized correctly. The fastest quality check is the copy-paste test. Open the converted PDF, select a paragraph of text, copy it, and paste it into a plain text editor. Read through it. You're looking for garbled words, missing spaces between words that ran together, numbers that became letters (the digit '0' becoming the letter 'O' is a classic OCR error), and punctuation that got swapped or dropped. For documents where accuracy is critical — legal contracts, medical records, financial statements — you should do a more systematic review. Open the original scan side by side with the converted version. Spot-check at least 10% of the pages, focusing on sections with dense text, small font sizes, or any areas where the original scan looked faded or blurry. If you find error rates above 1–2%, consider whether the source scan quality is the problem. Rescanning at 300 DPI instead of 150 DPI can cut error rates dramatically. Most modern flatbed scanners default to 200 or 300 DPI for document scanning — check your scanner software settings under 'Scan Resolution' or 'Output Quality.' If you're working from phone photos, apps like Microsoft Lens or Adobe Scan will produce better input files than a standard camera app because they apply perspective correction and contrast enhancement before saving. CocoConvert doesn't currently offer a confidence score or per-page quality report alongside the output file. That's a genuine limitation. Enterprise OCR platforms like ABBYY FineReader do provide character-level confidence highlighting, which is useful for compliance-sensitive workflows. For high-stakes documents, that additional verification layer may be worth the investment.

Converting a Scanned PDF to an Editable Word Document

Making a PDF searchable is useful, but sometimes you need to actually edit the content — fix errors, update figures, reformat sections. In that case, you want to convert the scanned PDF directly to a Word document rather than stopping at a searchable PDF. CocoConvert handles this in a single step. Use the PDF to Word converter and enable the OCR option in the settings panel — you'll see a toggle labeled 'Enable OCR for scanned documents.' With OCR on, the engine will recognize the text and attempt to reconstruct the Word document with matching fonts, paragraph styles, and basic layout. The reconstruction quality varies significantly with document complexity. A simple one-column text document — a letter, a memo, a basic report — will convert cleanly. A multi-column magazine layout, a document with complex tables, or anything with text wrapped around images will require manual cleanup after conversion. Tables are particularly tricky: OCR can recognize the text inside cells, but reconstructing the table structure accurately depends on how clearly the cell borders appear in the scan. Expect to spend time cleaning up the Word output for any document beyond simple text. Fonts may not match exactly if the original used uncommon typefaces. Page numbers, headers, and footers sometimes end up in the wrong place. Images embedded in the original may appear misaligned. For a 10-page scanned report with standard formatting, budget about 20–30 minutes of cleanup after conversion. For a 50-page document with tables and mixed layouts, budget significantly more. OCR-to-Word conversion is a starting point, not a finished product.

When OCR Is the Wrong Tool for the Problem

OCR solves the problem of image-based PDFs, but not every unsearchable PDF is image-based. Before running OCR, it's worth diagnosing which problem you actually have. Some PDFs contain real text but have that text encoded with a custom or embedded font that doesn't map correctly to standard Unicode characters. When you copy text from these PDFs, you get nonsense characters — boxes, random symbols, or garbled strings. This isn't an OCR problem; it's a font encoding problem. Running OCR on top of it will technically add a searchable text layer, but you're adding a second layer of potential errors on top of an already broken document. The better fix for encoding issues is to re-export the PDF from the original source application with standard font embedding settings. Password-protected PDFs that restrict text copying will also appear to have no selectable text in some viewers. OCR won't help here — the text data exists, it's just access-restricted. You need to remove the restriction first, which requires the document password. Some PDFs are simply corrupted — the file structure is damaged and the document doesn't render correctly at all. CocoConvert will attempt to repair minor corruption during conversion, but severely damaged files may fail to process entirely. Finally, if your goal isn't searchability but accessibility — making a PDF readable by screen readers for visually impaired users — OCR alone isn't sufficient. Accessible PDFs require tagged structure (headings, lists, reading order, alt text for images), which OCR doesn't produce. PDF accessibility remediation is a separate workflow that goes beyond what any automated conversion tool currently handles well.

Practical Tips for Better OCR Results Every Time

Source quality is the single biggest factor in OCR accuracy, and it's entirely within your control before the file ever reaches CocoConvert. Scan at 300 DPI minimum. This is the standard recommendation from archivists, legal professionals, and document management systems alike. At 300 DPI, character edges are sharp enough for reliable recognition. At 150 DPI, small fonts (anything below 10pt) become ambiguous. At 600 DPI, you get marginally better results but much larger file sizes — usually not worth it for standard business documents. Use grayscale or black-and-white mode for text-only documents. Color scans of text documents are larger files and sometimes introduce JPEG compression artifacts that blur character edges. For documents with color charts or photographs you need to preserve, use color. For contracts, invoices, and reports, grayscale at 300 DPI is the right setting. Clean the scanner glass before scanning. A smudge or dust particle on the glass will appear as a dark mark on every page of a multi-page scan, and OCR engines may try to interpret those marks as characters. If you're scanning a bound book, press the spine flat and scan each page individually rather than scanning two pages at once. The curve near the spine creates a shadow that degrades OCR accuracy significantly in that region. Batch processing is available on CocoConvert's paid plans. If you have a folder of 50 scanned PDFs that all need OCR, you can upload them as a ZIP archive and process them in one job rather than uploading files one at a time. The output is returned as a ZIP of converted files with matching filenames. For anyone digitizing an archive of paper records, this saves considerable time.

← Browse all articles