informational

How Computers Detect File Types: Magic Bytes Explained

2026-05-17 9 min read

Why File Extensions Are Only Half the Story

Most people assume a computer knows a file is a JPEG because it ends in .jpg. That assumption is wrong about half the time. File extensions are nothing more than a naming convention — a hint to the operating system, not a guarantee of content. Rename a PNG file to .txt and Windows will happily try to open it in Notepad, spitting out gibberish. Rename a Word document to .pdf and Adobe Acrobat will refuse to open it entirely. This matters enormously for file conversion. When you upload a file to a conversion service like CocoConvert, the system cannot blindly trust the extension you provided. A malicious actor could rename an executable (.exe) to .jpg and attempt to slip it past a naive file handler. A well-meaning user might accidentally save a file with the wrong extension. Corrupted downloads sometimes lose their extension metadata entirely. The solution the computing world settled on decades ago is to read the file's actual binary content — specifically, its opening bytes — and compare those bytes against a known database of signatures. These signatures are called magic bytes, or more formally, file signatures. They are the ground truth about what a file actually contains, independent of whatever name it was given.

What Magic Bytes Actually Are

Every standardized file format reserves the first few bytes of the file for a unique identifier. These bytes are written in hexadecimal notation and are part of the format specification itself — not something an operating system adds afterward. Here are some concrete examples you can verify yourself: - **JPEG images** always start with the bytes FF D8 FF. The first two bytes (FF D8) mark the start of the JPEG data stream; the third (FF) begins the first marker segment. - **PNG images** start with 89 50 4E 47 0D 0A 1A 0A — that's 8 bytes. The 50 4E 47 portion is ASCII for 'PNG', which is why some editors call it the PNG signature. - **PDF files** begin with 25 50 44 46, which is ASCII for '%PDF'. You can open any PDF in a plain text editor and read that prefix yourself. - **ZIP archives** (and by extension, DOCX, XLSX, PPTX, and JAR files, which are all ZIP-based) start with 50 4B 03 04 — ASCII 'PK', the initials of Phil Katz, the format's creator. - **MP3 audio** files commonly begin with FF FB or FF F3, representing the MPEG sync word. - **EXE files** on Windows start with 4D 5A — ASCII 'MZ', after Mark Zbikowski, one of the format's designers. The term 'magic bytes' comes from the Unix utility called `file`, which has maintained a database called `/etc/magic` (or `/usr/share/misc/magic` on modern systems) since the early 1970s. When you run `file photo.jpg` in a Linux terminal, that utility reads the first few bytes, consults its database, and returns the actual file type — regardless of the extension.

How Detection Software Reads the Signature

Reading magic bytes is not complicated in principle, but implementing it robustly requires handling a surprising number of edge cases. The basic process works like this: the software opens the file as a raw binary stream, reads a small buffer — typically the first 512 bytes, though some formats require reading further — and then compares that buffer against a list of known signatures. The comparison is not always a simple prefix match. Some formats have signatures at a fixed offset rather than at byte zero. The ISO disk image format, for example, places its signature ('CD001') at byte offset 32,769. ZIP files can have their central directory at the end of the file rather than the beginning, which is why some ZIP detectors scan from the tail. Modern libraries like Apache Tika (Java), python-magic (Python), or libmagic (C) handle this complexity behind the scenes. Apache Tika alone recognizes over 1,300 file types and can detect MIME types, character encodings, and even embedded metadata. CocoConvert uses server-side signature detection as the first validation step before any conversion job begins — if the declared MIME type from the browser doesn't match what the binary signature says, the file is flagged for secondary inspection. There are also container formats that complicate detection. A DOCX file and a JAR file both start with 50 4B 03 04 because they are both ZIP archives. Distinguishing them requires reading deeper into the archive — specifically, looking for the presence of a [Content_Types].xml entry (DOCX) versus a META-INF/MANIFEST.MF entry (JAR). This two-pass detection is standard practice in any serious file handling pipeline.

When Magic Bytes Fail (And They Do)

Magic bytes are reliable, but they are not infallible. Several real-world scenarios cause signature-based detection to produce wrong or ambiguous results. **Truncated files** are the most common problem. If a download was interrupted at 40%, the file might have a valid JPEG header but no actual image data. The signature check passes, but the conversion will fail downstream when the decoder tries to parse a complete image and finds nothing. **Intentionally crafted files** can exploit the fact that magic bytes only cover the header. A file can have a valid PNG header followed by malicious payload data. This is a known attack vector called a polyglot file — a single binary that is simultaneously valid in two different formats. Security researchers have demonstrated JPEG/JavaScript polyglots that browsers will execute as scripts while image viewers display them as photos. Signature detection alone does not catch these. **Format version conflicts** create another layer of ambiguity. Microsoft Office files before 2007 (DOC, XLS, PPT) use the Compound Document Binary Format, which starts with D0 CF 11 E0 A1 B1 1A E1. All three formats share this same signature. You cannot distinguish a .doc from a .xls from a .ppt using magic bytes alone — you have to parse the internal structure further. **Plain text formats** — CSV, JSON, XML, HTML, Markdown — have no magic bytes at all. They are just sequences of characters. Detection for these relies on heuristic analysis: looking for specific patterns like angle brackets (HTML/XML), curly braces (JSON), or comma-delimited rows (CSV). Heuristics can be wrong. A CSV file with semicolon delimiters will confuse many detectors. CocoConvert will tell you upfront if it cannot confidently identify a file type. Rather than guessing and producing corrupt output, the service returns an error with a description of what was found, which is more useful than a silently broken conversion.

Practical Implications for File Conversion

Understanding magic bytes changes how you troubleshoot failed conversions. When a conversion job fails with a message like 'unsupported or unrecognized file format,' the problem is almost never the extension — it is almost always the content. Here are the most common causes and what to do about them: **The file is actually a different format than expected.** This happens frequently with images exported from design tools. Figma, for example, can export a file labeled .jpg that is actually a PNG internally. Open the file in a hex editor (HxD on Windows, Hex Fiend on macOS) and check the first four bytes yourself. If you see 89 50 4E 47, it is a PNG regardless of what the filename says. Rename it and resubmit. **The file is password-protected or DRM-locked.** Encrypted Office documents still start with D0 CF 11 E0, so signature detection passes. But the content is ciphertext, not readable data. CocoConvert — and frankly any conversion service — cannot decrypt password-protected files. This is not a limitation of the service; it is a deliberate security property of the encryption. **The file is a container holding unexpected content.** A .docx file that was actually saved as a .docx by renaming a .zip will have the right signature (50 4B) but will lack the required Word processing XML inside. The converter will open the archive, find no document structure, and fail. **Codec mismatches in video files.** An MKV container (starts with 1A 45 DF A3) can hold video encoded in H.264, H.265, AV1, VP9, or dozens of other codecs. Magic bytes confirm the container format, not the codec. If CocoConvert supports MKV conversion but the video stream uses an obscure codec like RealVideo 4, the container detection will succeed but the transcoding step will fail.

How to Check a File's True Type Without Specialized Software

You do not need to install anything to verify what a file actually is. Several straightforward methods work on every major operating system. **On Windows:** Open PowerShell and run `Format-Hex -Path 'C:\path\to\yourfile.ext' | Select-Object -First 3`. This prints the first 48 bytes in hex. Compare the first row against the signatures listed earlier in this article. **On macOS and Linux:** Open Terminal and run `xxd yourfile | head -3`. The output shows offset, hex values, and ASCII representation side by side. Alternatively, run `file yourfile` — the `file` utility is built into both operating systems and returns the detected type immediately. **In a browser, without any tools:** Go to CocoConvert's detection page (cocoConvert.com/detect), upload the file, and the service will return the detected MIME type, the matched signature bytes, and the confidence level. This is useful when you are on a locked-down work machine where you cannot run terminal commands. **In Python (for developers):** Install python-magic with `pip install python-magic` and run two lines: `import magic; print(magic.from_file('yourfile', mime=True))`. This returns a standard MIME type string like `image/jpeg` or `application/pdf`. For production use, pair this with the `filetype` library, which has no system dependencies and works on Windows without needing libmagic DLLs. Knowing the true MIME type before uploading to any conversion service saves time. If the detected type is not in the service's supported input list, you know immediately that conversion is not possible without first transforming the file in another way.

The Limits of What Any Conversion Service Can Promise

Magic byte detection is a solved problem for the roughly 200 file formats that cover 99% of everyday use. For the long tail of specialized, proprietary, or legacy formats, no service — including CocoConvert — can offer complete coverage. Formats like Autodesk DWG (CAD drawings), proprietary medical imaging formats like DICOM variants from specific scanner manufacturers, or niche audio formats from vintage hardware synthesizers often have partial or no open documentation. Even if the magic bytes are known, the internal structure may be undocumented enough that a converter produces output with missing data, wrong colors, or dropped metadata layers. CocoConvert supports approximately 300 input formats as of this writing. That number sounds large until you consider that the Library of Congress's PRONOM registry documents over 2,000 distinct file formats. The gap represents formats that are either too obscure to justify engineering effort, legally encumbered by patents, or simply undocumented enough that any conversion would be lossy in unpredictable ways. The honest recommendation: if you are working with files in industries like medical imaging, geospatial data, or professional broadcast video, verify format support before committing to any conversion workflow. CocoConvert's format support page lists every supported input and output combination with notes on known limitations — check it before uploading a 4GB broadcast master file only to find that the specific flavor of MXF your camera produces is not supported. Magic bytes tell a computer what a file is. They do not tell it whether converting that file will produce something useful. That second question depends on format documentation, codec licensing, and engineering work that no amount of clever byte-sniffing can replace.

← Browse all articles