informational

How Computers Detect File Types: Magic Bytes Explained

2026-05-17 9 min read

Why File Extensions Are Only Half the Story

Most people assume a computer knows a file is a JPEG because it ends in .jpg. This is a reasonable assumption, but it’s wrong about half the time. File extensions are just a naming convention. They’re a hint to the operating system, not a guarantee of what’s inside. Rename a PNG file to .txt, and Windows will happily try to open it in Notepad, showing you a screen of gibberish. Try to open a Word document you’ve renamed to .pdf, and Adobe Acrobat will flatly refuse. This matters enormously for file conversion. When you upload a file to a service like CocoConvert, the system can't just trust the extension. Someone could rename a malicious executable (.exe) to .jpg to slip it past a naive file handler. Or, more commonly, a user might just save a file with the wrong extension by accident. We’ve all seen corrupted downloads that lose their extension metadata entirely. The solution, settled on decades ago, is to ignore the filename and read the file's actual binary content. Specifically, a program reads the first few bytes and compares them against a known database of signatures. These signatures are called magic bytes, or more formally, file signatures. They are the ground truth about a file’s contents, regardless of its name.

What Magic Bytes Actually Are

Every standardized file format reserves its first few bytes for a unique identifier. These bytes, written in hexadecimal, are baked into the format's specification itself—they aren't added by the operating system later. You can see these for yourself with a hex editor. Look at a few common file types: - **JPEG images** always start with FF D8 FF. The first two bytes (FF D8) mark the start of the data stream, and the third (FF) begins the first marker segment. - **PNG images** start with an 8-byte signature: 89 50 4E 47 0D 0A 1A 0A. Notice the 50 4E 47 part is ASCII for 'PNG'. - **PDF files** begin with 25 50 44 46, which is ASCII for '%PDF'. You can open any PDF in a plain text editor and see this right at the top. - **ZIP archives** start with 50 4B 03 04. This is 'PK' in ASCII, for Phil Katz, the format's creator. Because DOCX, XLSX, PPTX, and JAR files are all ZIP-based, they share this signature. - **MP3 audio** files often begin with FF FB or FF F3, which is the MPEG sync word. - **EXE files** on Windows start with 4D 5A—'MZ' in ASCII, for Mark Zbikowski, one of the original architects of MS-DOS. The name 'magic bytes' comes from the Unix `file` utility, which has used a database called `/etc/magic` (or `/usr/share/misc/magic` on modern systems) since the 1970s. When you run `file photo.jpg` in a terminal, the utility reads the opening bytes, consults its database, and tells you the real file type, completely ignoring the .jpg extension.

How Detection Software Reads the Signature

In principle, reading magic bytes is simple. In practice, the edge cases will get you. The basic process involves opening the file as a raw binary stream, reading a small buffer—often the first 512 bytes—and comparing that buffer against a list of known signatures. Some formats, however, require reading much further. The comparison isn't always a simple prefix match. Some formats place their signature at a fixed offset. The ISO disk image format, for instance, has its 'CD001' signature starting at byte 32,769. ZIP files can have their central directory at the end of the file, forcing some detectors to scan from the tail instead of the head. Modern libraries like Apache Tika (Java), python-magic (Python), and libmagic (C) handle this complexity. Apache Tika alone knows over 1,300 file types, detecting MIME types, character encodings, and even embedded metadata. At CocoConvert, we use server-side signature detection as our first line of defense. If the MIME type your browser declares doesn't match what the binary signature says, the file gets flagged for a closer look before any conversion starts. Container formats make things even trickier. A DOCX file and a JAR file both start with 50 4B 03 04 because they're both ZIP archives. To tell them apart, software has to look deeper inside the archive for specific files, like [Content_Types].xml for DOCX or META-INF/MANIFEST.MF for a JAR. This two-pass detection is standard practice for any serious file handling pipeline.

When Magic Bytes Fail (And They Do)

Magic bytes are reliable, but they aren't infallible. Several real-world scenarios can trip up signature-based detection and give you wrong or ambiguous answers. **Truncated files** are a constant headache. If a download gets interrupted, you might have a file with a valid JPEG header but no actual image data. The signature check passes, but the conversion fails later when the decoder expects a complete image and finds only a fragment. **Intentionally crafted files** can exploit the fact that magic bytes only cover the header. A file can have a valid PNG header followed by a malicious payload. This is a known attack vector called a polyglot file—a single binary that is simultaneously valid as two different file types. Researchers have created JPEG/JavaScript polyglots that a browser will execute as a script while an image viewer displays it as a photo. Signature detection alone won't catch these. **Format version conflicts** create another layer of ambiguity. Before 2007, Microsoft Office files (DOC, XLS, PPT) used the Compound Document Binary Format, which starts with D0 CF 11 E0 A1 B1 1A E1. All three formats share the exact same signature. You can't distinguish a .doc from an .xls from a .ppt using magic bytes; you have to parse the internal structure. **Plain text formats** like CSV, JSON, XML, HTML, and Markdown have no magic bytes at all. They are just sequences of characters. Detection here relies on heuristic analysis: looking for patterns like angle brackets (HTML/XML) or curly braces (JSON). These heuristics can be wrong. Anyone who has ever wrestled with a 'CSV' that uses semicolons instead of commas knows this pain all too well. If CocoConvert can't confidently identify a file type, it tells you. We think returning a clear error is far more useful than guessing and producing a corrupt file.

Practical Implications for File Conversion

So how does this help you? It completely changes how you troubleshoot failed conversions. When a service reports an 'unsupported or unrecognized file format,' the problem is almost never the filename extension. It's the content. Here are the most common culprits and what to do about them: **The file is a different format than you think.** This happens a lot with exports from design tools. Figma, for instance, might export a file labeled .jpg that is actually a PNG. Your best bet is to open the file in a hex editor (like HxD on Windows or Hex Fiend on macOS) and check the first few bytes. If you see 89 50 4E 47, it’s a PNG, regardless of the name. Rename it and try again. **The file is password-protected or DRM-locked.** Encrypted Office documents still start with D0 CF 11 E0, so the signature check passes. But the content inside is ciphertext. CocoConvert can't decrypt password-protected files. Don't mistake this for a limitation of the service; it's a fundamental security feature of encryption itself. **The file is a container with the wrong contents.** A file created by renaming a generic .zip to .docx will have the right signature (50 4B), but it will fail conversion because it lacks the required Word processing XML structure inside. The converter opens the archive, finds nothing to work with, and gives up. **Codec mismatches in video files.** An MKV container (starts with 1A 45 DF A3) can hold video encoded in H.264, H.265, AV1, VP9, or dozens of others. Magic bytes confirm the MKV container, not the video stream's codec. If CocoConvert supports MKV but your file uses an obscure codec like RealVideo 4, the initial detection will pass but the transcoding step will fail.

How to Check a File's True Type Without Specialized Software

You don't need to install specialized software to verify a file's real identity. These methods work on every major operating system using tools you already have. **On Windows:** Open PowerShell and run `Format-Hex -Path 'C:\path\to\yourfile.ext' | Select-Object -First 3`. This command prints the first 48 bytes in hexadecimal. Compare the first row of bytes to the signatures listed earlier in this article. **On macOS and Linux:** Open the Terminal and run `xxd yourfile | head -3`. This shows the byte offset, hex values, and ASCII representation. Even better, just run `file yourfile`. The `file` utility is built-in and gives you a clean, human-readable answer immediately. **In a browser, without any tools:** If you're on a locked-down machine where you can't run commands, go to CocoConvert's detection page at cocoConvert.com/detect. Upload the file, and our service will report the detected MIME type, the signature bytes it matched, and its confidence level. **In Python (for developers):** After `pip install python-magic`, you can run `import magic; print(magic.from_file('yourfile', mime=True))`. This gives a standard MIME type like `image/jpeg`. My advice for production Python code, however, is to use the `filetype` library instead. It has no system dependencies, which makes it much easier to deploy on Windows without wrestling with libmagic DLLs. Knowing the true MIME type before you upload saves a lot of time. If a file's detected type isn't on a service's supported input list, you know right away that the conversion will fail.

The Limits of What Any Conversion Service Can Promise

Magic byte detection is a solved problem for the roughly 200 file formats that cover 99% of everyday use. But for the long tail of specialized, proprietary, or legacy formats, no service—including CocoConvert—can promise complete coverage. Formats like Autodesk DWG for CAD drawings, proprietary medical imaging variants from specific scanner brands, or niche audio formats from vintage synthesizers often have poor or no open documentation. Even if the magic bytes are known, the internal structure might be a black box. A converter might produce output with missing data, incorrect colors, or dropped metadata layers. As of this writing, CocoConvert supports about 300 input formats. While that sounds like a lot, the Library of Congress's PRONOM registry documents over 2,000 distinct file formats. That gap represents formats that are too obscure to justify the engineering effort, are legally encumbered by patents, or are so poorly documented that any conversion attempt would be a gamble. My recommendation is this: if you work in industries like medical imaging, geospatial data, or professional video, you must verify format support before committing to a conversion workflow. Check CocoConvert's format support page. It lists every supported input and output, along with notes on known limitations. It's better to check first than to upload a 4GB broadcast master file only to find its specific MXF flavor isn't supported. Magic bytes tell a computer what a file *is*. They don't tell it whether converting that file will produce something *useful*. That second, more important question depends on good documentation, clear licensing, and dedicated engineering work that no amount of clever byte-sniffing can replace.

← Browse all articles