Skip to content
Logo Theodo

Mastering File Upload Security: Understanding File Types

Marek Elmayan10 min read

Laptop with a malicious file uploading on a server represented by a cloud with a lock

In today’s digital landscape, file exchange has become an integral part of our online activities, from sharing work documents to uploading pictures on social media. However, this convenience comes with inherent risks. File upload functions, seemingly innocuous, can serve as potent vectors for high-severity attacks, potentially jeopardizing your data and user security. Whether you’re a seasoned developer or just embarking on your coding journey, ensuring the security of your file upload features is paramount.

To address this concern, it’s crucial to understand the concept of file types, including file extensions and MIME types. These are fundamental to ensure the proper handling and validation of uploaded files and thus prevent some malicious files from being uploaded to your application.

How to determine the type of a file? Introducing extension & MIME type

Visual breakdown of a PDF file's anatomy emphasizing security elements. Key metadata like MIME type and magic number are highlighted, suggesting their role in content security and integrity checks.

File Extensions

File extensions are suffixes added to a file's name, typically separated from the base filename by a dot (e.g., .jpg, .pdf, .txt). They serve as a way to identify the file format, which can be important for both the operating system and the user.

File extensions are a common way for users to recognize the type of a file. For example, .jpg or .jpeg extensions are associated with image files, while .pdf extensions denote documents. However, it’s important to note that file extensions can be manipulated or spoofed by malicious users, making them an unreliable method for file type verification.

Developers should not solely rely on file extensions when validating the types of uploaded files in a web application. An attacker could easily rename a malicious file with a benign extension, exploiting this vulnerability.

MIME Types

MIME (Multipurpose Internet Mail Extensions) types, on the other hand, provide a more reliable way to identify the content type of a file. MIME types are associated with the actual content of the file rather than its name.

Illustration depicting examples of MIME types. At the top, the general structure of MIME type is described as 'type/subtype'. Below, 3 examples are listed: 'text/plain', 'image/png', and 'application/pdf'.

A MIME type is typically expressed as a two-part identifier, separated by a slash (/). The first part specifies the general category of the file, such as “text” for plain text documents or “image” for image files. The second part provides more specific information about the file’s format, such as “html” for HTML web pages or “jpeg” for JPEG image files. Together, these two parts form a complete MIME type, like “text/html” for HTML documents or “image/jpeg” for JPEG images.

These MIME types are standardized, and web servers use them to indicate the content type of a file in HTTP headers when serving files over the web. You can check the Common MIME types 👀

When dealing with file uploads, you can extract the MIME type from the file’s metadata, allowing you to verify its content regardless of the file’s name or extension. This helps prevent attackers from disguising malicious content by simply changing the file extension.

Here is a quick command line example showing that modifying the extension of a file doesn’t change its mime type and doesn’t prevent it from being executed.

cat hello.php
# output: <?php echo "Hello World!"; ?>

file --mime-type hello.php
# output: hello.php: text/x-php

php hello.php
# output: Hello world!

cp hello.php hello.png

file --mime-type hello.png
# output: main.png: text/x-php

php hello.png
# output: Hello world!

🪄 Magic Numbers hidden in the File Metadata

File metadata refers to the hidden information or attributes associated with a file. Basically, metadata are data about the data, it provides essential context about the file, such as its creation date, author, size, and information about its content. From our security point of view, it’s the last information that interests us. By examining metadata, we can gain insights into the nature of uploaded files and deduce their MIME type ✨.

Magic numbers, also known as magic bytes or file signatures, are special values or sequences of bytes located at the beginning of a file, in the metadata, that help identify the file’s format or type. They serve as a way for software to quickly determine the nature of a file without having to rely solely on its file extension. Magic numbers are commonly used to ensure that the correct software or application is used to open and interpret the file.

Here are some common examples:

File typeExtensionHex digits
PDF format.pdf25 50 44 46
PNG format.png89 50 4e 47
GIF format.gif47 49 46 38

You can check a more extensive list of file signatures.

The command hexdump -C is used in Unix-like operating systems to display the contents of a file in hexadecimal format along with printable ASCII characters. Here is a way to check that your PDF is legitimate by verifying its magic number.

Output of the hexdump -C command on a pdf in a terminal. With a highlight on the first hex digits 25 50 44 46 corresponding to the PDF format

🛠️ Unreliable tools to help with quick type validation

What is the Content-Type header ?

The Content-Type header is an HTTP header used to indicate the media type or MIME type of the data that is being sent in the HTTP response. This header is crucial for web servers and clients to understand how to process the content they receive. For example, when a web server sends an HTML web page to a browser, it includes the Content-Type header with a value of “text/html” to inform the browser that it should interpret and display the content as an HTML document.

However, the Content-Type header should not be relied upon for robust file type validation, especially for security-critical applications.

The Content-Type for uploaded files is provided by the client in the HTTP request, thus it can be manipulated by malicious users. Hence it cannot be trusted, as it is trivial to spoof. Although it should not be relied upon for security, it can provide a quick check to prevent users from unintentionally uploading files with the incorrect type.

<input type='file'/> HTML component

To let users upload files to your web application, you can use the HTML <input type="file"> element. This element creates a file picker dialog that allows users to select one or more files from their local device. The selected files can then be uploaded to the server using a form submission or an AJAX request.

<form action="/upload" method="post" enctype="multipart/form-data">
  <input type="file" name="file" />
  <input type="submit" value="Upload" />
</form>

You can even specify the types of files that users can select by using the accept attribute. This attribute accepts a comma-separated list of MIME types or file extensions. For example, to only allow users to select image files, you can use the following code:

<input type="file" name="file" accept="image/*" />

Nevertheless, it’s important to note that the accept attribute is only a hint to the browser and does not provide any security. It’s trivial for malicious users to bypass this restriction by simply modifying the file picker dialog or sending a request directly to the server. According to the documentation, only the extension is used to determine the type of the file. Therefore, it is not a reliable way to validate the type of uploaded files but it can be used to prevent users from unintentionally uploading files with the incorrect type.


🔒 Secure your app and your file upload feature

Ensure the usage of business-critical extensions only, without allowing any type of non-required extensions.

🧬 Using Both Extensions and MIME Types:

To achieve robust security, it's advisable to combine both file extensions and MIME types when validating uploaded files. By cross-referencing the file extension with the extracted MIME type, you can add an extra layer of protection to your application. This ensures that the file's content matches its declared extension.

✅ Whitelist relevant types

Based on your web application’s needs, ensure the least harmful and the lowest risk file types to be used. By limiting the list of allowed file types, you can already avoid executables, scripts and other potentially malicious content from being uploaded to your application.

For instance, allow only common image types for a picture sharing feature, such as png, jpeg, svg.

⚠️ Having protection against file type doesn’t mean that it is robust. Some flaws likely exist in the mechanisms implemented to validate the file type. One example among many, the file extension can be spoofed by using a double extension such as image.png.php or image.png%00.php (null byte injection).

→ The use of a deny list is inherently flawed as it’s difficult to explicitly block every possible file extension that could be used to execute code. (It can be bypassed by using lesser known, alternative file extensions that may still be executable, such as .php5.shtml, .docm, … )

→ Ensure that the validation occurs after decoding the file name and that a proper filter is set in place in order to avoid certain known bypasses.

For more information and practical knowledge, check the following Portswigger topic.


🥷 Exploiting file type vulnerabilities with Polyglot Files 🎭 

Tempering directly the magic number of a file isn’t a big risk since it will make it unreadable to programs thus preventing malicious code execution. Nevertheless, there is a special category of files, called Polyglot files, that are used to exploit vulnerabilities in file validation.

Inspired by the analogy to multilingualism, they are files that are deliberately crafted to be valid and functional in multiple file formats simultaneously.

For instance, a PDF-ZIP polyglot can be opened as a valid PDF document and also at the same time decompressed as a valid ZIP archive. Hence if you allow the upload of PDF files then you can bypass the type restriction and upload a ZIP archive.

Here is a polyglot program that can be interpreted as a C++ or as a Python program depending on the compiler used. The common technique used is to make use of languages that use different characters for comments.

#if 0
print('Hello world')
#endif
#if 0
""" "
#endif
#include <iostream>
int main() {
    std::cout << "Hello world" << std::endl;
}
#if 0
" """
#endif

For a PDF-ZIP file, it is more complicated since you need to understand the structure of the file and how to craft it to be valid in both formats by manipulating bytes. Many different methods can be used to create polyglot files and it is not limited to 2 types for a single file. I will not go into details about the process of creating polyglot files but here are some resources to get started on this fascinating topic. Here is a list of polyglot files you can manipulate and test. Otherwise, you can use Mitra to create your own polyglot files.

How to prevent polyglot files from being uploaded to your application?

All the above methods are complex, not light to implement, and not always necessary depending on the context of your application. However, it is important to be aware of the existence of polyglot files and the risks they represent. The next step is to configure your server safely, thus preventing the execution of malicious code.


🚀 Stay tuned for the next steps

I hope you now understand more about how file types work. File type validation is a complex process, and no single method is infallible or sufficient. Using a combination of methods to protect against file upload attacks is recommended.

If you’d like to find out more about DoS attacks and antivirus scanning for file uploads, take a look at the second article in this series: Mastering File Upload Security: DoS attacks and Antivirus.

Liked this article?