Shared MIME-info Database

X Desktop Group

Thomas Leonard

<tal197@users.sf.net>

Introduction

Version

This is version 0.11 of the Shared MIME-info Database specification, last updated 17 Apr 2003.

What is this spec?

Many programs and desktops use the MIME system[MIME] to represent the types of files. Frequently, it is necessary to work out the correct MIME type for a file. This is generally done by examining the file's name or contents, and looking up the correct MIME type in a database.

It is also useful to store information about each type, such as a textual description of it, or a list of applications that can be used to view or edit files of that type.

For interoperability, it is useful for different programs to use the same database so that different programs agree on the type of a file and information is not duplicated. It is also helpful for application authors to only have to install new information in one place.

This specification attempts to unify the MIME database systems currently in use by GNOME[GNOME], KDE[KDE] and ROX[ROX], and provide room for future extensibility.

The MIME database does NOT store user preferences (such as a user's preferred application for handling files of a particular type). It may be used to store static information, such as that files of a certain type may be viewed with a particular application.

Language used in this specification

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119[RFC-2119].

Overview of previous systems

KDE

KDE uses .desktop files, with Type=MimeType, one file per type to determine type from file name. The files are arranged in the filesystem to mirror the two-level MIME type hierarchy. The syntax is very similar to other .desktop files, with Name=, Comment= etc.

Example file:

[Desktop Entry]
Encoding=UTF-8
MimeType=application/x-kword
Comment=KWord
Comment[af]=kword
[... etc. other translations ]
Icon=kword
Type=MimeType
Patterns=*.kwd;*.kwt;
X-KDE-AutoEmbed=false

[Property::X-KDE-NativeExtension]
Type=QString
Value=.kwd

KDE does not have a separate system for specifying extension matches, but uses case-sensitive glob patterns for everything.

A single file stores all the rules for recognising files by content. This is almost identical to file(1)'s magic.mime database file, but without the encoding field.

The format is described in the file itself as follows:

# The format is 4-5 columns:
#    Column #1: byte number to begin checking from, ">" indicates continuation
#    Column #2: type of data to match
#    Column #3: contents of data to match
#    Column #4: MIME type of result

GNOME

GNOME uses the gnome-vfs library to determine the MIME type of a file. This library loads name-to-type rules from files with a '.mime' extension in a system-wide directory (set at install time), and merged with those in the user's directory. It loads textual descriptions for the types from files in the same directories, ending with '.keys'. The file gnome-vfs.mime in the system directory is always loaded first (allowing everything else to override it). The file user.mime in the user's directory is always loaded last, making these settings take precedence over all others.

The format of the .mime files are described as follows:

# Mime types as provided by the GNOME libraries for GNOME.
#
# Applications can provide more mime types by installing other
# .mime files in the PREFIX/share/mime-info directory.
#
# The format of this file is:
#
# mime-type
#	ext[,prio]:	list of extensions for this mime-type
#	regex[,prio]:	a regular expression that matches the filename
#
# more than one ext: and regex: fields can be present.
#
# prio is the priority for the match, the default is 1. This is required
# to distinguish composed filenames, for example .gz has a priority of 1
# and .tar.gz has a priority of 2 (thus a file having the filename
# something.tar.gz will match the mime-type for tar.gz before the mime-type
# for .gz
#
# The values in this file are kept in alphabetical order for convenience.
# Please maintain this when adding new types. Also consider adding a
# human-readable description to gnome-vfs.keys when adding a new type here.
#
# Also do please not add illegal mime types, observe the mime standard when
# adding new types.

When looking up the type for a file, gnome-vfs looks first for an exact-case match for the extension, then an all upper-case match, then an all lower-case match. If no matches are found, or there is no '.' in the name, then the regular expression matches are checked. It does this first for rules with priority 2, then for those with priority 1. The modification time on the mime-info directories is used to detect changes.

The .keys files contain type-to-description rules, eg:

application/msword
	description=Microsoft Word document
	[de]description=Microsoft Word-Dokument
	...

Guidelines for writing descriptions can be found in the mime-descriptions-guidelines.txt file.

The format for magic entries is defined as:

# The format of magic entries is:
#
#     offset_start[:offset_end] pattern_type pattern [&pattern_mask] type
#
# <offset_start> and <offset_end> are decimal numbers (file offsets).
#
# <pattern_type> is (byte | short | long | string | date | beshort |
#                    belong | bedate | leshort | lelong | ledate).
#
# <pattern> is an ASCII string with non-printable characters escaped
# as hex or octal escape sequences, and spaces and other important
# whitespace escaped with '\'.
#
# <pattern_mask> is a string of hex digits. The mask must be the same
# length as the pattern.
#
# <type> is a valid MIME type.
#
# Order magic patterns such that ambiguous ones (such as
# application/x-ms-dos-executable) are at the end of the list and
# therefore get applied last.
#
# Avoid rules that require a seek deep into the examined file. If you
# must, locate such rules at the end of the list so that they get
# applied last
#
# When designing new document formats, make them easily recognizable
# by defining a sufficiently unique magic pattern near the document
# start. A good pattern is at least four bytes long and contains one
# or two non-printable characters so that text files won't be
# misidentified.

ROX

Note that ROX is now using this specification. This section details the previous implementation.

ROX searches MIME-info directories in CHOICESPATH (~/Choices/MIME-info:/usr/local/share/Choices/MIME-info:/usr/share/Choices/MIME-info by default). Files from earlier directories override those in later ones, but the order within a directory is not specified.

The files are in the same format as GNOME, except:

There are no .keys files, so files of all extensions are loaded.
The priority is ignored.
A case-sensitive match is tried first, then a lower-case match. No upper-case match is tried.

Multiple extensions are allowed. Eg:

application/x-compressed-postscript
	ext: ps.gz eps.gz

When looking up the type for a file, ROX starts with the first '.' and tries a case-sensitive match of the remaining text against the extensions. The it tries again with the filename in lower-case. It then tries again from the second '.', and so on. If no type is found, it tries the regular expressions.

ROX has no rules for determining a file's type from its contents.

ACAP Media Type Dataset Class

The ACAP Media Type Dataset Class[ACAP] draft proposes a network-based solution to the problem. The sytem is intended mainly to allow email clients to find a viewer application for an attachment. The draft gives the following example of a record for the `image/jpeg' type:

attribute                            value
---------                            -----
entry                                JPEG image
mediatype.common.type                image/jpeg
mediatype.common.extension           jpg
mediatype.common.extensionOther      (jpeg jpe jfif jfi)
mediatype.common.description         JPEG is an image format most
				     suitable for compressing photographs
mediatype.common.suppressWarning     1
mediatype.macOS.type.bin             JPEG
mediatype.macOS.creator.bin          JVWR
mediatype.macOS.creator.name         JPEG View
mediatype.macOS.action.edit.bin      8BIM
mediatype.macOS.action.edit.name     Adobe Photoshop
mediatype.unix.action.view           xv

Finding a type involves sending a query to a remote database, which is intended to allow companies to configure MIME information on a company-wide basis. The system is designed to allow a single database to be shared between different platforms.

It is not designed for handing a large number of lookups quickly, making the system too slow for use in file managers and similar. Both its name and contents matching are more primitive than other systems; only extensions can be matched, case is ignored, and the contents matching can only match an exact string at the start of the file. It provides similar information to this specification, except that it defines many items of platform-specific information in the core specification, instead of using namespacing to allow vendor extensions.

Unified system

In discussions about these systems, it was clear that the differences between the databases were simply a result of them being separate, and not due to any fundamental disagreements between developers. Everyone is keen to see them merged.

This specification proposes:

A standard way for applications to install new MIME related information.
A standard way of getting the MIME type for a file.
A standard way of getting information about a MIME type.
Standard locations for all the files, and methods of resolving conflicts.

Further, the existing databases have been merged into a single package [SharedMIME].

Directory layout

There are two important requirements for the way the MIME database is stored:

Applications must be able to extend the database in any way when they are installed, to add both new rules for determining type, and new information about specific types.
It must be possible to install applications in /usr, /usr/local and the user's home directory (in the normal Unix way) and have the MIME information used.

This specification uses the XDG Base Directory Specification[BaseDir] to define the prefixes below which the database is stored. In the rest of this document, paths shown with the prefix <MIME> indicate the files should be loaded from the mime subdirectory of every directory in XDG_DATA_HOME:XDG_DATA_DIRS.

For example, when using the default paths, “Load all the <MIME>/text/html.xml files” means to load /usr/share/mime/text/html.xml, /usr/local/share/mime/text/html.xml, and ~/.local/share/mime/text/html.xml (if they exist).

Each application that wishes to contribute to the MIME database will install a single XML file, named after the application, into one of the three <MIME>/packages/ directories (depending on where the user requested the application be installed). After installing, uninstalling or modifying this file, the application MUST run the update-mime-database command, which is provided by the freedesktop.org shared database[SharedMIME].

update-mime-database is passed the mime directory containing the packages subdirectory which was modified as its only argument. It scans all the XML files in the packages subdirectory, combines the information in them, and creates a number of output files.

Where the information from these files is conflicting, information from directories lower in the list takes precedence. Any file named Override.xml takes precedence over all other files in the same packages directory. This can be used by tools which let the user edit the database to ensure that the user's changes take effect.

The files created by update-mime-database are:

<MIME>/globs (contains a mapping from names to MIME types)
<MIME>/magic (contains a mapping from file contents to MIME types)
<MIME>/XMLnamespaces (contains a mapping from XML (namespaceURI, localName) pairs to MIME types)
<MIME>/MEDIA/SUBTYPE.xml (one file for each MIME type, giving details about the type)

The format of these generated files and the source files in packages are explained in the following sections. This step serves several purposes. First, it allows applications to quickly get the data they need without parsing all the source XML files (the base package alone is over 700K). Second, it allows the database to be used for other purposes (such as creating the /etc/mime.types file if desired). Third, it allows validation to be performed on the input data, and removes the need for other applications to carefully check the input for errors themselves.

The source XML files

Each application provides only a single XML source file, which is installed in the packages directory as described above. This file is an XML file whose document element is named mime-info and whose namespace URI is http://www.freedesktop.org/standards/shared-mime-info. All elements described in this specification MUST have this namespace too.

The document element may contain zero or more mime-type child nodes, in any order, each describing a single MIME type. Each element has a type attribute giving the MIME type that it describes.

Each mime-type node may contain any combination of the following elements, and in any order:

glob elements have a pattern attribute. Any file whose name matches this pattern will be given this MIME type (subject to conflicting rules in other files, of course).
KDE's glob system replaces GNOME's and ROX's ext/regex fields, since it is trivial to detect a pattern in the form '*.ext' and store it in an extension hash table internally. The full power of regular expressions was not being used by either desktop, and glob patterns are more suitable for filename matching anyway.

magic elements contain a list of match elements, any of which may match, and an optional priority attribute for all of the contained rules. Low numbers should be used for more generic types (such as 'gzip compressed data') and higher values for specific subtypes (such as a word processor format that happens to use gzip to compress the file). The default priority value is 50, and the maximum is 100.

Each match element has a number of attributes:

Attribute	Required?	Value
type	Yes	`string`, `host16`, `host32`, `big16`, `big32`, `little16`, `little32` or `byte`.
offset	Yes	The byte offset(s) in the file to check. This may be a single number or a range in the form `start:end', indicating that all offsets in the range should be checked. The range is inclusive.
value	Yes	The value to compare the file contents with, in the format indicated by the type attribute.
mask	No	The number to AND the value in the file with before comparing it to `value'. Masks for numerical types can be any number, while masks for strings must be in base 16, and start with 0x.

Each element corresponds to one line of file(1)'s magic.mime file. They can be nested in the same way to provide the equivalent of continuation lines.

comment elements give a human-readable textual description of the MIME type. There may be many of these elements with different xml:lang attributes to provide the text in multiple languages.
root-XML elements have namespaceURI and localName attributes. If a file is identified as being an XML file, these rules allow a more specific MIME type to be chosen based on the namespace and localname of the document element.
If localName is present but empty then the document element may have any name, but the namespace must still match.

Applications may also define their own elements, provided they are namespaced to prevent collisions. Unknown elements are copied directly to the output XML files like comment elements. A typical use for this would be to indicate the default handler application for a particular desktop ("Galeon is the GNOME default text/html browser"). Note that this doesn't indicate the user's preferred application, only the (fixed) default.

Here is an example source file, named diff.xml:

<?xml version="1.0"?>
<mime-info xmlns='http://www.freedesktop.org/standards/shared-mime-info'>
  <mime-type type="text/x-diff">
    <comment>Differences between files</comment>
    <comment xml:lang="af">verskille tussen lêers</comment>
    ...
    <magic priority="50">
      <match type="string" offset="0" value="diff\t"/>
      <match type="string" offset="0" value="***\t"/>
      <match type="string" offset="0" value="Common subdirectories: "/>
    </magic>
    <glob pattern="*.diff"/>
    <glob pattern="*.patch"/>
  </mime-type>
</mime-info>

In practice, common types such as text/x-diff are provided by the freedesktop.org shared database. Also, only new information needs to be provided, since this information will be merged with other information about the same type.

The MEDIA/SUBTYPE.xml files

These files have a mime-type element as the root node. The format is as described above. They are created by merging all the mime-type elements from the source files and creating one output file per MIME type. Each file may contain information from multiple source files. The magic, glob and root-XML elements will have been removed.

The example source file given above would (on its own) create an output file called <MIME>/text/x-diff.xml containing the following:

<?xml version="1.0" encoding="utf-8"?>
<mime-type xmlns="http://www.freedesktop.org/standards/shared-mime-info" type="text/x-diff">
<!--Created automatically by update-mime-database. DO NOT EDIT!-->
  <comment>Differences between files</comment>
  <comment lang="af">verskille tussen lêers</comment>
  ...
</mime-type>

The glob files

This is a simple list of lines containing a MIME type and pattern, separated by a colon. For example:

# This file was automatically generated by the
# update-mime-database command. DO NOT EDIT!
...
text/x-diff:*.diff
text/x-diff:*.patch
...

Applications MUST first try a case-sensitive match, then a case-insensitive one. This is so that main.C will be seen as a C++ file, but IMAGE.GIF will still use the *.gif pattern.

If several patterns match then the longest pattern SHOULD be used. In particular, files with multiple extensions (such as Data.tar.gz) MUST match the longest sequence of extensions (eg '*.tar.gz' in preference to '*.gz'). Literal patterns (eg, 'Makefile') must be matched before all others. It is suggested that patterns beginning with `*.' and containing no other special characters (`*?[') should be placed in a hash table for efficient lookup, since this covers the majority of the patterns. Thus, patterns of this form should be matched before other wildcarded patterns.

There may be several rules mapping to the same type. They should all be merged. If the same pattern is defined twice, then they MUST be ordered by the directory the rule came from, as described above.

Lines beginning with `#' are comments and should be ignored. Everything from the `:' character to the newline is part of the pattern; spaces should not be stripped. The file is in the UTF-8 encoding. The format of the glob pattern is as for fnmatch(3). The format does not allow a pattern to contain a literal newline character, but this is not expected to be a problem.

Common types (such as MS Word Documents) will be provided in the X Desktop Group's package, which MUST be required by all applications using this specification. Since each application will then only be providing information about its own types, conflicts should be rare.

The magic files

The magic data is stored in a binary format for ease of parsing. The old magic database had complex escaping rules; these are now handled by update-mime-database.

The file starts with the magic string "MIME-Magic\0\n". There is no version number in the file. Incompatible changes will be handled by creating both the current `magic' file and a newer `magic2' in the new format. Where possible, compatible changes only will be made. All numbers are big-endian, so need to be byte-swapped on little-endian machines.

The rest of the file is made up of a sequence of small sections. Each section is introduced by giving the priority and type in brackets, followed by a newline character. Higher priority entries come first. Example:

[50:text/x-diff]\n

Each line in the section takes the form:

[ indent ] ">" start-offset "=" value
[ "&" mask ] [ "~" word-size ] [ "+" range-length ] "\n"

Part	Example	Meaning
indent	1	The nesting depth of the rule, corresponding to the number of '>' characters in the traditional file format.
">" start-offset	>4	The offset into the file to look for a match.
"=" value	=\0x0\0x2\0x55\0x40	Two bytes giving the (big-endian) length of the value, followed by the value itself.
"&" mask	&\0xff\0xf0	The mask, which (if present) is exactly the same length as the value.
"~" word-size	~2	On little-endian machines, the size of each group to byte-swap.
"+" range-length	+8	The length of the region in the file to check.

Note that the value, value length and mask are all binary, whereas everything else is textual. Each of the elements begins with a single character to identify it, except for the indent level.

The word size is used for byte-swapping. Little-endian systems should reverse the order of groups of bytes in the value and mask if this is greater than one. This only affects `host' matches (`big32' entries still have a word size of 1, for example, because no swapping is necessary, whereas `host32' has a word size of 4).

The indent, range-length, word-size and mask components are optional. If missing, indent defaults to 0, range-length to 1, the word-size to 1, and the mask to all 'one' bits.

Indent corresponds to the nesting depth of the rule. Top-level rules have an indent of zero. The parent of an entry is the preceding entry with an indent one less than the entry.

If an unknown character is found where a newline is expected then the whole line should be ignored (there will be no binary data after the new character, so the next line starts after the next "\n" character). This is for future extensions.

The text/x-diff above example would (on its own) create this magic file:

00000000  4d 49 4d 45 2d 4d 61 67  69 63 00 0a 5b 35 30 3a  |MIME-Magic..[50:|
00000010  74 65 78 74 2f 78 2d 64  69 66 66 5d 0a 3e 30 3d  |text/x-diff].>0=|
00000020  00 05 64 69 66 66 09 0a  3e 30 3d 00 04 2a 2a 2a  |..diff..>0=..***|
00000030  09 0a 3e 30 3d 00 17 43  6f 6d 6d 6f 6e 20 73 75  |..>0=..Common su|
00000040  62 64 69 72 65 63 74 6f  72 69 65 73 3a 20 0a     |bdirectories: .|

The XMLnamespaces files

Each XMLnamespaces file is a list of lines in the form:

namespaceURI " " localName " " MIME-Type "\n"

For example:

http://www.w3.org/1999/xhtml html application/xhtml+xml

The lines are sorted (using strcmp) and there are no lines with the same namespaceURI and localName in one file. If the localName was empty then there will be two spaces following the namespaceURI.

Storing the MIME type using Extended Attributes

An implementation MAY also get a file's MIME type from the user.mime_type extended attribute. The type given here should normally be used in preference to any guessed type, since the user is able to set it explicitly. Applications MAY choose to set the type when saving files. Since many applications and filesystems do not support extended attributes, implementations MUST NOT rely on this method being available.

Recommended checking order

Because different applications have different requirements, they may choose to use the various methods provided by this specification in any order. However, the RECOMMENDED order to perform the checks is:

If a MIME type is provided explicitly (eg, by a ContentType HTTP header, a MIME email attachment, an extended attribute or some other means) then that should be used instead of guessing.
If no explicit type is present, the glob rules should be applied to the name to get the type.
If no glob rules match, the magic rules should be tried next.
If nothing matches, the default type of application/octet-stream should be used for binary data, or text/plain for textual data. Checking the start of the file for ASCII control characters is a good way to guess whether a file is binary or text, but note that files with high-bit-set characters should still be treated as text since these can appear in UTF-8 text, unlike control characters.

There are several reasons for checking the glob patterns before the magic. Some applications don't check the magic at all, and this makes it more likely that both will get the same type. Users can easily understand why calling their text file README.mp3 makes the system think it's an MP3, whereas they have trouble understanding why their computer thinks README.txt is a PostScript file. If the system guesses wrongly, the user can often rename the file to fix the problem.

Security implications

The system described in this document is intended to allow different programs to see the same file as having the same type. This is to help interoperability. The type determined in this way is only a guess, and an application MUST NOT trust a file based simply on its MIME type. For example, a downloader should not pass a file directly to a launcher application without confirmation simply because the type looks `harmless' (eg, text/plain).

Do not rely on two applications getting the same type for the same file, even if they both use this system. The spec allows some leeway in implementation, and in any case the programs may be following different versions of the spec.

User modification

The MIME database is NOT intended to store user preferences. Users should never edit the database. If they wish to make corrections or provide MIME entries for software that doesn't provide these itself, they should do so by means of the Override.xml mentioned in the section called “Directory layout”. Information such as "text/html files need to be opened with Mozilla" should NOT go in the database.

Contributors

Thomas Leonard <tal197@users.sf.net>

David Faure <david@mandrakesoft.com>

Alex Larsson <alexl@redhat.com>

Seth Nickell <snickell@stanford.edu>

Keith Packard <keithp@keithp.com>

Filip Van Raemdonck <mechanix@debian.org>

Christos Zoulas <christos@zoulas.com>

References

[GNOME] The GNOME desktop, http://www.gnome.org

[KDE] The KDE desktop, http://www.kde.org

[ROX] The ROX desktop, http://rox.sourceforge.net

[DesktopEntries] Desktop Entry Specification, http://www.freedesktop.org/standards/desktop-entry-spec.html

[SharedMIME] Shared MIME-info Database, http://www.freedesktop.org/standards/shared-mime-info.html

[RFC-2119] Key words for use in RFCs to Indicate Requirement Levels, http://www.ietf.org/rfc/rfc2119.txt?number=2119

[BaseDir] XDG Base Directory Specification http://www.freedesktop.org/standards/basedir/draft/basedir-spec/basedir-spec.html

[ACAP] ACAP Media Type Dataset Class ftp://ftp.ietf.org/internet-drafts/draft-ietf-acap-mediatype-01.txt