How MathML is stored in files and the clipboard

This document describes how MathML is embedded in all the file formats in which MathType can save equations.

MathType can save its equations in a variety of file formats and object types. MathType stores its own equation data structures (MTEF) in each file (see: How MTEF is Stored in Files and Objects) and it also stores the MathML translation.

MathType uses its standard MathML 2.0 (namespace attr) translator (as defined by translator file, "MathML2 (namespace attr).tdl", which is part of the MathType installation), to generate the Presentation MathML that it embeds.

Most of the file types that MathType can output are defined by standards that are not under Wiris's control and, therefore, there is no opportunity to make MathML an official part of the file format. Luckily, the designers of these file formats have seen the need to store application-specific data and have provided mechanisms to allow this. The rest of this document describes how MathType makes use of such mechanisms for storing MathML.

This feature applies only to MathType version 6.0 and later.

Whenever MathML is embedded in a file it is encoded as UTF-8 text, with white space (tabs, CR, LF, blanks, etc.) removed to save space. The MathML also contains a comment to specify the translator used (see examples below).

PICT was the native graphics format for Mac QuickDraw. QuickDraw was a core part of the classic Mac OS, and was superseded by the Quartz graphics system. QuickDraw was officially deprecated in Mac OS X 10.4 (Tiger), and PICT was dropped as the native graphics format in favor of PDF. MathType for Mac doesn't generate pure PICT, but rather PICT with embedded PostScript. The reason for maintaining some PICT capability is for use in Office 2011.

MathType embeds MathML into PICT as a comment thus:

// picture comment header
typedef struct {
    long appl_sig;    // 'MMLP'
    short local_kind; // 1 for len and checksum present, 0 if not
    short len;        // length of data in bytes
    short checksum;
    // followed by the MathML data
} PComHeader;

The Windows Metafile format is Microsoft Window's native graphics metafile (picture) format and is used in WMF files, on the clipboard, and in OLE objects.

MathML is embedded in WMF data using the MFCOMMENT escape function in the same way that MTEF is; the format of the MathML and MTEF data embedded in WMF is described in the document:

MathML data is stored in an EPS file as a PostScript comment immediately following the MTEF data (see: How MTEF is Stored in EPS), which is immediately following the header required by the EPS format and preceding the PostScript code generated by MathType to draw the equation. The first line identifies the comment and the following lines contain the MathML text. For example:

%MathType!MathML!1!1!+-
%<?xmlversion="1.0"?><!--MathType@Translator@5@5@Ma
%thML2(Clipboard).tdl@MathML2.0(Clipboard)@--><math
%display='block'xmlns='http://www.w3.org/1998/Math/
%MathML'><mrow><msqrt><mi>a</mi></msqrt></mrow></ma
%th><!--MathType@End@5@5@-->!

When reading this data the characters after %MathType!MathML, on the first line, can be ignored. Note the absence of white space (tabs, CR, LF, blanks, etc.,  removed to save space) and comment specifying the translator used.

MathML text is embedded into a GIF file as an Application Extension Record, which consists of a 14-byte header (Application Extension Descriptor), followed by the MTEF data. The header contains:

Byte Introducer = 0x21;
Byte ExtensionLabel = 0xFF;
Byte BlockSize = 0x0B;
Byte ApplicationId[8] = "MathType";
Byte AuthenticationCode[3] = "003";

The data follows this header and is written as a series of blocks each containing 255 bytes or less. Each block starts with a single byte count followed by the data. The end is marked as a block with length 0.

The header is unique enough that the easiest way to extract the data might be to scan the file for the 14-byte header, then expect the MathML data blocks to follow. Properly decoding the GIF records isn't that hard either, but obviously requires you read the GIF specification.

MathType registers a clipboard format with the name, "MathML Presentation", and uses this type for MathML data transferred via the Windows clipboard or drag-and-drop mechanisms.

MathType 6 always places "MathML Presentation" on the clipboard when copying or cutting an equation. And it is always enumerated second after "MathType MTEF" (MathType's native format).

For reference, here's the MathML MathType puts on the clipboard for $$A=\pi^2$$, in C++ string constant form:
 

"<?xml version="1.0"?>"
"<!-- MathType@Translator@5@5@MathML2 (Clipboard).tdl@MathML 2.0 (Clipboard)@ -->"
"<math display='block' xmlns='http://www.w3.org/1998/Math/MathML'>"
"<semantics>"
"<mrow>"
"<mi>A</mi><mo>=</mo><msup>"
"<mi>&#x03C0;</mi>"
"<mn>2</mn>"
"</msup>"
"</mrow>"
"<annotation encoding='MathType-MTEF'>MathType@MTEF@5@5@+=
feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn
hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr
4rNCHbGeaGqiFu0JLipgYlb91rFfpec8Eeeu0xXdbba9frFj0=OqFf
ea0dXdd9vqai=hGuQ8kuc9pgc9q8qqaq=dir=f0=yqaiVgFr0xfr=x
fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamyqaiabg2
da9iabec8aWnaaCaaaleqabaGaaGOmaaaaaaa@3C26@
</annotation>"
"</semantics>"
"</math>"
"<!-- MathType@End@5@5@ -->" 

Note the MathML is UTF-8 encoded and NULL terminated.