File formats and structures

« Table of contents

Before we can begin, we need to know that the familiar .doc file extension has been used on many different Microsoft Word file formats over the years. There are in fact only four major file formats, but each has several variations to cater for features introduced in the more than 20 major versions of Microsoft Word to date. These four main underlying structures of the saved files determine how numbering behaves in the product, and what goes wrong with it.

The four major file formats are:

Within these formats, there are two basic structures: The original text+formatting structure, and the newer OLE structured storage.

Text + formatting formats

Word for DOS and the first versions of Word for Windows and Mac Word produce files that basically contain the text of the document in plain ANSI, followed by binary data that specifies the formatting. In the early 1980s this was a great advance on the traditional word processor file formats, which put the formatting commands in among the text.

In these formats, the numbering exists in the text as real text and remains stable once created. Unfortunately, people rarely use these versions of Word these days.

OLE Structured Storage

In order to make full use of linked and embedded objects from other applications (such as pictures, charts, or spreadsheets) Word versions 6 and later use OLE structured storage files. Each file is a nested series of containers like Chinese Eggs. For example, each paragraph is a container of characters. Each Section Break is a container of paragraphs. Each document is a container of Section Breaks.

Each container contains something (it may be a stream of characters, a set of Word drawing objects, or a spreadsheet, picture or other binary object). The container also contains one or more pointers that connect it with property tables that determine the formatting and behaviour of the contained something”.

In Word versions 6 and higher, numbering is expressed as a property, a little container that contains a marker to say where in the text it is to print, and a pointer to an entry in a list of numbering formats that determines what kind of number it is, and how it increments. It is important to understand that this property does not contain the actual number. Word works this out on the fly just before it displays or prints the document.

« Table of contents