How to save a Word 2000 and higher document as an HTML file without getting unnecessary XML/CSS and other padding
Article contributed by Dave Rado and Bob Buckland
Follow this link
and download the HTML 2 filter.
The filter has three modules: a macro; a GUI module which you'll find under Start + Programs;
and a command line one which you can pass parameters to. A readme file is supplied with the filter.
The modules are:
Filter.exe Runs from a command prompt, but not a DOS box and can be used in batch files.
It does not need to be
MSFilter.exe is the program that launches a GUI dialog (a bit more user friendly than
Filter.exe) with the same options as Filter.exe. It cannot be passed command line parameters as far as I can tell. It uses a second file
MSFilter.DLL. Can also be run without being installed.
MSFilter.dot Works as an add-in for Word 2000 and also uses
MSFilter.DLL. It uses (somewhat) the settings in Tools=>Options=>General=>[Web Options] to produce
When you save a Word 2000 document as a web page, it exports a huge amount of XML, which enables
Word to read the document back subsequently without any loss of information. It also exports
CSS details not only for the styles in use but for all styles available to the
document! All of which can turn what should be a 4k HTML file into a 70k file, or even larger!
If you need
“round-tripability” the standard File + Save As Web Page works well (provided
you view the page in IE it won't look right in Netscape). But if you're planning to put the
resulting pages onto a proper web site, and page download times are of any importance to your users, you'll want to use the HTML 2 filter.
Another tip: If you have access to Macromedia's Dreamweaver, it does a brilliant job of stripping
out unnecessary code. I once got a 100k HTML file saved from Word down to 5k using
Dreamweaver. Both files looked identical in my web browser!
HTML Tidy has also been
spoken well of in the newsgroups (although it doesn't appear to have a
batch-processing option, unfortunately).
Another trick I often use is to use is to select Insert +
File from within FrontPage (I got this idea from fellow MVP Cindy Meister).
It's no use if you want to preserve all of Word's formatting, but it is a lot
better than pasting from Word to Notepad to FrontPage, the html that comes in is
clean, and I find it less time-consuming for the sort of web pages I
create than other methods. The styles information is replaced by manual
formatting, but you can Select All and press Ctrl+Spacebar in FrontPage, and
then apply your css styles.
Running the HTML filter in Word 2002
Word 2002 lets you save as filtered HTML as standard (File + Save As Web Page,
Filtered). You can even set the default file save format to be Web Document,
Filtered (although if you then use Word 2002's File + Preview in Browser
command, it ignores that and uses Word's regular Web Page format). The functionality of the MSFilter.dot/.DLL is built in, and
it doesn't use any of the three separate files, but it does pay more attention
to the features in Word 2002 Tools + Options + General + Web Options.
Unfortunately, the Word 2002 feature does not let you do batch processing from
the command line, but you can download the
Office 2000 HTML Filter and use that with Word 2002 files. All three modules work with Word 2002,
but the installer doesn't know how to recognize Word 2002/Office XP so the files need to be unpacked and placed manually if there are no Office 2000
Another feature of the Office 2000 filter that Word 2002 did not
retain is the one you can use to view or grab the HTML source from all Word 2000 and 2002
files without going to View Source. That feature is the copy HTML macro built into MSFilter.dot.