How to save a Word 2000 and higher document as an HTML file without getting unnecessary XML/CSS and other padding
Article contributed by Dave Rado and Bob Buckland
Word 2000
Follow this link and download the HTML 2 filter.
The filter has three modules: a macro; a GUI module which you'll find under Start + Programs;
and a command line one which you can pass parameters to. A readme file is supplied with the filter.
The modules are:
|
Filter.exe Runs from a command prompt, but not a DOS box and can be used in batch files. It does not need to be “installed”. |
|
MSFilter.exe is the program that launches a GUI dialog (a bit more user friendly than Filter.exe) with the same options as Filter.exe. It cannot be passed command line parameters as far as I can tell. It uses a second file MSFilter.DLL. Can also be run without being installed. |
|
MSFilter.dot Works as an add-in for Word 2000 and also uses MSFilter.DLL. It uses (somewhat) the settings in Tools=>Options=>General=>[Web Options] to produce “Compact HTML”. |
When you save a Word 2000 document as a web page, it exports a huge amount of XML, which enables Word to read the document back subsequently without any loss of information. It also exports CSS details – not only for the styles in use – but for all styles available to the document! All of which can turn what should be a 4k HTML file into a 70k file, or even larger!
If you need “round-tripability” the standard File + Save As Web Page works well (provided you view the page in IE – it won't look right in Netscape). But if you're planning to put the resulting pages onto a proper web site, and page download times are of any importance to your users, you'll want to use the HTML 2 filter.
Another tip: If you have access to Macromedia's Dreamweaver, it does a brilliant job of stripping out unnecessary code. I once got a 100k HTML file saved from Word down to 5k using Dreamweaver. Both files looked identical in my web browser!
HTML Tidy has also been spoken well of in the newsgroups (although it doesn't appear to have a batch-processing option, unfortunately).
Another trick I often use is to use is to select Insert + File from within FrontPage (I got this idea from fellow MVP Cindy Meister). It's no use if you want to preserve all of Word's formatting, but it is a lot better than pasting from Word to Notepad to FrontPage, the html that comes in is clean, and I find it less time-consuming for the sort of web pages I create than other methods. The styles information is replaced by manual formatting, but you can Select All and press Ctrl+Spacebar in FrontPage, and then apply your css styles.
Running the HTML filter in Word 2002
Word 2002 lets you save as filtered HTML as standard (File + Save As Web Page, Filtered). You can even set the default file save format to be Web Document, Filtered (although if you then use Word 2002's File + Preview in Browser command, it ignores that and uses Word's “regular” Web Page format). The functionality of the MSFilter.dot/.DLL is built in, and it doesn't use any of the three separate files, but it does pay more attention to the features in Word 2002 Tools + Options + General + Web Options.
Unfortunately, the Word 2002 feature does not let you do batch processing from the command line, but you can download the Office 2000 HTML Filter and use that with Word 2002 files. All three modules work with Word 2002, but the installer doesn't know how to recognize Word 2002/Office XP so the files need to be unpacked and placed manually if there are no Office 2000 applications installed.
Another feature of the Office 2000 filter that Word 2002 did not retain is the one you can use to “view” or “grab” the HTML source from all Word 2000 and 2002 files without going to View Source. That feature is the “copy HTML” macro built into MSFilter.dot.