Word Doc to HTML Preparation
From TechWiki
It is not unusual to want to convert and utilize MS Word docs for both tagging and use in the OSF portal. This document presents one workflow for doing so.
Contents |
MS Word Docs
Though the process outlined herein will work for any MS Word document, the accuracy of conversion is a function of the complexity of the source MS Word document. Source documents with tables, nested lists, columns, fancy fonts or boxes, references, tables of contents or footnotes will be harder to convert accurately without further manipulation.
It is strongly advised that source MS Word docs use standard heading levels and simple layout design for the cleanest conversion.
Conversion to HTML
Word docs are often desirable as HTML docs or pages within a portal. To convert from native MS Word to clean HTML, this online source is recommended:
This is a free, online resource. Documents are entered into the text conversion box singly (one-by-one). While specific circumstances may warrant other settings, these are the suggested standard conversion settings that should be checked:
- Remove empty paragraphs
- Convert <b> to <strong>, <i> to <em>
- Replace non-ascii with HTML entities
- Replace smart quotes with ascii equivalents
- Indent with tabs, not spaces
- Replace non-breaking spaces with ordinary spaces
Batch Conversion
Another product ($99) that allows batch (entire directory) conversions is:
While offering more conversion options and formats, the system is harder to use and standard settings are not nearly as good as the "out-of-the-box" settings for Word2CleanHTML. Nonetheless, if single processing proves onerous, this option is the likely preferred alternative.
Further Cleaning
The HTML generated from the above is suitable for direct use within the Drupal WYSIWYG editor or in a third-party product like KompoZer. These two option should be suitable for cleanup of any final stray HTML material.
The result of this step should be directly usable with a Drupal OSF portal.
Prep for scones Processing
Actual document tagging is aided by starting with straight text (no HTML) files. The easiest way to produce such straight text is to copy and paste the cleaned HTML views of the converted pages above into a standard text or programmer's editor, such as the recommended Notepad++. Once copied and pasted, the text document can be saved as a straight *.txt file. If placed into a shared directory, the entire directory can then be processed in batch by scones.
See further these scone instructions.
If batch processing is desirable, there are many commercial and open source alternatives.