Introduction to Docvert

The Drupal Docvert module is a plugin to assist with the conversion from propriatory Microsoft-only Word documents into workable HTML, managed by Drupal as nodes, pages and books.

Important: This Drupal module alone does not include the entire required process, it needs to integrate with a Docvert Service which needs to be hosted on a nearby machine. Setting up that stand-alone service in the first place can be quite a job - you need full admin control of it, and if you can't supply or find such a machine, the Drupal module alone will not be much help to you. Instructions for setting it up are however available here, and from the Docvert home site.

The Process

  1. An editor creates content within Drupal, and attaches a MSWord document as a filefield attachment.
  2. On demand, that attached file is POSTed to the Docvert web service.
  3. On the service end:
    1. An instance of LibreOffice (nee OpenOffice) is launched to handle the file.
    2. API methods within LibreOffice are called to parse the Word Doc, and export it as structured HTML, including exports of the embedded images and a subset of the formatting.
    3. Additional optional 'pipeline' methods are called using XSLT to process the HTML into tidied, templated results.
    4. The entire result is packaged into a ZIP file and returned to the caller.
    See the Docvert FAQ for more about how and why this works.
  4. The Drupal docvert module unpacks those results and inserts the text into the Drupal node, or set of book pages.
  5. Returned embedded images are saved locally, and re-linked into the resulting pages.