Facility to import an existing, static HTML site structure into Drupal Nodes.
This is done by allowing an admin to define a source directory (siteroot) of a traditional HTML website, and importing (as much as possible) the content and structure into a Drupal site.
Files will be absorbed completely, and their existing cross-links should be maintained, whilst the standard headers, chrome and navigation blocks should be stripped and replaced with Drupal equivalents. Old structure will be inferred and imported from the old folder hierarchy.
See the setup section for details. Because of the number of settings, this is not just a point-and-go module.
This module uses no database tables of its own. It requires XML support on the server, this can be tricky if it's not already enabled.
Given a working system, the process is thus:
By following these instructions, you should probably be able to end up with a version of the old content in the new layout. For large sites (200+ pages) some extra tuning may be neccessary, eg using different templates for different sources.
Incremental imports, processing just sections at a time, or repeated imports as you tune the content or the transformation should be non-destructive. Re-importing the same file will retain the same node ID path, and any Drupal-specific additions made so far.
This is intended as a run-once sort of tool, that, once tuned right on a handful of pages, can churn through a large number of reasonably structured, reasonably formatted pages doing a lot of the boring copy & paste that would otherwise be required.
The existing file paths of the source content will be used to create an automatic menu, and therefore a heirachical structure identical to the source URLs. With path.module, appropriate aliases will also be created such that this will enable a drupal instance to TRANSPARENTLY REPLACE an existing static site without breaking any bookmarks!
A peek under the hood into what happens in what order
The more valid and more homogenous the source site is, the better. A
creation using strict XHTML and useful, semantic tags like #title
#content or something could be imported swiftly. One with a
variety of table structures may not...
Of course, this tool is supposed to be useful when dealing with messy,
non-homogenous legacy sites that need a makeover. Sometimes
regular expression parsing may come to the rescue for content
extraction, but that's not implimented yet.
I'm choosing XSL because I know it, it's powerful for converting content out of (well-structured) HTML, and I've had success with this approach in the past. Others may object to this abstract technology (XSL is NOT an easy learning curve) but the alternative options include RegExp wierdness or cut and paste. (which I may patch on as alternative methods - or someone else can have a go) Both approaches I've also used successfully in bulk site templating (over THOUSANDS of pages) but it's my call. Making your own XSL import template is non-trivial.
In the interests of good housekeeping, imported files with spaces in the
filenames will be renamed to use underscores.
Although it spaces can be worked around, they just cause trouble in
website URLs. Thus, references to the spaced, or %20 versions of the files
may break. This rewrite can be disabled in the settings.
Filenames are assumed to be, and will remain, case-sensitive.
The module can use either the PHP4 and PHP5 implimentations (which are quite different) but the PHP modules do have to be enabled somehow. This can be tricky as they often require extra libraries to be put in your path somewhere. Please don't ask me for instructions, every time I've done it it hurts my head.
If you can see the words XSL or XSLT in your phpinfo() output, You should be fine. The module will test and warn you anyway.
PHP 4.3 has at least one known bug.
The module also uses the famous HTMLTidy tool. There is now a PHP module that impliments HTMLTidy natively, but that needs to be installed and enabled. If you don't have access to that, we can run it from the command line. Find the appropriate binary release of HTMLTidy for your system, and place it in your PATH, in the modules install directory, or wherever you like, then define the path to the executable in the settings. This works fine under Windows too.
If this sounds complicated, and you have limited access to a Unix host and need to use it, there is an auto-installer that can attempt to set up tidy even on a box you don't have login access to.
As mentioned above, the preferred method is to enable the official,
binary release tidy extension (not the PECL extension if you can help it).
On some distros (Windows, Redhat) this is just a matter of uncommenting
extension=tidy.so
in your php.ini.
In Ubuntu (as of 2007)
the tidy extension has been left out of the default debian PHP package :( although it may be found in certain repositories?.
Official instructions are to
recompile php5 from source --with-tidy but that's a bit scary if you are used to using a package manager.
Instead, this post
gives instructions on how to compile just the extension, then add it to php.
I also had to apt-get php5-dev
to get "phpize"
on a brand new clean system, and had to use
./configure --with-tidy=tidy-20051018/
instead of just ./configure
An import template defines the mapping between existing HTML content
and our node values. It uses the XSL language because of the power it
has to select bits of a structured document, for example select=\"//*[@id='content']\"
... will find the block anywhere in the page, of any type with the id
'content', and select=\"//table[@class='main']//td[position(3)]\"
Will locate the third TD block in the table called 'main'. Both these
examples would be common when trying to extract the actual text
from a legacy site.
You can begin with the example XSL template, this contains code that attempts to translate a page containing the usual HTML structures like (either title or h1) and (either the div called 'content' or the entire body tag) into a standard, minimal, vanilla, sematically-tagged HTML doc.
It's likely that whatever site you are importing will NOT be shaped exactly like we need it to translate straight using this format. You have to identify the parts of your existing pages that can reliably be scanned for to define content, then come up with an XPath expression to represent this.
If your source, for example, didn't use nice H1 tage to denote the page title, but instead always looked like
<font size='+2'><B>my page</B></font>
... your template could be made to find it, wherever it was in the
page using select=\"//font[@size='+2']/B\"
and proceed to
use that as the node title.
No, the code is not pretty, and if Regular Expressions are a foreign
language to you, This is worse.
But this is why developers have been ranting for the last ten
years about using semantic markup!!
The uniformity, and the usefulness of the metadata detected in the
source files will play a big part here.
It's easier to develop and test the XSLT using a third-party tool, I recommend Cooktop. Be sure to set the XSL engine to 'Sablotron' which is the one that PHP uses under the hood.
Although it would be possible to configure a logical mapping system to select different import templates based on different content, at this stage the administrator is expected to be doing a bit of hand-tweaking, and predicting all possible inputs is impossible. Some of this sort of logic can however be built into the powerful XSL template, if you are good at XSL
Once importing is taking place, you can even filter it more to improve the structure of the input, for example by removing all redundant FONT tags, or by ensuring that every H1,2,3 tag has an associated #ID for anchoring. Yay XSL.
If you have taxonomy enabled and the source is tagged, these terms can be imported. Links with a rel='tag' attribute will be taken to refer to keywords, tags or terms in your available taxonomy. Either of the following syntaxes should be detected as the term 'Interesting':
<a href='whatever/term' rel='tag' >interesting</a> <link href='whatever/term' rel='tag' title='interesting' />First - if an existing term of that name exists in any valid vocabulary, the imported page will be tagged with it. If not, the first available freetagging vocabulary will be used to insert the tag. If no freetagging is enabled, only pre-existing terms will be used.
Although very rarely used, it's possible to specify the target vocab with syntax such as:
<a href='whatever/term' rel='tag' >subject:interesting</a> <a href='places/77' rel='tag' >location:aotearoa</a>Which will place the page as 'Interesting' in the 'Subject' vocabulary and 'Aotearoa' in the 'Location' vocabulary. Raw imports probably won't have this level of namespace, but you can use it to translate contextual information in the page into import clues using the XSL template. EG, this can be translated quite easily in XSL:
<div class='subjects'> <h3>More:<h3> <a href='whatever/term/13' rel='tag' >Interesting</a> <a href='whatever/term/funny' rel='tag' >Funny</a> </div> <div class='location'> <h3>Places:<h3> <a href='places/aotearoa' rel='tag' >Aotearoa</a> </div>
The base functionality supports placing found content into the $node->body
field, not naturally into any arbitrary CCK fields, but this is also
possible.
If you have a CCK node with (eg) fields:
field_text, field_byline, field_imageand your input pages are nice and semantically tagged, eg
<body> <h1 id='title'>the title</h1> <div id='image'><img src='this.gif'/></div> <h3 id='byline'>By me</h3> <div id='text'>the content html etc</div> </body>
A mapping from HTML ids to CCK fields will be done automatically, and the content should just fall into place.
$node->title = "the title"; $node->field_image = "<img src='this.gif'/><"; $node->field_byline = "By me"; $node->field_text = "the content html etc";
It's common that imported source may contain related content you want to capture
First - edit your target content type to include a multiline, multivalue text field called 'field_sidebar'
Any source data with the class or id 'sidebar' will now arrive into that textarea. The XSL may need to be adjusted, eg like so:
<xsl:template name="sidebar" match="//*[id='leftcol']"> <div id="sidebar"> <xsl:apply-templates /> </div> </xsl:template>
The above snipped will rename any 'leftcol' content into your own 'sidebar' field.
This field can then be rendered as you wish within Drupal, eg by using cck_block.module to put it BACK into a column within your theme :-)
This method can be extended a lot.
In fact, ANY element found in the source text with an ID or class
gets added to the $node
object during import, although most
data found this way is immediatly discarded again if the content type
doesn't know how to serialize it.
A special-case demonstrated here prepends field_
to known
CCK field names. Normally they get labelled as-is.
If the source data is NOT tagged, you'll have to develop a bit of custom XSL to produce the same effect.
... xsl preamble ... <xsl:template name="html_doc" match="/"> <html> <body> ... other extractions ... <h3 id="byline"> <xsl:value-of select="./descendant::xhtml:img[2]/@alt" /> </h3> </body> </html> </xsl:template>
In this example, the byline we wanted to extract was the alt value of
the second image found in the page (a real-world example). This has now
been extracted and wrapped in an ID-ed h3 during an early phase of the
import process, and should now turn up in the CCK field_byline as
desired.
XSL is complex, but magic.
Any add-on modules can use a hook to extract their own data from the input file and add data to the node object.
Core modules all add bits as defined in the import_html_modules.inc
file, see that for the hook prototype and docs. Contrib examples are in the import_html/modules
directory.
If, for example you were able to extract date/time information out of an import page, event.module could be told to do so, and create detailed event nodes.
On the admin/settings/import_html screen, you can (if you wish):
Files and folders beginning with _ or . are nominally 'hidden' so are skipped and do not show up on this listing. While it's possible to list a thousand or so files, It may be a good idea to allow the listing to be more selective, to scale to larger sites. Do this by entering the Subsection to list before clicking list and waiting for every file on the server to be enumerated.
As mentioned in Usage, this module uses no database tables of its own. Pages are read straight into 'page' nodes. I guess it could feed into flexinode if your import files had extra parsable content blocks, and I've sucessfully used it to import other random XML formats (RecipeML) although the advantages of doing so are limited.
It's easy to imagine this sytem set up as a synchroniser, that could re-fetch and refresh local nodes when remote content changes. This would involve recording exactly what the source URL was (which isn't currently done) but would be a fun feature.
I may fork off the page-parsing into a pluggable method, so that a regexp version can be developed alongside, and be used for folk without XSL support.
How to leverage this to import a local site to a remote server? You must either unpack the source files somewhere on that machine, then provide the absolute path where the server can find them, or upload a zip package and I'll try to take it from there.(TODO)
Also TODO is a 'Spidering' method to try to import URL sites. Way in the future!
TODO Allow settings to set import content type
to something other than 'page' done
TODO Find a way to map more meta-data from the original page (assuming there is any to be extracted) to Drupal properties, eg get the contents of META keywords into Taxonomy associations
TODO There are issure when a page links directly to a file that would be regarded as a resource via an href. Most hrefs are re-written to point to the new node, but things like large images or word docs get imported under 'files'. The XSL rewrite_href_and_src.xsl attempts to correct for this, but there may be some side-effects. Always run a link checker after import.
The PHP4 XML parser (Sablotron) has trouble with duplicate attributes - if found in a tag (like from old bad HTML) all subsequent input will be flattened to plaintext. Older versions of HTMLTidy, however, do not detect and fix these for us. Make sure that tidy supports option repeated-attributes. It seems the commandline version fixed this somewhere between the 2000 and 2004 release. (Not sure about the PHP module version - it's PHP5, so should be OK)
An issue has been found running under
PHP 4.3.11 xsltproc 1.0.16 libxml 2.6.2 libxslt 1.1.0
(possibly other similar configurations) whereby < and > encoded entities in the input are prematurely converted to the < and > s outputting unencoded results. We are pretty sure this is a limitation of the old, incomplete XML implimentations of the time. An upgrade to PHP 4.4 solved it in one case, so good luck.
If your server has PHP
open_basedir restrictions in effect, the webserver/PHP process may
be prevented from accessing files outside of webroot. This is a good
security measure, but may stop import_html from reading your source data
(even though browsing the source directories may still appear to work).
The open_basedir
setting can be seen in your phpinfo.
An error like: Local file copy failed (/tmp/1fixed/simple.htm to
files/imported/simple.htm)
When you are sure the source file does
exist and permissions are readable may be symptomatic
of good security on your server. A reasonable fix is to place your
source data inside webroot/files (even if just temporarily) to run the
import process, then delete it later. Alternatively, copy your data over
top of web root (as described in walkthrough.htm) to do an in-place
import. Disabling open_basedir is not recommended, and probably requires
root privileges anyway. Drupal.org
issue discussion
I've gone to great lengths to rewrite the links from the new node
locations to relative links to the resources that moved over
into /files/ but there are problems. When a/long/path/index.html
links to its image by going ../../../files/a/long/path/pic.jpg
it works which is good. But as a/long/path/index.html
is also aliased to a/long/path
- that up-and-over path is wrong
now the page is being served from what looks to the browser like a
different place.
I don't favour embedding anything that hard-codes the Drupal
base_url, and we don't want to use HTML BASE. I want to continue to
support portable subsites, so embedding site-rooted links (/files/etc
)
is not great either.
Currently, by happy chance, going up one ../
too far
will get ignored by most browsers, so if you are not
running Drupal in a subdirectory, the requests for both style of page will
just work. Which will mean that 80% of cases should get by OK. The rest
may need an output filter of some sort developed some day
If you find that duplicate, identical menu items are created (both '/mycontent' and '/mycontent/mycontent') or that child items are created under inaccessable or non-existant parents, check the 'Default Document' in the settings. The process will interpret 'index.htm' and 'index.html' differently, and only the correct one can be used as the parent item. Enter the filename appropriate to the site you are importing from.
/var/www/old_site/html/
archives/2007
There is provision for pages and resources to be placed in different places in the site.
In general, pages will be accessed under the normal site root, while Drupal conventions place images and documents into a /sites/site.name/files folder. Most legacy sites have images in /images etc. This needs to become /sites/site.name/files/images
The rewriting does this, by detecting the suffix of the file being linked to in href= or src= tags. Any file type that is not in the list of known HTML file types is regarded as a 'resource' and rewritten to appear under the 'files' directory.
The $import_html_file_classes array is currently hard-coded in the module.
File suffixes are not good enough
for this, should the suffix list be editable, or should I scan the
files themselves?
TODO Hard http://old.site/ links-to-self may need to be removed using a different process.
Long ago, I started building this with reference to the existing import/export module but I couldn't find too many common features. The transitional format the XSL templates convert into is a 'microformat' of XHTML (basically XHTML, but with strictly controlled classes and IDs). This is how I see a platform-agnostic dump of content should be exported, when this eventually morphs into import_export_HTML.