Creating Drupal content using Python

I wanted to import data into Drupal from elsewhere - in my case, from XML exported by another system.

Perhaps the easiest way to interface with Drupal at a simple level is to use the Services module. This lets you, for example, create applications in other languages which can create, list, or update Drupal content. There is a mix of documentation for it, some of which is a bit dated.

So here’s a quick example of how you can use a Python program to create a node in Drupal. Hope this helps to get you started if you need to do something similar.

#! /usr/bin/python
# 
# Creating nodes in Drupal 6 via the Services module using 
# Python and XML-RPC.
# 
# To use this, get and install the Services module from 
#    http://drupal.org/project/services
# 
# Enable the XMLRPC Server module, the System Service module and the 
# Node Service module at least.
#
# 
# Note that this is intended to demo quick importing of data from other 
# systems. It makes no use of authentication, in fact you need to enable 
# various permissions for anonymous users to be able to use it.  
# Enable things like 'administer nodes', 'create page content',
# 'create url aliases' and 'administer url aliases'.
# 
# You should remember, therefore, to DISABLE them before you go live 
# with a production site!
# 
# This assumes you have 'Use sessid' but not 'Use keys' enabled in the 
# Services settings at http://yoursite/admin/build/services/settings.

import xmlrpclib, time

# Put the URL for your Service in here. See admin/build/services.
s = xmlrpclib.ServerProxy('http://myhost.com/services/xmlrpc')

class node:
    # You need to set uid and username appropriately for your site if you don't want
    # everything to be posted by Anonymous.
    def __init__(self, title, body, path, ntype='page', uid=1, username='qsf'):
       self.title = title
       self.body = body
       self.path = path
       self.type = ntype
       self.uid = uid
       self.name = username
       self.promote = False

try:
    sessid, user = s.system.connect()
    # Here you could read in the content for each node from some other source and do
    #   n = node(title, body, path)
    # but for now we'll just do:
    n = node('A test node', 'This is an interesting page', 'interesting')
    # and then save it into Drupal
    s.node.save(sessid, n)
    # where it should appear at /interesting.

except xmlrpclib.Fault, err:
    print "A fault occurred"
    print "Fault code: %d" % err.faultCode
    print "Fault string: %s" % err.faultString

Importing content from Plone to Drupal

I used a variant of this script to convert a Plone-based site to Drupal. Whether any of the following would work for somebody else’s site depends to a great degree on the particular configurations involved, but I offer it in case it’s useful as inspiration! You’ll almost certainly need to tweak it.

As those familiar with Plone will know, the content is stored in an object-oriented database, and mapping that onto other things can be tricky. Most of my content was reasonably straightforward, though, and I tackled the problem by switching on the Zope options to make the Plone site accessible by FTP. I could then use the FTP tool of my choice to access what looked like a filesystem and copy the whole hierarchy onto another machine.

My Plone content was basically HTML and images, and each node in Plone came across as a simple HTML file without all the boilerplate formatting. Perfect! I could then write a script which would take, for example, the file in news/article1, read it as XML, gather the bits I wanted and create a node in Drupal which also had news/article1 as the path.

The main bit of manual tweaking was to do with relative URLs. In Drupal you really want every link on your site to be an absolute URL (ie. to begin with ‘/’), because there are so many ways of chopping and displaying your content. On the old site, news/article1 might link to article2 in the same folder, which isn’t going to work if news/article1 is sometimes accessed as node/29, or recentstuff/thisweek, or whatever. You need to look for links that don’t begin either with http or with / and update them appropriately to /news/article2 or whatever. I didn’t do that in the script because there were few enough that I could do it using global searches in my editor.

OK, so here’s my import.py script. Having copied the Plone content to my local drive by FTP, I changed into the folder and did, for example:

./import.py news/article1 news/article2 ...

OK - here it is. I hope it’s reasonably self-explanatory. I use the tidy utility to convert sometimes non-conformant HTML into valid XML, and

#! /usr/bin/python
# 
# This assumes you have 'Use sessid' but not 'Use keys' enabled in the 
# Services settings at http://yoursite/admin/build/services/settings.
# You'll need the appropriate permissions set, and the Path module enabled.

import xmlrpclib, time, sys, subprocess, os
import xml.etree.ElementTree as ET

# Where is the XML-RPC service on your Drupal site?
s = xmlrpclib.ServerProxy(':8888/services/xmlrpc')

class node:
    # You need to set uid and username appropriately for your site if you don't want
    # everything to be posted by Anonymous.
    def __init__(self, title, body, path, ntype='page', date=None, uid=1, username='qsf'):
       self.title = title
       self.body = body
       self.path = path
       self.type = ntype
       self.uid = uid
       self.name = username
       self.promote = False
       self.format = 3
       self.comment = 0
       if date:
           # self.created = date
           # self.changed = date
           self.date = date
           print "date = ",date

sessid, user = s.system.connect()
    
for path in sys.argv[1:]:
    try:
        # Read the XML file, tidying it up and making valid XML.
        tidypipe = subprocess.Popen(["tidy", "-q", "-asxml", "-n",  path], 
            stdout=subprocess.PIPE)
        # The 'tidy' process will sometimes add namespace declarations.
        # ElementTree stuff below gets messy if we have to parse namespaces
        # so I'm just going to throw them away before we read the file.
        nsstrip = subprocess.Popen(["sed", "s/^<html .*>$/<html>/"], 
            stdin = tidypipe.stdout, stdout=subprocess.PIPE)
        # Read the file into a DOM tree
        tree = ET.parse(nsstrip.stdout)
        root = tree.getroot()
        desc = None
        date = None
        ntype = 'page'
        # Some of the info we want is in the <meta> tags.
        for i in root.findall(".//meta"):
            if i.get('name')=='Description': desc = i.get('content')
            if i.get('name')=='Effective_date': 
                date = i.get('content')
            if i.get('name')=='Type' and i.get('content')=='News Item': 
                ntype='story'
        
        # This is where we build up the body of our new node.
        body = ""
        # We turn what was the 'Description' in Plone into the teaser text in Drupal
        if desc:
            body += "<p>%s</b>\n<!--break-->\n" % desc
            
        # The title comes from <title>
        title = root.find('.//title').text
        # and the rest from <body> 
        body_outer = root.find('.//body')
        # The <body> may have text that isn't inside any element.
        body += body_outer.text
        # then we add all the elements.
        for n in body_outer:
            body += ET.tostring(n)

        # Strip off index_html, if that's part of the path
        path = path.replace('/index_html', '')

        # Show the user what's happening
        print "Title:",title
        print "Desc: ",desc
        print "Date: ",date
        
        # And create a new node
        n = node(title, body, path, ntype=ntype, date=date)
        s.node.save(sessid, n)

        # On a Mac this will open it in the browser so you can check it out.
        # os.system('open %s/%s' % (':8888', path))
        print "------------------------------------------"

    except xmlrpclib.Fault, err:
        print "======================================"
        print "A fault occurred with",path
        print "Fault code: %d" % err.faultCode
        print "Fault string: %s" % err.faultString
        print "======================================"

I don’t normally have comments switched on on this site, so if you’d like to add any comments/tips, it might be best to do so on the Status-Q post.

Quentin Stafford-Fraser, Dec 2008.