Observer.com Migrates to Drupal 6 Every Night

By Austin Smith on January 12th, 2009 at 4:23am
Posted in Data Migration, Drupal, Free Advice, New York Observer and Observer.com

Every night for the next month, while I am sleeping, all of the data in the Drupal 5 version of Observer.com will migrate to the Drupal 6 version. This means that our editors will always get to see yesterday's content on the build site of the Drupal 6 version and we will have witnessed more than 30 (hopefully) successful data migrations and have plenty of time to make adjustments. When we're ready to launch the site, we'll just let the data migrate as usual, then dump the newly created database and files directory, install them on the production MySQL and NFS servers, and change Apache's configuration to point to the Drupal 6 version--it couldn't be easier, and it won't take me longer than thirty minutes. So the why of this should be pretty obvious who's done this sort of migration before. But the how isn't so clear, or it wouldn't have taken me so long to get right.

I should emphasize that this is the result of a couple of hard lessons from Drupal migrations. When we went from version 1 to version 2 of our Drupal 5 site, I was up for 28 straight hours trying to make sense of what our vendors built and merge it with our data--I've never spent so much time re-weighting blocks and shifting them around regions. When we went from version 2 to version 3, I was better prepared with a tested script, but I hadn't yet done it for real and my team had some really hilarious trouble (involving recursively creating nodes, infinitely, on cron). So this, being the biggest migration yet, save for the first one that Moshe Weitzman did, I wanted to be absolutely prepared.

Moving a ton of data

The most cumbersome part is getting all the data from the current Drupal 5 production server to the Drupal 6 development server. Our MySQL slave server also houses our NFS server for Drupal's files/ directory, so all the action takes place here. If you have a slave, you should dump data off of it, not the master. We have a script, better_dump.sh, which creates a SQL database dump.

#!/bin/bash
mysqldump -uxxx -pxxxx --no-data dbname
mysqldump -uxxx -pxxxx --no-create-info \
    --ignore-table=nyobserver.sessions \
    --ignore-table=nyobserver.watchdog \
    --ignore-table=nyobserver.cache \
    --ignore-table=nyobserver.cache_menu \
    --ignore-table=nyobserver.cache_page dbname

See all those "--ignore-table" options? We have no use for the sessions, watchdog, or any cache tables, so we leave them behind. The data file still clocks in at over a gigabyte, so we gzip it down to about 400 mb. Don't pipe this shell script through gzip, let it run, and then gzip it. Then just move it with SCP.

Moving the files was a total bear the first time. I just made a tarball of the files/ directory and copied it (more than 5 gigabytes). After that first time, though--no sweat. You can use find to get the files created since the last dump, and -exec the output to tar. Like this:

#!/bin/bash
cd /exports/nyobserver/
find . -ctime 1 -exec tar -rf ~/files.tar {} \;

"find . -ctime 1" means "find everything under the current directory newer than a day." Then just SCP as before. Note that there's little point in gzipping it too because most of what's in my files/ directory are images.

NB: The servers are configured with relatively unprivileged SSH keys so they can log into each other anonymously and SCP files back home. The actual generation of the data and file dumps is done on cron, as is the downloading and importing of them on the development server.

Three MySQL Databases, Four Drupal Sites

This approach requires three different MySQL databases and four different Drupal sites. A "blank" database, which has the new Drupal 6 architecture, generated from a permanent (not changing every night) Drupal installation. The script dumps this database into a second database (the target database, rebuilt every night) to re-initialize it to the blank state. Then the Drupal 5 database is re-imported, because it will have changed in the last 24 hours. The PHP script then runs--it's in the Drupal 5 codebase because it has to massage lots of data in place. These are the structural changes we're making:

  1. At present, blogs, columns, and series are all CCK node types. In the new version, they're just terms in a vocabulary because we don't expect to retain any additional data about them. If we need to, we will find some other way, because we get a big bonus out of being able to use the same view for any term.
  2. At present, we're using ImageField. In the Drupal 6 version, we're reverting to image.module for a number of reasons detailed in a prior post.
  3. At present, reporters, the people who actually write and get credit for articles are stored as nodes. I want them to be users.
  4. We hacked check_plain to accept bulletin-board style tags so reporters could use italics in their title. In this version, we're just using a CCK field named "rich title" to display in HTML, and the normal node title, sans markup or fancy quotes, to display in the browser title bar and in syndication.

These changes and a number of others take place during the nightly migration. The first script that runs is within the Drupal 5 version to alter vocabularies and convert "author" nodes to real users, and then another script runs from the Drupal 6 version to finish migrating images and clean up titles in nodes.

So I said four Drupal sites. Here they are:

  1. One that houses the just-imported-from-production Drupal 5 database. It has to be a full, bootstrappable installation, though, because it uses some Drupal functions in its migration script.
  2. One that always houses the blank state of the Drupal 6 database. We make all our theme, view, module, etc. changes here, and then they get built into the migrated version automatically.
  3. One that always houses the most recent migration from Drupal 5 to Drupal 6.
  4. One that allows me to "pause" the migration so I can develop for awhile with live data. At some point, though, I'll have to make the changes to the blank state site so that they can get merged back in.

Doesn't that take forever?

No! The entire process--dumping the database, tarballing the files, transferring them to the dev server, decompressing them, reverting the databases to the blank state, and then running the migration scripts--takes less than an hour. Of that hour, the actual migration takes about five minutes--and this for a database with almost 100,000 nodes. The key is this basic query:

INSERT INTO drupal6.foo SELECT * FROM drupal5.foo

This won't always work exactly, but for tables which didn't change between D5 and D6, it works perfectly--and very, very fast, too. For tables with minor changes, I was usually able to simply list the columns I wanted from D5. The only data I iterate over using PHP is the article nodes themselves, article authors, and images. Technically I needed to make three queries per article node, but I made those three queries once with no where clause and cached them in PHP arrays (used in this case like a hashtable). This way I end up making four select queries total rather than 3n + 1, where n is the number of nodes I need to convert.

The script is broken into three distinct phases: copying files from the remote server, reverting the databases to the blank state, then running the migration. So if I'm running the process manually (as I am while I write this post), I have just three bash scripts to worry about.

Once more, from the top

  1. Cron running on our database slave dumps the database and the most recent file uploads.
  2. Cron running on our development server downloads the database and file uploads from the database slave.
  3. When the downloads are complete, the development server erases the Drupal 5 database and imports it from the just-arrived dump. It copies the new files into the Drupal 6 site.
  4. The dev server dumps the blank state database and copies it to the nightly build database, erasing last night's migration and merging today's work in with the real data.
  5. The script which runs the actual migration is triggered. It bootstraps the Drupal 5 site and massages all the data it can, but mostly it just sends commands to MySQL to import data using INSERT INTO... SELECT FROM... syntax.
  6. When this is done, the import is almost complete and the site is browseable. A script running under the Drupal 6 site finishes some things that are best done with built-in Drupal functions, runs some unit tests, and shuts down.

Drupal: Still just software

There's a deeper story here about doing build automation in Drupal. This is one of my very favorite practices--build all the code, put it all on the same server, import a lot of production data, and run your unit tests. When you come to work, you'll know for sure whether you actually completed yesterday's task. We do continuous integration all the time just by having one central installation of our site which updates every time anybody commits, but it doesn't always have the latest data. Build automation is hot stuff in Drupal right now: testing.drupal.org is making the work of patch review a whole lot easier, and there's no reason for you not use these strategies on your Drupal sites. Best practices are best practices--Drupal or not.

Questions? Anyone with experience doing this have advice?

Very interesting article.

Very interesting article. Thanks

Awesome!

At Sony we're doing a LOT of d5 -> d6 upgrades. We have around 70 sites built on our d5 platform and as new features are required, we want to move those sites off of our aging d5 platform and onto our new, cool d6 platform. We've built a prep module for d5 and a complementary finishing module for d6. The whole process is accomplished by pointing both d5 and d6 sites to the d5 database, enable the d5 module, then run update.php on the d6 site.

So we have the upgrade process pretty nailed down (for our specific platform at least), but the idea of doing an automated upgrade for x number of nights before doing the live upgrade is brilliant. We will definitely be moving towards this strategy, since we have so many upgrades in our future.

Speedier Database Backups

With so much data, you may want to consider mysqlhotcopy

http://dev.mysql.com/doc/refman/5.0/en/mysqlhotcopy.html

Something I picked up from a session at Drupalcon Szeged
http://szeged2008.drupalcon.org/files/mysqlBackupSecurity-DrupalCon-2008...

andre

+1

Totally going to do this. Thanks!

not on innodb

mysqlhotcopy won't run on innodb tables, so if you're using those, it's not an option. There's a paid alternative though, but I haven't tried it (last time I looked it cost more than I was paying for my VPS, although I'm not on a VPS any longer and neither is observer.com).

Great sharing

This sort of write-up must read insanity for small site owners, but it's going on a lot at the corporate/enterprise level - much more than what we hear about in blogs. My suspicion is that people are frequently re-inventing the wheel, since you can't exactly commit an end-to-end process like this in a code repo. It's really great that you've taken the time to write this!

Great writeup

Thanks for documenting your process Austin. This is some great material that we can all build on.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <img> <p>
  • Lines and paragraphs break automatically.

More information about formatting options