Every night for the next month, while I am sleeping, all of the data in the Drupal 5 version of Observer.com will migrate to the Drupal 6 version. This means that our editors will always get to see yesterday's content on the build site of the Drupal 6 version and we will have witnessed more than 30 (hopefully) successful data migrations and have plenty of time to make adjustments. When we're ready to launch the site, we'll just let the data migrate as usual, then dump the newly created database and files directory, install them on the production MySQL and NFS servers, and change Apache's configuration to point to the Drupal 6 version--it couldn't be easier, and it won't take me longer than thirty minutes. So the why of this should be pretty obvious who's done this sort of migration before. But the how isn't so clear, or it wouldn't have taken me so long to get right.
I should emphasize that this is the result of a couple of hard lessons from Drupal migrations. When we went from version 1 to version 2 of our Drupal 5 site, I was up for 28 straight hours trying to make sense of what our vendors built and merge it with our data--I've never spent so much time re-weighting blocks and shifting them around regions. When we went from version 2 to version 3, I was better prepared with a tested script, but I hadn't yet done it for real and my team had some really hilarious trouble (involving recursively creating nodes, infinitely, on cron). So this, being the biggest migration yet, save for the first one that Moshe Weitzman did, I wanted to be absolutely prepared.
The most cumbersome part is getting all the data from the current Drupal 5 production server to the Drupal 6 development server. Our MySQL slave server also houses our NFS server for Drupal's files/ directory, so all the action takes place here. If you have a slave, you should dump data off of it, not the master. We have a script, better_dump.sh, which creates a SQL database dump.
#!/bin/bash
mysqldump -uxxx -pxxxx --no-data dbname
mysqldump -uxxx -pxxxx --no-create-info \
--ignore-table=nyobserver.sessions \
--ignore-table=nyobserver.watchdog \
--ignore-table=nyobserver.cache \
--ignore-table=nyobserver.cache_menu \
--ignore-table=nyobserver.cache_page dbnameMoving the files was a total bear the first time. I just made a tarball of the files/ directory and copied it (more than 5 gigabytes). After that first time, though--no sweat. You can use find to get the files created since the last dump, and -exec the output to tar. Like this:
#!/bin/bash
cd /exports/nyobserver/
find . -ctime 1 -exec tar -rf ~/files.tar {} \;NB: The servers are configured with relatively unprivileged SSH keys so they can log into each other anonymously and SCP files back home. The actual generation of the data and file dumps is done on cron, as is the downloading and importing of them on the development server.
This approach requires three different MySQL databases and four different Drupal sites. A "blank" database, which has the new Drupal 6 architecture, generated from a permanent (not changing every night) Drupal installation. The script dumps this database into a second database (the target database, rebuilt every night) to re-initialize it to the blank state. Then the Drupal 5 database is re-imported, because it will have changed in the last 24 hours. The PHP script then runs--it's in the Drupal 5 codebase because it has to massage lots of data in place. These are the structural changes we're making:
These changes and a number of others take place during the nightly migration. The first script that runs is within the Drupal 5 version to alter vocabularies and convert "author" nodes to real users, and then another script runs from the Drupal 6 version to finish migrating images and clean up titles in nodes.
So I said four Drupal sites. Here they are:
No! The entire process--dumping the database, tarballing the files, transferring them to the dev server, decompressing them, reverting the databases to the blank state, and then running the migration scripts--takes less than an hour. Of that hour, the actual migration takes about five minutes--and this for a database with almost 100,000 nodes. The key is this basic query:
INSERT INTO drupal6.foo SELECT * FROM drupal5.foo
This won't always work exactly, but for tables which didn't change between D5 and D6, it works perfectly--and very, very fast, too. For tables with minor changes, I was usually able to simply list the columns I wanted from D5. The only data I iterate over using PHP is the article nodes themselves, article authors, and images. Technically I needed to make three queries per article node, but I made those three queries once with no where clause and cached them in PHP arrays (used in this case like a hashtable). This way I end up making four select queries total rather than 3n + 1, where n is the number of nodes I need to convert.
The script is broken into three distinct phases: copying files from the remote server, reverting the databases to the blank state, then running the migration. So if I'm running the process manually (as I am while I write this post), I have just three bash scripts to worry about.
There's a deeper story here about doing build automation in Drupal. This is one of my very favorite practices--build all the code, put it all on the same server, import a lot of production data, and run your unit tests. When you come to work, you'll know for sure whether you actually completed yesterday's task. We do continuous integration all the time just by having one central installation of our site which updates every time anybody commits, but it doesn't always have the latest data. Build automation is hot stuff in Drupal right now: testing.drupal.org is making the work of patch review a whole lot easier, and there's no reason for you not use these strategies on your Drupal sites. Best practices are best practices--Drupal or not.
Questions? Anyone with experience doing this have advice?
Very interesting article.
Very interesting article. Thanks
Awesome!
At Sony we're doing a LOT of d5 -> d6 upgrades. We have around 70 sites built on our d5 platform and as new features are required, we want to move those sites off of our aging d5 platform and onto our new, cool d6 platform. We've built a prep module for d5 and a complementary finishing module for d6. The whole process is accomplished by pointing both d5 and d6 sites to the d5 database, enable the d5 module, then run update.php on the d6 site.
So we have the upgrade process pretty nailed down (for our specific platform at least), but the idea of doing an automated upgrade for x number of nights before doing the live upgrade is brilliant. We will definitely be moving towards this strategy, since we have so many upgrades in our future.
Speedier Database Backups
With so much data, you may want to consider mysqlhotcopy
http://dev.mysql.com/doc/refman/5.0/en/mysqlhotcopy.html
Something I picked up from a session at Drupalcon Szeged
http://szeged2008.drupalcon.org/files/mysqlBackupSecurity-DrupalCon-2008...
andre
+1
Totally going to do this. Thanks!
not on innodb
mysqlhotcopy won't run on innodb tables, so if you're using those, it's not an option. There's a paid alternative though, but I haven't tried it (last time I looked it cost more than I was paying for my VPS, although I'm not on a VPS any longer and neither is observer.com).
Great sharing
This sort of write-up must read insanity for small site owners, but it's going on a lot at the corporate/enterprise level - much more than what we hear about in blogs. My suspicion is that people are frequently re-inventing the wheel, since you can't exactly commit an end-to-end process like this in a code repo. It's really great that you've taken the time to write this!
Great writeup
Thanks for documenting your process Austin. This is some great material that we can all build on.
Post new comment