Search & Replace PHP Serialized Strings

Or how we improved the WordPress database migration speed by 40x with Go.

At VIP we move WordPress databases around many times per day and as you can probably imagine, our customers tend to have of a lot of data. When we move data between environments, one of the bottlenecks has historically been search and replace of the domain names. The two main use cases are importing data containing development domain names and moving production data to a hosted development environment.

The best option in most situations is WP-CLI’s search-replace command. WP-CLI is fairly ubiquitous at this point and it’s easy to use. The problem that we tend to have is that it’s not quite fast enough on really huge datasets.

You may be wondering how a simple search and replace could be the bottleneck. WordPress stores lots of PHP serialized strings in the database. Since object lengths are encoded in serialized strings, you can’t simply search and replace domain names in the database unless they happen to be the same length.

php > echo serialize( "google.com" ); s:10:"google.com";

WP-CLI deals with this by pulling the objects out of the database, unserialzing them, running recursive search and replace on the resulting object, reserializing them, and then updating the database. While this only takes a few seconds on most WordPress sites, it can take many minutes or even hours on sites with millions of post objects.

Since our migration process is based on .sql files, we thought it might be faster to run the search and replace outside WordPress as long as we could reliably fix the encoded lengths. So, I wrote Automattic/go-search-replace to do that.

There are two main things that happen:

  1. The search and replace. We simply replace one domain with another.
  2. Fix encoded string lengths as necessary.

It turns out that dealing with the string lengths is not as hard as we originally thought. Modifying nested objects is not a concern because we’re not changing the number of nested items. We only need to look at the string length immediately preceding the string we’re replacing.

Another problem that could have been hard to deal with is maintaining the integrity of the .sql file. It would be easy to replace the characters that are used by the MySQL importer to delineate records, like ),\n(. Our solution is to limit the search domain roughly to the characters that make up valid domain names.

Using this new command line tool, we were able to improve the search and replace process by about 25x. I wondered if we could make it even faster using concurrency.

One challenge is that we need to ensure every line in the resulting file is in the same place. For that, we use a buffered channel of channels. For each line, we write a new string channel to the channel and asynchronously write to that channel. Effectively, we put a placeholder on the channel synchronously, and write to it asynchronously. Here’s that code on GitHub.

We’ve been using this in production for over a year now. In some tests we’ve seen the search and replace performance improved by up to 40x. Since we use STDIN and STDOUT, we can use go-search-replace as part of an import pipeline. In some cases, the search and replace runs at line rate and has no effect on the total import time.

Docker WordPress

Almost a year ago now, I started a Docker image for WordPress so I could guarantee there would be new builds available immediately after new WordPress releases.

Since I was working on an image anyway, I decided to build in support for Memcached. Object cache plugins like Memcached Object Cache are pretty common on high traffic WordPress sites.

I also set up wp-config.php to load some of the configuration options from environment variables. To make the image a little more extensible, I also have it automatically loading PHP files in the /var/www/html/wp-config directory. You can mount a volume with all your extra config files in the container to automatically load them.

After the initial setup, maintenance was super easy. Docker Hub automatically builds every time I push to GitHub. I’ve also got a script that checks for new WordPress and WP-CLI releases and automatically pushes updates to GitHub.

There are currently PHP 7.3 builds for WordPress core (PHP-FPM) and WP-CLI available.

Lossy Compression with Photon

Photon, the image service hosted by Automattic, does lossless compression automatically. Page Speed will probably still complain that images aren’t compressed enough. Luckily, Photon has a way to fix that.

There are a couple of parameters, quality and strip, that will further reduce the file size of JPEG images. Quality is pretty straight forward. The strip parameter will let you strip EXIF and color data. I use a snippet like this to set the quality to 80% on my site.

https://gist.github.com/joshbetz/9399b50dc4d47e774bd4

The results can be pretty dramatic. At full size, this image of downtown Madison goes from 16MB to 2.7MB by setting the quality to 80%. That’s a big deal on a mobile connection and it’s pretty hard to spot the difference on most images unless you’re looking at them side by side.

Madison

One Year at Automattic

It’s been a year plus one day since I graduated from the University of Wisconsin, which means I’ve been working at Automattic for exactly a year.

In the past year, I’ve seen whales in Boston, palm trees in Santa Cruz, dolphins in Key West, and a sun rise in Napa.

View this post on Instagram

Whale!

A post shared by Josh Betz (@jshbz) on

View this post on Instagram

Good morning Santa Cruz

A post shared by Josh Betz (@jshbz) on

View this post on Instagram

Stayed up for the sunrise. #wpvip

A post shared by Josh Betz (@jshbz) on

I’ve also had the chance to work with the best people, coolest clients, and interesting projects I could have hoped for in my first year out of school. I interned at Automattic the summer before my final year of school. I wanted to work here for a long time before that, but after meeting the people I would work with, I didn’t want to work anywhere else. I’m proud of the work we do, contributing to the WordPress community, our open source code, but it’s the people that really make it a great company.

I’ve been able to do all of this without moving away from my friends, family, and the best place in the world that’s buried by snow for 5 months every year.

View this post on Instagram

#onwisconsin

A post shared by Josh Betz (@jshbz) on

View this post on Instagram

At least the snow looks nice.

A post shared by Josh Betz (@jshbz) on

This weekend at the 2014 UW graduation, I was reminded how I felt a year ago — proud of what I had accomplished and where I was going. We get better at what we do every day and I’m just as proud to work here today as I was then.

#uwgrad

Posted by Josh Betz on Monday, May 19, 2014

Also, we’re hiring. 🙂

WordCamp Milwaukee 2013 Slides

On June 8, 2013 I gave a talk about automating your WordPress development Workflow at WordCamp Milwaukee. Here are the slides along with some notes.

https://speakerdeck.com/joshbetz/automate-all-the-things

As I’ve mentioned before, computers are really good at carrying out repetitive tasks and we should use that to our advantage to make development easier.

We all know you should develop locally and on the latest development version of WordPress. But in order to checkout the trunk, you have to use SVN — unless you know about git-svn. You can checkout the entire core repository with:

git svn clone -s http://core.svn.wordpress.org

That’s going to take a while because it is going to download all the WordPress history from forever, but once it’s done it’s just another simple git-svn command to keep it up to date:

git svn rebase

You can also put git repositories inside other git repositories. This is useful so you can use a different repository for each plugin and theme. Just initialize a new repository in the themes or plugins directory and it will work just like normal — even if you’re already using git to manage your development version of WordPress.

Automate your Environment

If you like using a virtual machine for your local development environment, vagrant has to be the way to go. A simple configuration file defines the virtual machine and vagrant, which relies on VirtualBox, takes care of setting it up. Pair that with provisioning software like Puppet or Chef and you’ve got a development environment that you can get up and running with a single command.

Vagrant depends on a Vagrantfile that defines the virtual machine. It’s important to note that the folder that the Vagrantfile lives in will be shared using VirtualBox’s built in sharing on the guest to /vagrant.

Puppet scripts define how a server should look. So, “has apache, php, and mysql installed” is an example. The idempotence of puppet is really cool. You can run a puppet script over and over again and it won’t break anything. If you say a server should have nginx and it already has nginx, puppet will just skip that step. The weird part about puppet, and the thing I found most confusing at first, was that stuff doesn’t run in order. You have to clearly specify dependences because the software is going to optimize your script as much as possible. This gets easier to follow once you work with it for a while.

My current setup is at https://github.com/joshbetz/WCMKE–2013-Vagrant-Puppet

When everything is configured, simply run vagrant up and vagrant will start the virtual machine. Other commands are vagrant ssh, vagrant halt, vagrant destroy, and vagrant provision. SSH is straight forward. Halt shuts down the machine. Destroy removes the machine from VirtualBox. Provision runs the provisioner, whether that’s Puppet, Chef, or a shell script.

This is really powerful. Like your dotfiles, you can put this in the cloud and download the configuration when you need it. And simply run vagrant up to start the machine.

The other nice thing is that our development environments can be open source now. Working together on this stuff is a powerful thing, but I don’t think I have to convince the WordPress community of the power of open source.

Automate your Development

Since we are talking about WordPress development, I’m also going to mention some ways you can make the software work for you while you’re writing code.

  1. Developer Plugin – http://wordpress.org/extend/plugins/developer/
    It’s maintained by Automattic and basically just helps you set up your local development environment the right way. From checking your wp-config.php for the proper constants, to verifying that you’re on the latest development release of WordPress, to recommending awesome plugins that make WordPress development run more smoothly.

  2. Theme Unit Test – http://codex.wordpress.org/Theme_Unit_Test
    This one is maintained by the WordPress Theme Review Team. Even though it’s called the Theme Unit Test, this is useful for all developers. It’s a collection of a huge range of content for you to import into WordPress. Even if you’re developing the next hit plugin, you’re going to need content to test on. There are posts of every post format and any kind of example you could think of.

  3. Underscores.me – http://underscores.me
    Something else that is maintained by Automattic. While it’s technically a theme, I think of it more as a collection of awesome snippets that you can drop into your theme. That’s not to say you couldn’t download this and style it the way you want and be done. It would probably be great as a theme by itself, but there’s so much awesome in here that I just like to scan through the code once in a while and see what new magic I find.

So, to wrap this up, make your computer work for you. Automate everything that can be automated unless you know you’ll never need to do it again. When you do something twice, that’s a good sign that you’ll probably do it again. 🙂