Search & Replace PHP Serialized Strings

Or how we improved the WordPress database migration speed by 40x with Go.

At VIP we move WordPress databases around many times per day and as you can probably imagine, our customers tend to have of a lot of data. When we move data between environments, one of the bottlenecks has historically been search and replace of the domain names. The two main use cases are importing data containing development domain names and moving production data to a hosted development environment.

The best option in most situations is WP-CLI’s search-replace command. WP-CLI is fairly ubiquitous at this point and it’s easy to use. The problem that we tend to have is that it’s not quite fast enough on really huge datasets.

You may be wondering how a simple search and replace could be the bottleneck. WordPress stores lots of PHP serialized strings in the database. Since object lengths are encoded in serialized strings, you can’t simply search and replace domain names in the database unless they happen to be the same length.

php > echo serialize( "google.com" );
s:10:"google.com";

WP-CLI deals with this by pulling the objects out of the database, unserialzing them, running recursive search and replace on the resulting object, reserializing them, and then updating the database. While this only takes a few seconds on most WordPress sites, it can take many minutes or even hours on sites with millions of post objects.

Since our migration process is based on .sql files, we thought it might be faster to run the search and replace outside WordPress as long as we could reliably fix the encoded lengths. So, I wrote Automattic/go-search-replace to do that.

There are two main things that happen:

The search and replace. We simply replace one domain with another.
Fix encoded string lengths as necessary.

It turns out that dealing with the string lengths is not as hard as we originally thought. Modifying nested objects is not a concern because we’re not changing the number of nested items. We only need to look at the string length immediately preceding the string we’re replacing.

Another problem that could have been hard to deal with is maintaining the integrity of the .sql file. It would be easy to replace the characters that are used by the MySQL importer to delineate records, like ),\n(. Our solution is to limit the search domain roughly to the characters that make up valid domain names.

Using this new command line tool, we were able to improve the search and replace process by about 25x. I wondered if we could make it even faster using concurrency.

One challenge is that we need to ensure every line in the resulting file is in the same place. For that, we use a buffered channel of channels. For each line, we write a new string channel to the channel and asynchronously write to that channel. Effectively, we put a placeholder on the channel synchronously, and write to it asynchronously. Here’s that code on GitHub.

We’ve been using this in production for over a year now. In some tests we’ve seen the search and replace performance improved by up to 40x. Since we use STDIN and STDOUT, we can use go-search-replace as part of an import pipeline. In some cases, the search and replace runs at line rate and has no effect on the total import time.