Search & Replace PHP Serialized Strings

Or how we improved the WordPress database migration speed by 40x with Go.

At VIP we move WordPress databases around many times per day and as you can probably imagine, our customers tend to have of a lot of data. When we move data between environments, one of the bottlenecks has historically been search and replace of the domain names. The two main use cases are importing data containing development domain names and moving production data to a hosted development environment.

The best option in most situations is WP-CLI’s search-replace command. WP-CLI is fairly ubiquitous at this point and it’s easy to use. The problem that we tend to have is that it’s not quite fast enough on really huge datasets.

You may be wondering how a simple search and replace could be the bottleneck. WordPress stores lots of PHP serialized strings in the database. Since object lengths are encoded in serialized strings, you can’t simply search and replace domain names in the database unless they happen to be the same length.

php > echo serialize( "google.com" );
s:10:"google.com";

WP-CLI deals with this by pulling the objects out of the database, unserialzing them, running recursive search and replace on the resulting object, reserializing them, and then updating the database. While this only takes a few seconds on most WordPress sites, it can take many minutes or even hours on sites with millions of post objects.

Since our migration process is based on .sql files, we thought it might be faster to run the search and replace outside WordPress as long as we could reliably fix the encoded lengths. So, I wrote Automattic/go-search-replace to do that.

There are two main things that happen:

  1. The search and replace. We simply replace one domain with another.
  2. Fix encoded string lengths as necessary.

It turns out that dealing with the string lengths is not as hard as we originally thought. Modifying nested objects is not a concern because we’re not changing the number of nested items. We only need to look at the string length immediately preceding the string we’re replacing.

Another problem that could have been hard to deal with is maintaining the integrity of the .sql file. It would be easy to replace the characters that are used by the MySQL importer to delineate records, like ),\n(. Our solution is to limit the search domain roughly to the characters that make up valid domain names.

Using this new command line tool, we were able to improve the search and replace process by about 25x. I wondered if we could make it even faster using concurrency.

One challenge is that we need to ensure every line in the resulting file is in the same place. For that, we use a buffered channel of channels. For each line, we write a new string channel to the channel and asynchronously write to that channel. Effectively, we put a placeholder on the channel synchronously, and write to it asynchronously. Here’s that code on GitHub.

We’ve been using this in production for over a year now. In some tests we’ve seen the search and replace performance improved by up to 40x. Since we use STDIN and STDOUT, we can use go-search-replace as part of an import pipeline. In some cases, the search and replace runs at line rate and has no effect on the total import time.

Wire

Wire is an RSS reader for iOS that displays articles with their native formatting — or an optimized mobile version for sites that support AMP — and doesn’t require yet another account to sign up for.

As mentioned previously, it had been quite some time since I wrote any iOS apps, so I decided to use some of my free time this summer to build an app that I wanted to use.

There are other RSS readers. It’s not exactly new territory. But like most apps, the good ones all seem to require you to sign up for another account. I don’t know about you, but I find the number of accounts I already have to worry about somewhat overwhelming. I don’t need anymore. Relying on an iCloud account won’t solve this problem for everyone, but in this case it solves it for me.

The other main feature I wanted in an RSS reader was the ability to disable the monotonous E-reader-like view that has become so common. So many of the websites I read look great on mobile.

As you can see, it’s a pretty standard design. There’s a simple list of articles that you can group however you want. The article view loads a selected article in a web view. I really prefer this real view of the site to the reader-ized version other apps use.

I have found Wire to be quite good for what I want. It doesn’t do much more than aggregate articles from the websites you want to keep up with, but I think it does that well.

Core Data & Concurrency

It’s been a while since I last worked on any mobile apps, so I thought the last couple weeks of sabbatical would be a good time to get caught up on iOS and specifically to learn Swift, which didn’t exist last time ?. For the most part, it was pretty easy to pick up and I was able to move fairly quickly on some apps I had been thinking about.

One of the only problems I ran into was related to Core Data concurrency. Specifically, if two different threads are reading and writing data, you get errors like 'NSGenericException', reason: Collection <__NSArrayM: 0x7fabb400> was mutated while being enumerated.

The solution is private queue contexts. I won’t do a full explanation of queue contexts here, but the Apple Developer Documentation has some good information. The idea is to create a private context to operate on while we’re doing background work.

For this to work as expected there are a few things that need to happen:

  1. Set context.parent. In order for the changes to eventually be written to disk, we have to associate the new, private context with the main context by setting newContext.parent = oldContext.
  2. Use context.object(with: objectID) to make core data relationships. We need to get a reference in the current context to any objects that were created outside the context.
  3. After saving, we also need to save the parent context to commit the changes to disk. To make it thread safe, we use oldContext.perform or oldContext.performAndWait depending on whether is should be asynchronous or not.

I put together a gist to demonstrate:

https://gist.github.com/joshbetz/01d86cbfb2e04fd30df7ac92e8c7b4c5

The following articles were especially helpful to understand how to fix this problem in my case:

  1. Apple’s Core Data Programming Guide > Concurrency
  2. Core Data Concurrency & Maintaining a Silky Smooth UI

Automate git bisect

I occasionally use git bisect to figure out where I’ve introduced test failures. If you’re not familiar, given known good and bad commits, along with a test command, git bisect does a binary search over your repo to determine where a bug or test failure was introduced. This is especially useful on large repos with a test suite that takes a minute or two (or more ?) to run.

Most of the time, I just need to check the last handful of commits. To that end, I’ve written a bash function that assumes the current HEAD is bad, 10 commits prior is good, and automatically runs a test command that you define.

gbi() {
	git bisect start
	git bisect bad
	git checkout HEAD~10
	git bisect good
	git bisect run "$@"
}

This won’t cover every case, but should help automate this fairly verbose process most of the time.

Go Configure

I’ve been dabbling in Golang on and off for a little while now. Recently I was looking for a package to read and parse configuration files with the following requirements in mind:

  1. Be fast.
  2. Use environment variables and JSON files (in that order).
  3. Support hot reloading.

I looked at some of the popular configuration packages, but didn’t see anything that I loved. Many of them don’t meet the requirements I had in mind or are more complex than I’d like.

So, I decided to write config — I know, it’s a great name. It’s pretty simple. We cache values in memory to minimize disk IO. On SIGHUP (or any time you call config.Reload()), we clear the cache, so you can update the configuration without restarting your app.

Usage looks something like this:

https://gist.github.com/joshbetz/bcd1e0a862b7118060a56ea379781833

There are more examples and details on the API in the README on Github.

If you have questions or ideas, issues and pull requests are welcome on Github. Let me know if you use it!