Updated 7/13/22: Updated to reflect some changes to the gist which simplify the graceful shutdown logic.

At VIP, we run a highly available Node service that powers much of our platform. One challenge we see teams face is the question of how to scale a highly available API.

That’s a broad problem to solve, but let’s assume we already have adequate test coverage and everything in front of the API taken care of for us. We only care about things we can change about the Node app itself.

Our typical answer looks something like this:

  1. Use Node’s cluster module to fully take advantage of multiple CPUs
  2. Gracefully reload worker processes for deploys and uncaught exceptions

Node Cluster

Node’s cluster module uses child_process.fork() to create a new process where communication between the main process and the worker happens over a unix socket.

The TCP module’s server.listen() function hands off most of the work to the main process, allowing child processes to act like they’re all listening on the same port.

HTTP Server Example

Let’s take a simple http server as an example. Here we have a server that listens on port 3000 by default and returns Hello World!. It also throws an uncaught exception 0.001% of the time to simulate a bug we haven’t accounted for.

/**
 * External dependencies
 */
const { createServer } = require( 'http' )

module.exports = createServer( ( req, res ) => {
	if ( Math.random() > 0.99999 ) {
		// Randomly throws an uncaught error 0.001% of the time
		throw Error( '0.001% error' )
	}

	res.end( 'Hello World!\n' )
} ).listen( process.env.port || 3000 )

Obviously a real server would be much more complex, but this toy example will be adequate for this example. We could run this server with node server.js and we’d have an http server running on our server.

The first thing we’ll do is use Node’s cluster module to start one copy of the server per CPU, which will automatically load balance between them.

#!/usr/bin/env node

/**
 * External dependencies
 */
const cluster = require( 'cluster' )

const WORKERS = process.env.WORKERS || require( 'os' ).cpus().length

if ( cluster.isMaster ) {
	for ( let i = 0; i < WORKERS; i++ ) {
		cluster.fork()
	}

	cluster.on( 'listening', ( worker, address ) => {
		console.log( 'Worker %d (pid %d) listening on http://%s:%d',
			worker.id,
			worker.process.pid,
			address.address || '127.0.0.1',
			address.port
		)
	} );
} else {
	const server = require( './server' )
}

This will start one copy of the server for each CPU in our system. The operating system will take care of scheduling these processes across the CPUs.

Graceful Reload

Now that we have multiple processes, we can gracefully reload these in case of errors and for deploys.

Errors

In case of errors, we terminate the worker process and spawn a new one. This is important because an uncaught exception means the process is now in an inconsistent state. In other words, an exception occurred that was not accounted for and we’re not sure what side effects that will have.

First, we’ll ensure that worker processes are restarted if any exit unexpectedly. In the isMaster branch:

cluster.on( 'exit', ( worker, code, signal ) => {
	if ( ! worker.exitedAfterDisconnect ) {
		console.log( 'Worker %d (pid %d) died with code %d and signal %s, restarting', worker.id, worker.process.pid, code, signal )
		cluster.fork()
	}
} )

Here worker.existAfterDisconnect would be true if we call worker.disconnect(), but false if the worker itself calls process.exit(). That becomes important in this next step, where we automatically terminate the worker process in the case of an uncaught exception.

const SHUTDOWN_TIMEOUT = process.env.SHUTDOWN_TIMEOUT || 5000
process.on( 'uncaughtException', error => {
	console.log( error.stack )

	// Exit immediately, no need to wait for graceful shutdown
	process.exit( 1 );
} )

We terminate the process with process.exit( 1 ). Since there was some kind of uncaught error, we just want to terminate the worker and spawn a new one. There is no need to wait for graceful shutdown in this case.

Deploys

For deploys, we gracefully reload all the worker processes one at a time to avoid any downtime in the process.

In the worker, we watch for the disconnect event. This again calls server.close() to stop accepting new connections and terminates the process when all active connections have closed.

const server = require( './server' )
process.on( 'disconnect', () => {
	server.close( () => process.exit( 0 ) );
} );

Upon SIGHUP we create one new worker for each active worker and gracefully shutdown the old worker when the new one is ready to accept connections.

process.on( 'SIGHUP', () => {
	console.log( 'Caught SIGHUP, reloading workers' )

	for ( const id in cluster.workers ) {
		cluster.fork().on( 'listening', () => {
			gracefulShutdown( cluster.workers[ id ] )
		} )
	}
} )

Gracefully shutting down a worker involves a few steps.

First, we trigger the disconnect event. As mentioned before, when all the connections are closed, the worker process will terminate itself. Since we want to ensure this worker is stopped within a reasonable timeframe, we force it to close with worker.kill() after 5 seconds.

const SHUTDOWN_TIMEOUT = process.env.SHUTDOWN_TIMEOUT || 5000
const gracefulShutdown = worker => {
	const shutdown = setTimeout( () => {
		// Force shutdown after timeout
		worker.kill();
	}, SHUTDOWN_TIMEOUT );

	worker.once( 'exit', () => clearTimeout( shutdown ) );
	worker.disconnect();
}

Upon SIGINT or ^C, we’ll perform a similar graceful shutdown routine. The only difference is that we don’t need to restart each worker this time.

process.on( 'SIGINT', () => {
	console.log( 'Caught SIGINT, initiating graceful shutdown' )

	for ( const id in cluster.workers ) {
		gracefulShutdown( cluster.workers[ id ] )
	}
} )

To prevent the initial SIGINT from propagating to worker processes and immediately terminating them, we’ll handle the signal separately there. The first one is ignored, but if you press ^C or otherwise send SIGINT twice, all threads are closed immediately, bypassing the graceful shutdown.

process.on( 'SIGINT', () => {
	// Ignore first SIGINT from parent

	process.on( 'SIGINT', () => {
		process.exit( 1 )
	} )
} )

I hope this was helpful. You can see the full example on GitHub.