Updated 7/13/22: Updated to reflect some changes to the gist which simplify the graceful shutdown logic.
At VIP, we run a highly available Node service that powers much of our platform. One challenge we see teams face is the question of how to scale a highly available API.
That’s a broad problem to solve, but let’s assume we already have adequate test coverage and everything in front of the API taken care of for us. We only care about things we can change about the Node app itself.
Our typical answer looks something like this:
- Use Node’s
cluster
module to fully take advantage of multiple CPUs
- Gracefully reload worker processes for deploys and uncaught exceptions
Node Cluster
Node’s cluster module uses child_process.fork()
to create a new process where communication between the main process and the worker happens over a unix socket.
The TCP module’s server.listen()
function hands off most of the work to the main process, allowing child processes to act like they’re all listening on the same port.
HTTP Server Example
Let’s take a simple http server as an example. Here we have a server that listens on port 3000 by default and returns Hello World!
. It also throws an uncaught exception 0.001% of the time to simulate a bug we haven’t accounted for.
/**
* External dependencies
*/
const { createServer } = require( 'http' )
module.exports = createServer( ( req, res ) => {
if ( Math.random() > 0.99999 ) {
// Randomly throws an uncaught error 0.001% of the time
throw Error( '0.001% error' )
}
res.end( 'Hello World!\n' )
} ).listen( process.env.port || 3000 )
Obviously a real server would be much more complex, but this toy example will be adequate for this example. We could run this server with node server.js
and we’d have an http server running on our server.
The first thing we’ll do is use Node’s cluster module to start one copy of the server per CPU, which will automatically load balance between them.
#!/usr/bin/env node
/**
* External dependencies
*/
const cluster = require( 'cluster' )
const WORKERS = process.env.WORKERS || require( 'os' ).cpus().length
if ( cluster.isMaster ) {
for ( let i = 0; i < WORKERS; i++ ) {
cluster.fork()
}
cluster.on( 'listening', ( worker, address ) => {
console.log( 'Worker %d (pid %d) listening on http://%s:%d',
worker.id,
worker.process.pid,
address.address || '127.0.0.1',
address.port
)
} );
} else {
const server = require( './server' )
}
This will start one copy of the server for each CPU in our system. The operating system will take care of scheduling these processes across the CPUs.
Graceful Reload
Now that we have multiple processes, we can gracefully reload these in case of errors and for deploys.
Errors
In case of errors, we terminate the worker process and spawn a new one. This is important because an uncaught exception means the process is now in an inconsistent state. In other words, an exception occurred that was not accounted for and we’re not sure what side effects that will have.
First, we’ll ensure that worker processes are restarted if any exit unexpectedly. In the isMaster
branch:
cluster.on( 'exit', ( worker, code, signal ) => {
if ( ! worker.exitedAfterDisconnect ) {
console.log( 'Worker %d (pid %d) died with code %d and signal %s, restarting', worker.id, worker.process.pid, code, signal )
cluster.fork()
}
} )
Here worker.existAfterDisconnect
would be true if we call worker.disconnect()
, but false if the worker itself calls process.exit()
. That becomes important in this next step, where we automatically terminate the worker process in the case of an uncaught exception.
const SHUTDOWN_TIMEOUT = process.env.SHUTDOWN_TIMEOUT || 5000
process.on( 'uncaughtException', error => {
console.log( error.stack )
// Exit immediately, no need to wait for graceful shutdown
process.exit( 1 );
} )
We terminate the process with process.exit( 1 )
. Since there was some kind of uncaught error, we just want to terminate the worker and spawn a new one. There is no need to wait for graceful shutdown in this case.
Deploys
For deploys, we gracefully reload all the worker processes one at a time to avoid any downtime in the process.
In the worker, we watch for the disconnect
event. This again calls server.close()
to stop accepting new connections and terminates the process when all active connections have closed.
const server = require( './server' )
process.on( 'disconnect', () => {
server.close( () => process.exit( 0 ) );
} );
Upon SIGHUP
we create one new worker for each active worker and gracefully shutdown the old worker when the new one is ready to accept connections.
process.on( 'SIGHUP', () => {
console.log( 'Caught SIGHUP, reloading workers' )
for ( const id in cluster.workers ) {
cluster.fork().on( 'listening', () => {
gracefulShutdown( cluster.workers[ id ] )
} )
}
} )
Gracefully shutting down a worker involves a few steps.
First, we trigger the disconnect
event. As mentioned before, when all the connections are closed, the worker process will terminate itself. Since we want to ensure this worker is stopped within a reasonable timeframe, we force it to close with worker.kill()
after 5 seconds.
const SHUTDOWN_TIMEOUT = process.env.SHUTDOWN_TIMEOUT || 5000
const gracefulShutdown = worker => {
const shutdown = setTimeout( () => {
// Force shutdown after timeout
worker.kill();
}, SHUTDOWN_TIMEOUT );
worker.once( 'exit', () => clearTimeout( shutdown ) );
worker.disconnect();
}
Upon SIGINT
or ^C
, we’ll perform a similar graceful shutdown routine. The only difference is that we don’t need to restart each worker this time.
process.on( 'SIGINT', () => {
console.log( 'Caught SIGINT, initiating graceful shutdown' )
for ( const id in cluster.workers ) {
gracefulShutdown( cluster.workers[ id ] )
}
} )
To prevent the initial SIGINT
from propagating to worker processes and immediately terminating them, we’ll handle the signal separately there. The first one is ignored, but if you press ^C
or otherwise send SIGINT
twice, all threads are closed immediately, bypassing the graceful shutdown.
process.on( 'SIGINT', () => {
// Ignore first SIGINT from parent
process.on( 'SIGINT', () => {
process.exit( 1 )
} )
} )
I hope this was helpful. You can see the full example on GitHub.