By Jeff Dickey
When running a node application in production, you need to keep stability, performance, security, and maintainability in mind. In this article, I’ll outline what I think are the best practices for putting Node.js into production.
By the end of this guide, this setup will include 3 servers: a load balancer (lb) and 2 app servers (app1 and app2). The load balancer will health check and balance traffic between the servers. The app servers will be using a combination of systemd and node cluster to load balance and route traffic around multiple node processes on the server. Deploys will be a one-line command from the developer’s laptop and cause zero downtime or request failures.
It will look roughly like this:
Photo credit: Digital Ocean
How this article is written
This article is targeted to those with beginning operations experience. You should, however, at least be basically familiar with what a process is, what upstart/systemd/init are and process signals. To get the most out of it, I suggest you follow along with your own servers (but still using my demo Node app for parity). Outside of that, there are some useful configuration settings and scripts that should make for good reference in running Node.js in production.
The final app is hosted here: https://github.com/dickeyxxx/node-sample.
For this guide I will be using Digital Ocean and Fedora. However, it’s written as generically as possible so there should be value here no matter what stack you’re on.
I will be working off of vanilla Digital Ocean Fedora 20 servers. I’ve tested this guide a few times, so you should be able to follow along with each step without a problem.
Why Fedora?
All Linux distros (aside from Gentoo) are moving to systemd from various other init systems. Because Ubuntu (probably the most popular flavor in the world) hasn’t yet moved over to systemd (they’ve announced they will), I felt that it would be inappropriate to teach Upstart here.
systemd offers some significant advantages over Upstart including advanced, centralized logging support, simpler configuration, speed, and way more features.
Install Node.js
First thing you’ll need to do on the server is to set it up to run node. On Digital Ocean, I was able to get it down to just these 4 commands:
yum update -y
yum install -y git nodejs npm
npm install -g n
n stable
This installs Node from yum (which might install an old version), then the awesome n package to install/switch node versions. Finally, n installs the latest stable build of Node. From here, run # node --version
and you should see it running the latest Node version. Later we’ll see how to automate this step with Ansible.
Create a Web User
Because it is insecure to run your application as root, we will create a web user for our system. To create this user, run: # useradd -mrU web
Adding the application
Now that we’ve added Node and our user, we can move on to creating our Node app.
- Create a folder for the app:
# mkdir /var/www
- Set the owner to web:
# chown web /var/www
- Set the group to web:
# chgrp web /var/www
- cd into it:
# cd /var/www/
- As the web user:
$ su web
- Clone the sample hello world app repo:
$ git clone https://github.com/dickeyxxx/node-hello-world
The sample code consists of a very simple Node.js app:
var http = require('http');
var PORT = process.env.PORT || 3000;
http.createServer(function (req, res) {
console.log('%d request received', process.pid);
res.writeHead(200, {'Content-Type': 'text/plain'});
res.end('Hello world!n');
}).listen(PORT);
console.log('%d listening on %d', process.pid, PORT);
Run the app using: $ node app.js
.
You should be able to go the server’s IP address in the browser and see the app up and running:
Note: you may need to run # iptables -F
to flush the iptables as well as firewall-cmd --permanent --zone=public --add-port=3000/tcp
to open the firewall.
Another thing to note is that this runs on port 3000. Making it run on port 80 would be possible using a reverse proxy (such as nginx), but for this setup we will actually run the app servers on port 3000 and the load balancer (on a different server) will run on port 80.
systemd
Now that we have a way to run the server, we need to add it to systemd to ensure it will stay running in case of a crash.
Here’s a systemd script we can use:
[Service]
WorkingDirectory=/var/www/node-hello-world
ExecStart=/usr/bin/node app.js
Restart=always
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=node-hello-world
User=web
Group=web
Environment='NODE_ENV=production'
[Install]
WantedBy=multi-user.target
- Copy this file as root to
/etc/systemd/system/node-sample.service
- Enable it:
# systemctl enable node-sample
- Start it:
# systemctl start node-sample
- See status:
# systemctl status node-sample
- See logs:
# journalctl -u node-sample
You can try killing the Node process by its pid and see if it starts back up!
Clustering processes
Now that we can get a single process running, we need to use the built-in node cluster which will automatically load balance traffic to multiple processes. Here’s a script that you can use to host a Node.js app. Simply run that file next to app.js
: $ node boot.js
This script will run 2 instances of the app, restarting each one if it dies. It will also allow you to perform a zero-downtime restart by sending SIGHUP. Try that now by making a change to the response in app.js
. You can see the server update by running: $ kill -hup [pid]
. It will gracefully restart by restarting one process at a time. You’ll need to update the systemd configuration if you want it to boot the clustered version of your app as opposed to the single instance once. Also, if you add an ExecReload=/bin/kill -HUP $MAINPID
attribute to your systemd config, you can run # systemctl reload node-sample
to do a zero-downtime restart! Here’s an example of the Node cluster systemd config:
[Service]
WorkingDirectory=/var/www
ExecStart=/usr/bin/node boot.js
ExecReload=/bin/kill -HUP $MAINPID
Restart=always
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=node-sample
User=web
Group=web
Environment='NODE_ENV=production'
[Install]
WantedBy=multi-user.target
Load Balancing
In production you’ll need at least 2 servers just in case one goes down. I would not deploy a real system to just a single box. Keep in mind: boxes don’t just go down because they break – perhaps you want to take one down for maintenance? A load balancer can perform health checks on the boxes and if one has a problem, it will remove it from the rotation.
First, setup another Node.js app server using all of the previous steps. Next create a new Fedora box in Digital Ocean (or wherever) and ssh into it.
Install haproxy: # yum install haproxy
Change /etc/haproxy/haproxy.cfg
to the following (replacing the server IPs with your app IPs):
defaults
log global
mode http
option httplog
option dontlognull
option http-server-close
option forwardfor
option redispatch
retries 3
timeout http-request 10s
timeout queue 1m
timeout connect 10s
timeout client 1m
timeout server 1m
timeout http-keep-alive 10s
timeout check 10s
frontend main *:80
stats enable
stats uri /haproxy?stats
stats auth myusername:mypass
default_backend app
backend app
balance roundrobin
server app1 107.170.145.120:3000 check
server app2 192.241.205.146:3000 check
Now restart haproxy: systemctl restart haproxy
– you should see the app running on port 80 on the load balancer. You can also go to /haproxy?stats
to see the HAProxy stats page – credentials: (my_username/my_pass) For more information on setting up HAProxy, check out this guide I used, or the official docs.
Deploying Your Code with Ansible
Now most production guides would stop here, but I don’t think this is a complete setup, you still need to do a deploy! Without a deploy script, it isn’t a terrible process to update our code. It would look something like this:
- SSH into app1
- cd /var/www/node-hello-world
git pull
the latest codesystemctl reload node-sample
to restart the app
The major downside is that we have to do this on each server, making it a bit laborious. Using Ansible we can push our code out from the dev machine and properly reload the code. Ansible tends to scare people. I think people assume it’s similar to complicated tools like Chef and Puppet, but it’s a lot closer to Fabric or Capistrano. It basically just ssh’s into boxes and runs commands. There are no clients, no master server and no complicated cookbooks, just commands. It does have features that make it great at provisioning too, but you can just use it to deploy code if you wish. Here’s the Ansible files needed if you’d like to deploy code like this:
---
- hosts: app
tasks:
- name: update repo
git: repo=https://github.com/dickeyxxx/node-hello-world version=master dest=/var/www/node-hello-world
sudo: yes
sudo_user: web
notify:
- reload node-sample
handlers:
- name: reload node-sample
service: name=node-sample state=reloaded
[app]
192.241.205.146
107.170.233.117
Run it with the following from your dev machine (make sure you installed Ansible): ansible-playbook -i production deploy.yml
That production file is called an inventory file in Ansible. It simply lays out the hostnames of all the servers and their role.
The yml file here is called a playbook. It defines the tasks to run. In this case, it gets the latest code from GitHub. If there are changes, it calls the ‘notify’ task that will reload the app server. If there are no changes, that handler does not get called. If you wanted to also, say, install npm packages, you could do that here as well. Make sure you use npm shrinkwrap if you don’t check your packages into the repo, by the way.
Note that if you want to pull down a private git repo, you’ll need to set up SSH Agent Forwarding.
Ansible for Provisioning
Ideally, we would have the app server building part automated so that we don’t have to go through these steps every time. For that we can use the following Ansible playbook to provision the app servers like we did manually before:
---
- hosts: app
tasks:
- name: Install yum packages
yum: name={{item}} state=latest
with_items:
- git
- vim
- nodejs
- npm
- name: install n (node version installer/switcher)
npm: name=n state=present global=yes
- name: install the latest stable version of node
shell: n stable
- name: Create web user
user: name=web
- name: Create project folder
file: path=/var/www group=web owner=web mode=755 state=directory
- name: Add systemd conf
template: src=systemd.service.j2 dest=/etc/systemd/system/node-sample.service
notify:
- enable node-sample
handlers:
- name: enable node-sample
shell: systemctl enable node-sample
[Service]
WorkingDirectory={{project_root}}
ExecStart=/usr/bin/node boot.js
ExecReload=/bin/kill -HUP $MAINPID
Restart=always
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier={{project_name}}
User=web
Group=web
Environment='NODE_ENV=production'
[Install]
WantedBy=multi-user.target
Run it using: ansible-playbook -i [inventory file] app.yml
.
Here is the same for the load balancer.
Final app
Here’s a GitHub project with final result of all these steps. As it mentions, updating the inventory file, running the provision and deploy steps should build out a full app automatically.
Staging?
Making other environments is easy. Simply add a new inventory file (ansible/production) for staging and start referencing it when calling ansible-playbook
.
Testing
Test your setup! If for no other reason than that it’s really fun to try to find ways to knock your cluster offline. Use Siege under load test mode. Try sending kill -9
to various processes. Knock a server offline. Send random signals to things. Run out of disk space. Just find things you can do to mess with your cluster and ensure that the availability % doesn’t drop.
Improvements to be made
No production cluster is perfect, and this is no exception. I would feel pretty comfortable rolling this into production, but if I wanted to harden it further, here’s what I would do:
HAProxy Failover
Right now HAProxy (while stable) is an SPOF. We could change that with DNS Failover. DNS Failover is not instantaneous, and would result in a few seconds of downtime while DNS propogates. I am not really concerned about HAProxy failing, but I am concerned about human error in changing LB config.
Rolling deploys
In case a deploy goes out that breaks the cluster, I would setup a rolling deploy in Ansible to slowly roll out changes, health checking along the way.
Dynamic inventory
I think others would rate this higher than myself. In this setup you have to commit the hostnames of the servers into the source code. You can configure Ansible to use dynamic inventory to query the hosts from Digital Ocean (or other provider). You could also use this to create new boxes. Really though, creating a server in Digital Ocean isn’t the most difficult thing.
Centralized Logging
JSON logging is really the way to go since you can easily aggregate and search through the data. I would take a look at Bunyan for this.
It’d be nice if the logs for all of this were drained to one queryable location. Perhaps using something like Loggly, but there are lots of ways to do this.
Error Reporting and Monitoring
Again, there are lots of solutions for error reporting and logging. There are none that I’ve tried on Node that I have really liked though, so I’m hesitant to suggest anything. Please post in the comments if there’s a solution to either that you’re a fan of.
For more tips, check out the awesome Joyent guide on running Node.js in production.
There you have it! This should make for a simple, stable Node.js cluster. Let me know if you have any tips on how to enhance it!
This article was originally published at https://blog.carbonfive.com/2014/06/02/node-js-in-production/
Just wondering where a tool like PM2 (https://github.com/unitech/pm2) comes into play here – it seems to take out some of the manual work with setting up clustering, monitoring etc, replacing at least node-cluster. How reliable/mature is it compared to the manual approach? I’m running an app in a staging environment and it seems to be stable so far.
The boot.js script and systemd script replaces pm2. I think pm2 is way too much magic, and it’s pretty bulky for a node package.
Thank you for the good guide. If you are currently stuck on a system without systemd, or want to run in a docker, then supervisord can be a good alternative. However, one thing that I cannot seem to find with this setup is how dead workers are replaced. I think it is recommended to let process die when you get uncaught exception. This is the default behavior anyway.
It’s sort of buried, but dead workers are handled here: https://gist.github.com/dickeyxxx/0f535be1ada0ea964cae#file-boot-js-L67
error handling inside the process, however, was out of scope for this article. Joyent has a great guide here though: https://www.joyent.com/developers/node/design/errors
Man, this was a great post. Thanks for the info.
I’m wondering how you would handle deploying code while the process is still running tasks. Wouldn’t ‘systemctl reload node-sample’ kill the current processes and cause you to lose anything that was in-flight? By in-flight I mean a task run by the node process that a user is waiting on.
no. boot.js cleanly closes the socket when receiving a sigterm signal allowing current connections to complete
Thanks for the article Jeff – awesome write up.
How does this compare with running your Node application on AWS using Elastic Beanstalk? What are the pros and cons of either approach?
Can’t say unfortunately, I’ve never used elastic beanstalk. Try it and write about it! I’d love to know!
Instead of doing DNS failover, you could go with HAproxy and keepalived.
Agreed. That’s probably the most bullet proof setup, although keepalived would be trickier to configure
Jeff,
Nice article – thanks for the great information. I am relatively new to the node and have been looking for good information about production deployment of node applications.
I had two quick questions about your solution. First, in the clustered solution, how does each app.js worker process know about which port it should listen on for requests, or can each worker listen on the same port? The app.js script looks like it defaults to 3000 unless set in the environment, and I don’t see anyplace where a different environment value is being provided. When requests for app.js to handle come in from outside, do they somehow get automatically routed to one of the workers? I would have assumed you would need some sort of reverse proxy in front of app.js even with the cluster to route the requests to one of the available workers.
Second, in the systemd script for the clustered solution, it looks like boot.js is being executed by systemd from /var/www. boot.js is being instructed to launch app.js in each worker, and app.js is located in /var/www/node-hello-world, so does the cluster module know to look in subdirectories for worker scripts to launch, or how does that work?
Thanks again for the great article.
Rich
in your first paragraph that’s all the haproxy box. it listens on port 80 and forwards port 3000 on the app boxes. There’s no reverse proxy though, there are merits to using a reverse proxy and not with node. It basically comes down to whether or not you want to buffer your requests (I prefer not to, the thing I like about node is being very close to the networking layer). If you had relatively large requests that might be good, but that’s atypical for node apps.
The second part I think might be a mistake, I’ll look into it.
Nice article,
If I want to install for example MongoDB, where should it be installed?
Hi,
When testing the boot.js file on my local machine, all requests are handled by a single process. Specifically the last process to start handles all requests. How would I get the other process to handle some requests?