How to achieve zero downtime deployment with nodejs

nodejs servers have flaws capable of causing downtimes at any time. In this article, we will explore early incident detection techniques, with a special focus on preventing deployment time service disruption.

This blog post is a followup to: Easy nodejs deployment article, an article that cover deployment on traditional servers, and “Deploying nodejs applications” an article that cover deployment in native an non-native cloud environments. This blog may include leveraging containers(Docker and Kubernetes) to achieve zero deployment downtime in future iterations.

In this article we will talk about:

Even though this blog post was designed to offer complementary materials to those who bought my Testing nodejs Applications book book, the content can help any software developer to level up their knowledge. You may use this link to buy the book. Testing nodejs Applications Book Cover

nodejs configuration and deployment

There is a series of operations that takes place, before the code hits the production environment and lands in hands of customers, for battle-hardening purposes.

While keeping in mind that those series of actions may be of interest to the reader, we also have to be mindful that we cannot cover everything in this piece. But here is the deal, those steps have been covered in the following blog posts:

Now that we have an idea of how deployment and configuration work, let's identify the source of downtime.

Genesis of downtimes

Downtimes come for a variety of reasons. We will revisit intentional and unintentional downtimes, as well as resilient highly available systems.

Intentional downtime is downtimes resulting from an upgrade, or update that requires a system shutdown. Traditionally, new deployments require shut-down of the system that is about to be replaced, so that a new system can be booted, especially when those two systems use the same port.

Unintentional downtimes are unexpected either resulting from fatal unhandled errors, crashes resulting from running out of resources(disk/CPU/RAM). In either case, in multi-process systems, for the downtime to happen, all server faulty processes have to die and never be replaced with new healthy processes.

Resilient systems are designed in a way that some faulty processes crash and the self-healing mechanism boots up new healthier processes. systems. This can either be achieved by leveraging cluster API, a combination of reverse proxy/load balancing to healthier instances, or more recently containerization.

Monitoring

Monitoring, custom alerts, and notifications systems

Monitoring overall system health makes it possible to take immediate action, as soon as something unusual happens. Key metrics to looks at are CPU usage, memory availability, disk capacity, read/write failure rate, particular errors/failures, and software error rates.

Monitoring systems makes it easy to detect, identify and eventually repair or recover from a failure in a reasonable time. When monitoring production applications, the aim is to be quick to respond to incidents. Sometimes, incident resolution can also be automated. For instance, a notification system also triggers a prescribed script that remediates the problem. These kinds of systems are also known as self-healing systems.

For article on how to install reporting tools, please read this articles: “How to install reporting tools”. More on customizing monitoring tools ~ How to monitor deployed applications for reporting and quick response time and Notifications via email or custom scripts 3) logging for issue discover and traceability. Monitoring nodejs applications.

It is a good idea to use a monitoring tool outside the application environment(or server). This strategy bails out when downtime originates either from same rack of servers, or to an entire data center. However, monitoring tools deployed on a same server, have advantage of better taking the pulse of the environment on which the application is deployed on. A winning strategy is deploying both solutions, so that notifications can go out even when an entire data center has a downtime.

Other tools monitoring Uptime ~ Monitoring-dashboard

Recovering from failure

Recovering from a failure is really a broad term. One of the best way to recover from a failure, is to have no failure at all. Alternatively, it is possible to spread the risk of having failure across multiple layers, and design recovery mechanism around each individual layer.

To elaborate more on what was stated above: a database server crash creates a domino effect that results in application server to crash as well. When a database server is indeed a cluster of servers, it becomes hard for all database servers in the pool to crash at the same time. The risk of having a database server crash is spread across multiple database servers, therefore minimal, consequently, reducing chances to crash entire service at the same time. The vulnerability of this approach is linked to region the database servers may be hosted in, or data center.

Another example in nodejs world, would be instead of having one single process server, instead span out multiple processes. That way, when one server process dies, the remaining server processes take over while waiting for the crashed server to recover, or the system to spin up a new server process. This is make easy with cluster API, or leveraging nginx load balancer to redirect traffic to processes that are still alive.

How to leverage cluster API to avoid deployment time downtimes. For more on configuring nginx for resiliency discusses how to reboot nginx every time it dies for some reasons. how to configure nginx to serve nodejs applications. How to leverage streams to avoid deployment downtimes

Rollback Strategy

How to rollback a deployment from a version that fails to a version that works

The best strategy to rollback a failing deployment is obviously not having a failing deployment in the first place. But when that happens, the backup plan is to have those failures as early as possible and flip the switch back to systems that work. This kind of strategy was made popular in the blue/green deployment strategy.

It is possible to run a canary version, where a less mature product gets the taste of the harsh environment it will be running on, on a limited number of customers. The cohort enrolled in running canary build, should be tolerant to shortcomings of the canary product they are using. It is quite reasonable to call these customer beta testers as well.

Another alternative is to release and deploy versioned products. This is especially the case for static assets(SPAs and API endpoints, to name a few). In case a version is broken, the client can only switch to a version that works while waiting for a patch to land and fix the broken version.

Conclusion

In this article, we revisited how to achieve zero downtime by leveraging tools that already exist in both the nodejs runtime or available free of charge as open-source software. There are additional complimentary materials in the “Testing nodejs applications” book.

References