2023/04/21 • 3 min read

How Node Exporter helps to monitor our on-premises systems

Maintaining monolithic, on-premises applications, some built years ago, can be challenging for many teams. In the ideal world, the applications should have consistent logs, metrics, traces, etc. which teams can rely on, to monitor health and performance. There are cases in which our systems need to be observable using additional monitoring tools to ensure their health and consistent behaviour and to react quickly in case of unexpected performance, to minimize business impacts.

What is Node Exporter

It’s an open-source Prometheus exporter which collects and exposes hardware and OS metrics like CPU, System Load, RAM, Network Traffic, Disk, etc. It acts as a layer between the Prometheus server and the system in which the application is hosted to collect metrics.

Node exporter metrics can then be used for several purposes, like visual dashboarding, real-time monitoring, and/or alerting.

Figure 1: Node Exporter metric example in Prometheus Graphical User Interface.

 

Node Exporter metrics on Grafana

In La Redoute, we use Grafana linked to Prometheus datasource for the visual dashboard of node exporter metrics. To have the node exporter metrics on Grafana, there’s no need to create dashboards from scratch; it’s easy to find some Grafana Templates that everyone can use and adapt.

On La Redoute, we started with two templates.

1. A Basic Template that is used to monitor the health and performance of the VMs of our applications which are displayed on our TV screens:

Figure 2: Representation of a Basic Grafana dashboard with node exporter metrics

 

2. An Advanced template to go deeper into the analysis when an issue is raised on the application system performance or health.

Figure 3: Representation of an Advanced Grafana dashboard with node exporter metrics

 

Closely monitor our applications

Node Exporter was introduced as one of the monitoring tools for the VMs of our on-premises business applications.

In the past, there was a specific event in which we realized that, if we were monitoring node exporter metrics, the issue could have been prevented and business impacts mitigated.

Node exporter metrics give patterns over time that allow us to determine if something is not going well in the system. If the metric diverges from the pattern, it is a signal that system health is decreasing, which with real-time alerting, gives us time to analyze, find and fix the issue, even before any business impact is felt, like performance degradation or outages.

Figure 4 represents the behavior of the metrics Network Traffic (node_network_receive_bytes_total, node_network_transmit_bytes_total) from the deployment where a defect was introduced on 17/09 until its resolution on 13/10. This defect was not noticed until the business started to complain about performance issues almost 3 weeks later. The defect was adding input traffic in the system and slowly consuming all the system resources, time that the business felt the impact, and later an outage, which forced infra teams to add more resources to the VM, until the issue was fixed.

Figure 4: Representation of Network Traffic node exporter metrics in Grafana

 

Also, the System Load was affected by the defect. In Figure 5 it’s clear we have a slow increase of the System Load (node_loadn) over the time since the deployment until the hotfix, which was not properly monitored and, in the end, led to business impact.

Figure 5: Representation of System Load node exporter metric in Grafana

 

After a proper debriefing of what could have been done to prevent/mitigate the business impacts, it was clearly visible that since the defect until the business felt the impact, the Network Traffic and the System Load metrics were unstable and increasing slowly (represented in red on both images).

This outage led us to define which most important metrics are exposed by node exporter for this specific system, find the patterns, monitor them, and finally implement proper alerts in real-time in case instability is detected in one metric value.

As a first step for monitoring and alerting, we defined two simple metrics as the ones which we consider the most valuable for this specific case.

  • System Load – CPU measurement for a given moment. Represents the number of processes that are being executed or waiting to be executed.
  • Network Traffic – Helps to calculate network bandwidth, having the bytes either transmitted or received across the system.

By implementing alerting for these two metrics, we successfully reduced the critical impact events to almost zero.

The next image represents an example of the system load values exposed by the node exporter, where it’s visible that on 20/12 an event happened that was impacting the system. As an automated alert was raised, the support teams were able to quickly understand what happened and fix it, even before the values of the system load increases until the point that impacts business applications.

Figure 6: Representation of System Load node exporter metric in Grafana

 

As a conclusion, node exporter has become an essential tool in La Redoute’s portfolio for monitoring and alerting. Since its implementation, it has helped to monitor the performance and health of our legacy systems. The tool has enabled the engineering and support teams to track deployments, maintenance, and unplanned events and avoid or mitigate unexpected issues that were not getting attention. Node exporter allows our teams to act proactively towards potential issues, and when not possible to solve them before they happen, it also allows to quickly react and search for a diagnostic and consequently to act rapidly on their resolution.

 

Go back to the blog posts list