IOTA Fabio Consul Automatic LoadBalancer

Introduction

I am very excited to write about the following subject. At my job (Jexia) I was introduced to Fabio and Consul and very quickly have I learned to appreciate the powers of combining these two together.

After having seen several community members setting up load-balancers in different implementations, e.g. pure DNS round-robin or nginx, I realized the powers Fabio and Consul harness with regards to the IOTA network.

In this blog I would like to present a Proof-of-Concept, which can easily be tweaked and used as a fully automated, truly scalable production ready solution for load balancing¬† IOTA full-nodes. The load balancing is meant to help clients (humans or better yet — machines) to seamlessly find a node’s API port to connect to (default 14265) and process its data (make a transaction, send data onto the tangle etc.)

Update:

I am going to present this Proof-of-Concept at the IOTA NRW meetup in Cologne on the 27th of February 2018. Stay tuned!

Use Cases

There are two main use cases that come to mind when discussing this fully automated, highly available load-balancer setup: the first is load-balancing connections for light-wallets to full-nodes. The second is a carrier grade infrastructure for IoT backbone.

A fully automated deployment, auto-scaling and self-healing infrastructure to serve tens of thousands of clients. Be those real people or an entire infrastructure of sensors sending data to the tangle.

By using a load balancer, we can add thousands of full-nodes to our pool. For every new client’s connection to the load balancer’s address, the client will be routed to one of the full-nodes in our pool.

The client (wallet, sensor, vehicle etc) doesn’t have to maintain a list of full-nodes or verify a full-node’s health status. The client uses a single Fully Qualified Domain Name which resolves to the IP of the nearest load-balancer, and ultimately to a healthy full node.

It is important to note that it isn’t transparent to the client to which node he/she is connecting to, nor does it need to know that. The only address used by the client is the load-balancer’s (FQDN) address.

Links

Fabio: https://github.com/fabiolb/fabio
Consul: https://www.consul.io/

Another very cool load balancer that can use different back-ends to hot-load route configurations: http://docs.traefik.io/

Why Fabio?

Fabio is a low latency, “zero configuration” service and a very fast load balancer written in Go.

While nginx and (especially) HAproxy are amazing products, the community editions both lack the ability to hot-load configurations in the way I am going to describe here. There are number of work around options, none of which I find as neat as Consul+Fabio and none of which can so easily be automated.

Managing routes

Indeed, nginx and haproxy allow to reload new configurations. Yet, this is still a little different, in my eyes, when you need to copy files to your load balancers, or maintain a template or separate database for your pool of full-nodes.

Imagine you need 3, 5 or even 7 load balancers to support a huge infrastructure: you would need to have shared storage (introducing a new single point of failure) or copy the new configuration files to all 7 load balancers in order to maintain the pool (adding and removing full-nodes dynamically).
In addition, you need a smart template system to easily add new, or remove existing services from the configuration files. And we didn’t even talk about the health checks.

Fabio makes all of this very easy — thanks to Consul.

The author of Fabio (Frank Schroeder) works at ebay and Fabio is used on marktplaats.nl — a huge Dutch website owned by ebay.

Traefik

I would also like to point out that Traefik is a slightly more comprehensive load balancer with a handful of options (e.g. rate limiting, headers manipulation etc).

It can also use Consul (and many other!) back-ends to configure its routes.

Depending on your requirements, I think Traefik is a nice option with the advantage that it allows for a little more options than Fabio.

I hope to add a follow up blog on how to use Traefik with Consul, and explore the options it provides.

How?

Consul is an advance discovery service including a key value store, service health checks, DNS and more. Fabio reads its route configurations from Consul. Consul runs health checks on which Fabio is able to determine when to remove a full node from its routes.

For instance, we can write a custom health check for full-node X. If it fails, Fabio instantly removes this route from its configuration, thereby, avoiding clients from connecting to this node. If full-node x happens to return to a healthy state, it is immediately returned to Fabios route table.

Why Health Checks?

We only want to let clients to connect to full nodes which pass certain criteria. For example: is the full node responsive? Does it run the latest IRI version? Is it secure (limits some API commands) and so on.

High Availability Setup

In a highly available setup, each IRI node has a Consul agent running on it. It is responsible for communicating with the Consul servers (on the load balancers) and registering the service (IRI).

Proof-of-Concept

Who is this blog intended for?

It is intended for those who poses some experience and basic knowledge of load balancing and overall Linux administration (…DevOps)

The idea of this PoC is to help demonstrate a powerful, secure and auto-scaling load-balancer solution for large infrastructures.

This PoC can be tested but should not be used for production unless some refactoring is being done. For example, I am not setting up HTTPS certificates for Fabio. This can be done, obviously. Such options are are configurable in Fabio as in any other load balancer. In addition, in this PoC, I am not demonstrating the auto-scaling as this is a topic in itself.

High Availability

I am a firm believer that a stand alone load balancer should never be used in production, unless it is deployed in a highly available setup. This PoC can be extended to support a clustered HA setup.

The PoC has been tested on CentOS 7.4. It can also be used on Ubuntu 16 or 17.

Note that both Fabio and Consul can be installed and run as Docker containers.

Playbook?

Why don’t I create an awesome playbook to automate this installation on multiple servers as I’ve done for IRI? https://github.com/nuriel77/iri-playbook

Unfortunately, short on time. However, that is how I would configure a real production environment: using Ansible to bootstrap. Better yet, use Kubernetes for offering high-availability and auto-scaling out-of-the-box.

Installation

Let’s install some dependencies.

For CentOS run:

For Ubuntu:

And then let’s begin by installing Consul.

Consul

Let’s create a user for consul to run under:

Then, create the required directories:

Download (currently latest) consul:

Unzip the package:

Move the binary to its directory and make executable:

Add systemd control file:

Generate an encryption key:

Add configuration file:

Enable and start consul:

Check if consul is active:

And check last lines of the logs:

If we’ve gotten until here, well done!

Now we can create a master token for Consul (this is optional!)

If all good, your master token should be in $TOKEN, this command should display it:

You can store the token in an additional JSON file, it is better so you can always check what it was:

Restart consul to ensure all is running fine:

Check its status (should be Active):

Remember that this is just a PoC. If you want to run Fabio & Consul in high availability, you need to create a consul cluster. In addition, each IRI node will have a Consul agent running next to it, which is responsible for registering IRI in the Consul servers. This is just to give you an idea. The high availability setup is outside the scope of this blog (I might just write about this in next blog… If you are interested!)

Fabio

Create a user for Fabio to run under:

Add the required directories:

Download fabio and make executable:

Add systemd control file:

Add Fabio’s configuration file:

Note: the line with the $TOKEN above is optional — only if you created a TOKEN for consul!

If you decide to run fabio on ports under 1024 (443 or 80) you will have to allow this, as unprivileged users are not allowed to bind to ports under 1024:

If you are configuring this on CentOS you might need to configure Selinux with:

Enable and start Fabio

Check Fabio’s status:

And check the last log lines:

If all is right, we now have both Consul and Fabio running.

As a last check, see if the ports are all listening:

You should get a list that looks a little like this:

Custom Health Check

This is a simple bash script which acts as a health check for each service (full node) which is added to Consul. Based on this check (exit 0 is “passed”, 1 is warning and 2 is critical) Fabio will decide if to add or remove the full node’s route from its configuration.

Download the custom node check script:

Then make it executable:

Some notes about the check:
It requires as argument an address (e.g. -a https://test.server.com:14265) and can accept some arguments such as timeout values, minimum neighbors a node must have or use -i to skip checking whether a node has limited some API commands.

Configuration

Now that we have both services working, we can start the configuration process.
The idea behind this PoC, if you are a creative DevOps/programmer is to be able to add or remove full nodes by creating services in Consul.

For example — let’s say you have a registration page where full node owners enter their email address and full node’s address. Next you confirm their email and proceed to add their node as a service to consul (all this done automatically of course). If the node is okay, it will be added to the list of nodes Fabio can route to.

Registering a Fullnode in Consul as a Service

Registering services (and respective health checks) into Consul can be done using Consul’s HTTP/REST API.

We have a new node we’d like to add. Let’s say we already have IRI on the node where we’re configuring this (http://127.0.0.1:14265)

First, let’s export the consul token into a variable, in case you are starting a new session, or had to logout/login:

Set your node’s (load balancer) fully qualified domain name:

Then enter the consul directory:

Create a json file e.g.:

This json contains a definition to register a new service ID 127.0.0.1 and a health check to the full node’s address. In addition, the -i is telling the health check script to ignore checks for REMOTE_LIMIT_API commands (this is localhost, so they always work).

The tags (urlprefix) is what Fabio uses in order to construct its route.

Register the service:

Check the service is in Consul:

Check the health check is in Consul:

Depending on the “interval” value in the service’s check json definition, you might need to wait 30 seconds until the health check runs.

To de-register a service run:

Let’s add another service, this time one that uses HTTPS. Here’s the json:

Note that urlprefix in the tag now also contains proto=https because Fabio isn’t the SSL termination point for this particular node and has to talk to it HTTPS.

Register the service:

Now, let check we have the routes in Fabio:

If all okay, you should be getting a list of routes.

If you do not get the expected number of routes, probably the health checks are not “passed” in Consul. To verify this run:

And see if any of the checks is not “passed”, you should also be able to see what the script output was to help determine the problem.

Testing

To test if routes work, you should be able to access your node’s address (and fabio port) using curl:

If you run this command multiple times (and have multiple full nodes registered) you should see a difference. For example, each node has a different amount of neighbors, or different amount of free RAM. This should be different each time you run the curl command.

If you are just testing and used a fake fully qualified domain for your load balancer node, you can trick fabio by adding the Host header to the curl command containing your fake FQDN, e.g.:

Conclusion and Final Thoughts

What did we achieve? A system to automate inclusion or removal of full nodes behind a load balancer.
Now you can write a little front-end program/registration page to allow full nodes to add them selves. Of course, this has only been a Proof of Concept, but I hope you enjoyed it and got the idea.

HTTPS?

Yes, you can do a lot with Fabio. You can add your own SSL certificates to Fabio and let people connect security (not too big a concern to be honest, because there’s no sensitive information being sent, however for validating your server’s identity, certificates are a benefit). Please refer to Fabio’s configuration file for HTTPS certificates and more:

https://github.com/fabiolb/fabio/blob/master/fabio.properties

High Availability

To reach true high availability, you would need a minimum of 3 load balancers. You can of course also run IRI on those servers if they are strong enough.

In a nutshell, what do you need in order to have HA?
– Three servers minimum with consul and fabio.

Fabio doesn’t need to know of the other servers. That keeps it simple.
Consul has to be clustered. I advice you refer to Consul’s documentation on how to create a cluster. This way, each server running consul and fabio sees the same services.

DNS

But now we have 3 servers, each with a different external IP address. To which address do we tell our clients to connect to? Simple: use DNS with multiple A records. Then DNS will do a round-robin on the addresses and each one of the servers will be used.

To setup the Consul cluster nicely, the Consul instances have to communicate with each other. this is best done using HTTPS!!!
If Consul if using external IPs this is important.
Alternatively you can set up OpenVPN and use the private network for Consul’s traffic, in which case you don’t need to bother about HTTPS in Consul.

Consul Clustered

While configuring Consul in a cluster works out of the box, you still need your application to talk to one of the consul servers to register or remove services.

Your application will have to handle the case when one of the Consul servers in the cluster is down, and talk to another. Or, you can use a shared IP (VRRP/Keepalived).

Who am I

I am a cloud engineer and DevOps for Jexia which is a very exciting startup.
We are developing awesome software meant for developers, I recommend you have a look.

Feel free to look me up and message me on linkedin, github or twitter (nuriel77).

Donations

Did you like this blog? Feel free to send a donation:

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.