As part as my never-ending quest to improve how I build cool things, I’ve been working for some time on building out infrastructure to help automate and monitor how my apps and servers are doing. I’ve written about horizontal scaling before, but today I’d like to get into one specific facet of its implementation: automated network discovery, and how we use it at FarmGeek to build reliable applications.


The problem

So lets say you have a few servers - a node balancer, two application servers and a database server, for example. Everything’s working fine until BAM, one of your application servers crashes. To make things worse, in this scenario for an unexplained reason nobody finds out about this. However your HAProxy checks work and so the node leaves the connection pool as expected.

Your server capacity just silently halfed in size, without any notifications and with no way of recovering from the problem. That’s not good.

There are a bunch of problems with the “standard” setup being described here:

  • There’s no way of understanding what resources are available among the servers currently switched on - every server suffers from a “Network Blindness”.
  • HAProxy’s checks fail silently.
  • There’s no way of handling IP changes or new servers without manually editing HAProxy’s config.

Using Consul, and with some help from Diplomat and Envoy, we aim to fix all three of these issues.

Introducing Consul

The first problem on this list can be solved with the help of a handy little idea known as Automated Service Discovery. One such implementation is Consul by the lovely fellows at Hashicorp, which is our weapon of choice at FarmGeek.

There are three core things which Consul can do which helps us:

  • It provides a distributed Key-Value store which allows us to persist configuration data across a network, thus allowing our services to become more portable and easier to run in parallel - as they can share configuration data between each other without relying on a datastore being present.
  • It provides a DNS service for services on the network which allows our servers to become more “Network Aware” with almost zero extra work. The DNS service also doubles as a simple Load Balancer.
  • It provides health checks against those services, and will remove them from the DNS pool if they begin to fail.

Of course, Consul does a heap of other things for us, but we’ll focus on these three main things today as they’re the most relevant to the solving of our problem.

I’m not going to go over installing Consul here, as there’s a brilliant tutorial on, but I will explain services, as they’re the key to how we achieve a fully distributed system.

A Service is defined in Consul with (you guessed it) a Service Definition. A Service Definition outlines what kind of service we’re describing, which port it’s on, and what we have to run to check its health. I recommend at least running service checks on the database and the application instances. You can check the service however you want (bash script, ruby script, etc). The main stipulation is that you return a number that’s not zero for less-than-perfect results. This allows Consul to decide if a service is unhealthy or not. This in turn allows consul to remove dodgy services from the pool of connections.

Another important point is how Consul’s DNS API works. Yes - Consul has a DNS API. The way that it works is simple: it provides you with a random IP if you send it a specially crafted domain to resolve. It can even give you more detailed version if you use the SRV command. Very cool. But the question is, how do you get your app (or any tool for that matter) to send DNS requests to consul? At FarmGeek, we’re using DNSMasq to achieve this. All you need to do, is install consul using their guide, install DNSMasq, and then create a /etc/dnsmasq.d/10-consul file with the following contents:


Restart dnsmasq and you’ll be able to resolve consul’s *.consul domains without breaking your regular DNS resolution. Simple!


Introducing Diplomat

Consul allows our servers to talk to one another and to check on the services on our servers, but how do our apps talk to consul? Consul has a DNS and a HTTP API for us to use, and Diplomat is a lightweight ruby wrapper for the HTTP API. At FarmGeek, we use it to store basic configuration data amongst our servers that we’d traditionally provide within Environment Variables.

To use Diplomat, simply add it to your Gemfile, then use Diplomat’s static methods anywhere where you’d like to get or set key-value data.

An example use-case would be to configure rails’ database connection. The example used in the README looks like this:

<% if Rails.env.production? %>
  adapter:            postgresql
  encoding:           unicode
  host:               <%= Diplomat::Service.get('postgres').Address %>
  database:           <%= Diplomat.get('project/db/name') %>
  pool:               5
  username:           <%= Diplomat.get('project/db/user') %>
  password:           <%= Diplomat.get('project/db/pass') %>
  port:               <%= Diplomat::Service.get('postgres').ServicePort %>
<% end %>

However, since we have DNS resolution working now, we could have Consul balance our API connections by setting the host to postgres.service.consul, and if we have more than one postgres service available in the network, we’ll be randomly switched between them automatically.

 Introducing Envoy

NB: Envoy is now unsupported as it has been usurped by consul templates. Use them instead!

At this point our servers are aware of one another, our services can are aware of one another, and our apps are able to share configurations. The final step is to connect our apps to our services. Usually this is straight forward. In the case of HAProxy, however, it’s a bit more tricky.

So we came up with Envoy, a really simple NodeJS script FarmGeek have released on Github under the MIT license to connect HAProxy to Consul. It’s designed to be very hackable and lightweight, and it should run on each HAProxy server. Envoy will reload your config simply by calling service haproxy reload, so it may require sudo.

To use Envoy, clone the repository onto your server, add in a haproxy template based on the sample one in the repository, and run it (as a service, preferably). Envoy will periodically poll Consul for changes, and if it finds any, it’ll replace your haproxy config and reload. Simple! I’ve outlined an example configuration to serve as a way of explaining what envoy does:

  log local0
  log local1 notice
  chroot /var/lib/haproxy
  maxconn 4096
  stats timeout 30s
  stats socket /tmp/haproxy.status.sock mode 660 level admin
  user haproxy
  group haproxy

  # Default ciphers to use on SSL-enabled listening sockets.
  # For more information, see ciphers(1SSL).
  ssl-default-bind-ciphers RC4-SHA:AES128-SHA:AES256-SHA

    log global
    mode http
    option httplog
    option dontlognull
    option redispatch
    retries 3
    maxconn 2000
    timeout connect 5000ms
    timeout client 50000ms
    timeout server 50000ms
    errorfile 400 /etc/haproxy/errors/400.http
    errorfile 403 /etc/haproxy/errors/403.http
    errorfile 408 /etc/haproxy/errors/408.http
    errorfile 500 /etc/haproxy/errors/500.http
    errorfile 502 /etc/haproxy/errors/502.http
    errorfile 503 /etc/haproxy/errors/503.http
    errorfile 504 /etc/haproxy/errors/504.http

  listen stats :1234
    mode http
    stats enable
    stats uri /
    stats refresh 2s
    stats realm Haproxy\ Stats
    stats auth username:password

  frontend incoming
    bind *:80
    reqadd X-Forwarded-Proto:\ http
    mode http
    acl api hdr_dom(host) -i
    acl web hdr_dom(host) -i
    <% if (services.indexOf('api') > -1) { %>
    use_backend api if api
    <% } %>
    <% if (services.indexOf('web') > -1) { %>
    use_backend web if web
    <% } %>

  frontend incoming_ssl
    bind *:443 ssl crt /etc/ssl/ssl_certification.crt no-sslv3 ciphers RC4-SHA:AES128-SHA:AES256-SHA
    reqadd X-Forwarded-Proto:\ https
    mode http
    acl api hdr_dom(host) -i
    acl web hdr_dom(host) -i
    <% if (services.indexOf('api') > -1) { %>
    use_backend api if api
    <% } %>
    <% if (services.indexOf('web') > -1) { %>
    use_backend web if web
    <% } %>

<% services.forEach(function(service) { %>
  backend <%= service %>
    # Redirect to https if it's available
    redirect scheme https if !{ ssl_fc }
    # Data is proxied in http mode (not tcp mode)
    mode http
    <% backends[service].forEach(function(node) { %>
    server <%= node['node'] + ' ' + node['ip'] + ':' + node['port'] %>
    <% }); %>
<% }); %>

I won’t go over how HAProxy works, as there’s plenty of guides on the internet on that, but let’s break dive into the areas which aren’t “standard” compared to most configs:

 1 frontend incoming
 2     bind *:80
 3     reqadd X-Forwarded-Proto:\ http
 4     mode http
 5     acl api hdr_dom(host) -i
 6     acl web hdr_dom(host) -i
 7     <% if (services.indexOf('api') > -1) { %>
 8     use_backend api if api
 9     <% } %>
10     <% if (services.indexOf('web') > -1) { %>
11     use_backend web if web
12     <% } %>

Line 5 - acl api hdr_dom(host) -i - is using HAProxy’s access control list system to create the variable “api” if the incoming traffic is requesting the hostname In line 8, we then use that variable to decide whether to use the backend or not. However we must also check that consul has a backend of the same name, and so in line 7 we check that consul has a backend to match before we try to use it.

 1 <% services.forEach(function(service) { %>
 2   backend <%= service %>
 3     # Redirect to https if it's available
 4     redirect scheme https if !{ ssl_fc }
 5     # Data is proxied in http mode (not tcp mode)
 6     mode http
 7     <% backends[service].forEach(function(node) { %>
 8     server <%= node['node'] + ' ' + node['ip'] + ':' + node['port'] %>
 9     <% }); %>
10 <% }); %>

In this segment, we’re taking all the services that Envoy has found through Consul and spitting them out as backend services. Part of this includes spitting out all of the healthy nodes attached to the service, which can be seen in lines 7-9.

When all is said and done

Now we’ve connected up our systems, we’ve made a great stride towards building a more fault-tolerant system.

Next steps are:

  • Building an orchestration tool which can cut costs by powering servers up or down depending on load and health.
  • Building a notification tool which alerts the admin when something acts oddly (perhaps signalling a bug).
  • Handling distributed storage is also something that needs to be addressed.