As my business has grown over the past 10 months, I’ve found myself in charge of more and more servers. Being a one-man-band most of the time, I’ve found myself increasingly aware that I don’t have a comprehensive understanding of what my servers (and apps!) are doing, and if they’re crying out for help, they might not get it until something goes amiss.

So I set about looking for a solution.

Over the course of the last month, I’ve greatly improved almost every aspect of how our apps are delivered, from using configuration management to set up servers and ensure consistency across them, to deploying and testing with configuration management (which I’ve blogged about before), to centralising logs system resource information, and building metrics upon them.

This article is an account of my adventure so far, and how it’s made my life so much easier (and my servers so much easier to maintain!).

In the beginning, there was nothing

This is how I felt when a server went wrong

At first, I was configuring everything by hand on the server its self. This was bad in so many ways, but here’s a quick summary for you:

  • You’re relying on your host to keep backups of your server if it is ever destroyed
  • You have no easy way of tracking what changes you’ve made
  • It’s almost impossible to duplicate the server without doing some snapshot-fu, and even if you do successfully clone the server, you’ll need to configure it by hand to deal with the scaling / new virtualhosts / whatever you’re doing.
  • Shell scripts are typically not idempotent - they just blindly hack away at your server, not knowing if the changes they’re making are necessary or not.

In the back of my mind, I knew this was going to become a problem, so as I began needing more and more servers, I decided to try out some configuration management to allow myself to configure blank servers exactly as I wanted them.

Configuration Management

Puppet

I started off experimenting with puppet, as it seemed to be the most popular at the time. Unfortunately, I found it incredibly difficult to set up in the way I wanted it. I did manage to get it working, but after about 3 days it stopped working for some reason. Instead of persevering with it, I decided I’d spent too much time wrestling with it and reverted back to “the old way”, while searching for an alternative.

Ansible

I decided to skip Chef entirely, as I came across a thread discussing salt on hacker news, which looked cool. By the time I’d looked at what made Salt so good, I came across ansible on that same thread. The big thing for me was the fact that ansible didn’t need a whole load of dependencies in order to run. Looking at the documentation, I also preferred ansible’s approach to dependency chains and execution order of tasks on the server.

So I took the plunge. I spent the weekend, and came up with 7 “roles” for the different parts of my server architecture. These 7 roles would allow me to construct my version of a perfect rails server, running on nginx with unicorn, with a mysql database and an upstart script to keep the server in check. These roles would build nginx virtualhosts, set up the database and user, git clone the repository - the works.

892c5943190fbd71d064649e36ea795536951394

<%= code 8166872 %>

As I started integrating ansible into my production environment, I discovered a bug which meant that ansible would crash when it tried to ensure that a mysql user had access to a database. I was in the middle of deploying to production, and didn’t exactly want to wait for the fix, as it would mean having half of my servers (and apps) managed by hand, and the others through configuration management, so I rolled up my sleeves and fixed the bug. The codebase was very foreign to me, and I don’t have much experience with python, but I managed to come up with a small patch which was accepted into the repository. Happy days!

Ansible config 1.5

Inevitably, my server setup got more complicated. I had to start supporting static sites, as well as a PHP project. I started building the roles out as I needed to support more and more things. My 7 roles turned into 30, as I wanted to deploy more and more tools and dependencies on certain servers. As this happened, deploying got slower and slower.

So like any responsible developer, I refactored! I found a thread on the ansible mailing list showing how you can use when: "'x' in y" syntax to skip tasks in certain situations. I took advantage of that to allow my roles to separate configuration management out from application deployment. This resulted in a very large speed improvement, and also stopped tasks being run in duplicate (if more than one virtualhost was being deployed, for example).

Another cheap speed boost came from Ansible 1.3’s accelerated mode, which is a cheap, light messaging queue system which allows ansible to relay its information through a socket daemon instead of reconnecting to SSH for every task. This resulted in at least a 2x speed improvement.

Databases

Oh my God, I’m doing databases wrong

Usually around once or twice a year, I’ll come across an article or website which makes me physically gawk at the screen. This is usually the result of seeing something truly mind-boggling (emscripten’s bananabread demo comes to mind as one of the highlights of 2012, for example). For 2013, the big revelation came from 12Factor.net - a relatively comprehensive guide to delivering apps the way they should be delivered. It became immediately obvious to me that, put simply, I’d been doing deployment all wrong, and it was wasting my time.

So I went through each of the 12 factors on the website, and did a full audit of what I was doing right, and what could be improved.

Some things I was glad to see I was doing right:

There where two blaring omissions: treating logs as event streams, and not relying on hard coding backing services into the app. I realised that I needed to find a way of applying the unique identifiers such as authentication or api tokens for these services (especially the databases and SMTP servers) outside of the codebase. Queue ansible again.

The first thing I did was write a task in ansible which detected what kind of codebase was being deployed. I then built generic configuration templates for each codebase type, and built a task to apply them on top of the codebase, with all the authentication details filled in. Because ansible also handles the creation of the database and user, it already had all the details and they were guaranteed to be the most current!

In each of the codebases effected, I replaced the old configs with ones which took their values from environment variables by default. This meant that each developer on the project could set their config up based on their preferences, and would never check it into the version control by accident. I even wrote a small zsh plugin which would check for a .env file and source it, allowing the developer to easily apply different settings to different projects. Nice!

Oh my God, I’m doing databases wrong: the sequel

The next revelation came from the fact that I watched a video called “Why you shouldn’t use MySQL”. I was stunned at how dodgy MySQL 5.5’s default directives are. So I looked into switching to Postgres - it has support for UUIDs, arrays and other cool things - so it looked like a good choice.

Thanks to the fact that I was using an ORM for all of my projects, switching over was almost painless. There were a few things here and there I had to tweak to get my tests to pass, but in total, it took about 4 hours to move all of my codebases over. Of course, this meant another ansible repo, which also took another 3 or so hours to build, so the whole process took around a day all-in.

The result is that since the move, I’ve been able to take advantage of some of the more advanced datatypes within my applications, which has made life much easier, and has almost certainly resulted in performance gains.

Metrics

Taking off the blindfold

The other big thing missing from my infrastructure strategy was any notion of monitoring. Upon realising this I was suddenly made aware of just how blind I was. Server monitoring software is often attributed towards high-traffic websites and large, enterprise-scale systems, but I came to realise that monitoring was even more important for me than it was for them - at the very minimum, there would be more eyes on the server at a large organisation, whereas I was the only person in charge of the system, and the buck stopped with me. In order to mitigate against this, I’ve decided to build out my monitoring tools, because the quicker I know about potential problems, the quicker I can react and stop them becoming actual problems.

I did a bit of research, and came across logstash - a JRuby log parser and aggregator, which uses the excellent Elasticsearch database to store logs. Add a slice of kibana to the mix to visualise the whole thing, and you’ve got yourself a seriously classy visualisation tool.

Seriously classy visualisation with kibana and logstash

I spent quite a lot of time honing my logstash config to handle all sorts of log types, as well as installing the excellent logstasher gem into all of my rails apps, but the result is that I now have a real time stream of all messages coming from all of my rails apps.

But I wasn’t satisfied with that. Once I’d realised how useful it was to aggregate data such as this, I decided to follow a blog post by Aaron Mildenstein and add collectd to all of my servers, allowing for basic system stats reporting through kibana.

collectd sending events through to kibana makes me happy

Next Steps

Sensu

I don’t see the point in using a tool like Nagios, which is notoriously hard to set up and configure (and very ugly), if I have a blank slate and can choose whatever I think is best to solve the job and maintain. I’ve looked around, and I’ve found a monitoring framework called Sensu which looks like just what I need. The fact that it has Redis and RabbitMQ support means that integration with logstash should be relatively straightforward (who knows, I might even have a go at writing a Sensu plugin for logstash), and the additional alerts system will allow me to send notifications before things get awry.

Open Source

I’ve built an extensive ansible playbook for installing, configuring and securing logstash, elasticsearch and kibana, which I’ve put on github. I want to continue to refine my playbooks (one thing in particular is to make better use of logstash 1.0.14’s configuration directory support in order to make the configuration more ansible-friendly.

Slack Integration

Internally we use Slack for communication, so building a slack notification plugin for Sensu (we’re already doing this for continuous integration passes & failures, version control changes, API documentation changes and more).

Dev-friendly Authentication Management

The next big thing is authentication management for developers. We use SSH keys for all our servers, but I’d like to find a good method of allowing users to obtain keypairs and authentication text files without asking around other members of the team (maybe even by using slack’s API to allow users to query a bot and get a Private message response, I dunno).

Docker

I really want to start using Docker in the future. A great way of achieving that would potentially be through implementing Flynn or a similar project.

CoreOS

I’d love to move to CoreOS once I have all my apps and services running in docker. This would also allow me to use etcd to handle replication configuration amongst other things, which would be super cool.

Conclusion

Wow, this article ended up far larger than I thought it would be. The past few months have been very exciting for me, not least because of how I’ve managed to clean up how my servers are provisioned and maintained. 2014 is going to be a big year, and having a clean infrastructure will hopefully also make it as smooth as possible.