I often get the question what it is exactly that I do for a living and usually I can’t really talk about it due to non-disclosure agreements and possible fall-out but in this case I got permission to do a write-up of my most recent job.
Recently I got called in on a project that had gone off the rails. The system (a relatively low traffic website with a few simple pages) was built using Angular and Symfony2 used postgres and redis for its back-end, centos as the operating system for the unix portion of the system, 8 HP blades with 128 G RAM each in two blade enclosures and a very large HP 3par storage array underneath all that as well as some exotic hardware related windows machines and one special purpose application server running a java app. At first glance the hardware should have been easily able to deal with the load. The system had a 1Gbps uplink, every blade ran multiple (VMware) VMs. After interviewing the (former) CTO I got the impression that he thought the system was working A-ok. Meanwhile users were leaving or screaming blue murder and the management was getting seriously worried about the situation being out of control and possibly beyond repair, an existential threat if there ever was one.
So for the last 5 weeks I’ve been working pretty much day and night on trying to save the project (and indirectly the company) and to get it to a stage where the users are happy again. I don’t mind doing these jobs, they take a lot of energy and they are pretty risky for me financially but in the end when - if - you can turn the thing around it is very satisfying. Usually I have a pretty good idea of what I’m getting into. This time, not so much!
The job ended up being team work, it was way too much for a single person and I’m very fortunate to have found at least one kindred spirit at the company as well as a network of friends who jumped to my aid at first call. Quite an amazing experience to see a team of such quality materialize out of thin air and go to work as if they had been working together for years. My ‘plan B’ was to do an extremely rapid re-write of the system but fortunately that wasn’t necessary.
It turned out that the people that had originally built the system (a company in a former east-block country) had done a reasonably good job when it came to building the major portions of the application. On the systems level there was lots that was either inefficient or flat-out wrong. The application software was organized more or less ok, but due to time pressure and mis-management of the relationship and the project as a whole the builders had gotten more and more sloppy towards the end of the project in an attempt to deliver something that clearly wasn’t ready by the agreed upon deadline. I don’t blame them for this, they were placed in a nearly impossible situation by their customer, the (dutch, not associated with the company that built the whole thing) project manager was dictating what hardware they were going to have to run their project on (I have a suspicion why but that’s another story) and how that was all set up and so the whole thing had to become the justification of all the expenses made up front. He also pushed them to release the project at a stage where they were pretty much begging him (in writing) for more time and had made it abundantly clear that they felt the system was not really ready for production. That letter ended up in a drawer somewhere. A single clueless person in a position of trust with non technical management, an outsourced project and a huge budget, what could possibly go wrong… And so the big red button was pushed and the system deployed to production and from there it went down very rapidly. By the time I got called in the situation had become very serious indeed.
Here is an enumeration of some of the major things that we found:
When I first started working on the project the system was comprised of a whopping 130 VMs, each of those seriously restricted in terms of CPU and memory available (typically: 3G RAM, 2 or 4 cores). I wished I was joking here, but I’m not, for every silly system function there was a VM, a backup VM and another two that were make-believe running in another DC (they weren’t, that second blade enclosure was sitting one rack over). Now that it’s all done the whole thing is running comfortably on a single server. Yes, that puts all your eggs in one basket. But that single server has an MTBF that is a (large) multiple of the system the way it was set up before and does not suffer from all the communications overhead and possible sources of trouble that are part and parcel of distributed systems. Virtualization when used properly is a very powerful tool. But you can also use it to burn up just about any CPU and memory budget without getting much performance (or even reliability) in return. Don’t forget that if you assign a VM to just about every process you are denying the guest OS the ability to prioritize and schedule, you’re entirely relying on the VM architecture (and hence on yourself) to divide resources fairly in a mostly static fashion and that setup doesn’t have the in-depth knowledge the guest OS does about the multiple processes it is scheduling. Never mind the fact that each and every one of those VMs will have to be maintained, kept in sync, tuned and secured. The overhead of keeping a VM spun up is roughly equivalent to keeping a physical server alive. So use virtualization, but use it with care, not with abandon and be aware of where and how virtualization will affect your budgets ($, cycles, mem). Use it for the the benefits it provides (high availability, isolation, snapshotting). Over time we got rid of most of the VMs, we’re left with a handful now, carefully selected with regards to functional requirements. This significantly reduced total system complexity and potential attack surface and made a lot of the problems they were experiencing tractable and eventually solvable.
Application level documentation was non-existent. For the first few days we were basically charting the system as we worked our way through it, to figure out what we had, which bits went where and how they communicated. Having some usable application level documentation would have been a real time saver here, but as is usual with jobs like these documentation is the thing everybody hates to do and pushes as far ahead as possible. It’s usually seen as some kind of punishment to have to write documentation. What I wouldn’t have given for a nice two level drawing and description of how the whole thing was supposed to work on day #1.
The system was scaled out prematurely. The traffic levels were incredibly low for a system this size (< 10K visitors daily) and still it wouldn’t perform. First you scale ‘up’ as far as you can, then you scale ‘out’. You don’t start with scaling out, especially not if you have < 100K visitors per day. At that level of traffic a well tuned server (possibly a slightly larger one) is what you need. Maybe at some point you’d have to off-load a chunk of the traffic to another server (static images for instance, or the database). And if you can no longer run your site comfortably on a machine with 64 cores and 256G of RAM or so (the largest still-affordable server that you can get quickly today) then the time has come to scale out. But you want to push that point forward as far as you can because the overhead and associated complexity of a clustered system compared to a single server will slow down your development, will make it much harder to debug and trouble-shoot and in general will eat up your most precious resource (the time the people on your project have) in a hurry. So keep an eye out for that point in time where you are going to have to scale out and try to plan for it but don’t do it before you have to. The huge cluster as outlined above should have been able to support more than one million users per day for the application the company runs, and yet it did not even manage to support 10K. You can blow your way through any budget, dollars, memory, cycles and storage if you’re careless.
There was no caching of any kind, anywhere. No opcode cache for the PHP code in Symfony, no partials caching, no full page cache for those pages that are the same for all non logged in users, no front-end caching using varnish or something to that effect. No caching of database queries that were repeated frequently and no caching of the results either. All this adds up to tremendous overhead. Caching is such a no-brainer that it should be the first action you take once the system is more or less feature complete and you feel you need more performance. The return is huge for the amount of time and effort invested. Adding some basic caching reduced the CPU requirements of the system considerably at a slight expense in memory. We found a weird combination of PHP version and opcode cache on the servers, PHP 5.5 with a disabled xcache. This is strange for several reasons, PHP 5.5 provides it’s own opcode cache but that one had not been compiled in. After enabling xcache it turned out the system load jumped up considerably without much of an explanation to go with that (it should have gone down!). Finally after gaining a few gray hairs and a suggestion by one of the people I worked with we threw out xcache and recompiled PHP to enable opcache support and then everything was fine. One more substantial jump in performance.
Sessions were stored on disk. The HP 3par is a formidable platform when used appropriately, but in the end it is still rotating media and there is a cost associated with that. Having a critical high-update resource on the other side of a wire doesn’t help, so we moved the sessions to RAM. Eventually these will probably be migrated to redis so they survive a reboot. Moving the sessions to RAM significantly reduced the time required per request.
The VMs were all ‘stock’, beyond some basic settings (notably max open files) they weren’t tuned at all for their workload. The database VM for instance had all of 3G worth of RAM, and was run with default settings + some half-assed replication thrown in. It was slow as molasses. Moving the DB to the larger VM (with far more RAM) and tuning it to match the workload significantly improved performance.
The database didn’t have indices beyond primary keys. Hard to believe that in this day and age there are people who call themselves DBA who will let a situation like that persist for more than a few minutes, but apparently it’s true. Tons of queries were logged dutifully by postgres as being too slow (> 1000 ms per query), typically because they were scanning tables with a few hundred thousand or even millions of records. Adding indices and paring down the databases to what was actually needed (one neat little table contained a record of every request ever made to the server and the associated response…) again made the system much faster than it had been up to that point.
The system experienced heavy loadspikes once every hour, and at odd moments during the night. These loadspikes would take the system from its - then - average load of 1.5 to 2 or so all the way to 100 and beyond, and caused the system to be unresponsive. This took some time to track down, eventually we found two main causes (with the aid of the man that had configured the VMware and storage subsystem, to rule out any malfunctioning there). The first is that the linux kernel elevator sort and the 3par get into a tug of war over who knows best what the drive geometry looks like. Setting the queue scheduler to ‘noop’ got rid of part of the problem, but the hourly loadspike remained. It turns out that postgres has an ‘auto vacuum’ setting that when it is enabled will cause the database to go on some introspective tour every hour which was the cause of the enormous periodical loads. Disabling auto vacuum and running it once nightly when the system is very quiet anyway solved that problem.
The system was logging copious information to disk, on every request large amounts of data would be written to logfiles that would get extremely large. Disabling these logs (they were there for debug purposes) sped up the system considerably and also closed a major security hole. In a bit of a hurry the builders had made the root of the web tree world writeable and the log files were stored there for easy access by the general public, wanna-be DDOSers, hackers and competitors. So disabling these logfiles killed two birds with one stone, it significantly reduced the amount of data written to disk for a performance gain, it also closed a major security hole.
What didn’t help either is that the hardware - in spite of being pretty pricey - broke down during all this. Kudos to VMware, I can confirm that their high availability solution can save your ass in situations like that, but still, it’s pretty annoying to have to deal with hardware failures on top of all the software issues. One of the blades failed, was fixed (faulty memory module) and then a few weeks later another blade failed (cause still unknown). Highly annoying, and for hardware this expensive I’d expect better. It probably is nothing but bad luck.
Besides all of the above there were numerous smaller fixes. So, now the load is at 0.6 or so when the system is busy. That’s still too high for my taste and I’m sure that it can be improved upon but it is more than fast enough now to keep the users happy and spending more time to make it faster still would be premature optimization. We fixed a ton of other issues besides that had direct impact on the user experience (front end stuff in Angular, and some back end PHP code) but since I’ve been mostly concentrating on system level stuff that’s what this post is about. The company is on a very long road to recover lost business now, it will take them a while to get there. But the arterial bleeding has been stopped, they’re doing an ok job of managing the project now with an almost entirely new local team in concert with the original supplier of the bespoke system. Emphasis will now be on testing methodology and incremental improvements, evolutionary rather than revolutionary and I’m sure they’ll do fine.
Instrumental in all this work was a system that we set up very early in the project that tracked interaction between users and the system in a fine grained manner using a large number of counters. This allowed us to make detailed analysis of the system under load in a non-intrusive way, and it also gave us a quick way to analyze the effect of a change (better or worse). In a sense this became a mini bookkeeping system that tracked the interactions in such a way that if it all worked the way it should that this would be reflected in certain relationships between the counters. A mismatch indicated either a lack of understanding in how the system worked or pointed to a bug (usually the latter…). Fixing the holes then incrementally improved the bookkeeping until we hit a margin of error for most of these counter that we were confident the system is working as intended. A few hold-outs remain but these have no easy fixes and will take longer to squash.
Lessons learned for the management of this particular company:
- Trust but verify
It’s ok to trust people that you hand a chunk of your budget and responsibilities to but verify their credentials before you trust them and make sure they stay on track over time by checking up on them with regularity. Demand that you be shown what is being produced. Do not become isolated from your suppliers, especially not when working with freelancers. And if you’re not able to do the verification yourself get another party on board to do that for you who is not part of the execution. Do not put the executive and control functions in the same hands.
- Don’t be sold a bill of goods
Know what you need. If you need an ‘enterprise level’ solution make sure your business is really ‘enterprise level’. It sounds great in to have a million dollars worth of hardware and software licenses but if you don’t actually need them it’s just money wasted.
- Know your business
Every business has a number of key parameters that determine the health of that business. Know yours, codify them and make sure that every system that you run in house ties in with that in realtime or as close to realtime as you can afford (once per minute is fine). Monitor those kpis and if they are off act, immediately.
- Be prepared to roll back
If the new system you accept on Monday causes a huge drop in turnover by Tuesday roll back, analyze the problem and fix it before you try to deploy again. A roll-back is a no brainer, if the drop is on the order of a few percent it may be worth to let the system continue to work but if you are operating at any kind of serious level of turnover a roll back is probably the most cost efficient solution. Having a turnover drop may be explained by some other cause but usually it simply is indicative of one or more problems with the new release.
- Work incremental, release frequently
Try to stay away from big bang releases as if your company depends on it (it does). Releasing bit-by-bit whilst monitoring those kpis like a hawk is what will save your business from disaster, it will also make it much easier to troubleshoot any problems because the search space is so much smaller.
This was a pretty heavy job physically, there were days when I got home at 4:30 am and was back in the traces at 10:30 the next day. That probably doesn’t sound too crazy until you realize I live 200 km away from where the work was done, part of the time I spent in a holiday resort nearby just to save on the time and fuel wasted on traveling. I’ve been mostly sleeping for the last couple of days, recovery will be a while after this one, I’m definitely no longer in my 20’s when working this hard came easy. Even so, I am happy they contacted me to get their problems resolved and I’m proud to have worked with such an amazing team thrown together in such a short time.
Thanks guys! And a pre-emptive Happy and Healthy 2015 to everybody reading this.