This is the rare occasion I’ll get to enlighten some of my non-technical friends into the kind of stuff that I do in my idle time for fun.
As many of you know, I’ve been working with the Cirion Group on Anchor Tab for awhile now. We’re live and in production, doing fine. But with any product it helps to be forward thinking. This evening I spent some time sketching out scenarios for the time when Anchor Tab becomes too big to fit on one server efficiently. (Hopefully because we have paying customers coming out our ears.) I thought I’d share some of the scenarios I devised with the reasoning behind them.
Currently, Anchor Tab has two services that are of interest to this conversation:
- The Webapp: The actual Anchor Tab application code.
- The Database: Where the Anchor Tab application records information. For Anchor Tab, we use MongoDB.
I host all of my services at Linode for a few different reasons. First, the performance of Linode throws Amazon’s EC2 in the dust for the equivalent price. Second, bandwidth is included in the cost for my Linode. That means that, barring extraordinary circumstances, I know what I’m going to be billed every month. Currently, we’re using one “Linode 1024” server to run Anchor Tab, affectionately dubbed “Leon” after the character Leon Vance on NCIS.
Leon currently runs both the webapp and the database with plenty of room to spare in terms of processing power, memory, and disk space. In short, our network diagram looks something like this:
In short, it’s pretty boring, right? But, the important thing is that this organization – topology, if you will, is meeting our needs. This arrangement costs us $20/mo, and does the job. But, I like to dream.
Below are some of the options I came up with for scaling out.
The Likely Scenario
Saying that the solution to load is “more servers” is about as intelligent as saying the solution to missed deadlines is “more workers.” If you go in blind, you’re going to be sorely disappointed with the results your “more servers” get you. Thankfully, I don’t have to go in blind. I built Anchor Tab. I’ve done some thinking about how it’s going to be used.
It stands to reason that our product will get a lot more traffic to the API (the part that makes Anchor Tabs appear on a customer’s site) over time than it will to the main site or tab manager. So, then, it stands to reason that the first step towards scaling out would be to separate out the API and the main application – to put them on different servers, if you will.
Because we want the API and the Webapp (the part of Anchor Tab humans interact directly with) to still work if the other server goes down, it also makes sense in this scenario to move Mongo to its own instance so that we’re somewhat insulated against the entire service going down if the webapp or API nodes go down. That means I’d build out something like this:
The arrangement above would cost us around $60/month ($20 for each server). In the grand scheme of things, that’s not really that expensive. It’s around the same price I pay every month for my cable bill actually. So, here, people interacting with anchortab.com directly get directed to the Webapp node and requests for information about a tab that’s supposed to be displayed on a customer’s website get directed to the API node. Both of them talk to the same database.
This means that if traffic to the API greatly outweighed traffic to the main webapp, the main webapp wouldn’t get bogged down responding to API requests. The hard truth of the matter is that potential customers don’t notice if a tab on our customer’s website takes a few extra seconds to load. They do notice if our homepage or the tab manager takes an inordinate amount of time to load.
The problem we run into here is that both the API and the webapp fall over if the Mongo node becomes unavailable for some reason. In techno-lingo, we’d call this a single point of failure: the Achilles heel of our server setup. We want to try to avoid these as much as reasonable once we get to the point where we have multiple servers.
Now, if we wanted to be cost effective, we could do something like this:
In this arrangement, each node has a local copy of the database and they are connected to a third node that also has a copy of the database and is additionally running the arbiter. In the MongoDB world, this database arrangement is called a replica set. The trick is that only one database in the set can accept write requests. That database is called primary and every other database (the read-only ones) are called secondaries. The problem we run into is this: what happens when primary fails?
That is why we have the arbiter. The arbiter is responsible for helping the replica set elect a new primary member so that normal operations can resume as quickly as possible.
In the arrangement above, we’ve removed the single point of failure for the database that we identified in the first arrangement. Here, any one node could go down and the entire system would still function so long as the primary database and the arbiter did not go down at the same time. There are a few different ways to mitigate that risk at no cost, but the best one is to configure Mongo such that the database paired with the arbiter will never be primary if there is another database online. That would mean that as long as the API or the Webapp node are online, one of them would be the primary.
Here’s the next question: what if two nodes fail at the same time? Remember that term “single point of failure”? We still have one for the Webapp and the API. So if one of those nodes goes down, that service is gone. At the point we’re using this configuration, that’s probably OK. What we care about is what happens to the other service(s) that still remain.
- If the Webapp and the API nodes both go down: Probably the worst case scenario. Nothing is visible to the public at all. Full service outage.
- If the Arbiter and Webapp nodes both go down, and Webapp was primary: The API would still be online, but it couldn’t write any data. In this scenario, we would still be able to provide our core service – the embedded tab – but users would be unable to access anchortab.com or change their settings. We would have to provide some safeguards in the API so that it will record things like views and subscriptions in memory until such a time as a new primary node is online.
- If the Arbiter and Webapp nodes both go down, and API was primary: Business as usual for the API, but anchortab.com is unreachable.
- If the API and Arbiter go down, and Webapp is primary: anchortab.com functions as normal. Tabs don’t appear on customers websites until API is back online.
- If the API and Arbiter go down, and API was primary: anchortab.com functions, but in read-only mode until the database is back up and running. Logins/sessions would probably be disabled and you’d really only be able to view the public landing page and a few other things.
To be honest, this isn’t a bad place to be. Sure, we still have a single point of failure for API or Webapp, but they are capable of limping along without each other (and without the arbiter) if they need to for a little while. Only two of the five scenarios result in our core service not working. Pretty effective use of $60 per month.
No Single Point of Failure
So I mentioned before that single points of failure should be avoided. What kind of architecture would I use to avoid that? Well, I’m glad you asked. If we want to avoid a single point of failure for our entire service, and we could afford it, I’d probably implement the following network layout.
Above, “NB” represents a Linode Node Balancer. Those are managed for us and are actually a series of different servers run by Linode. Each “NB” just represents one configuration for load balancing across multiple servers.
Anywho, in the arrangement above, I’ve organized the Anchor Tab application into three clusters: one for API, one for the webapp (anchortab.com), and one for the database. Assuming the Arbiter can never be primary so long as another database is online, no single server failure can result in the Anchor Tab service ceasing to function normally. Only when entire clusters fail do we start seeing issues.
We could further insulate this design by locating additional webapp and API clusters in different parts of the world (and making the database cluster as a whole geographically diverse). The starting price tag on this kind of design with Linode? At least $180/month.
This is the kind of stuff I think about for fun.