A video version of this article as available on YouTube here.
Facebook. One of the biggest software companies in the world. It didn’t always start like that though. Facebook started with a simple idea, and evolved over time to be a global application used by billions of users around the world.
In this blog post, we are going to walk through the evolution of Facebook’s software architecture. We are going to learn about how Facebook started with a LAMP (Linux, Apache, MySQL and PHP) stack, to a horizontally scaled distributed system.
Now how do I know anything about this? Well, in 2005 – about one year after Facebook’s initial launch, Mark Zuckerburg was a Guest Lecturer at Harvard University’s CS50 course. And if you ‘ve never heard of CS 50, its basically the most popular introductory computer science course on the internet. Now, during Zuckerburg’s 1.5 hour lecture, he talks a lot about how Facebook faced different types of problems as the application expanded to more universities, more users, and more features. Personally, I find that there is lots of valuable learnings to be had from this type of analysis because you get to understand the problem as it was at the time, and the progressive steps Zuckerburg’s team took to solve it. Now lucky for you, you don’t have to watch that 1.5 hour video because I watched the entire thing and I’m going to summarize it for you in this article. BUT if you do, the link is available here. But first, you need a little bit of context into how Facebook originally worked.
In today’s world, Facebook is a global application – you can search for absolutely anyone on the platform and provided you know the person’s name, and the person doesn’t have some overly restrictive privacy settings. An interesting historical fact here is that Facebook was originally developed so that you could only connect with a user that went to the same school. So for instance, if you went to one school and your friend goes to another, you wouldn’t have been able to interact with one another, but if you both went to the same school, you would — and this is an important point because this product decision had an interesting effect on the way the architecture was able to evolve to scale to the crazy amount of users Facebook had – around 6 million or so at the time.
So in the very early stages the entire facebook platform was hosted on a single rented computer. The exact configuration that Zuckerburg used was a LAMP stack which consisted of Linux for the Operating System, Apache for the web server, MySQL for the database, and PHP as the programming language (ew). Of course a good amount of HTML and CSS were also used for styling of the website.
This architecture was super straightforward – when a request came in, the Apache server would receive the request, query the database that was running on the same machine, and then finally return the content back to the browser. And this is how it originally worked for the first thousand users or so.
Now most of you may be inclined to think that Facebook started with some super sophisticated architecture and well thought out design, but we can see here that this is completely the opposite. LAMP stacks were and continue to be one of the most simple tech stacks out there, and by the way WordPress sites which power almost 33% of the internet’s websites are based on the LAMP stack – its literally everywhere.
Now what I want to point out before I move on to the is that you don’t need to be a genius or overengineer a solution when starting a new project. If you build a good enough product that solves a problem for people, its bound to do well. So as I always say, when you’re starting a new project, KEEP IT SIMPLE, STUPID!
The First Problem – How to Scale?
During the rapid expansion in 2004 and 2005, Zuckerburg describes a key architecture decision he needed to make to allow the application to scale to millions of users. The biggest barrier for scale at the time was the algorithm that was used to power the “Friends of Friends” or “Connections” feature that everyone loves. Theres something so darn interesting about finding mutual friends.
Now Zuckerburg goes into a lot of depth about this problem, but I’ll summarize it briefly for you here. Say theres Me, (Daniel) and you and we’re friends with one another, And I want to know who are our mutual friends. So who are the people that we both have as friends. How would I go about doing that? Well the simple and most straightforward way to figure this out is to 1 by 1 examine my friends, and see if they’re connected to you. And separately, examine each one of your friends, and see if they’re connected to me. The resulting users would be those individuals that are mutual friends of ours. Now this is all fine when you’re only looking 1 level down like we are here. But what happens if we want to know who are the common mutual friends’ of your friend’s friends’ – This would be 3 levels deep.
The complexity for solving this computation grows exponentially, since with each level, we need to examine more and more users to find mutual friends, or similarly, to find out how or if two people are somehow connected to one another.
Zuckerburg describes this problem as being one of the key drivers of a architectural decision – how to scale their application to still maintain this feature, but also make the computation more manageable.
How Zuckerburg and his company decided to solve this problem to partition his single database of all users, into one database per university. Now, when a request came in to find who are our mutual friends 4 or 5 levels deep, the number of users we have to search through to find the connections becomes smaller, because you’re only looking for mutual friends within your own school. -CUT- now some could say he didn’t really solve the problem, he just restricted usability of Facebook to allow it to scale, but that’s a topic for a different video.
Zuckerburg himself acknowledged that this was a technical decision that worked out well for the company and allowed them to scale to around 50 schools.
Problem 2 – Variance of Usage
Now the next key problem Zuckerburg faced after this was that there was a large usage variance between schools. For instance, a school like large Penn State had over 50,000 users, while a small university in ohio would only have 1000 users or maybe less. Now the existing architecture meant that if a surge of users from Penn State were overloading the server with requests and brought down the host machine, users from other schools would also be affected. Obviously a bad experience.
In order to solve this problem, Zuckerburg opted to separate out application machines that received web traffic, from his database machines which stored and served all the underlying data. Separately, he split the databases up onto multiple different machines. This approach had many benefits, but the biggest two were as follows:
- The application machines would be used generically to process a request for ANY university. Making the application fleet of machines generic allowed them to simply keep adding machines to the fleet in order to scale. This concept is called Horizontal scaling and I have a video on it that I’ll put in the description below.
- The second key benefit is that databases for each school could now exist on different machines and scale independently. This means that an outage in one machine would not affect an outage in another. This was a huge boost to the reduce the blast radius so that only a subset of users were affected in the case of an issue with a single database.
Problem 3 – Maxing out MySQL
MySQL is a pretty great out of the box database. But when you’re getting over 100 million page views a day and several hundred thousand concurrent users at a given time, even MySQL has its limits.
The next major hurdle was for Zuckerburg to scale the application even further by introducing a caching layer. Now if you haven’t heard of a cache before, Im not going to go into it into too much depth (I actually a completely separate video on caching that you can check out – I’ll put that down below), but essentially caching is the idea that – for a given request for a user, instead of having the application query the database, have the database look up the information, return that information, parse it, etc etc., we could store a copy of that information IN MEMORY (and I emphasize MEMORY here because in memory lookups are EXTREMELY fast), By using this approach we could get a huge performance benefit.
To be precise, Zuckerburg used Memcache which is an open source caching library and configured it to be distributed – so cached data was split up across multiple different machines in the cache fleet. This approach allowed the application to scale to even more users, but had some big holes in it. Specifically, Zuckerburg mentioned that Facebook had major issues when a single cache fleet machine went down. And the main reason this happens is because caches are generally key-value lookup stores. And what do you think Zuckerburg probably used for the key? It was probably the school – right? Now the way memcache works is that it distributes your data based on key – so if a single machine went down, it could be storing all the cached data for an entire school. So if this happened, incoming requests would find no data in the cache and have to query the underlying database. At facebook’s scale, this could cause the underlying database to go down or get over burdened with requests.
Now Zuckerburg doesn’t go much further into this problem, but does mention they made some custom modifications to the Memcached client to mitigate this problem, but I think its interesting anyways.
So throughout this exercise we’ve added a whole bunch of different layers and went from something simple, to something sophisticated that could handle millions of users. As a quick little post-analysis of all these changes I think its very interesting to see how facebook’s architecture evolved to solve all of these small problems as they arose.
Hope you enjoyed this article.