Livemocha - the worlds largest language learning site has had an incredible growth story. We have users from over 200 countries and doubling every month.
While its challenging keeping up with the growth, it is harder considering that the site has to perform well from multiple countries.
Over the last year we have done many things to optimize our site - Most of the things we did initially were fairly standard. You usually don't have to try non standard until you have tried some fairly simple standard effective methods.
The evolution of your architecture for the first year will probably look like this:
- Single Server, Single Database: You will be at this during your test with close friends and relatives.
- Multiple Servers, Single Database: The next challenge is having multiple servers serving your pages since this requires you to add a load balancer. You should do this soon ( we had moved to this by the time we launched our alpha) since you really don't want to have to deal with the site down at 3 am because your only webserver went down. There are various software load balancers that are cost effective and can even get you by for a while. We were able to use a software loadbalancer for almost 6 months after launch. If you are using AWS, they have a load balancer service, that you can use.
- Multiple Servers, Master Slave Database: Again you will get to this stage quickly because you dont want to be stuck if your master database crashes. So even if you do not use the slave for queries, you will want to have it for redundancy.
- Adding caching layers - either APC (webserver level) or memcached
- Servers using master and slave for queries - the slave would be read only.
- Multiple slaves.
- Separate servers to respond to static content requests.
Initially keep things simple - this makes it easy to develop, iterate and debug. Each layer of optimization adds to the complexity of debugging. For e.g. if you add a caching layer for your database queries, and if the user is seeing wrong list of subscriptions, the problem could be in view, db query or cache entry (stale etc.).
Optimize DB queries: Always monitor your database slow queries and continue to optimize them. Database query degradation is not linear. For e.g. You may find that a query to find users who have 10 friends, takes 200 ms with 2 million users, but takes 2 seconds with 2.5 million users because you just crossed the threshold that causes the query go to the disk instead of returning this from memory. This is something we do constantly at Livemocha. I would recommend creating automatic reports with slow query logs & query log that let you monitor automatically.
Asynchronous Service calls: Decouple services so you can service requests asynchronously. The can be done easily at the client side with ajax calls and also at the backend using webservices. If you try to do everything synchronously, you will hit bottlenecks soon. This can usually be done without the end user even noticing anything. Here is an example: When you added a friend, we used to generate notifications to existing friends of the event. This was done synchronously. Needless to say this gets out of hand as soon as people started to have a few friends and the call to add friend started waiting on the generation code. The right way to do this is to just queue up the request to generate the notifications and return immediately.
Assume services will fail: As I mentioned above, part of the architecture consideration is not just to support load but also to support failure and redundancy. The extent to which you can design with this in mind the easier supporting the site will be. For e.g. don't keep state on servers for users like sticky sessions that routes them to the same server. It makes it harder to bounce servers.
Scale by adding horizontally: This will almost sound like a cliche - everyone says this now. You want to design so that supporting the next level of traffic can be done not by upgrading your hardware to more memory, cpu etc. which gets expensive but ideally can be done by adding another node to the system. With opensource and linux based server environment, your costs of adding hardware is usually only the cost of the hardware so scaling horizontally is more cost effective. E.g. when you store media files, choose a distributed file system than mounting a 750gb raided nfs server. It gets harder to deal with failure and go to the next level of storage. Same with webserver pools etc.
Without monitoring and alerting you won't be able to tell whats wrong. Graph everything you can cpu, disk, memory, http requests, network, database status variables etc. Use an alerting system like nagios to alert you immediately if something is wrong based on monitoring.
Make your test machine where you run your performance test as close to production data as possible. SQL queries can perform very differently based on amounts of data in the tables.
You should measure how the site is doing. More on that in a follow on post.