Things are calm at the moment, so it seems a good time for me to ruminate on the current state of Dreamwidth's load/capacity/etc. Please let me know if anything is unclear, or if you have any concerns, or whatever -- I'll do my best to answer everything.
The summary -- Dreamwidth has definitely been hit with a lot of extra load in the past few weeks, but it's maybe not as much as you thought. We're over double what we were a month ago, but it's still only double -- we already had a good amount of load. Here's a good graph showing the bump:

There are several main "systems" that make up Dreamwidth (or, really, most web sites). They are the frontend, the web servers, the cache, the databases, and the miscellaneous services.
Let's take it one by one... we'll start with the easy things. We're going to talk about the current state of stuff and the scaling of it. This is a term that loosely means "making it handle more traffic". (Where traffic is more users, more features, more whatever.)
These are pretty straightforward. Scaling these is, typically, a matter of just running more of them. We're nowhere near capacity on most of these, and any that are, we can just run more of the worker processes. I'm not too worried about these -- even if they do get overloaded, they won't stop the site itself from working. People can still read, post, and do stuff.
If these overload, it will affect emails going out, search, payments, and similar things that are considered non-critical services. (I.e., if search goes down for a day, it's frustrating and I will do my best to get it back online ASAP, but it's a lower priority than web servers or databases.)
We use memcached for most of our caching. Having this service online is crucial for the site to be working -- without our cache, the databases will overload and croak. No good.
Thankfully, though, this system is nearly free to scale. All we have to do is deploy another few instances and update the site config to use them. The downside is that adding more instances will cause the site to slow down for a little while because the entire cache has to be emptied and redistributed across the new, larger cache cluster.
We're getting close to the point where I want to deploy new cache instances, and I will be doing that when I get the new databases up and ready. I'll schedule it for a low traffic time so it should have minimal impact on the site.
I'm very comfortable with our status here and our ability to scale out for more capacity.
These are the actual machines that handle processing the web pages, as you might expect. The nice thing about them, though, is that they are horizontally scalable. This means that adding more of them adds more capacity in a linear fashion. If we have ten web servers and they're overloaded, adding ten more doubles our capacity for this service.
We currently have six machines handling web requests and we can easily add more. It just takes about 48 hour notice to our hosting provider to get them to spin up and deploy a new machine. As soon as I notice us getting close to capacity on this tier, I submit a request and we get more up. Since the big bump of users two weeks ago, we've added two more web servers. If the load holds where it is now, we'll stay at this level -- but again, it's easy to add more.
I'm very comfortable with our status here, too.
We're currently running on MySQL databases. These machines are a lot more expensive than web servers -- more RAM, fancy disks, a RAID card with BBU, etc -- and they're a lot harder to scale than the webs. Harder, but not really impossible.
Physically, we have two machines. Logically, though, there are two types of databases -- the global database cluster and the user clusters. We have to talk about scaling the database in terms of its logical components, since they have different scaling requirements.
For the user clusters, these are effectively horizontally scalable, just like the web servers. We put online two more machines and we create a new user cluster, then we start moving people over to it. We can balance the load on the user databases by increasing or decreasing the number of users that "live" on that machine. You can see what user cluster you're on with our Where am I? tool.
The global cluster is harder to scale. There are some bits of data that have to live in one place because running it in several places makes code very, very hard to get right. Think about it like having two bosses -- if you have two bosses who do the same thing, you're never really sure who to listen to. Jim may tell you to work on project X, but Sally might say work on project Y. How do you decide what to do?
On the plus side, our global cluster is a lot, lot smaller than the user clusters. It only stores things like payments, user login information, and some other data that is pretty small and lightly used. It has a much higher capacity (how much load we can throw at it) before we have to consider scaling it.
Even then, scaling it can be done by adding more machines in as slaves -- i.e., exact copies of the master global database. This will buy us a decent amount of headroom before we have to consider doing something fancier like moving to SSDs instead of rotating disks. We can also add more cache machines to give us even more capacity.
We're hitting close to capacity on our existing databases, but we have two more machines on their way right now. They should be set up pretty soon (in the next day or so) and then we'll have more than double our current capacity. Also, we're still running on a MySQL version that is two years old -- there have been a lot of improvements to MySQL (particularly the Percona branch) since then, and I will be upgrading us soon.
All told, I'm pretty comfortable with our scaling here. Our existing systems are getting loaded but there's a very clear path from here to get us to more than 10x our current size. Once we start getting that big we'll have to do some more interesting work, but if we get to 10x our current size, we should have enough money that it will be no problem at all.
Finally, the frontend -- our load balancer -- the machine that handles getting all of the user traffic from the Internet to our web servers. We're running a combination of software on this machine, primarily Pound and Perlbal. (Although soon I will be adding Varnish to help with userpic caching.)
Scaling the frontend is easy up to a certain point, after which it becomes really hard. Thankfully that "certain point" is fairly far off. Right now we're at about 25% capacity on this machine -- this is after the doubled load! -- and adding in a Varnish cache for userpics should help reduce that to about 15%.
When we start getting closer to that point I have a few ideas that will help with the load -- notably offloading the Perlbal instances to another machine -- and that will allow us to go up to the bandwidth limit of the machine. We're doing up to about 25Mbps right now and we can go close to 800Mbps before we start to hit capacity on that front.
In short, then, I believe we're in good shape on this front and have a clear path to scaling this out to more than 10x our current load.
Honestly, the part that is most likely to bite us is also one of the easiest to fix -- and that's our code. There are certainly inefficient things in our codebase and we will have to address them as they come up. This is also exactly the kind of thing that has led LJ to temporarily suspend ONTD and similar communities from time to time, because that's the most expedient way to get the service back to normal for everybody else while they isolate and fix the problem triggered by the heavy users.
Dreamwidth will have the same policy, too. If the site goes down and it turns out to be because of a particular community or heavy user, we'll take what action we need to bring the site back -- and then we'll work our tails off to get service restored to that particular user/group. I also promise that we will communicate with anybody affected by this and let you know what's up -- you won't sit and wonder what happened.
All that said... any questions? Fire away, I'll answer them to the best of my ability. (Although I will say that right now I'm going to step away from the computer and go make some bread. It's New Year's Eve and I'd like to spend some time with my partner,
aposiopetic. I'll check back in though!)
And, if I haven't said it enough, thank you for using Dreamwidth. It's really gratifying to see people moving in and giving things a whirl. We've worked really hard on this site for the past few years -- this is our baby! -- and I'm so excited to share.
The summary -- Dreamwidth has definitely been hit with a lot of extra load in the past few weeks, but it's maybe not as much as you thought. We're over double what we were a month ago, but it's still only double -- we already had a good amount of load. Here's a good graph showing the bump:

There are several main "systems" that make up Dreamwidth (or, really, most web sites). They are the frontend, the web servers, the cache, the databases, and the miscellaneous services.
Let's take it one by one... we'll start with the easy things. We're going to talk about the current state of stuff and the scaling of it. This is a term that loosely means "making it handle more traffic". (Where traffic is more users, more features, more whatever.)
Miscellaneous Services
These are pretty straightforward. Scaling these is, typically, a matter of just running more of them. We're nowhere near capacity on most of these, and any that are, we can just run more of the worker processes. I'm not too worried about these -- even if they do get overloaded, they won't stop the site itself from working. People can still read, post, and do stuff.
If these overload, it will affect emails going out, search, payments, and similar things that are considered non-critical services. (I.e., if search goes down for a day, it's frustrating and I will do my best to get it back online ASAP, but it's a lower priority than web servers or databases.)
Cache
We use memcached for most of our caching. Having this service online is crucial for the site to be working -- without our cache, the databases will overload and croak. No good.
Thankfully, though, this system is nearly free to scale. All we have to do is deploy another few instances and update the site config to use them. The downside is that adding more instances will cause the site to slow down for a little while because the entire cache has to be emptied and redistributed across the new, larger cache cluster.
We're getting close to the point where I want to deploy new cache instances, and I will be doing that when I get the new databases up and ready. I'll schedule it for a low traffic time so it should have minimal impact on the site.
I'm very comfortable with our status here and our ability to scale out for more capacity.
Web servers
These are the actual machines that handle processing the web pages, as you might expect. The nice thing about them, though, is that they are horizontally scalable. This means that adding more of them adds more capacity in a linear fashion. If we have ten web servers and they're overloaded, adding ten more doubles our capacity for this service.
We currently have six machines handling web requests and we can easily add more. It just takes about 48 hour notice to our hosting provider to get them to spin up and deploy a new machine. As soon as I notice us getting close to capacity on this tier, I submit a request and we get more up. Since the big bump of users two weeks ago, we've added two more web servers. If the load holds where it is now, we'll stay at this level -- but again, it's easy to add more.
I'm very comfortable with our status here, too.
Databases
We're currently running on MySQL databases. These machines are a lot more expensive than web servers -- more RAM, fancy disks, a RAID card with BBU, etc -- and they're a lot harder to scale than the webs. Harder, but not really impossible.
Physically, we have two machines. Logically, though, there are two types of databases -- the global database cluster and the user clusters. We have to talk about scaling the database in terms of its logical components, since they have different scaling requirements.
For the user clusters, these are effectively horizontally scalable, just like the web servers. We put online two more machines and we create a new user cluster, then we start moving people over to it. We can balance the load on the user databases by increasing or decreasing the number of users that "live" on that machine. You can see what user cluster you're on with our Where am I? tool.
The global cluster is harder to scale. There are some bits of data that have to live in one place because running it in several places makes code very, very hard to get right. Think about it like having two bosses -- if you have two bosses who do the same thing, you're never really sure who to listen to. Jim may tell you to work on project X, but Sally might say work on project Y. How do you decide what to do?
On the plus side, our global cluster is a lot, lot smaller than the user clusters. It only stores things like payments, user login information, and some other data that is pretty small and lightly used. It has a much higher capacity (how much load we can throw at it) before we have to consider scaling it.
Even then, scaling it can be done by adding more machines in as slaves -- i.e., exact copies of the master global database. This will buy us a decent amount of headroom before we have to consider doing something fancier like moving to SSDs instead of rotating disks. We can also add more cache machines to give us even more capacity.
We're hitting close to capacity on our existing databases, but we have two more machines on their way right now. They should be set up pretty soon (in the next day or so) and then we'll have more than double our current capacity. Also, we're still running on a MySQL version that is two years old -- there have been a lot of improvements to MySQL (particularly the Percona branch) since then, and I will be upgrading us soon.
All told, I'm pretty comfortable with our scaling here. Our existing systems are getting loaded but there's a very clear path from here to get us to more than 10x our current size. Once we start getting that big we'll have to do some more interesting work, but if we get to 10x our current size, we should have enough money that it will be no problem at all.
Frontend
Finally, the frontend -- our load balancer -- the machine that handles getting all of the user traffic from the Internet to our web servers. We're running a combination of software on this machine, primarily Pound and Perlbal. (Although soon I will be adding Varnish to help with userpic caching.)
Scaling the frontend is easy up to a certain point, after which it becomes really hard. Thankfully that "certain point" is fairly far off. Right now we're at about 25% capacity on this machine -- this is after the doubled load! -- and adding in a Varnish cache for userpics should help reduce that to about 15%.
When we start getting closer to that point I have a few ideas that will help with the load -- notably offloading the Perlbal instances to another machine -- and that will allow us to go up to the bandwidth limit of the machine. We're doing up to about 25Mbps right now and we can go close to 800Mbps before we start to hit capacity on that front.
In short, then, I believe we're in good shape on this front and have a clear path to scaling this out to more than 10x our current load.
Code/other concerns
Honestly, the part that is most likely to bite us is also one of the easiest to fix -- and that's our code. There are certainly inefficient things in our codebase and we will have to address them as they come up. This is also exactly the kind of thing that has led LJ to temporarily suspend ONTD and similar communities from time to time, because that's the most expedient way to get the service back to normal for everybody else while they isolate and fix the problem triggered by the heavy users.
Dreamwidth will have the same policy, too. If the site goes down and it turns out to be because of a particular community or heavy user, we'll take what action we need to bring the site back -- and then we'll work our tails off to get service restored to that particular user/group. I also promise that we will communicate with anybody affected by this and let you know what's up -- you won't sit and wonder what happened.
Open floor!
All that said... any questions? Fire away, I'll answer them to the best of my ability. (Although I will say that right now I'm going to step away from the computer and go make some bread. It's New Year's Eve and I'd like to spend some time with my partner,
And, if I haven't said it enough, thank you for using Dreamwidth. It's really gratifying to see people moving in and giving things a whirl. We've worked really hard on this site for the past few years -- this is our baby! -- and I'm so excited to share.
no subject
Date: 2012-01-01 12:15 am (UTC)Admittedly, I understood about 3/4 of the post, but it sounds like Dreamwidth is in very capable hands. Thank you and
no subject
Date: 2012-01-01 12:17 am (UTC)no subject
Date: 2012-01-01 12:23 am (UTC)Happy New Year to you and yours!!!
no subject
Date: 2012-01-01 12:25 am (UTC)Best wishes for a very good New Year.
no subject
Date: 2012-01-01 12:25 am (UTC)Thank you so much for being this open with us and telling us so much about how things work.
no subject
Date: 2012-01-01 12:26 am (UTC)Thanks for all you do and Happy New Year.
no subject
Date: 2012-01-01 12:29 am (UTC)no subject
Date: 2012-01-01 12:31 am (UTC)Also, Happy New Year! It's 2012 here already <3
no subject
Date: 2012-01-01 12:45 am (UTC)(no subject)
From:(no subject)
From:(no subject)
From:Adding examples to the essay on technical debt
From:Re: Adding examples to the essay on technical debt
From:Re: Adding examples to the essay on technical debt
From:no subject
Date: 2012-01-01 12:36 am (UTC)no subject
Date: 2012-01-01 12:38 am (UTC)no subject
Date: 2012-01-01 12:44 am (UTC)no subject
Date: 2012-01-01 12:45 am (UTC)(no subject)
From:(no subject)
From:(no subject)
From:(no subject)
From:(no subject)
From:(no subject)
From:(no subject)
From:(no subject)
From:(no subject)
From:(no subject)
From:(no subject)
From:(no subject)
From:no subject
Date: 2012-01-01 12:45 am (UTC)Also? IDK how you do it, Mark, but when you explain things, I understand everything! *___*
Uh, I actually have a question about the importer (mind you, it's not really important, but I'm curious). Say I'm importing a community. If I hit the 'refresh' link, I'm taken back to the importer "main" page, and my main journal is selected. To actually see the status of the import, I need to re-select the comm I'm importing. Is that supposed to happen? I mean, it's no trouble at all to just click a couple of times to get to the import status, but iirc when I hit refresh and I'm re-importing content to my main journal, the import status page stays put.
no subject
Date: 2012-01-01 01:52 am (UTC)(no subject)
From:no subject
Date: 2012-01-01 12:48 am (UTC)Is the Varnish cache related to the 'Varnish errors' that LJ consistently suffers from? I think everyone who used LJ a lot in the past year has come to kneejerk loathe that word without really knowing what it is.
no subject
Date: 2012-01-01 01:53 am (UTC)For now, at least, I'm only looking at using it for static content like userpics, CSS, images, etc. It's really the best system out there for doing file caching these days.
(no subject)
From:(no subject)
From:Happy New Year!
Date: 2012-01-01 12:55 am (UTC)Many thanks to the Dreamwidth staff, volunteers and everyone who keeps making this place more loveable with each day!
no subject
Date: 2012-01-01 12:59 am (UTC)I really appreciate you keeping us in the loop. Have a happy new year and some very good bread!
no subject
Date: 2012-01-01 01:07 am (UTC)no subject
Date: 2012-01-01 01:26 am (UTC)no subject
Date: 2012-01-01 01:33 am (UTC)no subject
Date: 2012-01-01 01:46 am (UTC)I feel very comfortable in buying services from DW and and super psyched that my RP game is moving here! This is exactly the kind of business I'd like to support with my hobby!
no subject
Date: 2012-01-01 01:58 am (UTC)no subject
Date: 2012-01-01 02:18 am (UTC)The additional servers increase our server cost some -- we've added about $1,500/month of servers -- but our monthly spend is probably just around $10,000/month to run everything right now. Fiscally speaking, it basically comes down to -- if we have a lot of usage, and the percentage of paid accounts stays where it is now, we can afford to run things no problem. If the usage goes down, then the number of servers and such we need goes down as well, so we can scale back our monthly costs.
In reality we can scale back the monthly costs quite a bit. Right now we pay a little bit each month to make sure that I'm not the only person available if stuff breaks (all hail
(no subject)
From:(no subject)
From:(no subject)
From:(no subject)
From:(no subject)
From:no subject
Date: 2012-01-01 01:59 am (UTC)Thank You!
Date: 2012-01-01 02:38 am (UTC)THANK YOU!! And may you and your family have a wonderful New Year!!
no subject
Date: 2012-01-01 03:04 am (UTC)no subject
Date: 2012-01-01 04:01 am (UTC)no subject
Date: 2012-01-01 08:45 am (UTC)I'm totally with you though. At least, for now, I'm very happy with how things are going. We're able to pay a few people a little bit and make a really awesome service without having to count every penny and worrying about the next month's bills. That's a lot more success than most projects ever enjoy, and I sure as heck don't want to take it for granted.
(no subject)
From:no subject
Date: 2012-01-01 04:02 am (UTC)no subject
Date: 2012-01-01 08:43 am (UTC)Really, though, I'm not going to be using Varnish for the main web stuff. I know LJ is doing some fancy caching to try to reduce the backend load, but that's not necessary for us now or for the foreseeable future. We don't have anybody like ONTD. Of course if we do go down that path and somebody that big starts using Dreamwidth, we'll do what we have to do make the site work. That's where a lot of up front and honest communication comes in, though...
(no subject)
From: