- Throw more resources at it, i.e. hardware. Which puts off the-point-at-which-it'll-break to some time in the future. On our graph this would move the "break" line upwards.
- Change how it works. Which changes the rate at which the "social graph" line increases.
- before it breaks
- as it breaks
- after it breaks
If it wasn't broken why fix it?... we fix it, because it was about to become broken. If you didn't notice it's because the site threw more resources at it + before it broke. If you did notice it's either because it actually broke or your saw the effects of it breaking as it was breaking. Or The site changed how it worked + before it broke ... ... at which point the vocal users threw a hissy fit, tossed toys out of prams. Threatened to move to another website, had protests, wrote blog posts, caused News Sites desperate for stories (and traffic) to write articles like "Why website-x is dying" and "The end of site-x". But at the same time, currently its pretty much the only sane thing to do.
Why Things Break:That cool "user-who-did-x-also-did-y" feature was calculated whenever you visited your homepage. This worked for the 500 initial users (the site's builders and their friends) but started to take too long when they hit 1,000 users. The site solved this by caching (storing the results for an amount of time) the calculations. The users complained that they were being shown incorrect data because everyone they knew was doing stuff all the time and it wasn't updating fast enough. The site solved this by invalidating (removing the stored results so they need to be recalculated) the cache whenever anyone did anything. The site hits 5,000 users and the cache is being invalidated every sodding second ... the homepage takes too long to load. The site solves this by writing their own custom code for managing off-line tasks and puts everything into a task queue to be processed. 98% of users accept that the section that used to be called "What your friends are doing right now" gets changed to "What your friends have recently been doing". The other 2% of users throw a tantrum and accuse the site of being run by useless gibbering idiots.
Why Things Continue to Break:Now that the site has moved to doing the cool features in off-line tasks, the resources will look something like this as number of users grow ... When the site is still "small" the resource that's being used up is often Dave's Desktop Machine. You'll be able to recognise these machines, they'll be the ones tucked under a desk somewhere with a yellow PostIt note on saying ...
"Admin0. Do not switch off!!!"But lets just assume there's a pool of finite resource, no matter where it is. It can be used up in a few ways ...
Hard Drive Space/Memory.The amount needed keeps increasing until there's no room left for the "Standard Features". The cool stuff continues to work, but everything else on the site is starved of space and starts failing. Otherwise known as "We ran out of swap space on Admin0".
Money.Everything is done in the cloud, so the solution is just to spin up more instances/pay for more CPU resource. As long as the site has money, then this isn't a problem.
Time/CPU.Any point where the time taken to process data takes longer then the amount of time to generate the data. For example those off-line tasks that could crunch through 24 hours worth of information in just 6 hours, now take 27 hours. They often come about when someone says "Hey wouldn't it be cool if we could tell each user X based on stuff they and their contacts have done?" and someone else replies ... "Sure, I'll just get my machine to grab the data, work out the all the connections and put the results back into the database" -- that machine then becomes "Admin0. Do not switch off!!!" and starts to get hot. Space & Money are both fairly easily solved by throwing smarts and money at them. Time though is the kicker and will become a problem any site depending on the "social graph" will reach sooner or later based on user growth. Sites can solve this for a while by getting faster machines (or at the very least adding more fans) but that only works for so long. Giving the site a few other options ... "Federate the users", which at it's simplest basically means splitting the users down into smaller groups. Process all the data for one group on resource-pool A and the other group on resource-pool B. Which means you can now get through all the calculations again, but group A don't know what group B are doing and vica versa. Which involves users saying "I'm not getting all my updates, how hard can it be??" and throwing the afore mentioned hissy fit. Deciding how to split the users up into sub-groups is a fine art. Much like the brain there's probably a not-utterly-terrible way and a fucking awful way to split it into two halves if you really had to. The problem is then normally solved by hiring very smart people with PhDs in Complexity Theory. The ones with the right skills are fairly rare and are either already the site's CTO or employed by other sites trying to solve the same problem. Another option (cheaper) is to simplify the calculations. The outcome of this is that everything starts working again (or hopefully didn't break noticeably in the first place), you get a lot more headroom for the next several months and 98% of your users will continue to find the feature does pretty much what they want. The remaining 2% will say "I'm not getting all my updates/some updates are slightly out of date, how hard can it be??" and once more throw a hissy fit, write blog posts, call everyone involved stupid and accuse them of callously rejecting their most influence early adopters. Even though what the programmers have actually done is cleverly keep the site running for everyone for the next 6 months, including the 2% of vocal "cutting-edge" early adopters if only they'd shut up for a second. They'll probably also open source their solution.
One last random reason why things may "break"Something that turned out to be hugely popular was actually written by Dave in his favourite programming language, REBOL, as a side project on a stack of 20 networked Amiga 2000s. WebOps refuse to support the Amiga 2000s. Dave re-writes the code in Perl and leaves taking his Amigas with him. Everyone still loves the feature but hates using Perl, it gets re-written 3 times in PHP, it still doesn't scale. Someone re-writes it in an afternoon in Python but it only works and scales if sub-feature "x" gets left out. 98% of users don't notice, 1.9% of users form a protest #hashtag on twitter. 0.1% of users argue about the merits of scaling in PHP vs Python vs Their Favourite Language, they write a blogpost about it (using their own blogging platform they wrote themselves in 1997) slashdot links the post and ironically declares the original site "over".
A couple of reasons why things may changeOne or more of the current features, which seemed cool, turn out to create a system-gaming situation. Allowing a small core group of users to cause antisocial behavior, polarized views and screwing up data calculations that end up effecting all users. While also taking the site in a direction that the creators didn't originally intend or want. At that point those "damaging" features are removed, changed or replaced. Once again tantrums and accusations are throw around, even though the overall aim it to keep the site ticking along nicely for everyone thank you very much. Another reason is just general improvements. Usually where nothing is currently breaking or about to break (well beyond the normal) but watching how the majority of users are using the site and tools surrounding the site suggest ways in which the process can be improved and also new technologies taken advantage of. These will generally always be to a) make things simpler/better for new users, and b) level-up the current majority of users by raising the profile of lesser used but cool features that fit in with the sites overall strategy. The main reaction to that is people just hating change.
A recap: FAQQ. Why, as a paying customer wasn't I asked? A. Because we have so many users that they'll all say different things. We have to try and keep the vast (silent) majority of our users happy, while still keeping our own goals for the site in sight. (when I say "we" and "our" I mean "any site and their owners") A. Because the vast majority of users are silent, we use metrics to track what the vast majority are actually doing and where they are finding the main usefulness of the site.You are an edge case using the site in a particularly smart but resource-expensive fashion (probably). A. Because we need to keep making money and keep 95% of the stuff you use paid for. If we have to cut resource-expensive feature "x" to allow us to cover the costs of everything else, then we'll do that. Asking you won't make any difference ... or we can just slap more advertising and sponsorship everywhere so people can complain about that instead. Q. Can't you just give us the option to continue using the old version, it's just a check-box and you have all the old code anyway. A. No because the old code wasn't scaling, that's why it had to be replaced. The new version is a lot faster/scales better. It's just not possible to run both systems at the same time, for various architectural reasons. Sorry you hate it, but 98% of the people are better off. Q. I manage the Databases for my company so I know its not that hard, how stupid are you? A. Good for you ... go get a PhD in Complexity Theory and start your own site. Q. Why did you change the website/remove my favourite feature/have something not quite working right? A. Because it's better than having the whole site go down in a ball of fire and failure. Q. New site Y has my favourite feature/can do X why can't you? A. New site Y only has 500 users, just wait until it has to scale ... also all your friends are here. Q. I Hate you, I hate you all! You've removed everything good about this site, I was here from the early days, you don't respect anything your users want, I hate the way you run your site ... IF there was an alternative that was any good I'd use them in an instant! A. That's not a question.
Part 2: If Anna can add feature X with greasemonkey and site-y does something "better", why can't your "brilliant" programmers do it? ... coming "soon".
- Flickr Engineers Do It Offline
- Twitter, Ruby, and Scaling
- Everything on the Etsy Code is Craft blog
- The Art of Capacity Planning: Scaling Web Resources by John Allspaw
- Web Operations: Keeping the Data On Time by John Allspaw & Jesse Robbins