• Home
  • Projects

Rev Dan Catt's Blog

A random collection of random thoughts

Feeds:
Posts
Comments
« Blogging like it’s 2004
NHS Choices / Rewired State Digital Ad Sign Hacking. »

A user’s guide to websites, part 1: If it wasn’t broken why fix it?

27 September, 2010 by Reverend Dan Catt

OMGWTF?! your favourite website has just changed feature X, added something over here and removed something from over there. Or even, gasp, redesigned the whole site.

“Why, as a (paying) customer wasn’t I asked?” Alternatively “Can’t you just give us the option to continue using the old version, it’s just a check-box and you have all the old code anyway. I manage the Databases for my company so I know its not that hard!”

Well, here’s why (and believe me this is the short version) …

All big huge massive (cool) websites started off small at some point. Features were evolving quickly, exciting stuff was in place from day one and there were a few thousand users … maybe you were one of them.

The site was probably run off three racked servers (two serving content, one as the Database server) and 2 desktop machines doing various offline calculations and Database backup. Yes, we also have the cloud now, more on that later.

For a lot of large (cool) sites there’ll be a graph that looks like this …

An abstract view of websites.

There will (hopefully) be users joining at an increasing rate. The standard/simple features that probably make up the bulk of the site will in essence be pages displaying single database records (or amalgamation of a few) in a very beautiful CSS3 + HTML5 + AJAX + jQuery fashion (but probably without the HTML5 bit, but don’t tell PR that). Scaling that is relatively painless.

The ones we’re interested in are the “exciting features that probably involves the social graph”. These are along the lines of “Things people you are following are buying/doing/following” and “people who did this also did that”.

As a site gets more users, these features consume more and more resources until they reach the point at which things break.

Things can be fixed in (generally) two ways …

  1. Throw more resources at it, i.e. hardware. Which puts off the-point-at-which-it’ll-break to some time in the future. On our graph this would move the “break” line upwards.
  2. Change how it works. Which changes the rate at which the “social graph” line increases.

And these fixes can happen at three points

  1. before it breaks
  2. as it breaks
  3. after it breaks

Which brings us back to …

If it wasn’t broken why fix it?

… we fix it, because it was about to become broken.

If you didn’t notice it’s because the site threw more resources at it + before it broke.

If you did notice it’s either because it actually broke or your saw the effects of it breaking as it was breaking.

Or

The site changed how it worked +  before it broke …

… at which point the vocal users threw a hissy fit, tossed toys out of prams. Threatened to move to another website, had protests, wrote blog posts, caused News Sites desperate for stories (and traffic) to write articles like “Why website-x is dying” and “The end of site-x”.

But at the same time, currently its pretty much the only sane thing to do.

Why Things Break:

That cool “user-who-did-x-also-did-y” feature was calculated whenever you visited your homepage. This worked for the 500 initial users (the site’s builders and their friends) but started to take too long when they hit 1,000 users.

The site solved this by caching (storing the results for an amount of time) the calculations. The users complained that they were being shown incorrect data because everyone they knew was doing stuff all the time and it wasn’t updating fast enough.

The site solved this by invalidating (removing the stored results so they need to be recalculated) the cache whenever anyone did anything. The site hits 5,000 users and the cache is being invalidated every sodding second … the homepage takes too long to load.

The site solves this by writing their own custom code for managing off-line tasks and puts everything into a task queue to be processed.

98% of users accept that the section that used to be called “What your friends are doing right now” gets changed to ”What your friends have recently been doing”. The other 2% of users throw a tantrum and accuse the site of being run by useless gibbering idiots.

Why Things Continue to Break:

Now that the site has moved to doing the cool features in off-line tasks, the resources will look something like this as number of users grow …

How resources get used up

When the site is still “small” the resource that’s being used up is often Dave’s Desktop Machine. You’ll be able to recognise these machines, they’ll be the ones tucked under a desk somewhere with a yellow PostIt note on saying …

“Admin0. Do not switch off!!!”

But lets just assume there’s a pool of finite resource, no matter where it is. It can be used up in a few ways …

Hard Drive Space/Memory.

The amount needed keeps increasing until there’s no room left for the “Standard Features”. The cool stuff continues to work, but everything else on the site is starved of space and starts failing.

Otherwise known as “We ran out of swap space on Admin0″.

Money.

Everything is done in the cloud, so the solution is just to spin up more instances/pay for more CPU resource. As long as the site has money, then this isn’t a problem.

Time/CPU.

Any point where the time taken to process data takes longer then the amount of time to generate the data. For example those off-line tasks that could crunch through 24 hours worth of information in just 6 hours, now take 27 hours.

They often come about when someone says “Hey wouldn’t it be cool if we could tell each user X based on stuff they and their contacts have done?” and someone else replies …

“Sure, I’ll just get my machine to grab the data, work out the all the connections and put the results back into the database” — that machine then becomes “Admin0. Do not switch off!!!” and starts to get hot.

Space & Money are both fairly easily solved by throwing smarts and money at them.

Time though is the kicker and will become a problem any site depending on the “social graph” will reach sooner or later based on user growth.

Sites can solve this for a while by getting faster machines (or at the very least adding more fans) but that only works for so long. Giving the site a few other options …

“Federate the users”, which at it’s simplest basically means splitting the users down into smaller groups. Process all the data for one group on resource-pool A and the other group on resource-pool B.

Which means you can now get through all the calculations again, but group A don’t know what group B are doing and vica versa. Which involves users saying “I’m not getting all my updates, how hard can it be??” and throwing the afore mentioned hissy fit. Deciding how to split the users up into sub-groups is a fine art. Much like the brain there’s probably a not-utterly-terrible way and a fucking awful way to split it into two halves if you really had to.

The problem is then normally solved by hiring very smart people with PhDs in Complexity Theory. The ones with the right skills are fairly rare and are either already the site’s CTO or employed by other sites trying to solve the same problem.

Another option (cheaper) is to simplify the calculations. The outcome of this is that everything starts working again (or hopefully didn’t break noticeably in the first place), you get a lot more headroom for the next several months and 98% of your users will continue to find the feature does pretty much what they want.

The remaining 2% will say “I’m not getting all my updates/some updates are slightly out of date, how hard can it be??” and once more throw a hissy fit, write blog posts, call everyone involved stupid and accuse them of callously rejecting their most influence early adopters.

Even though what the programmers have actually done is cleverly keep the site running for everyone for the next 6 months, including the 2% of vocal “cutting-edge” early adopters if only they’d shut up for a second. They’ll probably also open source their solution.

One last random reason why things may “break”

Something that turned out to be hugely popular was actually written by Dave in his favourite programming language, REBOL, as a side project on a stack of 20 networked Amiga 2000s.

WebOps refuse to support the Amiga 2000s. Dave re-writes the code in Perl and leaves taking his Amigas with him. Everyone still loves the feature but hates using Perl, it gets re-written 3 times in PHP, it still doesn’t scale.

Someone re-writes it in an afternoon in Python but it only works and scales if sub-feature “x” gets left out. 98% of users don’t notice, 1.9% of users form a protest #hashtag on twitter. 0.1% of users argue about the merits of scaling in PHP vs Python vs Their Favourite Language, they write a blogpost about it (using their own blogging platform they wrote themselves in 1997) slashdot links the post and ironically declares the original site “over”.

A couple of reasons why things may change

One or more of the current features, which seemed cool, turn out to create a system-gaming situation. Allowing a small core group of users to cause antisocial behavior, polarized views and screwing up data calculations that end up effecting all users. While also taking the site in a direction that the creators didn’t originally intend or want.

At that point those “damaging” features are removed, changed or replaced. Once again tantrums and accusations are throw around, even though the overall aim it to keep the site ticking along nicely for everyone thank you very much.

Another reason is just general improvements. Usually where nothing is currently breaking or about to break (well beyond the normal) but watching how the majority of users are using the site and tools surrounding the site suggest ways in which the process can be improved and also new technologies taken advantage of.

These will generally always be to a) make things simpler/better for new users, and b) level-up the current majority of users by raising the profile of lesser used but cool features that fit in with the sites overall strategy.

The main reaction to that is people just hating change.

A recap: FAQ

Q. Why, as a paying customer wasn’t I asked?

A. Because we have so many users that they’ll all say different things. We have to try and keep the vast (silent) majority of our users happy, while still keeping our own goals for the site in sight.

(when I say “we” and “our” I mean “any site and their owners”)

A. Because the vast majority of users are silent, we use metrics to track what the vast majority are actually doing and where they are finding the main usefulness of the site.You are an edge case using the site in a particularly smart but resource-expensive fashion (probably).

A. Because we need to keep making money and keep 95% of the stuff you use paid for. If we have to cut resource-expensive feature “x” to allow us to cover the costs of everything else, then we’ll do that. Asking you won’t make any difference … or we can just slap more advertising and sponsorship everywhere so people can complain about that instead.

Q. Can’t you just give us the option to continue using the old version, it’s just a check-box and you have all the old code anyway.

A. No because the old code wasn’t scaling, that’s why it had to be replaced. The new version is a lot faster/scales better. It’s just not possible to run both systems at the same time, for various architectural reasons. Sorry you hate it, but 98% of the people are better off.

Q. I manage the Databases for my company so I know its not that hard, how stupid are you?

A. Good for you … go get a PhD in Complexity Theory and start your own site.

Q. Why did you change the website/remove my favourite feature/have something not quite working right?

A. Because it’s better than having the whole site go down in a ball of fire and failure.

Q. New site Y has my favourite feature/can do X why can’t you?

A. New site Y only has 500 users, just wait until it has to scale … also all your friends are here.

Q. I Hate you, I hate you all! You’ve removed everything good about this site, I was here from the early days, you don’t respect anything your users want, I hate the way you run your site … *IF* there was an alternative that was any good I’d use them in an instant!

A. That’s not a question.

Part 2: If Anna can add feature X with greasemonkey and site-y does something “better”, why can’t your “brilliant” programmers do it? … coming “soon”.

Further Reading:

  • Flickr Engineers Do It Offline
  • Twitter, Ruby, and Scaling
  • Everything on the Etsy Code is Craft blog
  • The Art of Capacity Planning: Scaling Web Resources by John Allspaw
  • Web Operations: Keeping the Data On Time by John Allspaw & Jesse Robbins

Disclaimer:

The views expressed in this blog post are solely those of me and not of my employers both past and present. Most of these opinions were formed over several conversations while grabbing lunch from the Ferry Building in San Francisco on Tuesday lunchtimes.

Posted in general | 26 Comments

26 Responses

  1. on 27 September, 2010 at 6:20 pm k

    I love you, really hard, and quite inappropriately. Especially the bit where you did the graph with ‘stuff involving twitter’.


  2. on 27 September, 2010 at 10:14 pm Rod

    WHY HAVEn’T YOU WRITEN PART 2 YET??? #FAIL


  3. on 28 September, 2010 at 12:11 pm » TTMMHTM: Scaling and redesigns, iPad for access, old games, HTML5 polyfills and unicorns - Christian Heilmann's blog – Wait till I come!

    [...] Catt of the Guardian (and Flickr fame) has a great write-up on website redesigns and scaling issues with social elements.Ablenet provides us with SoundingBoard, an iPad/iPhone/iPad Touch app that allows you to create [...]


  4. on 28 September, 2010 at 9:02 pm Asif

    Hilarious post, and highly insightful. :)


  5. on 28 September, 2010 at 9:16 pm Jayel Aheram

    Sounds awfully familiar.


  6. on 28 September, 2010 at 11:27 pm Links – September 24th, 2010 « A Modern Hypatia

    [...] An interesting explanation of why a website might not be working – especially large, database driven sites. And why adding new features is not just “Let’s add everything anyone wants!” [...]


  7. on 29 September, 2010 at 1:41 am links for 2010-09-28 | links.kburke.org

    [...] A user’s guide to websites, part 1: If it wasn’t broken why fix it? « Rev Dan Catt's Blog Social media is really hard to scale. Features that work for 500 users break easily once you start adding more… the workarounds are imperfect at best (tags: startup) [...]


  8. on 29 September, 2010 at 3:30 pm droope

    Lawl!!!!!!!! :)))

    Thanks, entertaining read.


  9. on 29 September, 2010 at 4:46 pm John Fiala

    I liked your post – very nicely done. But the font is somewhat broken in my firefox, which left the text headachy to read.

    Happily, I’ve got firebug, so I went in and turned off the ‘ff-dax-web-pro-condensed-1′ and ‘-2′ font-family notes so that I could read it.


  10. on 29 September, 2010 at 5:28 pm Anna Savoy

    Funny stuff. Makes you wonder what it must be like to have a social site with half a billion users.


  11. on 29 September, 2010 at 5:31 pm About Website Redesign and Updates | friskyGeek

    [...] excellent, albeit long, post about website lifecycle and development. Anyone that has worked on website redesign projects of any kind can relate to this. Q. I Hate you, [...]


  12. on 29 September, 2010 at 5:54 pm Piotr P.

    It’s brilliant, thank you very much and I’m looking forward to the second part.


  13. on 29 September, 2010 at 11:57 pm Paul

    So damn true. Including the desktop machine/’admin0′ bit. I’ve got war stories from the early days of $large_popular_vod_site which I’ll reserve for the book I’ll write one day, but are on similar lines.

    My own rules of thumb are: design for failure (you’ll need it), cache everything (but only for long enough), only be fast/accurate enough, and denormalisation is cheap. Oh, and don’t let UX people dictate your algorithms, even indirectly.


  14. on 30 September, 2010 at 4:49 pm Chris Lake

    Lovely article and lots of sound insight.

    I always wondered why sites like eBay didn’t change, given the scale of their resources, the upside, and the schoolboy-level user experience issues and errors found on its website. The reason? eBay is too big and is scared of change, of upsetting users (including sellers), and isn’t agile enough to adapt quickly. The bigger you are, the more responsibility you have, and the harder it is.

    It raises another question: at what point does something become *broken*? A tyre with a puncture will still rotate. Is it better to keep an awkward user interface that millions of users have learned, or to rebuild a new intuitive one (that forces users to reprogramme their brains)? Users might not like change, but what they really hate is the cognitive effort in figuring out what’s new. Multiple that by millions of users and you could have one mighty pissed off user base.

    I favour an ‘iterate and optimise’ approach. Lots of constant small changes rather than one big one every few years. That way you learn as you go along, you work in an environment based on data and listening, and users should be happier at adapting to more gradual change.

    Anyway, maximum kudos for the post.


  15. on 30 September, 2010 at 6:22 pm Rikki

    Admit it.. you got the inspiration for this excellent blog post from the forums I run, because the user responses are spot on.. except your ones don’t have auto censored words in them. One part you forgot though Is when you update all the more vocal users will be crying about how the previous version was the ever and perfect …. then in a few years when the process happens again those same users will declare that in fact this version is the bestest ever and ever and how dare you change it… and so the cycle continues


  16. on 30 September, 2010 at 7:44 pm Stephen Burgoyne Coulson

    The issue is rudeness (and not the rudeness of the users – which is often a reaction to the rudeness of the site owners).

    When a site pulls a change and then just lays it on the users (and tries to sell it to the users as something fabulously new and rad just for them) then the users have every right to throw a hissy fit. They were using it, used to it, knew where all the stuff was and no matter how much better/cooler/faster/cheaper it is now you are still a dink for dumping this surprise on them.

    Give users reasons for change and time up front and don’t lie about how you are “improving things” for them (e.g. “store closed to serve you better”) – use reason and honesty in a timely fashion. Yeah, you might be worried about how competitors might leverage the info but the site is about your users, not your competitors (or even about the technology that runs it for the most part). If people pay, keep them happy and treat them well.


  17. on 30 September, 2010 at 7:47 pm Noah

    How about just not adding that useless social graph shit in the first place, that only 5% of users use? I don’t like it, don’t use it, don’t want to be bothered with it, and I liked the old site better.

    Diminishing returns. That’s what it is.


  18. on 30 September, 2010 at 10:26 pm Canada Travel Guide News | Canada Travel Guide

    [...] A user’s guide to websites, part 1: If it wasn’t broken why fix it? « Rev Dan Catt’s Blo… [...]


  19. on 30 September, 2010 at 10:44 pm Thailand Travel Guide News | Thailand Travel Guide

    [...] A user’s guide to websites, part 1: If it wasn’t broken why fix it? « Rev Dan Catt’s Blo… [...]


  20. on 1 October, 2010 at 3:01 am links for 2010-09-30 (Jarrett House North)

    [...] A user’s guide to websites, part 1: If it wasn’t broken why fix it? « Rev Dan Catt's Blog The developer's view to feature selection, or why some good features die when sites scale. Brilliant. (tags: development webdesign productmanagement) [...]


  21. on 1 October, 2010 at 9:35 pm The Technology newsbucket: mobile malware, shorter Google, Yahoo sheds and more | Nur, was da steht

    [...] A user’s guide to websites, part 1: If it wasn’t broken why fix it? >> Rev Dan Ca…Of Flickr (and GuardianRoulette) fame: “Everyone still loves feature X but hates using Perl [in which it's been re-written], it gets re-written 3 times in PHP, it still doesn’t scale. “Someone re-writes it in an afternoon in Python but it only works and scales if sub-feature “x” gets left out. 98% of users don’t notice, 1.9% of users form a protest #hashtag on twitter. 0.1% of users argue about the merits of scaling in PHP vs Python vs Their Favourite Language, they write a blogpost about it (using their own blogging platform they wrote themselves in 1997) slashdot links the post and ironically declares the original site “over”.” [...]


  22. on 2 October, 2010 at 2:40 am Hank

    I’m not sure how you got into my company’s last ‘Strategic IT Planning Meeting’ but thanks for taking such good notes!


  23. on 5 October, 2010 at 11:49 pm » Practical Progressive Enhancement – my talk at the Future of Web Apps London 2010 - Christian Heilmann's blog – Wait till I come!

    [...] video of Dav Glass explaining the detailsYUI supporting capability-based loading and CSS transitionsProblems of scaling web sites and redesignsDouglas Crockford: [...]


  24. on 14 October, 2010 at 6:54 pm Blog SWL-Projekt » Practical Progressive Enhancement – my talk at the Future of Web Apps London 2010

    [...] Problems of scaling web sites and redesigns [...]


  25. on 10 March, 2011 at 8:03 am The Technology newsbucket: mobile malware, shorter Google, Yahoo sheds and more

    [...] A user’s guide to websites, part 1: If it wasn’t broken why fix it? >> Rev Dan Ca…Of Flickr (and GuardianRoulette) fame: “Everyone still loves feature X but hates using Perl [in which it's been re-written], it gets re-written 3 times in PHP, it still doesn’t scale. “Someone re-writes it in an afternoon in Python but it only works and scales if sub-feature “x” gets left out. 98% of users don’t notice, 1.9% of users form a protest #hashtag on twitter. 0.1% of users argue about the merits of scaling in PHP vs Python vs Their Favourite Language, they write a blogpost about it (using their own blogging platform they wrote themselves in 1997) slashdot links the post and ironically declares the original site “over”.” [...]


  26. on 15 September, 2011 at 2:17 am Mobile malware, shorter Google, Yahoo sheds and more | Tech2Crave

    [...] A user’s guide to websites, part 1: If it wasn’t broken why fix it? >> Rev Dan Ca… Of Flickr (and GuardianRoulette) fame: “Everyone still loves feature X but hates using Perl [in which it's been re-written], it gets re-written 3 times in PHP, it still doesn’t scale. “Someone re-writes it in an afternoon in Python but it only works and scales if sub-feature “x” gets left out. 98% of users don’t notice, 1.9% of users form a protest #hashtag on twitter. 0.1% of users argue about the merits of scaling in PHP vs Python vs Their Favourite Language, they write a blogpost about it (using their own blogging platform they wrote themselves in 1997) slashdot links the post and ironically declares the original site “over”.” [...]



Comments are closed.

  • About

    Dan Catt works at the Guardian doing some serious (and not so serious) number crunching. Previously he spent 4 years working at Flickr as a frontend engineer (+stuff), from when it had newly moved to California until about 4 Billion photos later.

    These views do not reflect those of the Guardian or Flickr. Apart from any sweary bits which totally reflect those of Flickr.

    Twitter: @revdancatt

  • Subscribe

    Posts.

  • Recent Posts

    • The last tiny problem I had with Blue Peter Totaliser as a child
    • Guardian Hack Day, Secondary Attention and 24 Hour Drone News Headlines Internet Radio
    • 2012 – A Year in Review
    • Last Friday, in photos
    • Flickr, Instagram, the Social Graph and Interfaces effecting behaviour
  • twitter

    • (which is essentially a 650 word version of this 140char tweet: https://t.co/1rs3UsLN) 1 hour ago
    • Blog, I wrote about the hateful Totaliser "The last tiny problem I had with Blue Peter Totaliser as a child" http://t.co/uP1eHu4m 1 hour ago
    • Having a cup of tea, because that's what I do now, drink tea. I keep forgetting. 18 hours ago
    • Dusting off some old AR video/face detection code to see if it even still works. 19 hours ago
    • Evening/Night child; "I feel poorly", temp, burning up, whining. Morning child; crazy running around happy "hey parents why so tired?" :/ 1 day ago
  • Archives

    • February 2012 (2)
    • January 2012 (5)
    • November 2011 (3)
    • October 2011 (1)
    • September 2011 (1)
    • August 2011 (1)
    • June 2011 (5)
    • May 2011 (4)
    • April 2011 (5)
    • March 2011 (3)
    • February 2011 (3)
    • January 2011 (5)
    • December 2010 (5)
    • November 2010 (6)
    • October 2010 (1)
    • September 2010 (3)
    • March 2010 (2)
    • February 2010 (2)
    • January 2010 (5)
    • December 2009 (1)
    • November 2009 (2)
    • October 2009 (8)
    • September 2009 (5)
    • August 2009 (6)
    • July 2009 (8)
    • June 2009 (12)
    • May 2009 (13)
    • April 2009 (9)
    • March 2009 (14)
    • February 2009 (18)
    • January 2009 (17)
    • December 2008 (17)
    • November 2008 (10)
    • October 2008 (3)
    • August 2008 (4)
    • July 2008 (2)
    • June 2008 (3)
    • May 2008 (5)
    • April 2008 (6)
    • March 2008 (2)
    • January 2008 (4)
    • December 2007 (1)
    • November 2007 (1)
    • October 2007 (1)
    • August 2007 (9)
    • July 2007 (1)
    • May 2007 (11)
    • April 2007 (8)
    • March 2007 (7)
    • February 2007 (10)
    • January 2007 (25)
    • December 2006 (8)
    • November 2006 (32)
    • October 2006 (5)
    • August 2006 (1)
    • April 2006 (1)
    • March 2006 (6)
    • January 2006 (3)
    • September 2003 (1)

Get a free blog at WordPress.com

Theme: MistyLook by Sadish.