On the internet, success is most likely to cause failure.
That’s a paradox that can take entrepreneurs by surprise. Maybe it’s true about everything? The internet simply makes it happen faster. A web server that was happily serving 100 people today, probably isn’t going to happily serve 100,000 tomorrow. At the very moment the world has finally discovered your solution to life-the-universe-and-everything, that solution dies in a heap of technical errors, and the world goes away again, suitably unimpressed.
This article describes my personal experiences of scaling a website up to serve tens of thousands users each day, most recently by hosting Gnomish literature in a “Cloud” environment.
The graph below shows total monthly views of pages with content, for El’s Anglin’ and Professions domains. The period is the 12 months from June 2008 to the end of May 2009. The vertical scales for each domain are slightly different: El’s Anglin’ almost reaches 3 million monthly page views, while El’s Professions peaks at 1.3 million. Pageviews are shown here because pageviews tend to cause scaling problems, not specifically visitors.
Monthly Page Views
Anglin’s growth looks fairly modest, but is still quite significant: From 0.9 million page views in September to 2.5 million in December – almost a 3-fold increase in 4 months. El’s Professions saw 4-fold growth from September to October, which then quickly declined again. Both patterns reflect changing demand for content:
- The steady rise in pageviews for El’s Anglin’ towards the end of 2008 is the rise in interest in fishing with the latest World of Warcraft Expansion. As seen in the previous expansion, interest in fishing lags slightly behind other parts of the game.
- The final peak in April 2009 was a new game patch, which added a lot of fishing content.
- The pre-expansion game patch in October 2008 introduced Inscription – the main subject of El’s Professions. In the first week interest in the subject was intense, but very little has since changed.
However, monthly variations only hint at the real hour-by-hour chaos within.
The server resource graph below is a good example of the type of short-term peaking in activity that World of Warcraft players can cause. The graph is for Saturday 12 December 2009. The bottom (x-axis) is Pacific Standard Time (hours). The 2 y-axis show average processor load (how much work the server’s CPU is doing) and memory used (how much physical memory is being used by the server), respectively.
The absolute values aren’t important. Instead, look at the variations: Between 02:00 to 05:00, server activity increases by a factor of at least 5. Then after a gap of a few hours a similar pattern occurs again, before finally settling down in the evening.
What’s happening? 12th December was the first time the Kalu’ak Fishing Derby had been held. The event occurred at 14:00 local time. The 05:00 peak is due to European players. The later peaks occur as the contest moves across the 4 main United States timezones.
In this case I was fortunate: Different parts of the world peaked at different times. But you can hopefully visualise what would happen if everyone arrived at about the same time. October 2008 and April 2009 both saw just that. It was messy.
For the website operator this poses some problems: How much slack do you pay for to ensure you can handle the peaks – 5, 10, 20 times as much as normal? Can you even fund that much redundancy? And if there is a peak, you’re probably already going to be busier, with even less spare time to panic about technical problems: For example the peak is caused by new content, which is (normally) still being researched or checked.
Survival of the Fittest
Blaine Cook‘s instruction, to “Cache the Hell out of Everything”, is good starting advice. Here caching means creating a static copy of a page, which can be served to most readers without the webserver having to process any information. When first writing a website it’s normally much easier to hold all the information in one database, and use templates to format that data as users request it. But if (as for El), pages rarely change, this becomes terribly inefficient: During a peak, several pages may be requested each second. The webserver is probably processing precisely the same thing several times each minute, wasting computing power just when it is under greatest pressure.
Another discovery was the use of far fewer supporting files. Scripts, styling, images. For example, El’s Inscription displayed separate small icon images for almost every item on the page. It was intended to help the reader quickly recognize items. These files were tiny in size. But there were a lot of them. So the webserver was actually serving tens of files each second.
Designing with the expectation of high numbers of users helps to keep the peaks manageable. But ultimately a very popular website will exhaust the technical resources of a single webserver, however well-designed the content is.
I struggled to define Cloud computing last March, and I’m still struggling. I fear it will become a new “Web 2.0” – a phrase that’s so cool to use, it ceases to have meaning. The basic concept involves using the pooled resources of many computers. In practice, Cloud hosting is then defined by 2 criteria:
- Is the cloud all in the same place (data centre) or are the Cloud’s server spread across the world? In the first case, Cloud hosting will only be as reliable as the data centre the computers live in. If the link to that location breaks, or that data centre loses power, it’s game over. Single-location clouds might be called a “very large Virtual Private Server“. Generally hosting in one place is technically easier, meaning fewer limitations, but obviously results in less redundancy.
- Are you simply buying computing/storage power (which you then administer yourself) or are you buying webspace on a hosting platform (fully maintained by the hosting company)? A lot of commercial Cloud hosting (such as Amazon Web Services) essentially involved buying computing/storage power, and then largely managing it yourself. Great for very large websites, but not so great for me: I need the flexibility to scale from 1 server to erm, maybe 2, and I don’t want the hassle of actively managing anything.
Within “platform clouds”, it seems there is a trade-off between redundancy and software/programming flexibility. For example:
Google’s App Engine scales up from the smallest website, and is hosted in more than one place. Great. But the platform is optimized for code written in specific languages (currently Python and Java), with significant limitations on database queries (notably no joins between tables within queries – no relational databases in the conventional sense). So while it is probably perfect for a new programmer’s project, an existing website would need to be heavily re-written.
Rackspace’s Cloud Sites are optimised for more traditional website programming (PHP/mySQL, ASP/MSSQL), but everything is hosted in one place – slightly more flexibility for slightly less redundancy. Personally, although I use Python on El’s sites (primarily to build static pages), for historic reasons all the dynamic content is written in PHP. In the final analysis, I was looking for a PHP-based platform with just enough Python support to let me run scripts in the background.
Moving from a conventional Virtual Private Server to Rackspace Cloud Sites still caused a few problems.
On a conventional web host, you and your code would have access to a command line – a text prompt from which to run programs. Cloud Sites have no command line. This isn’t just a problem for you. Scripts running on the webserver that try and execute a system command tend to fail. For example, PHP’s exec() seems all but worthless. Fortunately Python (and Perl) scripts can be run from the public side of the website. They have to be placed in a specific directory (cgi-bin – just like in the 1990s). I am not sure how efficiently such code is run, so this might not be suitable for every use. But it allowed me to build a small private administration area, loaded with all sorts of useful scripts.
There are many little limitations – things that simply require a different method to be adopted. For example, Symbolic links don’t appear to be supported. These would normally allow a directory to virtually mirror another directory. I host several sets (different sizes and coloring) of every World of Warcraft icon image. About 25,000 files in total. Most will never be used, but it is easier for me to upload them all, than work out what I need. I used to host one copy for El’s Anglin’, and symlink to El’s Professions: The same image became available from either domain. Now all the Professions icons directly reference the Anglin’ site in their URL.
Ironically, the biggest single annoyance is the painfully slow-to-load Rackspace Cloud administration interface. Fortunately once everything is set up, one rarely needs to use it.
Bitter experience has taught me that there is no such thing as a reliable web host. Although generally the more you pay, the more reliable they get. It is too early to judge performance fairly, but the last 2 weeks of December have been less than stellar:
- A complete outage for about 40 minutes. Fortunately, Rackspace’s Cloud hosts many high-profile websites, so outages tend to get noticed (18th).
- Several hours of “degraded performance” – meaning, “your site is so slow most users will give up before anything loads” (27th).
- And the worst user error message I’ve seen for some time:
“Unfortunately there were no suitable nodes available to serve this request.” Say what? This isn’t just user-unfriendly techno-babble. The system seems to ignore my custom files for handling internal server errors, which would give a slightly more friendly message.
The main advantage of the current hosting is that I no longer waste time worrying about whether the allocation of resources is sufficient. That became a significant distraction on the previous Virtual Private Server. Hopefully the technical glitches will get smoothed out as Cloud web-hosting matures.
Charges are based on a fixed monthly fee, with additional charges for disk space (site storage), bandwidth (size of pages/files transferred), and “computing cycles” (a measure of processor usage) above a basic quota. Based on the first 2 weeks, I estimate that the computing cycles quota will sustain 2-3 million page views per month (including all the related images, styles and scripts), and I won’t run out of bandwidth until at least 5 million page views per month.
Almost every page contains advertising. I currently earn more per advert than it will cost to serve the additional pages. That can’t simply be assumed on a video-games website. We attract the cheapest advertising. And when there is more advertising inventory than advertisers to fill that space, we’re the ones that run empty ad’ spaces. So in a depressed economy, we suffer more than most.
Last year I argued that cloud computing could allow hyper-virality (near-instantaneous adoption of a service or product), and I think that’s still where we’re heading. What’s changing now is that cloud computing is becoming an option for more modest websites, run by people outside the “tech bubble”.
There are trade-offs and limitations. But a lot of individuals and smaller businesses aren’t doing anything more complex than running standard Content Management or ‘blog software. Cloud-based hosting could give them the reassurance that the moment the world suddenly wants to discover them, their website won’t die in a heap of embarrassing technical errors.