Operations Category

Velocity: Jack and SAN, sitting in a tree

D-E-P-LOY-I-N-G.

Tonight at Ignite Velocity, Jack Valko (our Sr Director of Ops) spoke about the WhitePages.com data storage layer - in particular, why we chose a SAN over a collection of local disks for our 180MM listings, their PostgreSQL masters, and their many, many indexes provided for fast searching across virtually any collection of columns.

In the consumer software startup market, commodity (and, yes, cloud) hardware - dozens of 1Us with local drive, failing over all night long - is the norm. Our unusual decision saved dollars, operational time, and Planet Earth, though, and three years later we’re still happy with the decision.

Plus, free cocktails. Check out the slides, video coming soon.

From Velocity2008: Perceived Render Time

Ready for the latest new buzzword in web performance? We’ve been talking about this for a while internally, but today during the panel discussion Scott uttered those three little words that make our toes curl: perceived render time. Perceived Render Time (PRT) is the amount of time a user waits until a functional part of a web application can accept interaction.

Unfortunately, browsers are not created equal and the data they collect under the hood isn’t reliable. For instance, one thing we tried to measure was the amount of time it took for a page to reach DOMReady status, which is supposed to be when the document completes writing. This could be a good indication of render time, but DOMReady is implemented inconsistently across browsers so it is not a reliable way to measure performance in a Web2 world. Using a stopwatch, QA could never get a qualified reproduction of their experience vs. what the browser was telling them.

Has anyone solved this problem?

From Velocity2008: Performance Measurements in the Wild

Measuring a web application performance used to be limited to instrumenting CPU and IOps on web servers, database throughput or query times, and the backend network that connected your servers. In a Web2 world this model is outdated since the vast majority of your application interaction either happens somewhere in the cloud or in your user’s browsers. In other words, completely outside your datacenter. Products are coming to market to try to quantify this interaction and deliver data to businesses that rely on this new reality. You cannot manage what you cannot measure.

Web2 admins are realizing the scope of this issue and are hungry for access to these new streams of data. Hyperic today announced CloudStatus.com, a 3rd party measurement tool for Amazon’s EC2 and S3. Much like Keynote and Gomez cut their teeth in Web1, CloudStatus attempts to quantify and qualify the performance and availability of Amazon’s popular service. If you’re relying on a cloud to provide critical services, CloudStatus is on the right track to deliver data not only when things are going well, but also to assist in debugging when things are going horribly, terribly wrong.

Software developers are starting to get this new reality too. Frankly, it’s about time. After fighting with developers for years to make their code more efficient, I’m glad they are finally on my bus. Cool products like YSlow and Jiffy start to deliver tools to developers that help them diagnose and debug algorithms or page layouts that lead to poor user experiences. The data these tools gather are in the Ops wheelhouse and provide some common ground for hardware and software guys to meet on.

As first steps these are great — much better than what we had only a few years ago. What we’re still missing is a holistic view of Web2 that includes all of the factors in a chain that can affect user experience. After all, it’s your website no matter where your content comes from. Anything that affects revenue in a negative way will get the attention of a business analysis and bring him or her to your cubical sooner or later. Obtaining a universal view of the entire system would give administrators and analysts the tools required to enforce SLAs, debug your code, debug your partners’ code, etc.

And while I’m thinking of SLAs, it is worth noting they are completely irrelevant now. They do nothing to address this problem, are difficult impossible to enforce, and usually provide loopholes that enable bad behavior and poor business practices.

From the audience at Velocity2008: David Slays Goliath

The parade of morning keynotes and product introductions included some heavy hitters, namely Keynote and Google, and some lightweights, WhitePages.com. Taking a back seat to these guys is usually fine with us as we stay out of the line of fire and can make mistakes with very few people actually noticing. We also learn a lot from the bigboys and tend to follow their lead. After all, these are the industry heavies and can attract and hire more smart people and buy more infrastructure then we could ever hope to acquire.

Keynote demoed a free product today called KITE (Keynote Interest Testing Environment) which allows developers to instrument a workflow or click-stream in a web application and benchmark it from 3 locations in the US to gauge performance and user experience. For a free service it is pretty good, and the audience seemed somewhat receptive to the idea that it’s okay to send your performance data to Keynote, but someone behind me muttered, “how much does it cost for more than 3 datacenters, I wonder?”

Then Scott took the stage and announced Jiffy, which is basically the same product except you get performance telemetry from all the streams you care to implement from all of your users a few seconds after they happens. Oh, ours is free too and we give you the source.

A little guy with a slingshot (a good idea) and a few round rocks (some wicked free code) can do more than break a few windows …

Scott Ruthfield Announces Jiffy

Introducing Jiffy: Real-World Browser Instrumentation and Reporting

During 2007, we shipped a lot of new features on WhitePages.com and other sites in the network. While each of these improved the functionality and usefulness of our service for our customers, we weren’t doing a great job tracking the performance impact of these changes – and so our site was getting slower. Performance is a feature too and we had ignored it.

So we grabbed Firebug and started loading our pages over and over again, looking for patterns. This anecdotal foray just sent us on wild chases, so we backed up and looked for a system that would help us see what our customers were seeing, and allow us to measure every little thing – when individual third-party components loaded, how long it took the search forms to render, etc.

We didn’t find one. We found some third party tools from the usual suspects that sell performance monitoring services – we even implemented one of them – but we found that they weren’t flexible enough to give us the data we needed, slowed down our site, and weren’t real-time enough for our taste.

So for Hack Week, Don & Jack hacked together a project that would log how long it took to render each page in the proxy logs, and then report on the performance. Later in the year, Ben came on board and added the ability to capture individual moments on the page and write those to the proxy log as well, Devin designed the database schemas and rollups, John and Travis made sure it worked, and I… did some things. So all told, we built this thing that we now use that tells us how long some things take to happen. No single part of this system is rocket surgery, but the entire combination is truly something that we haven’t found in the market from anyone – paid, free, self-hosted or service-hosted. We’re proud of the work and already see benefits from it.

Additionally, late last year, we were discussing how we wanted to give back to the open source community: as a LAMPP (Linux, Apache, MySQL, PostgreSQL, & Perl) shop, open source systems have played a core role in our company’s technology stack. Individuals on the WhitePages team have occasionally contributed to projects, but we didn’t have anything significant that we as a company brought to the table.

So today, we’re releasing that performance toolset as an open source project, under the Apache 2.0 license, with the genuine hope that other web publishers get benefit out of the ability to gather real-world measurements from all of their users about the timings of individual aspects of their pages. Jiffy is the name for the toolset. There’s a whole other blog post about the naming process, but a jiffy is a standard term for a small unit of time, and that’s what we’re helping you measure.

We announced the public availability of Jiffy this morning at the O’Reilly Velocity Conference, with a Plenary session following Bill Coleman (the B in BEA Systems – I think of myself as the R in WhitePages.com) and a new release from Keynote Systems. The code is linked from code.whitepages.com (also a new site for us) - the slides are there as well, the video will be available soon. We’ll have more to say about Jiffy in the upcoming days, but in short – take a look, let us know if you have questions (you can use the Jiffy Google Group so everybody can see them, but of course we’ll be here too), and we look forward to seeing how others use Jiffy for their sites!