More about the Whitepages Developer API
Now that you’ve read Scott’s big picture posting about the new WhitePages.com public API offering, let me tell you a little about the down and dirty of developing our new API. Our data covers 180 million people and provides approximately 80% coverage of the US: when the opportunity came across my desk to build the API that would allow us to share that data, I was elated.
Let me start by giving you an overview of how we deal with the hard problems of searching those 180 million listings in under a quarter of a second and delivering them to the front-end website.
Some people think we have ‘just a database’ or ‘it’s just a website’. But what we do is hard work. We have multiple data vendors, some onsite and some offsite via their own API calls, all with differing data formats and the resulting merge issues that causes. Our onsite data takes up 3 terabytes of storage (with indexes) is rebuilt monthly with no identification to tie data together and handles billions of queries per year.
We use Oracle, Postgres, MySQL, and BerkeleyDB, depending on which has the strengths we need for any given job. We handle residential, nicknames, households, business listings and work number listings. Our data can be bizarre with fractional streets, decimal house numbers and misleading names like streets named “North”.
Yes, “it’s just a website” that happens to power 1300 affiliate sites, does over 100 million searches and has 34 million unique users per month. We have tiered, redundant systems with strict privacy controls that allow for non-published numbers and our own opt-out list.
All of this is built using Linux, apache mod_perl, our own special sauce and it runs on just 60 boxes (16 run our backend code). Our work includes an internal search API that is strong on speed, comprehensive with its searches and absolutely inappropriate to turn loose on the world (some of our return keys have bizarre names). What we needed was an extensible platform that would allow us to wrap our own API and make it palatable and easier to work with, and to provide multiple output formats.
Back in October, Colin (one of our Architects) and I sat down to sketch out what this would look like. We decided that we would leverage our known strengths and use apache mod_perl, a YAML file for config, Oracle for User preferences and an on-disk cache of those preferences to ensure reliability. It would have to be extensible to allow for new search types and versions and we would have to allow for small developers, large partners who could send millions of queries/day, and internal use. We considered SOAP but decided that a RESTful interface was easier for more people to interact with. We would provide an XSD for people to validate the XML against and JSON output for those who were doing JavaScript. New versions would only rev the version number when the output format changed but that everyone would get additional data entries and data fields as they become available.
We looked at writing our own user management system, but decided the way to go was to partner with Mashery and leave that and the community site up to their infrastructure while we focused on building the actual API.
We build OO Perl here so our first order of business after sketching out the rough requirements was to determine what classes would need to be built. We would need an Apache response handler which would handle the overall logistics, something to clean up and validate input, a class to take the output from the search and process it, and an output transformation class that would take that output and deliver it in whatever output format was requested. All of these factory classes would need to be versionable to allow for changes within our internal API as well as updates to our XSD as we build more functionality into our public API.
Once we got the generic framework worked out and determined that we could leverage it to handle the Mashery integration as well, it allowed us to bring Ewa onboard (giving her a good view of development from the other side of the fence). Ewa has been with WhitePages QA for just over three years and she is my go-to person if I need any question answered about testing our internal API. She stepped up and took point for Mashery integration without missing a beat in addition to her duties doing end to end testing of our final product.
All through November and into December, Colin and I worked out the details and wrote the code, taking the blueprints and making them real. By January 10th we had a rough and ready working version showcasing our three main search types and just in time, as Hack Week was looming and many members of our engineering team were chomping at the bit to get their hands on this API. Some of the products of Hack Week you can see showcased in our sample applications section over at developer.whitepages.com. It also allowed Zine, a new member of our QA team, to hit the ground running and start devising new and intricate ways to torture our poor code before we gave the final stamp of approval.
Hack Week was a lot of fun for me: I babysat the code as it was being really used for the first time and I found out for myself how easy it was to extend the framework to deliver data that isn’t accessed from our internal API. How easy? Well in two days I had two totally different methods built, both of them accessing raw data directly and serving it out. It’s always a pleasure to find out that your design decisions really do work out the way you plan.
So what have we been doing since then? Writing the Technical Documentation that you see at developer.whitepages.com/docs, fixing the bugs we’ve found during the QA testing phase, writing and testing the Mashery integration code, working with business to allow us to expose nearly all of our data to you, and generally wondering when the other shoe would drop. This project has gone way too smoothly and we really couldn’t have done it without the help of our full team. I was reflecting on the number of people who have had a hand in this process and while I won’t name them here, the number exceeds 30 and spans nearly every functional group within the 20% of the company that it represents.
It’s been a fun couple of months and I can’t wait to see what else we come up with for the API. I’ve got my list but I’m even more excited to see what other people will come up with in their own wish lists.
Cheers!
Dan Sabath, api lead dev