Wednesday, April 8, 2009

Google App Engine and BigTable - VERY interesting!

Every so often you come across a radically different approach to a certain class of data processing problem which makes you completely rethink what you knew before about how best to develop applications in that space. Systems which I would put in this category over my career include:
  • Smallworld VMDS (early 90s), for its approach to handling long transactions and graphically intensive applications in a database
  • Apama (early 2000s), for its approach to streaming analytics on real time data streams, by indexing queries instead of indexing data
  • Netezza (last year - for me), for its approach to data warehousing using "SQL in hardware" in its smart disk readers, together with extreme parallel processing
I would now add to that list Google's BigTable for incredibly scalable request-oriented web applications (update: Barry Hunter pointed out in the comments that the App Engine datastore and BigTable are not the same thing - datastore is built on top of the lower level BigTable, and adds extra capabilities. I haven't updated the whole post but in most cases where I say BigTable, it should say datastore. Thanks Barry!). I know I'm a bit behind the times on this - Google has been using it internally for years, and it was first made available for external use with the release of Google's App Engine last year. I had read about App Engine but hadn't got around to looking at it in any detail until last weekend, when for some reason I read a few more detailed articles, downloaded the App Engine development environment and ran through the tutorial and played around with it a bit.

There are too many interesting things to talk about in this regard for one post, so I'll spread them over several. And I should add the caveat that everything I say here is based on a few hours of poking around App Engine and BigTable, so it is quite possible I have missed or misunderstood certain things - if anyone with more experience in the environment has thoughts I would be very interested to hear them.

In general I was very impressed with App Engine - in less than an hour I was able to run through the getting started tutorial, which included getting a local development environment set up, going through 5 or 6 iterations of a simple web application, including user authentication, database setup, etc, and deploying several iterations of the application online. It takes care of a huge amount for you, including the ability to automatically scale to zillions of users. We could throw away large portions of our code for whereyougonnabe if we moved to App Engine, something which I am now seriously considering.

But for the rest of this post I'd like to talk about BigTable, the "database" behind App Engine. Google stresses that it isn't a traditional database - this paper, from several years ago, describes it as a "distributed storage system". It can handle petabytes of data spread across thousands of servers and is used by many Google applications, including search and Google Earth. So clearly BigTable is enormously scalable.

However, it also has some limitations on the types of queries it allows, which at first glance for someone used to a traditional relational database management system seem incredibly restrictive. Some of these restrictions seem to have good technical reasons and some seem a bit arbitrary. For example:
  • A query cannot return more than 1000 rows
  • You cannot use an inequality operator (<, <=, >=, >, !=) on more than one property (aka "field") in a query - so you can do
    SELECT * FROM Person WHERE birth_year >= :min
    AND birth_year <= :max
    but not
    SELECT * FROM Person WHERE birth_year >= :min_year
    AND height >= :min_height
  • If a query has both a filter with an inequality comparison and one or more sort orders, the query must include a sort order for the property used in the inequality, and the sort order must appear before sort orders on other properties.
  • And more - see the full list.
While these sort of constraints impose some challenges, the positive side is that as far as I can see, you can't write an inefficient query using BigTable (if anyone has a counterexample to this statement - based as I said on a couple of hours exposure to BigTable - please let me know!). It changes your whole approach to the problem. A traditional relational DBMS makes it very easy to ask whatever question you want (generally speaking), but you may then need quite a lot of work in terms of indexing, tuning, even data model redesign, to make the answer to that question come back quickly. It's easy to do sloppy data model and query design. With BigTable you may need to think more up front about how to fit your problem into the constraints it imposes, but if you can then you are guaranteed (I think!) that it will run quickly and scale.

There's an interesting example of this type of redesign in this post, which shows how you can redesign a query on a date range, where the obvious approach is to have two fields storing start_date and end_date, and run a query which includes an inequality operator against both fields - something which BigTable does not allow. The interesting solution given here is to use one field containing a list of (two) dates, which BigTable does allow (and most traditional DBMSs don't). This is a real world example of a query which is pretty inefficient if you do it in the obvious way in a traditional database (I have seen performance issues for this type of query in the development of whereyougonnabe) - BigTable forces you to structure the data in a different way which ends up being far more efficient.

I am still in two minds about the restriction of not allowing inequality operators on more than one field. This clearly guarantees that the query can run quickly, but restricts you from answering certain questions. Most database management systems would use the approach of having a "primary filter" and a "secondary filter" for a compound query like this - the system uses the primary filter to efficiently retrieve candidate records from the database which satisfy the first condition, and then you test each of those against the second condition to decide whether to return them. This is very common in spatial queries, where you return candidate records based on a bounding box search which can be done very efficiently, and then you compare candidate records returned against a more precise polygon to decide if they should be included in the result set. But this also adds complexity - it is non-trivial to decide which one of multiple clauses to use as the primary filter (this is what a query optimizer does), and it is quite possible that you end up scanning large portions of a table, which seems to be one of the things that BigTable wants to avoid.

Nonetheless, technically it would be easy for Google to implement a secondary filter capability, so I can only assume it is a conscious decision to omit this, to force you to design your data structures and queries in a way which only scan a small portion of a table (as the restriction of 1000 records returned does) - so ensuring the scalability of the application. I would be curious as to whether some of these restrictions, like the 1000 record limit, apply to internal Google applications also, or just to the public site where App Engine runs (in order to stop applications consuming too many resources). When App Engine first came out it was free with quotas, but Google now has a system for charging based on system usage (CPU, bandwidth, etc) once you go above certain limits - so it will be interesting to see if they lift some of these restrictions at some point or not.

But in general it's an interesting philosophical approach to impose certain artificial restrictions to make you design things in a certain way (in this case, for efficiency and scalability). Twitter imposing a message length of 140 characters is limiting in certain ways, but imposes a certain communication style which is key to how it works. The Ignite and Pecha-Kucha presentation formats impose restrictive and artificial constraints on how you present a topic (20 slides which auto-advance after 15 or 20 seconds respectively), but they force you to really think about how to present your subject matter concisely.

With BigTable I think there is an interesting mix of constraints which have clear technical reasons (they can't be easily overcome) and those which don't (they could be easily overcome - like secondary inequality filters and the 1000 record limit). Whether there is really a conscious philosophy here or whether the approach is just to avoid overloading resources on the public site (or a mix of both), I am intrigued by this idea of having a system where seemingly any query you can write is "guaranteed" to run fast and be extremely scalable (not formally guaranteed by Google, but it seems to me that this should be the case).

Of course one key question for me and for readers of this blog is how well does BigTable handle geospatial data - especially since a standard bounding box query involves inequality operators on multiple fields, which is not allowed. BigTable does support a simple "GeoPt" data type, but doesn't support spatial queries out of the box. I have seen some examples using a geohash (which is claimed on Wikipedia to be a recent invention, which as a referencing scheme may be true, but as an indexing scheme it is just a simple form of the good old quadtree index which has been in use since the 1980s - see the end of this old paper of mine). The examples I have seen so far using a geohash are simple and just "approximate" - they will give good results in some cases but incorrect results in others. I have several ideas for using simple quadtree or grid indexes which I will experiment with, but I'll save the discussion on spatial data in BigTable for another post in the not too distant future.

7 comments:

  1. Interesting posts, summarizes quite well the AppEngine Environment. The geoindexing part is also of particular interest to me, as you say geohash is floated around as a rough and ready solution (I proposed it first IIRC), but possible to do better - there are some libraries around implementing bit-quadtress.

    But a pet niggle of mine, bigtable != datastore. The datastore exposed in AppEngine is not Bigtable, they are quite different. The datastore is implemented with BigTable. Bigtable itself does not have indexes like that implemented by the Datastore, its much more low level.

    ReplyDelete
  2. Barry, thanks for the clarification on bigtable versus datastore - I'll make an update to that effect. Definitely a lot of interesting possibilities with it though.

    ReplyDelete
  3. Peter:

    Brian Flood of Arc2Earth fame has done a ton of work on storing, indexing, and querying spatial data in Big Table. He has some posts and videos about this and extending his Arc2Earth tools to include accessing Big Table datastores via ArcMap.


    B. Timoney

    ReplyDelete
  4. @Brian T, thanks - Brian F got in touch with me also, and I'll mention his stuff in the next installment on spatial data and App Engine.

    ReplyDelete
  5. There is also a 1000 limit on the number of results returned (that you can browse/access) for a single query on the google search engine. So maybe the limit for Datastore is intrinsic for the BigTable on which both are based.

    ReplyDelete
  6. Very interesting post on geohash and google app engine spatial search.

    Peter, did you find any efficient way to do bounding box query on app engine?

    ReplyDelete
  7. what about this...
    http://www.carlosble.com/2010/11/goodbye-google-app-engine-gae/

    ReplyDelete