Monday, April 13, 2009

Upcoming conferences (and 25% discount on Where 2.0!)

There are several conferences coming up that I wanted to mention, some of which I am speaking at.

First I should mention the GITA Geospatial Infrastructure Solutions Conference, which is in Tampa next week. I have had strong involvement with GITA since moving to the US in 1993, including five years on the GITA board (which I stood down from last December). I am sure the conference will be excellent as always, especially with the focus of the Obama administration on renewing US infrastructure providing some topical interest. This year for the first time they will be using Zerista to provide social networking and mapping services for attendees, so I'll be interested to hear how that goes (Zerista is the latest evolution of MapBuzz, founded by my friend Charlie Savage). It is with some sadness that I report that I won't be at the conference this year, the first one I will have missed since 1991!! I am sure I will be back again in future, but with my current focus on whereyougonnabe, and a limited startup budget :O, other conferences are higher on the list for me this year.

Which brings us on to Where 2.0, which I am actually attending for the first time this year, somewhat surprisingly - I have managed to have conflicting commitments for the past 2 or 3 years. I will be giving a five minute "Ignite" talk on the Tuesday evening, which I am looking forward to - I have done one Ignite format event before (20 slides which auto-advance every 15 seconds, for a total of 5 minutes) which I enjoyed, and have recently got into Presentation Zen and those ideas also lend themselves well to the format. I'll be mentioning several significant new developments we've been working on with whereyougonnabe, as part of discussing the challenges of modeling future location for social networking. And as a reader of this blog, you can get a 25% discount off registration by quoting the code WHR09FSP. Early registration ends tomorrow (April 14), so if you've been thinking about signing up now is a good time to do so! You can register here. I'm also planning to stay around afterwards to attend my first WhereCamp, which I'm also looking forward to.

Moving on to July, I am very tempted to try to make it over to Amsterdam for the State of the Map conference, the 3rd annual conference for OpenStreetMap - but haven't quite worked out if I will make it yet. I continue to find the growth of OpenStreetMap, and crowdsourced data in general, to be one of the most interesting trends in the geospatial industry at the moment, and I am sure this will be an excellent event. I am weighing up whether to submit a talk, and may well do so. I think that the (relatively) recent announcement of the Cloudmade development platform / API will help accelerate the growth in OpenStreetMap. In many ways the rapid growth of OpenStreetMap has been even more remarkable when you consider that the tools to build applications on top of the data have been relatively limited - the new Cloudmade development platform makes this much easier, which should result in a lot more applications based on OpenStreetMap data, which in turn will provide greater incentive for people to contribute to collecting and maintaining data.

And last but not least, I should mention the GeoWeb conference in Vancouver at the end of July (also organized by GITA together with Galdos). I've attended the past couple of years and presented last year (watch the video) and really enjoyed it both times - some good content, an interesting mixture of attendees from across the spectrum of the "neo" and "traditional" geo-camps (including a good number like myself who don't care too much for those labels and have a foot in both camps), and of course it's in beautiful Vancouver which as I've said before, I always like to visit. I've had a presentation accepted so will definitely be at this one.

Wednesday, April 8, 2009

Google App Engine and BigTable - VERY interesting!

Every so often you come across a radically different approach to a certain class of data processing problem which makes you completely rethink what you knew before about how best to develop applications in that space. Systems which I would put in this category over my career include:
  • Smallworld VMDS (early 90s), for its approach to handling long transactions and graphically intensive applications in a database
  • Apama (early 2000s), for its approach to streaming analytics on real time data streams, by indexing queries instead of indexing data
  • Netezza (last year - for me), for its approach to data warehousing using "SQL in hardware" in its smart disk readers, together with extreme parallel processing
I would now add to that list Google's BigTable for incredibly scalable request-oriented web applications (update: Barry Hunter pointed out in the comments that the App Engine datastore and BigTable are not the same thing - datastore is built on top of the lower level BigTable, and adds extra capabilities. I haven't updated the whole post but in most cases where I say BigTable, it should say datastore. Thanks Barry!). I know I'm a bit behind the times on this - Google has been using it internally for years, and it was first made available for external use with the release of Google's App Engine last year. I had read about App Engine but hadn't got around to looking at it in any detail until last weekend, when for some reason I read a few more detailed articles, downloaded the App Engine development environment and ran through the tutorial and played around with it a bit.

There are too many interesting things to talk about in this regard for one post, so I'll spread them over several. And I should add the caveat that everything I say here is based on a few hours of poking around App Engine and BigTable, so it is quite possible I have missed or misunderstood certain things - if anyone with more experience in the environment has thoughts I would be very interested to hear them.

In general I was very impressed with App Engine - in less than an hour I was able to run through the getting started tutorial, which included getting a local development environment set up, going through 5 or 6 iterations of a simple web application, including user authentication, database setup, etc, and deploying several iterations of the application online. It takes care of a huge amount for you, including the ability to automatically scale to zillions of users. We could throw away large portions of our code for whereyougonnabe if we moved to App Engine, something which I am now seriously considering.

But for the rest of this post I'd like to talk about BigTable, the "database" behind App Engine. Google stresses that it isn't a traditional database - this paper, from several years ago, describes it as a "distributed storage system". It can handle petabytes of data spread across thousands of servers and is used by many Google applications, including search and Google Earth. So clearly BigTable is enormously scalable.

However, it also has some limitations on the types of queries it allows, which at first glance for someone used to a traditional relational database management system seem incredibly restrictive. Some of these restrictions seem to have good technical reasons and some seem a bit arbitrary. For example:
  • A query cannot return more than 1000 rows
  • You cannot use an inequality operator (<, <=, >=, >, !=) on more than one property (aka "field") in a query - so you can do
    SELECT * FROM Person WHERE birth_year >= :min
    AND birth_year <= :max
    but not
    SELECT * FROM Person WHERE birth_year >= :min_year
    AND height >= :min_height
  • If a query has both a filter with an inequality comparison and one or more sort orders, the query must include a sort order for the property used in the inequality, and the sort order must appear before sort orders on other properties.
  • And more - see the full list.
While these sort of constraints impose some challenges, the positive side is that as far as I can see, you can't write an inefficient query using BigTable (if anyone has a counterexample to this statement - based as I said on a couple of hours exposure to BigTable - please let me know!). It changes your whole approach to the problem. A traditional relational DBMS makes it very easy to ask whatever question you want (generally speaking), but you may then need quite a lot of work in terms of indexing, tuning, even data model redesign, to make the answer to that question come back quickly. It's easy to do sloppy data model and query design. With BigTable you may need to think more up front about how to fit your problem into the constraints it imposes, but if you can then you are guaranteed (I think!) that it will run quickly and scale.

There's an interesting example of this type of redesign in this post, which shows how you can redesign a query on a date range, where the obvious approach is to have two fields storing start_date and end_date, and run a query which includes an inequality operator against both fields - something which BigTable does not allow. The interesting solution given here is to use one field containing a list of (two) dates, which BigTable does allow (and most traditional DBMSs don't). This is a real world example of a query which is pretty inefficient if you do it in the obvious way in a traditional database (I have seen performance issues for this type of query in the development of whereyougonnabe) - BigTable forces you to structure the data in a different way which ends up being far more efficient.

I am still in two minds about the restriction of not allowing inequality operators on more than one field. This clearly guarantees that the query can run quickly, but restricts you from answering certain questions. Most database management systems would use the approach of having a "primary filter" and a "secondary filter" for a compound query like this - the system uses the primary filter to efficiently retrieve candidate records from the database which satisfy the first condition, and then you test each of those against the second condition to decide whether to return them. This is very common in spatial queries, where you return candidate records based on a bounding box search which can be done very efficiently, and then you compare candidate records returned against a more precise polygon to decide if they should be included in the result set. But this also adds complexity - it is non-trivial to decide which one of multiple clauses to use as the primary filter (this is what a query optimizer does), and it is quite possible that you end up scanning large portions of a table, which seems to be one of the things that BigTable wants to avoid.

Nonetheless, technically it would be easy for Google to implement a secondary filter capability, so I can only assume it is a conscious decision to omit this, to force you to design your data structures and queries in a way which only scan a small portion of a table (as the restriction of 1000 records returned does) - so ensuring the scalability of the application. I would be curious as to whether some of these restrictions, like the 1000 record limit, apply to internal Google applications also, or just to the public site where App Engine runs (in order to stop applications consuming too many resources). When App Engine first came out it was free with quotas, but Google now has a system for charging based on system usage (CPU, bandwidth, etc) once you go above certain limits - so it will be interesting to see if they lift some of these restrictions at some point or not.

But in general it's an interesting philosophical approach to impose certain artificial restrictions to make you design things in a certain way (in this case, for efficiency and scalability). Twitter imposing a message length of 140 characters is limiting in certain ways, but imposes a certain communication style which is key to how it works. The Ignite and Pecha-Kucha presentation formats impose restrictive and artificial constraints on how you present a topic (20 slides which auto-advance after 15 or 20 seconds respectively), but they force you to really think about how to present your subject matter concisely.

With BigTable I think there is an interesting mix of constraints which have clear technical reasons (they can't be easily overcome) and those which don't (they could be easily overcome - like secondary inequality filters and the 1000 record limit). Whether there is really a conscious philosophy here or whether the approach is just to avoid overloading resources on the public site (or a mix of both), I am intrigued by this idea of having a system where seemingly any query you can write is "guaranteed" to run fast and be extremely scalable (not formally guaranteed by Google, but it seems to me that this should be the case).

Of course one key question for me and for readers of this blog is how well does BigTable handle geospatial data - especially since a standard bounding box query involves inequality operators on multiple fields, which is not allowed. BigTable does support a simple "GeoPt" data type, but doesn't support spatial queries out of the box. I have seen some examples using a geohash (which is claimed on Wikipedia to be a recent invention, which as a referencing scheme may be true, but as an indexing scheme it is just a simple form of the good old quadtree index which has been in use since the 1980s - see the end of this old paper of mine). The examples I have seen so far using a geohash are simple and just "approximate" - they will give good results in some cases but incorrect results in others. I have several ideas for using simple quadtree or grid indexes which I will experiment with, but I'll save the discussion on spatial data in BigTable for another post in the not too distant future.