Today I spent some time working updating on updating the airport dataset we use with whereyougonnabe. I thought that it was an interesting process to iteratively come up with a reasonable algorithm for what I needed to do, so I thought I would explain it here. I'd be interested in any feedback both on improvements to the technical approach (or entirely different approaches), and also on what the "right answer" should be from a user perspective in some of the questionable cases I mention.
As I've mentioned previously, we use Google Local Search for geocoding in general, but in our experience this really doesn't work well for geocoding airports based on the three letter airport codes. When we import travel itineraries from Tripit, the three letter airport codes are the best way to unambiguously identify which airports you will be in, so we need a reliable way of geocoding those.
We have been using the geonames dataset to do this, a rich open source database, which includes cities, airports and other points of interest from around the world. There are about 6.6 million points in the whole dataset, of which about 23000 are airports, and of those about 3500 have three letter IATA codes. Coverage seems to be fairly complete, but we have found a few airports with IATA codes missing - Victoria, BC, is one (code YYJ), and today I found that Knoxville was missing its code (TYS). I updated Victoria in the source data online, and it shows up in the online map but for some reason still does not appear in the downloaded data, I have that on my list of things to look into.
In general, the geonames airport records usually have null data in the city field. This is something I really wanted for each airport, so I can show a short high level description like "departing from Denver" or "arriving in London". Often the city name is included in the airport name, but not always - for example, the main airport in Las Vegas is just named "McCarran International Airport", which would not be an immediately obvious location to many people. And even where the city is included, the name is often longer than I want for a short description, for example "Ronald Reagan Washington National Airport".
For our first quick pass at populating the city for each airport, we used the Yahoo geocoding service, as it has a reverse geocoding function (and we thought we would try it out). You just pass this a coordinate and it gives you a city name (and other data) back. This is the data we are using in the live version of whereyougonnabe today. However, I noticed that this doesn't always give me what I would like - for example, it tells me that London Heathrow airport is in Ashford, and Vancouver airport is in Richmond. This may be technically correct (or not, I don't know for sure), but I'd really find it more useful to see the main city nearby in my high level description - so it will say I am arriving in London or Vancouver, in these two examples.
One nice aspect of the geonames dataset is that it includes the population of (many) cities, so we can use this in determining which is the largest nearby city. So in my first pass at solving the problem, I ran a PostGIS query for each airport to look for "large" cities nearby. Somewhat arbitrarily I initially specified population thresholds of 250000, 100000 and 0, and distance thresholds of 5, 25 and 50 miles. I would first search for cities with population > 250000 within 5 miles, and if I didn't find any I would extend to 25 miles and then 50 miles. If I still hadn't found anything, I would decrease the population to the next threshold and try again. If I found more than one in a given category I chose the closest, in this initial approach (the largest is obviously another alternative, but neither will be the right answer in all cases).
This worked reasonably well, returning London and Vancouver for the examples already mentioned. But it gave some wrong answers too, including Denver Airport which was associated with Aurora (a large suburb of Denver), and for a lot of small town airports it returned a larger town nearby, which in some cases might have been reasonable but in many cases wasn't.
So I decided to extend my algorithm to include matching on names - if there was a city called 'Denver' near 'Denver International Airport' then that is the one I was after. This required a little bit of SQL trickery. If I knew the city name and wanted to search for airports this is easy, you just use a SQL clause like:
where name like '%Denver%'
But it was the other way around, the value I had was a superstring of the value in the field I was querying. I found that in PostgreSQL you can do a statement like the following:
where 'Denver International Airport' like '%' || name || '%'
(|| is the string concatenation in PostgreSQL, so if the value in the field "name" is "Denver", the expression evaluates to 'Denver International Airport' like '%Denver%', which is true). I wouldn't really want to be running that sort of clause alone against the very large geonames table, for performance reasons, but since I also had clauses based on location and population to use as primary filters, performance was fine. And in fact to eliminate issues of upper versus lower case, the clause actually looked like:
where 'denver international airport' like lower('%' || name || '%')
So this would match the city names I found to any substring of the airport name (whether the city name was at the beginning or not). If I found a matching city name within a specified distance (I am currently using 40 miles), I use that, if not then I fall back to using the first approach to look for a "large city" nearby. On my first pass through with this approach, I was able to match names for 2356 of the 3562 airports with an IATA code.
I noticed that some cases were not matching when they looked like they should, where the relevant city names had accented characters. I realized that in some cases, the airport name had been entered with non-accented characters, so it was not matching the version of the city name which did have accented characters. The geonames data includes both an accented and non-accented (ascii) version of each city name, so I extended my where clause to do the wild card matching above on either the (accented) name field or the (non-accented) asciiname field. This increased the number of matching names to 2674, which was a pretty good improvement.
I noticed that in a few places, there were two potential "city" names in the airport name - in particular London Gatwick was matching on a "city" called Gatwick, rather than London, and Amsterdam Schiphol was matching a "city" called Schiphol. Looking at maps of the local area, I'm actually not convinced there is a city (or even a village) called Gatwick, and on further examination the geoname record shows its population as zero, and the same turned out to be true of Schiphol. But regardless, I decided to handle this by checking whether there was more than one city name match with the airport name, within the search radius, and if so to choose the largest city found (rather than the closest). There are some airports which serve two real cities also, and have both in their title, so this is a reasonable approach in those cases (no offense intended to the smaller cities!). This change returned me London instead of Gatwick and Amsterdam instead of Schiphol, as I wanted.
I was now getting pretty close, but I stumbled on a couple of small local airports where I was choosing a really really small town which happened to be closest, rather than a not quite so small town nearby which was really the obvious choice. For example, for Garfield County Airport in the mountains of Colorado, I was picking the town of Antlers (population listed as 0, though I think it might be a few more than that) rather than Rifle, population 7897! This was fixed just by added one more value to my population thresholds (I used 5000).
Obviously tweaking the population and distance thresholds would change some results, but from a quick skim I think the current results are pretty reasonable. One example which I am not sure whether to regard as right is JeffCo airport, a small airport which is in Broomfield, just outside Denver. The current thresholds associate this with Denver, which is probably reasonable, but you could argue that Broomfield is large enough to be named (45000).
If you're interested in checking out the full list, to see if your favorite airport is assigned to the city you think it should be associated with, you can check out this text file (the encoding of accented characters is garbled here, though it is correct in our database - but I haven't taken the time to figure out how to handle this using basic line printing in Java). Let me know if you think anything is wrong! The format of a line in the report is as follows:
13: GRZ Graz-Thalerhof-Flughafen -- 8073 Feldkirchen to Graz (Graz), true
The IATA airport code is first (GRZ), then the airport name from geonames (Graz-Thalerof-Fluhafen), then the city name we got from Yahoo reverse geocoding (8073 Feldkirchen), then the associated city name generated by our algorithm (Graz), followed by the ascii version of the name in parentheses (the same in this case), and lastly the "true" indicates that a match was found on the city name, rather than just using proximity and population.
So anyway, I thought this was an interesting example of how you can use the geonames dataset. I would be happy to contribute these city names back to geonames if people think this would be valuable - it is on my list to look into how best to do this, but if anyone has particular suggestions on this front let me know.