Monday, September 29, 2008
The following is a graduated symbol map showing number of Facebook users by city in the US.
This is an example of one of the map creation screens - the whole process has nice clear graphics which are well thought out to explain the process and options, with ways of getting more information when you want it.
I am a big fan of FortiusOne's vision of putting spatial analysis in the hands of people who are not geospatial specialists. There are still a lot of traditional GIS people who think the sky will fall in (people may draw invalid conclusions if they don't understand data accuracy, etc etc) if you let "untrained" people analyze geospatial data, but I think this is nonsense. I think the best analogy is the spreadsheet. Of course people can draw incorrect conclusions from any kind of data, geospatial or not, but in the great majority of cases this does not happen, and they discover useful insights. Spreadsheets let "untrained" people do useful analysis on all kinds of other data, and I think that FortiusOne is trying to democratize access to spatial analysis in the same way that the spreadsheet has done for non-spatial data. The benefits of having a much larger set of people able to do basic geospatial analysis are huge.
As I said above, I think that this first release looks great, and I look forward to seeing where they take this in the future. I understand that the public release of this will be coming soon.
Monday, September 22, 2008
Friday, September 19, 2008
The first was that if you can provide information at "the speed of thought", or the speed of a click, this enables people to do interesting things, and work in a different and much more productive way. Google Search is an example - you can ask a question, and you get an answer immediately. The answer may or may not be what you were looking for, but if it isn't you can ask a different question. And if you do get a useful answer, it may trigger you to ask additional questions to gain further insight on the question you are investigating. Netezza sees information at the speed of thought as a goal for complex analytics, which can lead us to get greater insights from data - more than you would if you spent the same amount of time working on a system which was say 20 times slower (spread over 20 times as much elapsed time), as you lose the continuity of thought. This seems pretty plausible to me.
A second idea is that when you are looking for insights from business data, the most valuable data is "on the edges" - one or two standard deviations away from the mean. This leads to another Netezza philosophy which is that you should have all of your data available and online, all of the time. This is in contrast to the approach which is often taken when you have very large data volumes, where you may work on aggregated data, and/or not keep a lot of historical data, to keep performance at reasonable levels (historical data may be archived offline). In this case of course you may lose the details of the most interesting / valuable data.
This got me to thinking about some of the places where you might apply some of those principles in the geospatial world. The following examples are somewhat speculative, but they are intended to get people thinking about the type of things we might do if we can do analysis 100x faster than we can now on very large data volumes, and follow the principle of looking for data "on the edges".
One area is in optimizing inspection, maintenance and management of assets for any organization managing infrastructure, like a utility, telecom or cable company, or local government. This type of infrastructure typically has a long life cycle. What if you stored say the last 10 or 20 years of data on when equipment failed and was replaced, when it was inspected and maintained, etc. Add in information on load/usage if you have it, detailed weather information (for exposed equipment), soil type (for underground equipment), etc, and you would have a pretty interesting (and large) dataset to analyze for patterns, which you could apply to how you do work in the future. People have been talking about doing more sophisticated pre-emptive / preventive maintenance in utilities for a long time, but I don't know of anyone doing very large scale analysis in this space. I suspect there are a lot of applications in different areas where interesting insights could be obtaining by analyzing large historical datasets.
This leads into another thought, which is that of analyzing GPS tracks. As GPS and other types of data tracking (like RFID) become more pervasive, we will have access to huge volumes of data which could provide valuable insights but are challenging to analyze. Many organizations now have GPS in their vehicles for operational purposes, but in most cases do not keep much historical data online, and may well store relatively infrequent location samples, depending on the application (for a long distance trucking company, samples every 5, 15 or even 60 minutes would provide data that had some interest). But there are many questions that you couldn't answer with a coarse sampling but could with a denser sampling of data (like every second or two). Suppose I wanted to see how much time my fleet of vehicles spent waiting to turn left compared to how long they spend waiting to turn right, to see if I could save a significant amount of time for a local delivery service by calculating routes that had more right turns in them (assuming I am in a country which drives on the right)? I have no idea if this would be the case or not, but it would be an interesting question to ask, which could be supported by a dense GPS track but not by a sparse one. Or I might want to look at how fuel consumption is affected by how quickly vehicles accelerate (and model the trade-off in potential cost savings versus potential time lost) - again this is something that in theory I could look at with a dense dataset but not a sparse one. Again, this is a somewhat speculative / hypothetical example, but I think it is interesting to contemplate new types of questions we could ask with the sort of processing power that Netezza can provide - and think about situations where we may be throwing away (or at least archiving offline) data that could be useful. In general I think that analyzing large spatio-temporal datasets is going to become a much more common requirement in the near future.
I should probably mention a couple of more concrete examples too. I have talked to several companies doing site selection with sophisticated models that take a day or two to run. Often they only have a few days to decide whether (and how much) to bid for a site, so they may only be able to run one or two analyses before having to decide. Being able to run tens or hundreds of analyses in the same time would let them vary their assumptions and test the sensitivity of the model to changes, and analyze details which are specific to that site - going back to the "speed of thought" idea, they may be able to ask more insightful questions if they can do multiple analyses in quick succession.
Finally, for now, another application that we have had interest in is analyzing the pattern of dropped cell phone calls. There are millions of calls placed every day, and this is an application where there is both interest in doing near real time analysis, as well as more extended historical analysis. As with the hurricane analysis application discussed previously, the Netezza system is well suited to analysis on rapidly changing data, as it can be loaded extremely quickly, in part because of the lack of indexes in Netezza - maintaining indexes adds a lot of overhead to data loading in traditional system architectures.
Wednesday, September 17, 2008
Interview with Rich Zimmerman about Netezza Spatial from Peter Batty on Vimeo.
Tuesday, September 16, 2008
For me, the most impressive aspect of today at the Netezza User Conference was the presentation from Shajy Mathai of Guy Carpenter, the first customer for Netezza Spatial, who talked about how they have improved the performance of their exposure management application, which analyzes insurance risk due to an incoming hurricane. They have reduced the time taken to do an analysis of the risk on over 4 million insured properties from 45 minutes using Oracle Spatial to an astonishing 5 seconds using Netezza (that’s over 500x improvement!). Their current application won the Oracle Spatial Excellence “Innovator Award” in 2006. About half of the 45 minutes is taken up loading the latest detailed weather forecast/risk polygons and other related data, and the other half doing point in polygon calculations for the insured properties. In Netezza the data updates just run continuously in the background as they are so fast, and the point in polygon analysis takes about 5 seconds. For insurance companies with billions of dollars of insured properties at risk, this time difference to get updated information is hugely valuable. The performance improvement you will see over traditional database systems will vary depending on the data and the types of analysis being performed - in general we anticipate performance improvements will typically be in the range of 10x to 100x.
Netezza is a company I have been very impressed with (and in the interests of full disclosure, I am currently doing some consulting work for them and have been for several months). They have taken a radically different approach to complex database applications in the business intelligence space, developing a “database appliance” – a combination of specialized hardware and their own database software, which delivers performance for complex queries on large (multi-terabyte) databases which is typically 10 to 100 times faster than traditional relational database architectures like Oracle or SQL Server. There are two primary means by which they achieve this level of performance. One is by highly parallelizing the processing of queries – a small Netezza configuration has about 50 parallel processing units, each one a powerful computer in its own right, and a large one has around 1000 parallel units (known as Snippet Processing Units or SPUs). Effectively parallelizing queries is a complex software problem – it’s not just a case of throwing lots of hardware at the issue. The second key element is their smart disk readers, which use technology called Field Programmable Gate Arrays (FPGAs), which essentially implement major elements of SQL in hardware, so that basic filtering (eliminating unwanted rows) and projection (eliminating unwanted fields) of data all happens in the disk reader, so unnecessary data is never even read from disk, which eliminates a huge bottleneck in doing complex ad hoc queries in traditional systems.
Apart from outstanding performance, the other key benefit of Netezza is significantly simpler design and administration than with traditional complex database applications. Much of this is due to the fact that Netezza has no indexes, and design of indexes and other ongoing performance tuning operations usually take a lot of time for complex analytic applications in a traditional environment.
Netezza’s technology has been validated by their dramatic success in the database market, which in my experience is quite conservative and resistant to change. This year they expect revenues of about $180m, growth of over 40% over last year’s $127m. About a year ago, Larry Ellison of Oracle said in a press conference that Oracle would have something to compete with Netezza within a year. This is notable because it’s unusual for them to mention specific competitors, and even more unusual to admit that they basically can’t compete with them today and won’t for a year. Given the complexity of what Netezza has done, and the difficulty of developing specialized hardware as well as software, I am skeptical about others catching them any time soon.
So anyway (to get back to the spatial details), the exciting news for people trying to do complex large scale spatial analytics is that Netezza has now announced support for spatial data types and operators – specifically vector data types: points, lines and areas. They support the OGC standard SQL for Simple Features, as well as commonly used functions not included in the standard (the functionality is similar to PostGIS). This enables dramatic performance improvements for complex applications, and in many cases lets us answer questions that we couldn’t even contemplate asking before. We have seen strong interest already from several markets, including insurance, retail, telecom, online advertising, crime analysis and intelligence, and Federal government. I suspect that many of the early users will be existing Netezza customers, or other business intelligence (BI) users, who want to add a location element to their existing BI applications. But I also anticipate some users with existing complex spatial applications and large data volumes, for whom Netezza can deliver these substantial performance improvements for analytics, while simplifying adminstration and tuning requirements.
One important thing to note is that Netezza is specifically not focused on "operational" geospatial applications. The architecture is designed to work effectively for mass queries and analysis - if you are just trying to access a single record or small set of records with a pre-defined query, then a traditional database architecture is the right solution. So in cases where the application focus is not exclusively on complex analytics, Netezza is likely to be an add-on to existing operational systems, not a replacement. This is typical in most organizations doing business intelligence applications, where data is consolidated from multiple operational systems into a corporate data warehouse for analytics (whether spatial or non-spatial).
Aside from the new spatial capabilities, the Netezza conference has been extremely interesting in general, and I will post again in the near future with more general comments on some of the interesting themes that I have heard here, including "providing information at the speed of thought"!
Having worked with interesting innovations in spatial database technologies for many years, from IBM's early efforts on storing spatial data in DB2 in the mid to late eighties, to Smallworld's innovations with long transactions, graphical performance and extreme scalability in terms of concurrent update users in the early nineties, and Ubisense's very high performance real time precision tracking system more recently, it's exciting to see another radical step forward for the industry, this time in terms of what is possible in the area of complex spatial analytics.
Friday, September 12, 2008
One interesting thing you might do is analyze the projected impact of hurricane Ike in a much more comprehensive and timely fashion than you can do with current technologies, and we'll have a case study about that next week. I'll be blogging about all this in much more detail next week.
Thursday, September 4, 2008
Some key features include:
- WYTV, a new animated map display inspired by the cool Twittervision, but using data from whereyougonnabe. This displays an interesting mix of activities from your friends and public activities from other users. You can click through on friends’ activities to see more details.
- Calendar synchronization now uses Google Local Search, based on the context of which city you’ve told us you’re in. So if you’re in Denver, CO and you put a location of “1100 Broadway” or “Apple Store” on an activity in your calendar that week, it will find a suitable location in Denver.
- We now support display of distances in kilometers as well as miles, a request we have had from several users. If your home is in the US or UK the default setting is miles, elsewhere it is km. You can change your preference on the settings page.
- The experience for new users is improved – when you add the application you can immediately view data from your friends, or public activities from other users, without having to enter any data. Users just need to tell us their home location before creating their own activities. There is more to come on this theme.
- Various small usability improvements and bug fixes.
As always, we welcome your feedback on the new features, and ideas for further improvements. We will have another new release coming very shortly with a number of things that didn’t quite make it in time for this one. You can try the new release here.
Wednesday, September 3, 2008
If you had five minutes on stage what would you say? What if you only got 20 slides and they rotated automatically after 15 seconds? Around the world geeks have been putting together Ignite nights to show their answers.
You can register here for free (and if you want to be on my team for the trivia quiz let me know!). I'll be talking about my thoughts on future location and social networking, among a fairly eclectic agenda which ranges from a variety of techie topics to the wonderful world of cigars and how to swear in French! I'm looking forward to the challenge of trying out this presentation format for the first time. If you're in the neighborhood I encourage you to stop by, I think it should be fun.
The following day, Thursday September 11, I switch from having to talk about future location for 5 minutes to talking for an hour and a half at the GIS in the Rockies conference in Loveland, CO. The standard presentations there are 30 minutes but the organizers asked if I would talk for longer, which was nice of them! It was a tough job to get my GeoWeb presentation into 30 minutes, so I'll have the luxury of being able to expand on the topics I talked about there, as well as talking more about the technology we're using, including PostGIS and Google Maps, more about some of the scalability testing and system design challenges we have addressed, and more about how I see the overall marketspace we're in and how we fit into that. I'm on from 3:30pm to 5pm - hope to see some of you there.