Friday, September 19, 2008

Analysis at the speed of thought, and other interesting ideas

As I have posted previously, I spent last week out at the Netezza User Conference, where they announced their new Netezza Spatial product for very high performance spatial analytics on large data volumes. I thought it was an excellent event, and I continue to be very impressed with Netezza's products, people and ideas. I thought I would discuss a couple of general ideas that I found interesting from the opening presentation by CEO Jit Saxena.

The first was that if you can provide information at "the speed of thought", or the speed of a click, this enables people to do interesting things, and work in a different and much more productive way. Google Search is an example - you can ask a question, and you get an answer immediately. The answer may or may not be what you were looking for, but if it isn't you can ask a different question. And if you do get a useful answer, it may trigger you to ask additional questions to gain further insight on the question you are investigating. Netezza sees information at the speed of thought as a goal for complex analytics, which can lead us to get greater insights from data - more than you would if you spent the same amount of time working on a system which was say 20 times slower (spread over 20 times as much elapsed time), as you lose the continuity of thought. This seems pretty plausible to me.

A second idea is that when you are looking for insights from business data, the most valuable data is "on the edges" - one or two standard deviations away from the mean. This leads to another Netezza philosophy which is that you should have all of your data available and online, all of the time. This is in contrast to the approach which is often taken when you have very large data volumes, where you may work on aggregated data, and/or not keep a lot of historical data, to keep performance at reasonable levels (historical data may be archived offline). In this case of course you may lose the details of the most interesting / valuable data.

This got me to thinking about some of the places where you might apply some of those principles in the geospatial world. The following examples are somewhat speculative, but they are intended to get people thinking about the type of things we might do if we can do analysis 100x faster than we can now on very large data volumes, and follow the principle of looking for data "on the edges".

One area is in optimizing inspection, maintenance and management of assets for any organization managing infrastructure, like a utility, telecom or cable company, or local government. This type of infrastructure typically has a long life cycle. What if you stored say the last 10 or 20 years of data on when equipment failed and was replaced, when it was inspected and maintained, etc. Add in information on load/usage if you have it, detailed weather information (for exposed equipment), soil type (for underground equipment), etc, and you would have a pretty interesting (and large) dataset to analyze for patterns, which you could apply to how you do work in the future. People have been talking about doing more sophisticated pre-emptive / preventive maintenance in utilities for a long time, but I don't know of anyone doing very large scale analysis in this space. I suspect there are a lot of applications in different areas where interesting insights could be obtaining by analyzing large historical datasets.

This leads into another thought, which is that of analyzing GPS tracks. As GPS and other types of data tracking (like RFID) become more pervasive, we will have access to huge volumes of data which could provide valuable insights but are challenging to analyze. Many organizations now have GPS in their vehicles for operational purposes, but in most cases do not keep much historical data online, and may well store relatively infrequent location samples, depending on the application (for a long distance trucking company, samples every 5, 15 or even 60 minutes would provide data that had some interest). But there are many questions that you couldn't answer with a coarse sampling but could with a denser sampling of data (like every second or two). Suppose I wanted to see how much time my fleet of vehicles spent waiting to turn left compared to how long they spend waiting to turn right, to see if I could save a significant amount of time for a local delivery service by calculating routes that had more right turns in them (assuming I am in a country which drives on the right)? I have no idea if this would be the case or not, but it would be an interesting question to ask, which could be supported by a dense GPS track but not by a sparse one. Or I might want to look at how fuel consumption is affected by how quickly vehicles accelerate (and model the trade-off in potential cost savings versus potential time lost) - again this is something that in theory I could look at with a dense dataset but not a sparse one. Again, this is a somewhat speculative / hypothetical example, but I think it is interesting to contemplate new types of questions we could ask with the sort of processing power that Netezza can provide - and think about situations where we may be throwing away (or at least archiving offline) data that could be useful. In general I think that analyzing large spatio-temporal datasets is going to become a much more common requirement in the near future.

I should probably mention a couple of more concrete examples too. I have talked to several companies doing site selection with sophisticated models that take a day or two to run. Often they only have a few days to decide whether (and how much) to bid for a site, so they may only be able to run one or two analyses before having to decide. Being able to run tens or hundreds of analyses in the same time would let them vary their assumptions and test the sensitivity of the model to changes, and analyze details which are specific to that site - going back to the "speed of thought" idea, they may be able to ask more insightful questions if they can do multiple analyses in quick succession.

Finally, for now, another application that we have had interest in is analyzing the pattern of dropped cell phone calls. There are millions of calls placed every day, and this is an application where there is both interest in doing near real time analysis, as well as more extended historical analysis. As with the hurricane analysis application discussed previously, the Netezza system is well suited to analysis on rapidly changing data, as it can be loaded extremely quickly, in part because of the lack of indexes in Netezza - maintaining indexes adds a lot of overhead to data loading in traditional system architectures.


Anonymous said...

your question about the time associated with waiting to turn left, vs. making a turn right has already been computed... UPS and its operational engineers have measured the difference, and it is now SOP to do 3 right turns instead of wait for a left...

Peter Batty said...

Thanks, that is interesting about the 3 right turns versus a left. But actually this triggers an interesting thought on the potential of more detailed analysis (the theme of the post), since this is obviously an average and will vary significantly by location and time of day - in a case where there is very little traffic, obviously it would be less efficient to do three right turns versus a left. So this is useful as a rule of thumb, but ideally you would have an algorithm which took these other factors into account.