geothought: Netezza

Showing posts with label Netezza. Show all posts

Tuesday, August 4, 2009

Netezza announces new architecture with 10-15x price-performance improvement

I have previously discussed Netezza, who produce data warehousing appliances that provide outstanding performance and simplicity for complex analytics on very large data volumes. I did some consulting work with them last year as they added spatial capabilities to their system. Today they announced a major new architecture, which they say gives a 3-5x performance improvement for typical workloads (more for some operations, less for others), and reduces price per terabyte by a factor of 3. So overall price performance improves by a factor of 10-15. Database guru Curt Monash has a good discussion of the new architecture and pricing implications on his blog.

The new hardware architecture is more flexible than the old one, which makes it easier to vary the proportions of processor, memory and disk, which will allow them to provide additional product families in future:

High storage (more disk, lower cost per terabyte, lower throughput)
High throughput (higher cost per terabyte but faster)
Entry level models

I think that the entry level and high throughput models will be especially interesting for geospatial applications, many of which could do interesting analytics with a Netezza appliance, but may not have the super large data volumes (multiple terabytes) that Netezza's business intelligence customers have. Another interesting change for the future is that Netezza's parallel processing units (now Snippet blades, or S-blades, formerly snippet processing units or SPUs) are now running Linux, whereas previously they were running a rather more obscure operating system called Nucleus. In future, this should make it easier to port existing analytic applications to take advantage of Netezza's highly parallel architecture (though this is not something that is available yet). The parallel processing units also do floating point operations in hardware rather than software, which should also have a significant performance benefit for their spatial capabilities.

I continue to think that Netezza offers some very interesting capabilities for users wanting to do high end geospatial analytic applications on very large data volumes, and that there will be a lot of scope for its use in analyzing historical location data generated by GPS and other location sensors. And I am just impressed by anyone who produces an overnight 10-15x price performance improvement in any product :) !

Thursday, February 12, 2009

Webinar next week on data warehouse appliances for Location Intelligence

I have posted previously about Netezza, who make data warehouse appliances, which can perform certain types of complex spatial analysis from 10x to 100x faster than traditional systems - I did some consulting for them last year. On Thursday next week I am speaking in a free webinar hosted by Directions Magazine and sponsored by Netezza, on the topic of data warehouse appliances for Location Intelligence. My talk will include the following topics:

One enterprise DBMS?
Data warehousing concepts
Data warehouse appliances
New possibilities for geospatial applications

On the subject of "one enterprise DBMS" I get to re-use the following slide, which I first used a (slightly different) version of back in 1993 or so when talking about Smallworld's VMDS database ... it's good to have a little material that can last that long in these times of rapid change :) !

The other main speaker will be Shajy Mathay from reinsurance company Guy Carpenter, who have been doing some very interesting things with Netezza - Shajy gave a very interesting presentation at the Netezza User Conference and I look forward to hearing what he has to say.

If you're interested you can get more information and sign up here.

Friday, September 19, 2008

Analysis at the speed of thought, and other interesting ideas

As I have posted previously, I spent last week out at the Netezza User Conference, where they announced their new Netezza Spatial product for very high performance spatial analytics on large data volumes. I thought it was an excellent event, and I continue to be very impressed with Netezza's products, people and ideas. I thought I would discuss a couple of general ideas that I found interesting from the opening presentation by CEO Jit Saxena.

The first was that if you can provide information at "the speed of thought", or the speed of a click, this enables people to do interesting things, and work in a different and much more productive way. Google Search is an example - you can ask a question, and you get an answer immediately. The answer may or may not be what you were looking for, but if it isn't you can ask a different question. And if you do get a useful answer, it may trigger you to ask additional questions to gain further insight on the question you are investigating. Netezza sees information at the speed of thought as a goal for complex analytics, which can lead us to get greater insights from data - more than you would if you spent the same amount of time working on a system which was say 20 times slower (spread over 20 times as much elapsed time), as you lose the continuity of thought. This seems pretty plausible to me.

A second idea is that when you are looking for insights from business data, the most valuable data is "on the edges" - one or two standard deviations away from the mean. This leads to another Netezza philosophy which is that you should have all of your data available and online, all of the time. This is in contrast to the approach which is often taken when you have very large data volumes, where you may work on aggregated data, and/or not keep a lot of historical data, to keep performance at reasonable levels (historical data may be archived offline). In this case of course you may lose the details of the most interesting / valuable data.

This got me to thinking about some of the places where you might apply some of those principles in the geospatial world. The following examples are somewhat speculative, but they are intended to get people thinking about the type of things we might do if we can do analysis 100x faster than we can now on very large data volumes, and follow the principle of looking for data "on the edges".

One area is in optimizing inspection, maintenance and management of assets for any organization managing infrastructure, like a utility, telecom or cable company, or local government. This type of infrastructure typically has a long life cycle. What if you stored say the last 10 or 20 years of data on when equipment failed and was replaced, when it was inspected and maintained, etc. Add in information on load/usage if you have it, detailed weather information (for exposed equipment), soil type (for underground equipment), etc, and you would have a pretty interesting (and large) dataset to analyze for patterns, which you could apply to how you do work in the future. People have been talking about doing more sophisticated pre-emptive / preventive maintenance in utilities for a long time, but I don't know of anyone doing very large scale analysis in this space. I suspect there are a lot of applications in different areas where interesting insights could be obtaining by analyzing large historical datasets.

This leads into another thought, which is that of analyzing GPS tracks. As GPS and other types of data tracking (like RFID) become more pervasive, we will have access to huge volumes of data which could provide valuable insights but are challenging to analyze. Many organizations now have GPS in their vehicles for operational purposes, but in most cases do not keep much historical data online, and may well store relatively infrequent location samples, depending on the application (for a long distance trucking company, samples every 5, 15 or even 60 minutes would provide data that had some interest). But there are many questions that you couldn't answer with a coarse sampling but could with a denser sampling of data (like every second or two). Suppose I wanted to see how much time my fleet of vehicles spent waiting to turn left compared to how long they spend waiting to turn right, to see if I could save a significant amount of time for a local delivery service by calculating routes that had more right turns in them (assuming I am in a country which drives on the right)? I have no idea if this would be the case or not, but it would be an interesting question to ask, which could be supported by a dense GPS track but not by a sparse one. Or I might want to look at how fuel consumption is affected by how quickly vehicles accelerate (and model the trade-off in potential cost savings versus potential time lost) - again this is something that in theory I could look at with a dense dataset but not a sparse one. Again, this is a somewhat speculative / hypothetical example, but I think it is interesting to contemplate new types of questions we could ask with the sort of processing power that Netezza can provide - and think about situations where we may be throwing away (or at least archiving offline) data that could be useful. In general I think that analyzing large spatio-temporal datasets is going to become a much more common requirement in the near future.

I should probably mention a couple of more concrete examples too. I have talked to several companies doing site selection with sophisticated models that take a day or two to run. Often they only have a few days to decide whether (and how much) to bid for a site, so they may only be able to run one or two analyses before having to decide. Being able to run tens or hundreds of analyses in the same time would let them vary their assumptions and test the sensitivity of the model to changes, and analyze details which are specific to that site - going back to the "speed of thought" idea, they may be able to ask more insightful questions if they can do multiple analyses in quick succession.

Finally, for now, another application that we have had interest in is analyzing the pattern of dropped cell phone calls. There are millions of calls placed every day, and this is an application where there is both interest in doing near real time analysis, as well as more extended historical analysis. As with the hurricane analysis application discussed previously, the Netezza system is well suited to analysis on rapidly changing data, as it can be loaded extremely quickly, in part because of the lack of indexes in Netezza - maintaining indexes adds a lot of overhead to data loading in traditional system architectures.

Wednesday, September 17, 2008

Interview with Rich Zimmerman about Netezza Spatial

A new development for the geothought blog, our first video interview! It's not going to win any awards for cinematography or production, but hopefully it may be somewhat interesting for the geospatial database geeks out there :). Rich Zimmerman of IISi is the lead developer of the recently announced spatial extensions to Netezza, and I chatted to him about some technical aspects of the work he's done. Topics include the geospatial standards followed in the development, why he chose not to use PostGIS source code directly, and how queries work in Netezza's highly parallelized architecture.

Interview with Rich Zimmerman about Netezza Spatial from Peter Batty on Vimeo.

Tuesday, September 16, 2008

Netezza Spatial

I have alluded previously to some interesting developments going on in very high performance spatial analytics, and today the official announcement went out about Netezza Spatial (after being pre-announced via Adena at All Points Blog and James Fee).

For me, the most impressive aspect of today at the Netezza User Conference was the presentation from Shajy Mathai of Guy Carpenter, the first customer for Netezza Spatial, who talked about how they have improved the performance of their exposure management application, which analyzes insurance risk due to an incoming hurricane. They have reduced the time taken to do an analysis of the risk on over 4 million insured properties from 45 minutes using Oracle Spatial to an astonishing 5 seconds using Netezza (that’s over 500x improvement!). Their current application won the Oracle Spatial Excellence “Innovator Award” in 2006. About half of the 45 minutes is taken up loading the latest detailed weather forecast/risk polygons and other related data, and the other half doing point in polygon calculations for the insured properties. In Netezza the data updates just run continuously in the background as they are so fast, and the point in polygon analysis takes about 5 seconds. For insurance companies with billions of dollars of insured properties at risk, this time difference to get updated information is hugely valuable. The performance improvement you will see over traditional database systems will vary depending on the data and the types of analysis being performed - in general we anticipate performance improvements will typically be in the range of 10x to 100x.

Netezza is a company I have been very impressed with (and in the interests of full disclosure, I am currently doing some consulting work for them and have been for several months). They have taken a radically different approach to complex database applications in the business intelligence space, developing a “database appliance” – a combination of specialized hardware and their own database software, which delivers performance for complex queries on large (multi-terabyte) databases which is typically 10 to 100 times faster than traditional relational database architectures like Oracle or SQL Server. There are two primary means by which they achieve this level of performance. One is by highly parallelizing the processing of queries – a small Netezza configuration has about 50 parallel processing units, each one a powerful computer in its own right, and a large one has around 1000 parallel units (known as Snippet Processing Units or SPUs). Effectively parallelizing queries is a complex software problem – it’s not just a case of throwing lots of hardware at the issue. The second key element is their smart disk readers, which use technology called Field Programmable Gate Arrays (FPGAs), which essentially implement major elements of SQL in hardware, so that basic filtering (eliminating unwanted rows) and projection (eliminating unwanted fields) of data all happens in the disk reader, so unnecessary data is never even read from disk, which eliminates a huge bottleneck in doing complex ad hoc queries in traditional systems.

Apart from outstanding performance, the other key benefit of Netezza is significantly simpler design and administration than with traditional complex database applications. Much of this is due to the fact that Netezza has no indexes, and design of indexes and other ongoing performance tuning operations usually take a lot of time for complex analytic applications in a traditional environment.

Netezza’s technology has been validated by their dramatic success in the database market, which in my experience is quite conservative and resistant to change. This year they expect revenues of about $180m, growth of over 40% over last year’s $127m. About a year ago, Larry Ellison of Oracle said in a press conference that Oracle would have something to compete with Netezza within a year. This is notable because it’s unusual for them to mention specific competitors, and even more unusual to admit that they basically can’t compete with them today and won’t for a year. Given the complexity of what Netezza has done, and the difficulty of developing specialized hardware as well as software, I am skeptical about others catching them any time soon.

So anyway (to get back to the spatial details), the exciting news for people trying to do complex large scale spatial analytics is that Netezza has now announced support for spatial data types and operators – specifically vector data types: points, lines and areas. They support the OGC standard SQL for Simple Features, as well as commonly used functions not included in the standard (the functionality is similar to PostGIS). This enables dramatic performance improvements for complex applications, and in many cases lets us answer questions that we couldn’t even contemplate asking before. We have seen strong interest already from several markets, including insurance, retail, telecom, online advertising, crime analysis and intelligence, and Federal government. I suspect that many of the early users will be existing Netezza customers, or other business intelligence (BI) users, who want to add a location element to their existing BI applications. But I also anticipate some users with existing complex spatial applications and large data volumes, for whom Netezza can deliver these substantial performance improvements for analytics, while simplifying adminstration and tuning requirements.

One important thing to note is that Netezza is specifically not focused on "operational" geospatial applications. The architecture is designed to work effectively for mass queries and analysis - if you are just trying to access a single record or small set of records with a pre-defined query, then a traditional database architecture is the right solution. So in cases where the application focus is not exclusively on complex analytics, Netezza is likely to be an add-on to existing operational systems, not a replacement. This is typical in most organizations doing business intelligence applications, where data is consolidated from multiple operational systems into a corporate data warehouse for analytics (whether spatial or non-spatial).

Aside from the new spatial capabilities, the Netezza conference has been extremely interesting in general, and I will post again in the near future with more general comments on some of the interesting themes that I have heard here, including "providing information at the speed of thought"!

Having worked with interesting innovations in spatial database technologies for many years, from IBM's early efforts on storing spatial data in DB2 in the mid to late eighties, to Smallworld's innovations with long transactions, graphical performance and extreme scalability in terms of concurrent update users in the early nineties, and Ubisense's very high performance real time precision tracking system more recently, it's exciting to see another radical step forward for the industry, this time in terms of what is possible in the area of complex spatial analytics.

geothought