Tuesday, September 16, 2008

Netezza Spatial

I have alluded previously to some interesting developments going on in very high performance spatial analytics, and today the official announcement went out about Netezza Spatial (after being pre-announced via Adena at All Points Blog and James Fee).

For me, the most impressive aspect of today at the Netezza User Conference was the presentation from Shajy Mathai of Guy Carpenter, the first customer for Netezza Spatial, who talked about how they have improved the performance of their exposure management application, which analyzes insurance risk due to an incoming hurricane. They have reduced the time taken to do an analysis of the risk on over 4 million insured properties from 45 minutes using Oracle Spatial to an astonishing 5 seconds using Netezza (that’s over 500x improvement!). Their current application won the Oracle Spatial Excellence “Innovator Award” in 2006. About half of the 45 minutes is taken up loading the latest detailed weather forecast/risk polygons and other related data, and the other half doing point in polygon calculations for the insured properties. In Netezza the data updates just run continuously in the background as they are so fast, and the point in polygon analysis takes about 5 seconds. For insurance companies with billions of dollars of insured properties at risk, this time difference to get updated information is hugely valuable. The performance improvement you will see over traditional database systems will vary depending on the data and the types of analysis being performed - in general we anticipate performance improvements will typically be in the range of 10x to 100x.

Netezza is a company I have been very impressed with (and in the interests of full disclosure, I am currently doing some consulting work for them and have been for several months). They have taken a radically different approach to complex database applications in the business intelligence space, developing a “database appliance” – a combination of specialized hardware and their own database software, which delivers performance for complex queries on large (multi-terabyte) databases which is typically 10 to 100 times faster than traditional relational database architectures like Oracle or SQL Server. There are two primary means by which they achieve this level of performance. One is by highly parallelizing the processing of queries – a small Netezza configuration has about 50 parallel processing units, each one a powerful computer in its own right, and a large one has around 1000 parallel units (known as Snippet Processing Units or SPUs). Effectively parallelizing queries is a complex software problem – it’s not just a case of throwing lots of hardware at the issue. The second key element is their smart disk readers, which use technology called Field Programmable Gate Arrays (FPGAs), which essentially implement major elements of SQL in hardware, so that basic filtering (eliminating unwanted rows) and projection (eliminating unwanted fields) of data all happens in the disk reader, so unnecessary data is never even read from disk, which eliminates a huge bottleneck in doing complex ad hoc queries in traditional systems.

Apart from outstanding performance, the other key benefit of Netezza is significantly simpler design and administration than with traditional complex database applications. Much of this is due to the fact that Netezza has no indexes, and design of indexes and other ongoing performance tuning operations usually take a lot of time for complex analytic applications in a traditional environment.

Netezza’s technology has been validated by their dramatic success in the database market, which in my experience is quite conservative and resistant to change. This year they expect revenues of about $180m, growth of over 40% over last year’s $127m. About a year ago, Larry Ellison of Oracle said in a press conference that Oracle would have something to compete with Netezza within a year. This is notable because it’s unusual for them to mention specific competitors, and even more unusual to admit that they basically can’t compete with them today and won’t for a year. Given the complexity of what Netezza has done, and the difficulty of developing specialized hardware as well as software, I am skeptical about others catching them any time soon.

So anyway (to get back to the spatial details), the exciting news for people trying to do complex large scale spatial analytics is that Netezza has now announced support for spatial data types and operators – specifically vector data types: points, lines and areas. They support the OGC standard SQL for Simple Features, as well as commonly used functions not included in the standard (the functionality is similar to PostGIS). This enables dramatic performance improvements for complex applications, and in many cases lets us answer questions that we couldn’t even contemplate asking before. We have seen strong interest already from several markets, including insurance, retail, telecom, online advertising, crime analysis and intelligence, and Federal government. I suspect that many of the early users will be existing Netezza customers, or other business intelligence (BI) users, who want to add a location element to their existing BI applications. But I also anticipate some users with existing complex spatial applications and large data volumes, for whom Netezza can deliver these substantial performance improvements for analytics, while simplifying adminstration and tuning requirements.

One important thing to note is that Netezza is specifically not focused on "operational" geospatial applications. The architecture is designed to work effectively for mass queries and analysis - if you are just trying to access a single record or small set of records with a pre-defined query, then a traditional database architecture is the right solution. So in cases where the application focus is not exclusively on complex analytics, Netezza is likely to be an add-on to existing operational systems, not a replacement. This is typical in most organizations doing business intelligence applications, where data is consolidated from multiple operational systems into a corporate data warehouse for analytics (whether spatial or non-spatial).

Aside from the new spatial capabilities, the Netezza conference has been extremely interesting in general, and I will post again in the near future with more general comments on some of the interesting themes that I have heard here, including "providing information at the speed of thought"!

Having worked with interesting innovations in spatial database technologies for many years, from IBM's early efforts on storing spatial data in DB2 in the mid to late eighties, to Smallworld's innovations with long transactions, graphical performance and extreme scalability in terms of concurrent update users in the early nineties, and Ubisense's very high performance real time precision tracking system more recently, it's exciting to see another radical step forward for the industry, this time in terms of what is possible in the area of complex spatial analytics.

7 comments:

Anonymous said...

Wow all over. Just to clarify - do they reprogram the FPGAs dynamically to handle whatever query is being processed at the time? That would be *AWESOME*! Or are the FPGAs just there to decrease time-to-market and allow bugfixes and updates every so often? I checked the press releases, etc, but couldn't find this detail.

Anonymous said...

It would be even cooler if they didn't have ridiculous limitations like 4000 point polygons. I am sure in some very specialized use cases it is fast, but it is like going back to 1985 in terms of functionality and limits.

Anonymous said...

>> ridiculous limitations

Doing things in hardware often introduces some hard limitations, since you can't just magic up an extra set of registers/processors/pipelines out of thin air like you can in software. Pays money, takes choice. :-)

Peter Batty said...

@Anonymous: I assume from the tone of your comment that you are probably a concerned competitor of Netezza Spatial :). I started working in the geospatial industry in 1986, and I can assure you that there is absolutely no resemblance between the functionality available in Netezza Spatial and what was available back then! Functionally it is very rich for a first release – as I mentioned it conforms to the OGC Simple Features standard, and does a lot beyond that. It is very comparable in functionality to PostGIS and SQL Server Spatial. Oracle has some additional functionality that none of the others have, like network modeling, workspace management, etc. In general a lot of these other areas have had limited uptake in the market from my perspective – I still see most people using functionality provided by the various GIS vendors in what you might call these “non core” spatial areas.

It is true that there is currently a limit of 4000 vertices in a single geometry, due to a size limit in the underlying Netezza architecture. But there are very straightforward ways around this limit, which can easily be made transparent to the user.

In summary, functionally it is highly competitive with the existing spatial databases in terms of handling vector geospatial data. For complex queries on large datasets, it generally blows the existing systems away in terms of performance. As I mentioned in the main post, if you have a well indexed query which returns a small number of records quickly in a traditional system, then Netezza may not give you much improvement – that’s really not their focus. If you have queries that are currently taking minutes or hours rather than seconds, those are probably very strong candidates for a dramatic speedup with Netezza.

Peter Batty said...

@tartley: yup, the FPGAs are programmed dynamically for the current query - definitely cool :) !!

Anonymous said...

> yup, the FPGAs are programmed dynamically for the current query - definitely cool :) !!

Nope, its still just software. There isn't any dynamic programmed hardware, but the magic is that it is massively parallel. The Netezza architecture is basically doing a full scan/read of the entire database for every query. That is why they don't need indexes. The beauty of it of course is that it splits that scan across LOTS of blades. For queries that process all the records 1 at a time, it really is magic.

As was pointed out above, it is slower than traditional systems (by a lot) for finding and returning a few records as well as some other types of more complex aggregate queries. There are some things that indexes are plain better for. The real promised land of course is still waiting for someone to combine the massively parallel processing of Netezza with the amazing indexing capabilities of Sybase IQ among other systems. I am honestly curious who will get there 1st. In the mean time, there is a very cool (albeit expensive) new arrow in the quiver.

Anonymous said...

@anonymous

The source code for the query is generated from (notionally) your SQL. The generated C is cross-compiled n the node for the FPGA instruction set then the code is written to the FPGA where it is executed. That's my understanding.