Table Partitioning – Greenplum and Netezza Differences

Greenplum Data Warehouse Partitioning

Table Partitioning in Greenplum Database

Greenplum Database divides tables into parts (also known as partitions) to enable massively parallel processing. Tables are partitioned during CREATE TABLE using the PARTITION BY (and optionally the SUBPARTITION BY) clause. When you partition a table in Greenplum Database, you create a top-level (or parent) table with one or more levels of sub-tables (or child tables). Internally, Greenplum Database creates an inheritance relationship between the top-level table and its underlying partitions, similar to the functionality of the INHERITS clause of PostgreSQL. Greenplum uses the partition criteria defined during table creation to create each partition with a distinct CHECK constraint, which limits the data that table can contain. The query planner uses CHECK constraints to determine which table partitions to scan to satisfy a given query predicate. The Greenplum system catalog stores partition hierarchy information so that rows inserted into the top-level parent table propagate correctly to the child table partitions. To change the partition design or table structure, alter the parent table using ALTER TABLE with the PARTITION clause.

Creating Partitioned Tables
You partition tables when you create them with CREATE TABLE. This section provides
examples of SQL syntax for creating a table with various partition designs.
To partition a table:
1. Decide on the partition design: date range, numeric range, or list of values.
2. Choose the column(s) on which to partition the table.
3. Decide how many levels of partitions you want. For example, you can create a
date range partition table by month and then subpartition the monthly partitions by
sales region.
• Defining Date Range Table Partitions
• Defining Numeric Range Table Partitions
• Defining List Table Partitions
• Defining Multi-level Partitions
• Partitioning an Existing Table

Defining Date Range Table Partitions:
CREATE TABLE sales (id int, date date, amt decimal(10,2))
DISTRIBUTED BY (id)
PARTITION BY RANGE (date)
( START (date ‘2008-01-01’) INCLUSIVE
END (date ‘2009-01-01’) EXCLUSIVE
EVERY (INTERVAL ‘1 day’) );
You can also declare and name each partition individually. For example:

CREATE TABLE sales (id int, date date, amt decimal(10,2))
DISTRIBUTED BY (id)
PARTITION BY RANGE (date)
( PARTITION Jan08 START (date ‘2008-01-01’) INCLUSIVE ,
PARTITION Feb08 START (date ‘2008-02-01’) INCLUSIVE ,
PARTITION Mar08 START (date ‘2008-03-01’) INCLUSIVE ,
PARTITION Apr08 START (date ‘2008-04-01’) INCLUSIVE ,
PARTITION May08 START (date ‘2008-05-01’) INCLUSIVE ,
PARTITION Jun08 START (date ‘2008-06-01’) INCLUSIVE ,
PARTITION Jul08 START (date ‘2008-07-01’) INCLUSIVE ,
PARTITION Aug08 START (date ‘2008-08-01’) INCLUSIVE ,
PARTITION Sep08 START (date ‘2008-09-01’) INCLUSIVE ,
PARTITION Oct08 START (date ‘2008-10-01’) INCLUSIVE ,
PARTITION Nov08 START (date ‘2008-11-01’) INCLUSIVE ,
PARTITION Dec08 START (date ‘2008-12-01’) INCLUSIVE
END (date ‘2009-01-01’) EXCLUSIVE );
You do not have to declare an END value for each partition, only the last one. In this
example, Jan08 ends where Feb08 starts.
Defining Numeric Range Table Partitions
A numeric range partitioned table uses a single numeric data type column as the
partition key column. For example:
CREATE TABLE rank (id int, rank int, year int, gender
char(1), count int)
DISTRIBUTED BY (id)
PARTITION BY RANGE (year)
( START (2001) END (2008) EVERY (1),
DEFAULT PARTITION extra );

Refer to Greenplum DBA Guide for further information:

Netezza’s alertnative to partitioning

Netezza Zone Maps are basically a combination of the best both partitioning and indexing. If you have your data organized right, you can get minimize the data read off disk as you mention. Beyond the DDL argument though, traditional database indexes and partitions have significant disadvantages to the zone map approach.

Netezza does not have indexes that the DBA must create and maintain, it does automatically build and maintain zone maps for all tables in the database. These zone maps contain the minimum and maximum value for all columns in every extent within each table. So, before the query starts reading from disk, it looks at the predicates in the query, and compares them to the zone maps to determine which table extents can be skipped and which need to be read.

For example, if you want to calculate the total sales of Red Ford Mustangs in June 2011, Netezza can skip any table extent that does not have data for June 2011. So, for a database with 7 years of sales history, it can skip any extent that has a maximum that is less, or a minimum that is greater than, June 2011. This eliminates 98% or more of the I/O required.

Indexes are even more illustrative of the advantage zone maps offer. Most indexes are useful only for single-record retrieval. The problem is the cost of maintaining and storing and reading the indexes. While there are differences in implementation (B-tree, bitmap, etc), they all boil down to mapping a single index value to a single record. If you have a billion rows in your fact table, you are going to have a billion rows in your (thin) index. Zone maps are always going to be an order of magnitude smaller in size since they are at the extent level, and so the maintenance costs are trivial.

Also, if you are processing large percentage of data, indexes are no help at all. If your data is not organized (sorted) on the indexed column, then the values you are looking for are likely to be spread throughout the extents on disk. That means you have to read the entire table anyway, with the additional cost of having to read the index first!

Syntax:
CREATE [ TEMPORARY | TEMP ] TABLE table_name
(
column_name type [ [ constraint_name ] column_constraint [ constraint_
characteristics ] ] [, … ]
[ [ constraint_name ] table_constraint [ constraint_characteristics ]
] [, … ]
) [ DISTRIBUTE ON ( column [, …] ) ]
[ORGANIZE ON {(<column>) | NONE}]
ORGANIZE ON {(<columns>) | NONE} ]
Specifies which columns (from one to four) the table is to be organized on. Not available for external tables. If columns are specified, the columns cannot be dropped, the table cannot have any materialized views, and all specified column data types must be zone-mappable. When NONE is specified, any organizing key definitions are removed form the host catalog. The table data reorganization takes effect when GROOM TABLE is run. For more information, see “Using Clustered Base Tables” in the IBM Netezza System Administrator’s Guide.
Example:
CREATE TABLE test_dm (id int, date_dt timestamp, amt decimal(10,2))
DISTRIBUTE ON (id)
ORGANIZE ON (id,date_dt);

insert into test_dm values (100,’2014-05-01 00:00:00′,10.1);

insert into test_dm values (200,’2014-05-05 00:00:00′,20.1);

insert into test_dm values (300,’2014-05-06 00:00:00′,340.1);
insert into test_dm values (400,now(),10.1);
explain verbose select * from test_dm where date_dt::date = now()::date;

Sequential Scan table “TEST_DM” (cost=0.0..0.0 rows=1 width=20 conf=64) {(TEST_DM.”ID”)}
QUERY VERBOSE PLAN:
QUERY SQL:
QUERY PLANTEXT:
NOTICE: QUERY PLAN:
Node 1.
explain verbose select * from test_dm where date_dt::date = now()::date
(xpath_none, locus=spu subject=self)
(spu_send, locus=host subject=self)
(host_return, locus=host subject=self)
[SPU Sequential Scan table “TEST_DM” {(TEST_DM.”ID”)}]
[SPU Return]
[Host Return]
Restrictions:
Projections:
— Estimated Rows = 1, Width = 20, Cost = 0.0 .. 0.0, Conf = 64.0
1:TEST_DM.”ID” 2:TEST_DM.DATE_DT 3:TEST_DM.AMT
((TEST_DM.DATE_DT < ‘2014-05-13 00:00:00’::”TIMESTAMP”) AND (TEST_DM.DATE_DT >= ‘2014-05-12 00:00:00’::”TIMESTAMP”))

 

 

Advertisements
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s