Parallel Restore and Partitioned Tables in Greenplum Database

NOTE:

This is a crosspost of a blog article I wrote for the greenplum.org website. ***

The release of Greenplum Database 7 (GPDB7) brings with it many new features, and each of them requires some thought to make sure we get the most out of them. Today, I want to talk about the tweaks we on the Greenplum Kernel team have made to get the most out of the new partitioning syntax. Hopefully this will be both interesting, and useful for any other applications that are looking to interact with partitioned tables in large numbers.

Background

GPDB7 introduces a new syntax for creating and managing partitioned tables, that we are calling Modern syntax. This should be very familiar to anyone who’s worked with PostgreSQL, as it is identical to theirs. While Greenplum has had its own unique partitioning syntax for quite a while now (that is now going by the name Classic syntax), we believe that aligning with our upstream will be very valuable for our users. Additionally Modern syntax is very flexible, and allows a much clearer way to express variety across leaf tables. This does, however, come at the cost of being substantially more verbose, which means it requires a bit more finesse to scale up efficiently. In our tests, this verbosity showed up as substantial performance regressions when restoring large amounts of metadata using either gprestore and pg_dump. Obviously, we do not want our shiny new features to cause performance regressions, so the Kernel team dug in to figure out a way to have our cake and eat it too.

Baseline Measurements

I put together an adversarial schema to test this problem, using tables from TPC-DS as a base. It uses Classic syntax to cause heavy partitioning, with 28 root tables and ~164,000 tables overall.

Here’s an example of one of the tables:

CREATE TABLE wide.orders
(O_ORDERKEY INT,
O_CUSTKEY INT,
O_ORDERSTATUS CHAR(1),
O_TOTALPRICE DECIMAL(15,2),
O_ORDERDATE DATE,
O_ORDERPRIORITY CHAR(15), 
O_CLERK  CHAR(15), 
O_SHIPPRIORITY INTEGER,
O_COMMENT VARCHAR(79))
DISTRIBUTED BY (O_ORDERKEY)
PARTITION BY RANGE (O_ORDERDATE)
(start('1982-01-01') INCLUSIVE end ('2015-12-31') INCLUSIVE every (5),
default partition others);

This schema is represented in Classic syntax as 28 CREATE TABLE statements and 28 ALTER OWNER statements. In (one possible) implementation of Modern syntax this will need ~164k each of CREATE TABLE, ALTER OWNER, and ATTACH PARTITION statements. While those initial 28 statements are way bigger, there’s overhead in each statement and it hurts performance. This overhead is what we’re trying to address.

Before diving in to performance work, we need data! We ran the schema into a 3-segment GPDB7 cluster on a developer workstation, backed it up with gpbackup, and then restored it with gprestore. Below are time measurements for where we started out with Classic and Modern syntax, and as you can see we’re looking at about a 6.8x slowdown in the simple implementation. More importantly, this ratio is generally where any other application with a simple implementation will land, as we were firing off individual statements on a persistent connection.

Classic Syntax – 14 minutes

time gpbackup --dbname=ajr --metadata-only --backup-dir perf_backups
real	0m22.737s
user	0m0.754s
sys	0m0.154s

time gprestore --create-db --redirect-db=ajr_rest --backup-dir perf_backups --timestamp=20230817153445
real	13m43.712s
user	0m0.672s
sys	0m0.388s

Modern Syntax – 96 minutes

time gpbackup --dbname=ajr --metadata-only --backup-dir perf_backups
real	0m48.048s
user	0m7.622s
sys	0m2.635s

time gprestore --create-db --redirect-db=ajr_rest --backup-dir perf_backups --timestamp=20230817112323
real	96m1.435s
user	0m13.270s
sys	0m8.043s

So what do we do about it?

In gprestore, these tables fall into what we call pre-data metadata (or just predata) which is all the metadata that we need to restore before we can pull data in. The way we handle predata had some room for optimization to address these performance regressions.

Solution 1: Slap a transaction on it

Historically, we have not wrapped predata restore in transactions. However, each of these ATTACH PARTITION statements has the overhead of a two-phase commit, and those can add up quite a bit. Wrapping each batch of the DDL in explicit BEGIN and END statements brings that down to just a handful of commits. This gets us a lot of the way back to a good place.

time gprestore --create-db --redirect-db=ajr_rest --backup-dir perf_backups --timestamp=20230817112323
real	31m51.755s
user	0m8.116s
sys	0m8.322s

Solution 2: Parallelize all the things

There is already support in gprestore for restoring data in parallel across a configurable number of connections (the --jobs flag), because the data in one table is generally independent from the data in others, and this usually represents the bulk of the runtime for a restore so there was lots of incentive to cut it down. Thus far, gprestore has run all predata statements in a single linear fashion. However, there isn’t really a reason it couldn’t be run in parallel, as many DDL statements are independent of each other.

To organize how metadata is restored, gpbackup already relies on pg_depend. This table tracks every object, and each object it depends on. We use this information to ensure that no object is restored before its dependencies are available. We accomplish this by using a depth-first search (DFS) topological sort to walk through the dependencies and get them into a workable order. During backup we record the intended order of execution for all predata statements, and on restore these are simply executed in the noted order.

To take this approach and enhance it to support parallel restores requires some changes. During backup, in addition to noting order of execution, we must also preassign the objects to groups in some way so that the parallel connections don’t contend or deadlock each other. We do this in two parts.

First we record the steps of the DFS, using the depth as a “tier” value. This means that objects with no dependents will go into the first tier, and each subsequent level of the tree will be in a subsequent tier. Our metadata restore will then begin and commit a separate transaction for each tier. This allows us to keep the transactions a bit shorter for better stability in case of errors, and gives us some room to push things upstream or downstream if we find that the information from pg_depend isn’t sufficient for keeping things in the correct order. However, this does not actually facilitate parallelism by itself, as the tiers must still go in order. So we want to be able to split up each tier to be run across multiple connections.

Within each tier, we also assign every object to a “cohort”. This is done by examining the dependencies of each object, and assigning the whole tree to a specific cohort. Then, if any subsequent objects have any of their tree already mapped, throwing that object’s whole tree in the same group. Once this is done, we can safely say that within each tier, no object can depend on anything from any other cohort, and thus each cohort is safe to restore completely independently of the others.

Pictures being worth many words, Figure 1 is a diagram of how this might look for a simple schema. Each rectangle formed by a tier+cohort is a single transaction. Each leaf table can be created independently, but the ATTACH PARTITION statements in tier 2 take a SHARE UPDATE EXCLUSIVE lock on the parents from tier 1 and the leaves from tier 2, so they have a shared dependency and must go in the same cohort or the transactions would deadlock each other trying to get locks already held elsewhere. Obviously the relationships between objects in a SQL database system can get incredibly complex, and we won’t go into the full details, but you can imagine how easy it would be to wind up with a deadlock in a schema of production-grade complexity if we had a less conservative approach.

Tier Cohort Diagram

One of the particularly neat things about this is that it provides our Table of Contents (toc) files a convenient way of seeing exactly how much parallelism is available for any given backup. The highest observed cohort value for any given tier is the theoretical maximum number of jobs that metadata restore could put to use. Here’s an example toc entry from our test backups, showing that we could, in theory, see performance improvements all the way up to 28 jobs. Don’t try that in prod, though…

- schema: wide
  name: supplier
  objecttype: TABLE
  referenceobject: ""
  startbyte: 13512
  endbyte: 13554
  tier:
  - 1
  - 28

On the restore side, then, all we have to do is hand out cohorts to each connection. We do this by iterating through each object’s restore statement, and assigning it to whichever connection its cohort belongs to. Each time a new cohort is seen, it gets assigned to the connection that currently has the fewest statements. Once all assignments are made, each connection opens a transaction and just executes its statements in order, no further logic needed.

Testing with a fairly modest 3 jobs (too many connections will swamp a system in the data restore, so we need to be cautious) we see that we’ve gotten the run time almost back to where we were, but now with the power of our new partitioning approach! Do note that this is in combination with the transaction improvements discussed above.

time gprestore --create-db --redirect-db=ajr_rest --backup-dir perf_backups --jobs=3 --timestamp=20230830163152
real	14m15.468s
user	0m7.006s
sys	0m5.994s

Now, you can’t get a new toy like this and not give it a proper test drive, right? So we did exactly what I said not to above, and tried it out with 28 jobs.

time gprestore --create-db --redirect-db=ajr_rest --backup-dir perf_backups --jobs=28 --timestamp=20230830163152
real	6m44.285s
user	0m6.964s
sys	0m5.717s

From 96 minutes down to under 7. Not too bad!

Results

So there we have it. With a combination of two different approaches we were able to restore 164,074 tables in just 32 seconds more than we were originally taking, while still supporting the flexibility and expressiveness of Modern syntax!

Future work

To be honest, there’s still some performance on the table. It’s just a question of triaging the effort, as there is always a lot to do.

Modern syntax allows leaf tables to have different owners than their root tables, so we can’t just blindly swap over to applying the ownership changes at the root level. However, we could, on the backup side, check for ownership heterogeneity and conditionally print only the root table ALTER OWNER statement if it’s safe. That’d be another 164k statements cut out in our test example.

Modern syntax also supports a CREATE TABLE PARTITION OF style, which attaches the partition automatically. This is able to express everything that the ATTACH PARTITION version can, but requires some significant processing logic to get there. We could rework how gpbackup dumps modern syntax partition tables to use this approach, and cut yet another 164k statements out of our test example. We’re planning to experiment with that, so keep an eye out for a future post, maybe!

andrew@theinternet

Software Engineering

OMSCS

Data Science