Heap and AO Table Formats
Apache Cloudberry supports two storage models: Heap storage and Appended-Optimized (AO) storage. The choice of storage model should be determined based on your data type and query type. This document introduces both storage models and provides recommendations for selecting the optimal one.
Heap storage
By default, Apache Cloudberry uses the same heap storage model as PostgreSQL. Heap table storage works best with OLTP-type workloads where the data is often modified after it is initially loaded. UPDATE
and DELETE
operations require storing row-level versioning information to ensure reliable database transaction processing. Heap tables are best suited for smaller tables, such as dimension tables, that are often updated after they are initially loaded.
Append-optimized storage
Append-optimized table storage works best with denormalized fact tables in a data warehouse environment. Denormalized fact tables are typically the largest tables in the system. Fact tables are usually loaded in batches and accessed by read-only queries. Moving large fact tables to an append-optimized storage model eliminates the storage overhead of the per-row update visibility information. This allows for a leaner and easier-to-optimize page structure. The storage model of append-optimized tables is optimized for bulk data loading. Single row INSERT
statements are not recommended.
To create a heap table
Row-oriented heap tables are the default storage type.
=> CREATE TABLE foo (a int, b text) DISTRIBUTED BY (a);
Use the WITH
clause of the CREATE TABLE
command to declare the table storage options. The default is to create the table as a regular row-oriented heap-storage table. For example, to create an append-optimized table with no compression:
=> CREATE TABLE bar (a int, b text)
WITH (appendoptimized=true)
DISTRIBUTED BY (a);
You use the appendoptimized=value
syntax to specify the append-optimized table storage type. appendoptimized
is a thin alias for the appendonly
legacy storage option. Apache Cloudberry stores appendonly
in the catalog, and displays the same when listing storage options for append-optimized tables.
UPDATE
and DELETE
are not allowed on append-optimized tables in a repeatable read or serizalizable transaction and will cause the transaction to end prematurely.
Choose row or column-oriented storage
Apache Cloudberry provides a choice of storage orientation models: row, column, or a combination of both. This section provides general guidelines for choosing the optimal storage orientation for a table. Evaluate performance using your own data and query workloads.
- Row-oriented storage: good for OLTP types of workloads with many iterative transactions and many columns of a single row needed all at once, so retrieving is efficient.
- Column-oriented storage: good for data warehouse workloads with aggregations of data computed over a small number of columns, or for single columns that require regular updates without modifying other column data.
For most general purpose or mixed workloads, row-oriented storage offers the best combination of flexibility and performance. However, there are use cases where a column-oriented storage model provides more efficient I/O and storage. Consider the following requirements when deciding on the storage orientation model for a table:
-
Updates of table data. If you load and update the table data frequently, choose a row-oriented heap table. Column-oriented table storage is only available on append-optimized tables.
See Heap storage for more information.
-
Frequent INSERTs. If rows are frequently inserted into the table, consider a row-oriented model. Column-oriented tables are not optimized for write operations, because column values for a row must be written to different places on disk.
-
Number of columns requested in queries. If you typically request all or the majority of columns in the
SELECT
list orWHERE
clause of your queries, consider a row-oriented model. Column-oriented tables are best suited to queries that aggregate many values of a single column where theWHERE
orHAVING
predicate is also on the aggregate column. For example:SELECT SUM(salary)...
SELECT AVG(salary)... WHERE salary > 10000
Or where the
WHERE
predicate is on a single column and returns a relatively small number of rows. For example:SELECT salary, dept ... WHERE state='CA'
-
Number of columns in the table. Row-oriented storage is more efficient when many columns are required at the same time, or when the row-size of a table is relatively small. Column-oriented tables can offer better query performance on tables with many columns where you access a small subset of columns in your queries.
-
Compression. Column data has the same data type, so storage size optimizations are available in column-oriented data that are not available in row-oriented data. For example, many compression schemes use the similarity of adjacent data to compress. However, the greater adjacent compression achieved, the more difficult random access can become, as data must be uncompressed to be read.
To create a column-oriented table
The WITH
clause of the CREATE TABLE
command specifies the table's storage options. The default is a row-oriented heap table. Tables that use column-oriented storage must be append-optimized tables. For example, to create a column-oriented table:
=> CREATE TABLE bar (a int, b text) WITH (appendoptimized=true, orientation=column) DISTRIBUTED BY (a);