Apache Cloudberry (Incubating) | Apache Cloudberry (Incubating)

After creating and preparing a database in Lesson 2: Create and Prepare a Database, you can start to create tables in the database.

note

To introduce Apache Cloudberry, we use a public data set, the Airline On-Time Statistics and Delay Causes data set, published by the United States Department of Transportation at http://www.transtats.bts.gov/. The On-Time Performance dataset records flights by date, airline, originating airport, destination airport, and many other flight details. The data is available for flights since 1987. The exercises in this guide use data for about a million flights in 2009 and 2010. You are encouraged to review the SQL scripts in the GitHub 000-cbdb-sandbox/configs/faa.tar.gz directory as you work through this introduction. You can run most of the exercises by entering the commands yourself or by executing a script in the faa directory.

Create tables using a SQL file in psql

In Apache Cloudberry, you can use the CREATE TABLE SQL statement to create a table.

In the following steps, you will be guided to run a SQL file create_dim_tables.sql that contains the CREATE TABLE statements needed to create faa databases.

Log into Apache Cloudberry in Docker as gpadmin. Then enter the faa directory, in which the SQL file create_dim_tables.sql is located.
```
[gpadmin@mdw tmp]$ cd /tmp
[gpadmin@mdw tmp]$ tar xzf faa.tar.gz
[gpadmin@mdw tmp]$ cd faa
```

Take a look at the create_dim_tables.sql file.

[gpadmin@mdw faa]$ more create_dim_tables.sql

The create_dim_tables.sql file contains the following CREATE TABLE statements:

DROP TABLE IF EXISTS faa.d_airports;  -- Drops the table if it exists to avoid name conflict.
CREATE TABLE faa.d_airports (  -- Creates the table.
    AirportID      INTEGER,
    Name           TEXT,
    City           TEXT,
    Country        TEXT,
    airport_code   TEXT,
    ICOA_code      TEXT,
    Latitude       FLOAT8,
    Longitude      FLOAT8,
    Altitude       FLOAT8,
    TimeZoneOffset FLOAT,
    DST_Flag       TEXT,
    TZ             TEXT
)
DISTRIBUTED BY (airport_code);  -- Specifies the distribution column airport_code.

DROP TABLE IF EXISTS faa.d_wac;
CREATE TABLE faa.d_wac (wac SMALLINT, area_desc TEXT)
DISTRIBUTED BY (wac);

DROP TABLE IF EXISTS faa.d_airlines;
CREATE TABLE faa.d_airlines (airlineid INTEGER, airline_desc TEXT)
DISTRIBUTED BY (airlineid);

DROP TABLE IF EXISTS faa.d_cancellation_codes;
CREATE TABLE faa.d_cancellation_codes (cancel_code TEXT, cancel_desc TEXT)
DISTRIBUTED BY (cancel_code);

DROP TABLE IF EXISTS faa.d_delay_groups;
CREATE TABLE faa.d_delay_groups (delay_group_code TEXT, delay_group_desc TEXT)
DISTRIBUTED BY (delay_group_code);

DROP TABLE IF EXISTS faa.d_distance_groups;
CREATE TABLE faa.d_distance_groups (distance_group_code TEXT, distance_group_desc TEXT)
DISTRIBUTED BY (distance_group_code);

Connect to the tutorial database as lily using the psql. You will run the SQL file as lily.

[gpadmin@mdw faa]$ psql -U lily tutorial

Password for user lily:  # changeme

psql (14.4, server 14.4)
Type "help" for help.
tutorial=>

Run the create_dim_tables.sql file using the psql command \i.

tutorial=> \i create_dim_tables.sql

The following messages are displayed:

psql:create_dim_tables.sql:1: NOTICE:  table "d_airports" does not exist, skipping
DROP TABLE
CREATE TABLE
psql:create_dim_tables.sql:18: NOTICE:  table "d_wac" does not exist, skipping
DROP TABLE
CREATE TABLE
psql:create_dim_tables.sql:22: NOTICE:  table "d_airlines" does not exist, skipping
DROP TABLE
CREATE TABLE
psql:create_dim_tables.sql:26: NOTICE:  table "d_cancellation_codes" does not exist, skipping
DROP TABLE
CREATE TABLE
psql:create_dim_tables.sql:30: NOTICE:  table "d_delay_groups" does not exist, skipping
DROP TABLE
CREATE TABLE
psql:create_dim_tables.sql:34: NOTICE:  table "d_distance_groups" does not exist, skipping
DROP TABLE
CREATE TABLE

Use the \dt command to display all tables in the faa schema. You can see that 6 tables have been created.

tutorial=> \dt
                    List of relations
Schema |         Name         | Type  | Owner | Storage
--------+----------------------+-------+-------+---------
faa    | d_airlines           | table | lily  | heap
faa    | d_airports           | table | lily  | heap
faa    | d_cancellation_codes | table | lily  | heap
faa    | d_delay_groups       | table | lily  | heap
faa    | d_distance_groups    | table | lily  | heap
faa    | d_wac                | table | lily  | heap
(6 rows)

About the distribution policy

The definition of a table includes the distribution policy for the data, which is critical for query performance. The goals for the distribution policy are to:

Distribute the volume of data and query execution work evenly among segments.
Enable segments to accomplish complicated query processing steps locally.

The distribution policy determines how data is distributed among segments. To get an effective distribution policy requires understanding of the data's characteristics, what kind of queries that would be executed on the data and what distribution strategy will best utilize the parallel execution capacity among segments.

Use the DISTRIBUTED clause in CREATE TABLE statement to define the distribution policy for a table. Ideally, each segment possesses an equal volume of data and performs equal share of work when queries run. There are 2 kinds of distribution policy syntax in Apache Cloudberry:

DISTRIBUTED BY (column, ...) defines a distribution key from one or more columns. A hash function applied to the distribution key determines which segment stores the corresponding row. Rows that have same distribution key are stored on the same segment. If the distribution keys are unique, the hash function will ensure that data is distributed evenly. The default distribution policy is a hash on the primary key of the table or the first column of table if no primary key is specified.
DISTRIBUTED RANDOMLY distributes rows in round-robin fashion among segments.

When different tables that have the same or similar columns as distribution key are about to be joined, join action might be accomplished on segments, which will be much faster than re-distributing rows across segments and then joining. The random distribution policy cannot make it happen, so it is definitely better to have a distribution key for a table.

What's next

After creating some tables in the database, you can continue to load data into the tables. See Lesson 4: Data Loading.

Create tables using a SQL file in psql​

About the distribution policy​

What's next​

Create tables using a SQL file in psql

About the distribution policy

What's next