Load Data from Amazon S3 Using the s3
Protocol
The s3
protocol is used in a URL that specifies the location of an Amazon S3 bucket and a prefix to use for reading or writing files in the bucket.
Amazon Simple Storage Service (Amazon S3) provides secure, durable, highly-scalable object storage. For information about Amazon S3, see Amazon S3.
You can define read-only external tables that use existing data files in the S3 bucket for table data, or writable external tables that store the data from INSERT
operations to files in the S3 bucket. Apache Cloudberry uses the S3 URL and prefix specified in the protocol URL either to select one or more files for a read-only table, or to define the location and filename format to use when uploading S3 files for INSERT
operations to writable tables.
The s3
protocol also supports Dell Elastic Cloud Storage (ECS), an Amazon S3 compatible service.
The pxf
protocol can access data in S3 and other object store systems such as Azure, Google Cloud Storage, and Minio. The pxf
protocol can also access data in external Hadoop systems (HDFS, Hive, HBase), and SQL databases. See pxf://
protocol.
Configure the s3 protocol
You must configure the s3
protocol before you can use it. Perform these steps in each database in which you want to use the protocol:
-
Create the read and write functions for the
s3
protocol library:CREATE OR REPLACE FUNCTION write_to_s3() RETURNS integer AS
'$libdir/gps3ext.so', 's3_export' LANGUAGE C STABLE;CREATE OR REPLACE FUNCTION read_from_s3() RETURNS integer AS
'$libdir/gps3ext.so', 's3_import' LANGUAGE C STABLE; -
Declare the
s3
protocol and specify the read and write functions you created in the previous step:To allow only Apache Cloudberry superusers to use the protocol, create it as follows:
CREATE PROTOCOL s3 (writefunc = write_to_s3, readfunc = read_from_s3);
If you want to permit non-superusers to use the
s3
protocol, create it as aTRUSTED
protocol andGRANT
access to those users. For example:CREATE TRUSTED PROTOCOL s3 (writefunc = write_to_s3, readfunc = read_from_s3);
GRANT ALL ON PROTOCOL s3 TO user1, user2;noteThe protocol name
s3
must be the same as the protocol of the URL specified for the external table that you create to access an S3 resource.The corresponding function is called by every Apache Cloudberry segment instance.
Use s3 external tables
Follow these basic steps to use the s3
protocol with Apache Cloudberry external tables. Each step includes links to relevant topics from which you can obtain more information.
-
Create the
s3
protocol configuration file:-
Create a template
s3
protocol configuration file using thegpcheckcloud
utility:gpcheckcloud -t > ./mytest_s3.config
-
(Optional) Edit the template file to specify the
accessid
andsecret
authentication credentials required to connect to the S3 location.
-
-
Apache Cloudberry can access an
s3
protocol configuration file when the file is located on each segment host or when the file is served up by anhttp/https
server. Identify where you plan to locate the configuration file, and note the location and configuration option (if applicable).If you are relying on the AWS credential file to authenticate, this file must reside at
~/.aws/credentials
on each Apache Cloudberry segment host. -
Use the
gpcheckcloud
utility to validate connectivity to the S3 bucket. You must specify the S3 endpoint name and bucket that you want to check.For example, if the
s3
protocol configuration file resides in the default location, you would run the following command:gpcheckcloud -c "s3://<s3-endpoint>/<s3-bucket>"
gpcheckcloud
attempts to connect to the S3 endpoint and lists any files in the S3 bucket, if available. A successful connection ends with the message:Your configuration works well.
You can optionally use
gpcheckcloud
to validate uploading to and downloading from the S3 bucket. -
Create an s3 external table by specifying an
s3
protocol URL in theCREATE EXTERNAL TABLE
command,LOCATION
clause.
Create a readable S3 external table
For reading data from S3, specify the S3 location and file pattern:
CREATE EXTERNAL TABLE sales_data (
transaction_id int,
product_name text,
sale_date date,
amount decimal(10,2)
)
LOCATION ('s3://s3-us-west-2.amazonaws.com/your-bucket/sales-data/ config=/path/to/s3.config')
FORMAT 'CSV' (HEADER);
Create a writable S3 external table
For writing data to S3:
CREATE WRITABLE EXTERNAL TABLE sales_export (LIKE sales_data)
LOCATION ('s3://s3-us-west-2.amazonaws.com/your-bucket/exports/ config=/path/to/s3.config')
FORMAT 'CSV'
DISTRIBUTED BY (transaction_id);
For read-only s3 tables, the URL defines the location and prefix used to select existing data files that comprise the s3 table. For example:
CREATE READABLE EXTERNAL TABLE S3TBL (date text, time text, amt int)
LOCATION('s3://s3-us-west-2.amazonaws.com/s3test.example.com/dataset1/normal/ config=/home/gpadmin/aws_s3/s3.conf')
FORMAT 'csv';
For writable s3 tables, the protocol URL defines the S3 location in which Apache Cloudberry writes the data files that back the table for INSERT
operations. You can also specify a prefix that Apache Cloudberry will add to the files that it creates. For example:
CREATE WRITABLE EXTERNAL TABLE S3WRIT (LIKE S3TBL)
LOCATION('s3://s3-us-west-2.amazonaws.com/s3test.example.com/dataset1/normal/ config=/home/gpadmin/aws_s3/s3.conf')
FORMAT 'csv';
About the s3 Protocol LOCATION URL
When you use the s3
protocol, you specify an S3 file location and optional configuration file location and region parameters in the LOCATION
clause of the CREATE EXTERNAL TABLE
command. The syntax follows:
's3://<S3_endpoint>[:<port>]/<bucket_name>/[<S3_prefix>] [region=<S3_region>] [config=<config_file_location> | config_server=<url>] [section=<section_name>]'
The s3
protocol requires that you specify the S3 endpoint and S3 bucket name. Each Apache Cloudberry segment host must have access to the S3 location. The optional S3_prefix value is used to select files for read-only S3 tables, or as a filename prefix to use when uploading files for s3 writable tables.
The Apache Cloudberry s3
protocol URL must include the S3 endpoint hostname.
To specify an ECS endpoint (an Amazon S3 compatible service) in the LOCATION
clause, you must set the s3
protocol configuration file parameter version
to 2
. The version
parameter controls whether the region
parameter is used in the LOCATION
clause. You can also specify an Amazon S3 location when the version
parameter is 2.
Although the S3_prefix is an optional part of the syntax, you should always include an S3 prefix for both writable and read-only s3 tables to separate datasets as part of the CREATE EXTERNAL TABLE
syntax.
For writable s3 tables, the s3
protocol URL specifies the endpoint and bucket name where Apache Cloudberry uploads data files for the table. The S3 file prefix is used for each new file uploaded to the S3 location as a result of inserting data to the table.
For read-only s3 tables, the S3 file prefix is optional. If you specify an S3_prefix, then the s3
protocol selects all files that start with the specified prefix as data files for the external table. The s3
protocol does not use the slash character (/
) as a delimiter, so a slash character following a prefix is treated as part of the prefix itself.
For example, consider the following 5 files that each have the S3_endpoint named s3-us-west-2.amazonaws.com
and the bucket_name test1
:
s3://s3-us-west-2.amazonaws.com/test1/abc
s3://s3-us-west-2.amazonaws.com/test1/abc/
s3://s3-us-west-2.amazonaws.com/test1/abc/xx
s3://s3-us-west-2.amazonaws.com/test1/abcdef
s3://s3-us-west-2.amazonaws.com/test1/abcdefff
- If the S3 URL is provided as
s3://s3-us-west-2.amazonaws.com/test1/abc
, then theabc
prefix selects all 5 files. - If the S3 URL is provided as
s3://s3-us-west-2.amazonaws.com/test1/abc/
, then theabc/
prefix selects the filess3://s3-us-west-2.amazonaws.com/test1/abc/
ands3://s3-us-west-2.amazonaws.com/test1/abc/xx
. - If the S3 URL is provided as
s3://s3-us-west-2.amazonaws.com/test1/abcd
, then theabcd
prefix selects the filess3://s3-us-west-2.amazonaws.com/test1/abcdef
ands3://s3-us-west-2.amazonaws.com/test1/abcdefff
Wildcard characters are not supported in an S3_prefix; however, the S3 prefix functions as if a wildcard character immediately followed the prefix itself.
All of the files selected by the S3 URL (S3_endpoint/bucket_name/S3_prefix) are used as the source for the external table, so they must have the same format. Each file must also contain complete data rows. A data row cannot be split between files.
You use the config
or config_server
parameter to specify the location of the required s3
protocol configuration file that contains AWS connection credentials and communication parameters.
Use the section
parameter to specify the name of the configuration file section from which the s3
protocol reads configuration parameters. The default section
is named default
. When you specify the section name in the configuration file, enclose it in brackets (for example, [default]
).