Skip to main content
Version: Next

Load Data Securely Using gpfdists

The gpfdists protocol is a secure version of the gpfdist protocol that enables encrypted communication between Apache Cloudberry and the gpfdist file server. When you use gpfdists, all data transfer is encrypted using SSL, protecting against eavesdropping and man-in-the-middle attacks.

gpfdists provides the same high-performance parallel data loading capabilities as gpfdist, but with additional security features essential for production environments handling sensitive data.

Security features

  • All data transmitted between Apache Cloudberry segments and gpfdist servers is encrypted using SSL/TLS protocols, protecting against eavesdropping and data interception.
  • Mutual authentication is enforced through client certificates, ensuring that both Apache Cloudberry and gpfdist servers verify each other's identities before establishing connections.
  • The implementation uses TLSv1 protocol with AES_128_CBC_SHA encryption algorithm to provide strong cryptographic protection for data in transit.
  • Secure server identification mechanisms prevent unauthorized systems from masquerading as legitimate gpfdist servers, protecting against man-in-the-middle attacks.

Before you begin

To use gpfdists, make sure:

  • SSL certificates configured on all segment hosts.
  • gpfdist utility available on the file server host.
  • Network connectivity between segment hosts and the gpfdist server.
  • Appropriate SSL certificate files in the correct locations.

Step 1. Set up SSL certificates

Required certificate files

The following certificate files must be present in the $PGDATA/gpfdists directory on each Apache Cloudberry segment host:

  • client.key - Client private key file
  • client.crt - Client certificate file
  • root.crt - Trusted certificate authorities file

Certificate requirements by configuration:

verify_gpfdists_cert--ssl_verify_peerRequired Certificate Files
on (default)on (default)client.key, client.crt, root.crt
onoffroot.crt
offonclient.key, client.crt
offoffNone

Install certificates

  1. Create the gpfdists directory on each segment host:

    mkdir -p $PGDATA/gpfdists
  2. Copy the certificate files to each segment host:

    # Copy to all segment hosts
    scp client.key client.crt root.crt gpadmin@segment-host:$PGDATA/gpfdists/
  3. Set appropriate permissions:

    chmod 600 $PGDATA/gpfdists/client.key
    chmod 644 $PGDATA/gpfdists/client.crt
    chmod 644 $PGDATA/gpfdists/root.crt

Step 2. Start gpfdist with SSL

Start the gpfdist utility with the --ssl option to enable secure connections:

gpfdist -p 8081 -d /data/load_files --ssl /path/to/certificates &

SSL options for gpfdist

  • --ssl <certificates_path>: Enables SSL and specify certificate directory
  • --ssl_verify_peer on|off: Controls peer verification (default: on)

Example: Start multiple secure gpfdist instances

# Starts the first secure gpfdist instance.
gpfdist -d /var/load_files1 -p 8081 --ssl /home/gpadmin/certs \
--ssl_verify_peer on -l /home/gpadmin/log1 &

# Starts the second secure gpfdist instance.
gpfdist -d /var/load_files2 -p 8082 --ssl /home/gpadmin/certs \
--ssl_verify_peer on -l /home/gpadmin/log2 &

Step 3. Create external tables with gpfdists

Use the gpfdists:// protocol in the LOCATION clause to create secure external tables:

Readable external table

CREATE EXTERNAL TABLE secure_sales_data (
transaction_id int,
product_name text,
sale_date date,
amount decimal(10,2)
)
LOCATION ('gpfdists://etl-server1:8081/sales/*.txt',
'gpfdists://etl-server2:8082/sales/*.txt')
FORMAT 'TEXT' (DELIMITER '|' NULL ' ');

Writable external table

CREATE WRITABLE EXTERNAL TABLE secure_export (
transaction_id int,
product_name text,
sale_date date,
amount decimal(10,2)
)
LOCATION ('gpfdists://etl-server1:8081/exports/sales_data.txt')
FORMAT 'TEXT' (DELIMITER '|')
DISTRIBUTED BY (transaction_id);

With error handling

CREATE EXTERNAL TABLE secure_data_with_errors (
id int,
name text,
value decimal(10,2)
)
LOCATION ('gpfdists://etl-server:8081/data/*.csv')
FORMAT 'CSV' (HEADER)
LOG ERRORS SEGMENT REJECT LIMIT 100;

Configuration parameters

Apache Cloudberry parameters

Configure these parameters in postgresql.conf:

# Enables/disables SSL certificate verification (default: on)
verify_gpfdists_cert = on

# Control segment parallelism
gp_external_max_segs = 64

gpfdist SSL parameters

The gpfdist utility supports these SSL-related options:

ParameterDescriptionDefault
--sslEnable SSL and specify certificate pathDisabled
--ssl_verify_peerVerify client certificateson

Security best practices

Certificate management

  • Generate all certificates from a trusted certificate authority to ensure proper validation and trust chain establishment.
  • Implement a regular certificate rotation schedule to enhance security and prevent issues from certificate expiration.
  • Store private keys in secure locations with restricted access permissions, ensuring only authorized personnel can access them.
  • Maintain secure backup copies of all certificate files to enable quick recovery in case of system failures or corruption.

Network security

  • Configure firewall rules to restrict access to gpfdists ports, allowing only authorized Apache Cloudberry segment hosts to connect.
  • Use secure network connections such as VPNs or private networks to prevent unauthorized access to data transmission channels.
  • Implement continuous monitoring of SSL connections and certificate expiration dates to proactively address security issues.

Access control

  • Apply the principle of least privilege by granting only the minimum permissions necessary for users and applications to perform their required functions.
  • Implement robust authentication mechanisms including multi-factor authentication where appropriate to verify user identities.
  • Enable comprehensive audit logging to track all access attempts, successful connections, and security-related events for compliance and security monitoring.

Troubleshooting

SSL connection errors

Check certificate configuration:

# Verifies certificate files exist.
ls -la $PGDATA/gpfdists/

# Checks certificate validity.
openssl x509 -in $PGDATA/gpfdists/client.crt -text -noout

Troubleshoot certificate verification issues

  1. Verify that the root.crt file contains the correct certificate authority chain and that all intermediate certificates are properly included for validation.
  2. Check certificate expiration dates using tools like openssl x509 -dates to ensure that certificates have not expired and plan for renewal well in advance.
  3. Validate that the private key file client.key corresponds exactly to the public certificate in client.crt using certificate validation tools.

Common error messages

ErrorCauseSolution
"SSL certificate verify failed"Invalid or expired certificateCheck certificate validity and CA
"SSL handshake failed"SSL configuration mismatchVerify SSL settings on both sides
"Permission denied"Incorrect file permissionsSet proper permissions on certificate files

Debug SSL connections

Enable verbose logging:

gpfdist -d /data -p 8081 --ssl /certs -V --ssl_verify_peer on

Performance considerations

  • SSL encryption introduces computational overhead during data transmission, which may reduce overall throughput compared to unencrypted connections, especially for large data transfers.
  • Certificate caching mechanisms help reduce the performance impact of SSL handshakes by reusing established secure connections across multiple data transfer operations.
  • Deploy multiple gpfdists instances across different hosts or network interfaces to distribute the load and achieve better aggregate throughput for concurrent data operations.
  • Ensure that your network infrastructure provides adequate bandwidth and low latency between Apache Cloudberry segments and gpfdist servers to minimize performance bottlenecks.

Migration from gpfdist to gpfdists

To migrate existing gpfdist usage to gpfdists:

  • Install and configure the required SSL certificates on all Apache Cloudberry segment hosts, ensuring proper permissions and certificate chain validation.
  • Update all external table definitions to change protocol specifications from gpfdist:// to gpfdists:// in their LOCATION clauses.
  • Restart your gpfdist file servers with the --ssl option enabled, specifying the appropriate certificate directories and SSL verification settings.
  • Thoroughly test secure connections from all Apache Cloudberry segments to verify that data loading operations work correctly with the new encrypted protocol.
  • Update any automation scripts, monitoring tools, and operational procedures to account for the new secure protocol requirements and certificate management tasks.

Example migration

Before (insecure):

LOCATION ('gpfdist://etl-server:8081/data.txt')

After (secure):

LOCATION ('gpfdists://etl-server:8081/data.txt')

Limitations

  • Cannot mix gpfdist:// and gpfdists:// protocols in the same external table definition.
  • All Apache Cloudberry segment hosts must have properly configured SSL certificates and trust relationships.
  • SSL encryption introduces computational overhead that may reduce data transfer throughput compared to unencrypted gpfdist.
  • Ongoing certificate lifecycle management is required, including renewal, rotation, and revocation processes.

Learn more