Infrastructure at your Service

Daniel Westermann

Loading data from S3 to AWS RDS for PostgreSQL

AWS RDS for PostgreSQL comes with an extension that allows you to fetch data from AWS S3 and to write back data to AWS S3. The use case for this is obvious: Either you use other AWS services that write data to S3 and you want to further process that data in PostgreSQL, or you want other AWS services to consume data from PostgreSQL by providing that data in S3. Let’s have a look at how that works.

The extension AWS is providing for working with S3 from inside PostgreSQL is called “aws_s3”:

postgres=> select * from pg_available_extensions where name like '%aws%';
    name     | default_version | installed_version |                   comment                   
-------------+-----------------+-------------------+---------------------------------------------
 aws_commons | 1.0             |                   | Common data types across AWS services
 aws_s3      | 1.0             |                   | AWS S3 extension for importing data from S3
(2 rows)

If you try to install the extension you’ll notice that there is a dependency on the “aws_commons” extension:

postgres=> create extension aws_s3;
ERROR:  required extension "aws_commons" is not installed
HINT:  Use CREATE EXTENSION ... CASCADE to install required extensions too.

You can install both extensions in one step using the “CASCADE” option:

postgres=> create extension aws_s3 cascade;
NOTICE:  installing required extension "aws_commons"
CREATE EXTENSION

These extensions provide a couple of helper functions (aws_commons) and the function to import a file from S3 (aws_s3):

postgres=> \dx+ aws_commons
             Objects in extension "aws_commons"
                     Object description                      
-------------------------------------------------------------
 function aws_commons.create_aws_credentials(text,text,text)
 function aws_commons.create_s3_uri(text,text,text)
 schema aws_commons
 type aws_commons._aws_credentials_1
 type aws_commons._s3_uri_1
(5 rows)

postgres=> \dx+ aws_s3
                                       Objects in extension "aws_s3"
                                            Object description                                             
-----------------------------------------------------------------------------------------------------------
 function aws_s3.table_import_from_s3(text,text,text,aws_commons._s3_uri_1,aws_commons._aws_credentials_1)
 function aws_s3.table_import_from_s3(text,text,text,text,text,text,text,text,text)
 schema aws_s3
(3 rows)

Having the extension ready we need a file we can import, so lets create one (exactly the same file as in the previous post, but a bit less rows):

[email protected]:~/Downloads$ cat gen_data.sh 
#!/bin/bash
 
FILE="/home/dwe/Downloads/sample.csv"
rm -rf ${FILE}
 
for i in {1..1000000}; do
    echo "${i},firstname${i},lastname${i},xxx${i}@xxx.com,street${i},country${i},description${i}" >> ${FILE}
done

[email protected]:~/Downloads$ chmod +x gen_data.sh 
[email protected]:~/Downloads$ ./gen_data.sh 
[email protected]:~/Downloads$ head -5 sample.csv 
1,firstname1,lastname1,[email protected],street1,country1,description1
2,firstname2,lastname2,[email protected],street2,country2,description2
3,firstname3,lastname3,[email protected],street3,country3,description3
4,firstname4,lastname4,[email protected],street4,country4,description4
5,firstname5,lastname5,[email protected],street5,country5,description5
[email protected]:~/Downloads$ ls -lha sample.csv 
-rw-rw-r-- 1 dwe dwe 96M Nov 10 11:11 sample.csv

We’ll be using a new bucket for this demo, so lets create one and then upload the file we just generated:

[email protected]:~/Downloads$ aws s3 mb s3://s3-rds-demo --region eu-central-1
make_bucket: s3-rds-demo
[email protected]:~/Downloads$ aws s3 cp sample.csv s3://s3-rds-demo/
upload: ./sample.csv to s3://s3-rds-demo/sample.csv         

Before we can do anything against S3 from RDS for PostgreSQL we need to setup the required permissions. You can use security credentials for this, but it is recommended to use IAM roles and policies. The first step is to create a policy that allows listing the bucket, read and write (write is required for writing data to S3 later on):

[email protected]:~$ aws iam create-policy \
>    --policy-name rds-s3-policy \
>    --policy-document '{
>      "Version": "2012-10-17",
>      "Statement": [
>        {
>          "Sid": "s3import",
>          "Action": [
>            "s3:GetObject",
>            "s3:ListBucket",
>            "S3:PutObject"
>          ],
>          "Effect": "Allow",
>          "Resource": [
>            "arn:aws:s3:::s3-rds-demo", 
>            "arn:aws:s3:::s3-rds-demo/*"
>          ] 
>        }
>      ] 
>    }' 
{
    "Policy": {
        "PolicyName": "rds-s3-policy",
        "PolicyId": "ANPA2U57KX3NFH4HU4COG",
        "Arn": "arn:aws:iam::xxxxxxxx:policy/rds-s3-policy",
        "Path": "/",
        "DefaultVersionId": "v1",
        "AttachmentCount": 0,
        "PermissionsBoundaryUsageCount": 0,
        "IsAttachable": true,
        "CreateDate": "2020-11-10T12:04:34+00:00",
        "UpdateDate": "2020-11-10T12:04:34+00:00"
    }
}

Once the policy is in place we create an IAM role which gets the policy just created attached to:

[email protected]:~$ aws iam create-role \
>    --role-name rds-s3-role \
>    --assume-role-policy-document '{
>      "Version": "2012-10-17",
>      "Statement": [
>        {
>          "Effect": "Allow",
>          "Principal": {
>             "Service": "rds.amazonaws.com"
>           },
>          "Action": "sts:AssumeRole"
>        }
>      ] 
>    }'
{
    "Role": {
        "Path": "/",
        "RoleName": "rds-s3-role",
        "RoleId": "AROA2U57KX3NP2XWVCELI",
        "Arn": "arn:aws:iam::xxxxxxxxxx:role/rds-s3-role",
        "CreateDate": "2020-11-10T12:07:20+00:00",
        "AssumeRolePolicyDocument": {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Principal": {
                        "Service": "rds.amazonaws.com"
                    },
                    "Action": "sts:AssumeRole"
                }
            ]
        }
    }
}

Attaching the policy to the role (you will need the ARN of the policy from above):

[email protected]:~$ aws iam attach-role-policy \
>    --policy-arn arn:aws:iam::xxxxxxxxxx:policy/rds-s3-policy \
>    --role-name rds-s3-role

Finally you need to attach the IAM role to the RDS instance by providing the ARN of the role and the identifier of your RDS instance:

aws rds add-role-to-db-instance \
   --db-instance-identifier dwe-postgres-helvetia \
   --feature-name s3Import \
   --role-arn arn:aws:iam::xxxxxxxx:role/rds-s3-role   \
   --region eu-central-1

Your RDS instance needs to be running to do that, otherwise you’ll get this:

An error occurred (InvalidDBInstanceState) when calling the AddRoleToDBInstance operation: The status for the dwe-postgres DB instance is stopped. The DB instance is not available for s3Import feature.

Having the IAM policy attached to the RDS instance we can load the csv, but first the s3 URI needs to be defined (we do not want to use access keys and credentials):

postgres=> SELECT aws_commons.create_s3_uri('s3-rds-demo'
postgres(>                                 ,'sample.csv'
postgres(>                                 ,'eu-central-1'
postgres(>                                 ) AS s3_uri \gset
postgres=> select :'s3_uri';
               ?column?                
---------------------------------------
 (s3-rds-demo,sample.csv,eu-central-1)
(1 row)

No we are ready to load the file:

postgres=> create table sample ( id int primary key
postgres(>                              , firstname varchar(20)
postgres(>                              , lastname varchar(20)
postgres(>                              , email varchar(20)
postgres(>                              , street varchar(20)
postgres(>                              , country varchar(20)
postgres(>                              , description varchar(20)
postgres(>                              );
CREATE TABLE
postgres=> SELECT aws_s3.table_import_from_s3 ( 'sample'
                                   , ''
                                   , '(format csv)'
                                   , :'s3_uri'
                                   );
                                 table_import_from_s3                                 
--------------------------------------------------------------------------------------
 1000000 rows imported into relation "sample" from file sample.csv of 100222272 bytes
(1 row)
postgres=> select * from sample limit 5;
 id |  firstname  |  lastname  |     email     |  street  |  country  |  description  
----+-------------+------------+---------------+----------+-----------+---------------
 77 | firstname77 | lastname77 | [email protected] | street77 | country77 | description77
 78 | firstname78 | lastname78 | [email protected] | street78 | country78 | description78
 79 | firstname79 | lastname79 | [email protected] | street79 | country79 | description79
  1 | firstname1  | lastname1  | [email protected]  | street1  | country1  | description1
  2 | firstname2  | lastname2  | [email protected]  | street2  | country2  | description2
(5 rows)

And we’re done. The follow up post will show the opposite: Writing back to to S3 from RDS for PostgreSQL.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Daniel Westermann
Daniel Westermann

Principal Consultant & Technology Leader Open Infrastructure