Secure s3 paths when reading

Python answer: I use an intermediate processing layer that extracts the S3 requests, passes it to a Python module to do the download, and then rewrites the SQL statement.

Longest post, but the point is: I needed my own authentication code, and find it convenient to insert Python preprocessing steps by extracting and rewriting the SQL. It's an alternative approach to the HTTPFS extension, allowing you to insert your own business logic.

Why not a DuckDB extension? Will probably switch over the flow to use extensions, but the intent is to be able to switch backends and this method was convenient.

I just threw this up on PyPi (will publish the repo shortly):

%pip install iql import iql
To configure, you can: ```

create a boto3 resource, and use it for S3 paths:

boto3_resource1 = ...
boto3_resource2 = ...
boto3_default = ....

import iql
from iql.extensions import aws_s3_extension as as3e
as3e.BOTO3_S3_RESOURCES["s3://bucket1"] = boto3_resource_1
as3e.BOTO3_S3_RESOURCES["s3://anotherbucket"] = boto3_resource_2
as3e.BOTO3_S3_RESOURCE_DEFAULT = boto3_default Or, monkey patch

define your function

def my_get_boto3_resource_for_request(url:str) -> object: # your implementation

# return a boto3 resource

monkey patch it

from iql.extensions import aws_s3_extension as as3e
as3e.get_boto3_resource_for_request = my_get_boto3_resource_for_request This is currently implemented as an overlay... which processes the s3() function within the SQL statement, then passes it to duckdb: import iql iql.DEFAULT_EXT_DIRECTORY = '/tmp' db_conn = ... your duckdb connection ... iql.execute("select * from s3('s3://......./myfile.parquet'), db_conn) ```

This will download myfile.parquet to /tmp/myfile.parquet, then rewrite the SQL to:
"select * from '/tmp/myfile.parquet'"

Passing the result to the db_conn. More work is needed to support multiple Parquet files/HIVE and some of the other features of HTTPFS.

Feedback encouraged.

/r/DuckDB Thread