feat: support S3 Table Buckets with S3TablesCatalog #1429

felixscherz · 2024-12-14T14:06:47Z

Hi, this is in regards to #1404.

I created a first draft of an S3TablesCatalog that uses the S3 Table Bucket API for catalog operations.

How to run tests

Since moto does not support mocking the S3 Tables API yet (WIP: getmoto/moto#8470) we have to run tests against a live AWS account. To do that, create an S3 Tables Bucket in one of the supported regions and then set the table bucket ARN and AWS Region as environment variables

AWS_REGION=us-east-2 AWS_TEST_S3_BUCKET_ARN=... poetry run pytest tests/catalog/integration_test_s3tables.py

felixscherz · 2024-12-14T16:46:16Z

I was able to work around the issue above by using FsspecFileIO instead of the default PyarrowFileIO. Using FsspecFileIO the catalog is now able to create new tables.

kevinjqliu · 2024-12-20T17:02:32Z

Thanks for working on this @felixscherz Feel free to tag me when its ready for review :)

felixscherz · 2024-12-29T15:51:03Z

I think you can now review this PR if you have time @kevinjqliu :)
The biggest issue for now will be that testing is only possible against AWS itself since moto does not support the s3tables API yet. I created an issue on the moto side but have not had the time to implement it myself getmoto/moto#8422.

I currently run tests by setting the ARN env variable to that of an s3 table bucket I created within my personal AWS account:

https://github.com/felixscherz/iceberg-python/blob/feat/s3tables-catalog/tests/catalog/test_s3tables.py#L24-L31

kevinjqliu

Thanks for the PR, i added a few comments to clarify the catalog behaviors

I'm a little hesitant to merge this in given that we have to run tests against a production S3 endpoint. Maybe we can mock the endpoint?

pyiceberg/catalog/s3tables.py

tests/catalog/test_s3tables.py

pyiceberg/catalog/s3tables.py

kevinjqliu · 2025-01-03T23:51:55Z

pyiceberg/catalog/s3tables.py

we'd want to add this to the docs https://py.iceberg.apache.org/configuration/#catalogs

pyiceberg/catalog/s3tables.py

kevinjqliu · 2025-01-04T00:20:16Z

I ran the tests locally ARN=arn:aws:s3tables:us-east-2:... poetry run pytest tests/catalog/test_s3tables.py
had to manually add s3tables.region to the catalog config

    properties = {"s3tables.table-bucket-arn": table_bucket_arn, "s3tables.region": "us-east-2", "py-io-impl": "pyiceberg.io.fsspec.FsspecFileIO"}

And these 3 testa failed, everything else is ✅

FAILED tests/catalog/test_s3tables.py::test_s3tables_api_raises_on_conflicting_version_tokens - botocore.exceptions.NoRegionError: You must specify a region.
FAILED tests/catalog/test_s3tables.py::test_s3tables_api_raises_on_preexisting_table - botocore.exceptions.NoRegionError: You must specify a region.
FAILED tests/catalog/test_s3tables.py::test_creating_catalog_validates_s3_table_bucket_exists - botocore.exceptions.NoRegionError: You must specify a region.

felixscherz · 2025-01-05T16:29:23Z

Thank you for the review!

I removed tests related to boto3 and set the AWS region explicitly for the test run.
I agree with you that we should not merge this as long as the only option for running tests is to run them against a live AWS account. I'm currently working on supporting s3tables with moto: getmoto/moto#8470.
We can hold off on this PR until moto can support s3tables so we can run tests against a mock endpoint?

kevinjqliu

Added a few more comments.

I was able to run the test locally

AWS_REGION=us-east-2 ARN=... poetry run pytest tests/catalog/test_s3tables.py

after making a few local changes

poetry update boto3
add aws_region fixture
pass aws_region to catalog

Could you update the PR description so others can test this PR out?

pyiceberg/catalog/s3tables.py

tests/catalog/test_s3tables.py

kevinjqliu · 2025-01-05T20:39:42Z

pyiceberg/catalog/s3tables.py

+        try:
+            self.s3tables = session.client("s3tables", endpoint_url=properties.get(S3TABLES_ENDPOINT))
+        except boto3.session.UnknownServiceError as e:
+            raise S3TablesError("'s3tables' requires boto3>=1.35.74. Current version: {boto3.__version__}.") from e


running poetry update boto3 will bump boto3 version to 1.35.88

actually this is already merged, can you rebase?
#1476

I rebased my branch. I think we need to keep this check however since as per pyproject.toml the requirement for boto3 is >=1.24.59. We could bump that to >=1.35.74 but since that is only required for S3 Tables API I'm not sure about forcing everyone to upgrade their boto3 version just to support S3 Tables. What do you think?

My two cents: I like this check. I do not think we should enforce dependency version upgrade for adding "optional" new service like s3tables, as it may reduce compatibility and cause additional version conflict.

HonahX

@felixscherz Thanks for the great contribution! Looking forward to adding this to PyIceberg! I left some comments. Please let me know what you think.

HonahX · 2025-01-06T19:19:48Z

tests/catalog/test_s3tables.py

@@ -0,0 +1,227 @@
+import pytest


Shall we rename this to integration_test_s3tables.py? We use this naming convention for tests involved real endpoints, like integration_test_glue.py. I think it will be great to keep a version of testing against real endpoints even after we have moto s3tables available.

thats a good point, we have integration tests marked for gcs

iceberg-python/tests/io/test_fsspec.py

Line 479 in 551f524

@pytest.mark.gcs

I renamed it to integration_test_s3tables.py and also changed the S3 table bucket arn environment variable to AWS_TEST_S3_TABLE_BUCKET_ARN to be more explicit, similar to glue: https://github.com/apache/iceberg-python/blob/main/tests/conftest.py#L2090-L2092

Tests are now run with

AWS_REGION=us-east-2 AWS_TEST_S3_TABLE_BUCKET_ARN=... poetry run pytest tests/catalog/integration_test_s3tables.py

I'll adjust the PR description. Let me know what you think:)

cool! i'd also @pytest.mark for now since we dont want this test to run with the other pytests. for example, the current tests will run with make test

we can merge the code with just the integration tests marked, but im on the fence for this one and would like to hear what others think

Since this the file's name starts with integration_* instead of test_*, it won't be collected by make test, so I think we're good for now even without any marks : ).

I am in favor of having both unit tests and integration tests for service integration like s3tables and glue. Although we would need to manually trigger these integration tests every time, they are still very helpful in case when bugs are not caught by mocked tests and when verifying the release.

pyiceberg/catalog/s3tables.py

HonahX · 2025-01-06T19:45:07Z

pyiceberg/catalog/s3tables.py

+        try:
+            self.s3tables = session.client("s3tables", endpoint_url=properties.get(S3TABLES_ENDPOINT))
+        except boto3.session.UnknownServiceError as e:
+            raise S3TablesError("'s3tables' requires boto3>=1.35.74. Current version: {boto3.__version__}.") from e


My two cents: I like this check. I do not think we should enforce dependency version upgrade for adding "optional" new service like s3tables, as it may reduce compatibility and cause additional version conflict.

pyiceberg/catalog/s3tables.py

HonahX · 2025-01-06T20:03:13Z

pyiceberg/catalog/s3tables.py

+    def commit_table(
+        self, table: Table, requirements: Tuple[TableRequirement, ...], updates: Tuple[TableUpdate, ...]
+    ) -> CommitTableResponse:


I did not find the logic for cases when table not exist, which means create_table_transaction will not be supported in the current version.

iceberg-python/pyiceberg/catalog/__init__.py

Lines 754 to 765 in e41c428

def create_table_transaction(

self,

identifier: Union[str, Identifier],

schema: Union[Schema, "pa.Schema"],

location: Optional[str] = None,

partition_spec: PartitionSpec = UNPARTITIONED_PARTITION_SPEC,

sort_order: SortOrder = UNSORTED_SORT_ORDER,

properties: Properties = EMPTY_DICT,

) -> CreateTableTransaction:

return CreateTableTransaction(

self._create_staged_table(identifier, schema, location, partition_spec, sort_order, properties)

)

We do not have to support everything in the initial PR. But it will be good to override create_table_transaction as "Not Implemented" for the s3tables

pyiceberg/catalog/s3tables.py

HonahX · 2025-01-07T03:52:36Z

pyiceberg/catalog/s3tables.py

+        try:
+            self.s3tables.create_table(
+                tableBucketARN=self.table_bucket_arn, namespace=namespace, name=table_name, format="ICEBERG"
+            )


If anything goes wrong after this point, I think we should clean up the created s3 table by s3tables' delete_table endpoint.

I added a try/except to delete the s3 table in case something goes wrong with writing the initial metadata.

pyiceberg/catalog/s3tables.py

HonahX · 2025-01-07T05:02:19Z

pyiceberg/catalog/s3tables.py

+S3TABLES_ACCESS_KEY_ID = "s3tables.access-key-id"
+S3TABLES_SECRET_ACCESS_KEY = "s3tables.secret-access-key"
+S3TABLES_SESSION_TOKEN = "s3tables.session-token"
+


Is there any case that the s3tables catalog will need a different set of credentials than the underlying FileIO for S3 buckets? It seems we could just re-use the s3. prefix here to avoid adding a new set of configuration names. I feel after all s3tables still belong to s3 so it is still intuitive. WDYT?

cc @kevinjqliu

I feel after all s3tables still belong to s3 so it is still intuitive.

I was thinking about this too. i think its a good idea to keep s3 (FileIO) separate from s3tables (Catalog).
its likely that the credentials are the same, but keeping the logical construct separate is important.

maybe we can just support fallback to s3. and client.
https://github.com/apache/iceberg-python/pull/1429/files#diff-c4084a5be80f826e488a804050557a54d6ecb0307af07ff57028482fb75b7d76R58

Using client. properties should be supported through get_first_property_value. If we want to support fallbacks to both s3. and client. we would need to determine an order. Maybe it's more straightforward to stick with the fallback to client. like the dynamodb and glue catalog implementations?

Maybe it's more straightforward to stick with the fallback to client. like the dynamodb and glue catalog implementations?

I agree esp this separates fileio properties from catalog properties. we should treat s3 tables as S3TablesCatalog

s3 (FileIO) separate from s3tables (Catalog).

Good point. I agree that we need to reserve the s3tables for S3TablesCatalog. My concern is that when configuring credentials, users will probably always use client.* prefix since technically s3tables catalog and s3 file io never use different set of credentials. In this case, do we really need another set of keys to configure credentials for s3tables only? Would love to hear what you think.

Maybe it's more straightforward to stick with the fallback to client. like the dynamodb and glue catalog implementations?

I also agree. Thinking back, bringing s3. here seems to create more confusion/complexities than value. It is more clear that we stick to use client.* to represent shared properties across clients (glue, s3, s3tables).

users will probably always use client.* prefix since technically s3tables catalog and s3 file io never use different set of credentials

yea the only feature of having s3tables. is the ability to assign a different credential from s3 file io. i think we should allow this use case but can recommend users to just set client. to take care of both file io and catalog configs. similar to the glue "Client-specific Properties" section https://py.iceberg.apache.org/configuration/#glue-catalog

pyiceberg/catalog/s3tables.py

kevinjqliu · 2025-01-08T04:06:02Z

can you run poetry lock --no-update for CI?

Co-authored-by: Kevin Liu <[email protected]>

Co-authored-by: Honah J. <[email protected]>

felixscherz mentioned this pull request Dec 14, 2024

Support for S3 catalog to work with S3 Tables #1404

Open

felixscherz marked this pull request as draft December 14, 2024 16:43

felixscherz marked this pull request as ready for review December 29, 2024 15:51

felixscherz changed the title ~~WIP: feat: support S3 Table Buckets with S3TablesCatalog~~ feat: support S3 Table Buckets with S3TablesCatalog Dec 29, 2024

kevinjqliu reviewed Jan 4, 2025

View reviewed changes

kevinjqliu reviewed Jan 5, 2025

View reviewed changes

felixscherz force-pushed the feat/s3tables-catalog branch from 398e2d7 to 05e4dfd Compare January 6, 2025 16:27

HonahX reviewed Jan 6, 2025

View reviewed changes

HonahX reviewed Jan 7, 2025

View reviewed changes

felixscherz added 14 commits January 8, 2025 09:05

feat: initial setup for S3TablesCatalog

11ce7a6

feat: support create_table using FsspecFileIO

6932281

feat: implement drop_table

f8bec25

feat: implement drop_namespace

9018e70

test: validate how version conflict is handled with s3tables API

b053fbb

feat: implement commit_table

f8fff8a

feat: implement table_exists

579bce6

feat: implement list_tables

5ef7790

refactor: improve list_namespace

6d496df

fix: return Identifier from list_tables

33b08f0

feat: implement rename table

798e65d

feat: implement load_namespace_properties

5297ca7

refactor: move some methods around

3817431

feat: raise NotImplementedError for views functionality

c4a6485

felixscherz and others added 28 commits January 8, 2025 09:05

chore: cleanup comments

8d64396

feat: catch missing metadata for load_table

642ccd8

feat: handle missing namespace and preexisting table

3c17fc0

feat: handle versionToken and table in an atomic operation

ec647fb

chore: run formatter

9396e73

chore: add type hints for tests

11f6f66

fix: no longer enforce FsspecFileIO

92fad8e

test: remove tests for boto3 behavior

cb23a72

test: verify column was created on commit

5251ffa

test: verify new data can be committed to table

b544b50

docs: update documentation for create_table

a02678a

test: set AWS regions explicitly

1facfe1

Apply suggestions from code review

096ab6f

Co-authored-by: Kevin Liu <[email protected]>

test: commit new data to table

4fd1d61

feat: clarify update_namespace_properties error

f7b960a

feat: raise error when setting custom namespace properties

c72164b

refactor: change S3TableCatalog -> S3TablesCatalog

c90bf31

feat: raise error on specified table location

78bbe54

feat: return empty list when querying a hierarchical namespace

bd38d15

refactor: use get_table_metadata_location instead of get_table

85d50fe

refactor: extract 'ICEBERG' table format into constant

c8434d4

feat: change s3tables.table-bucket-arn -> s3tables.warehouse

e479c04

Apply suggestions from code review

acfd06c

Co-authored-by: Honah J. <[email protected]>

feat: add link to naming-rules for invalid name errors

01c9003

feat: delete s3 table if writing new_table_metadata is unsuccessful

1d25b69

chore: run linter

188c45d

test: rename test_s3tables.py -> integration_test_s3tables.py

de31050

chore: update poetry.lock

2e1c383

felixscherz force-pushed the feat/s3tables-catalog branch from 894cbc9 to 2e1c383 Compare January 8, 2025 08:08

fix: add license to files

fdd1d7a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support S3 Table Buckets with S3TablesCatalog #1429

feat: support S3 Table Buckets with S3TablesCatalog #1429

felixscherz commented Dec 14, 2024 •

edited

Loading

felixscherz commented Dec 14, 2024

kevinjqliu commented Dec 20, 2024

felixscherz commented Dec 29, 2024

kevinjqliu left a comment

kevinjqliu Jan 3, 2025

kevinjqliu commented Jan 4, 2025

felixscherz commented Jan 5, 2025

kevinjqliu left a comment

kevinjqliu Jan 5, 2025

kevinjqliu Jan 5, 2025

felixscherz Jan 6, 2025

HonahX Jan 6, 2025

HonahX left a comment

HonahX Jan 6, 2025

kevinjqliu Jan 7, 2025

felixscherz Jan 7, 2025

kevinjqliu Jan 8, 2025

kevinjqliu Jan 8, 2025

HonahX Jan 8, 2025 •

edited

Loading

HonahX Jan 6, 2025

HonahX Jan 6, 2025

HonahX Jan 7, 2025

felixscherz Jan 7, 2025

HonahX Jan 7, 2025

kevinjqliu Jan 7, 2025

felixscherz Jan 7, 2025

kevinjqliu Jan 8, 2025

HonahX Jan 8, 2025 •

edited

Loading

kevinjqliu Jan 8, 2025

kevinjqliu commented Jan 8, 2025

	def create_table_transaction(
	self,
	identifier: Union[str, Identifier],
	schema: Union[Schema, "pa.Schema"],
	location: Optional[str] = None,
	partition_spec: PartitionSpec = UNPARTITIONED_PARTITION_SPEC,
	sort_order: SortOrder = UNSORTED_SORT_ORDER,
	properties: Properties = EMPTY_DICT,
	) -> CreateTableTransaction:
	return CreateTableTransaction(
	self._create_staged_table(identifier, schema, location, partition_spec, sort_order, properties)
	)

feat: support S3 Table Buckets with S3TablesCatalog #1429

Are you sure you want to change the base?

feat: support S3 Table Buckets with S3TablesCatalog #1429

Conversation

felixscherz commented Dec 14, 2024 • edited Loading

How to run tests

felixscherz commented Dec 14, 2024

kevinjqliu commented Dec 20, 2024

felixscherz commented Dec 29, 2024

kevinjqliu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinjqliu commented Jan 4, 2025

felixscherz commented Jan 5, 2025

kevinjqliu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HonahX left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HonahX Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HonahX Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinjqliu commented Jan 8, 2025

felixscherz commented Dec 14, 2024 •

edited

Loading

HonahX Jan 8, 2025 •

edited

Loading

HonahX Jan 8, 2025 •

edited

Loading