Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[590] Add RFC for XCatalogSync - Synchronize tables across catalogs #605

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

vinishjail97
Copy link
Contributor

Important Read

  • Please ensure the GitHub issue is mentioned at the beginning of the PR

What is the purpose of the pull request

Adds RFC for XCatalogSync - Synchronize tables across catalogs.

Brief change log

(for example:)

  • Add RFC for XCatalogSync - Synchronize tables across catalogs

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

## Abstract

Users of Apache XTable (Incubating) today can translate metadata across table formats (iceberg, hudi, and delta) and use the tables in different platforms depending on their choice.
Today there's still some friction involved in terms of usability because users need to explicitly [register](https://xtable.apache.org/docs/catalogs-index) the tables in the catalog of their choice (glue, HMS, unity, bigLake etc.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @vinishjail97, thanks for sharing the details. I think catalog sync is a useful feature. One of the key value adds of catalogs is governance, particularly access control. All the catalogs mentioned here provide the ability to grant different privileges to roles. The proposed catalog sync in XTable replicates the table across catalogs. What are your thoughts about porting the governance features?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good long term vision but deserves its own RFC to draw up the entities since it will expand beyond the tables now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 we can use the RunCatalogSync utility to support governance sync as well in the future but yeah would need a separate RFC.
#591 (comment)

## Implementation

Introducing the following interfaces. [[PR]]( https://github.com/apache/incubator-xtable/pull/603)
1. `CatalogSyncClient`: This interface contains methods that are responsible for creating table, refreshing table metadata, dropping table etc. in target catalog. Consider this interface as a translation layer between InternalTable and the catalog's table object.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think table DDL operations are related to InternalTable. But you bring up an important point. The table format and the catalog layer are two different layers in the analytics stack. Currently, XTable only supports conversion of table format level metadata, which is captured by current InternalTable. However, the proposed feature extends to the catalog layer where catalog level metadata translation takes place. So, in effect, this feature syncs InternalCatalog object. What are your thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the feature allows you to sync to a catalog without a source catalog by just registering those tables in the target catalogs. I think in its current form this is still syncing at the table level. If we want to support new entities to sync like access control and permissions level metadata, we are in a position to expand what the clients can do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InternalCatalog will probably take shape when we start supporting access control and permissions level synchronization. For now this feature is still synchronizing table level metadata from source catalog to target catalog.

1. `catalogId`: A user-defined unique identifier for the catalog, allows user to sync table to multiple catalogs of the same name/type eg: HMS catalog with url1, HMS catalog with url2.
2. `catalogType`: The type of the source catalog. This might be a specific type understood by XTable, such as Hive, Glue etc.
3. `catalogSyncClientImpl`(optional): A fully qualified class name that implements the interface for `CatalogSyncClient`, it can be used if the implementation for catalogType doesn't exist in XTable.
4. `catalogConversionSourceImpl`(optional): A fully qualified class name that implements the interface for `CatalogConversionSource`, it can be used if the implementation for catalogType doesn't exist in XTable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear what is the difference in the role of CatalogConversionSource and CatalogSyncClient. Could you please clarify. In case of table sync, only TableSource exists, there is no TableClient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I can. XTable's current functionality is to synchronize SourceTable and TargetTable for table format metadata, the user needs to input details related to the storage base path, format etc. and define the SourceTable object. CatalogConversionSource is the interface which converts a catalog table object to SourceTable, for syncing tables between source and target catalog, user just needs to provide the table identifier in the source catalog. CatalogSyncClient' is the interface responsible for synchronizing InternalTable to the target catalogs.

In case of table format sync as we have two interfaces, ConversionSource and ConversionTarget

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants