-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[590] Add RFC for XCatalogSync - Synchronize tables across catalogs #605
base: main
Are you sure you want to change the base?
Conversation
d6d886b
to
41c67b6
Compare
a3cb743
to
ece8d98
Compare
## Abstract | ||
|
||
Users of Apache XTable (Incubating) today can translate metadata across table formats (iceberg, hudi, and delta) and use the tables in different platforms depending on their choice. | ||
Today there's still some friction involved in terms of usability because users need to explicitly [register](https://xtable.apache.org/docs/catalogs-index) the tables in the catalog of their choice (glue, HMS, unity, bigLake etc.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @vinishjail97, thanks for sharing the details. I think catalog sync is a useful feature. One of the key value adds of catalogs is governance, particularly access control. All the catalogs mentioned here provide the ability to grant different privileges to roles. The proposed catalog sync in XTable replicates the table across catalogs. What are your thoughts about porting the governance features?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good long term vision but deserves its own RFC to draw up the entities since it will expand beyond the tables now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 we can use the RunCatalogSync
utility to support governance sync as well in the future but yeah would need a separate RFC.
#591 (comment)
## Implementation | ||
|
||
Introducing the following interfaces. [[PR]]( https://github.com/apache/incubator-xtable/pull/603) | ||
1. `CatalogSyncClient`: This interface contains methods that are responsible for creating table, refreshing table metadata, dropping table etc. in target catalog. Consider this interface as a translation layer between InternalTable and the catalog's table object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think table DDL operations are related to InternalTable. But you bring up an important point. The table format and the catalog layer are two different layers in the analytics stack. Currently, XTable only supports conversion of table format level metadata, which is captured by current InternalTable. However, the proposed feature extends to the catalog layer where catalog level metadata translation takes place. So, in effect, this feature syncs InternalCatalog object. What are your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the feature allows you to sync to a catalog without a source catalog by just registering those tables in the target catalogs. I think in its current form this is still syncing at the table level. If we want to support new entities to sync like access control and permissions level metadata, we are in a position to expand what the clients can do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InternalCatalog
will probably take shape when we start supporting access control and permissions level synchronization. For now this feature is still synchronizing table level metadata from source catalog to target catalog.
1. `catalogId`: A user-defined unique identifier for the catalog, allows user to sync table to multiple catalogs of the same name/type eg: HMS catalog with url1, HMS catalog with url2. | ||
2. `catalogType`: The type of the source catalog. This might be a specific type understood by XTable, such as Hive, Glue etc. | ||
3. `catalogSyncClientImpl`(optional): A fully qualified class name that implements the interface for `CatalogSyncClient`, it can be used if the implementation for catalogType doesn't exist in XTable. | ||
4. `catalogConversionSourceImpl`(optional): A fully qualified class name that implements the interface for `CatalogConversionSource`, it can be used if the implementation for catalogType doesn't exist in XTable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not clear what is the difference in the role of CatalogConversionSource and CatalogSyncClient. Could you please clarify. In case of table sync, only TableSource exists, there is no TableClient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I can. XTable's current functionality is to synchronize SourceTable
and TargetTable
for table format metadata, the user needs to input details related to the storage base path, format etc. and define the SourceTable
object. CatalogConversionSource
is the interface which converts a catalog table object to SourceTable
, for syncing tables between source and target catalog, user just needs to provide the table identifier in the source catalog. CatalogSyncClient
' is the interface responsible for synchronizing InternalTable to the target catalogs.
In case of table format sync as we have two interfaces, ConversionSource
and ConversionTarget
e25c831
to
970ad8d
Compare
Important Read
What is the purpose of the pull request
Adds RFC for XCatalogSync - Synchronize tables across catalogs.
Brief change log
(for example:)
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.