forked from wjohnson/pyapacheatlas
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Excel Samples for All Features (wjohnson#53)
All tabs / features of the excel reader now have stand alone samples that are reproducible / re-runnable (i.e. for the the type defs, we flip the force_update flag to be True). I demonstrate the following features: * Complex table and column level lineage in the Hive Bridge Style. * Bulk Entity uploads including demonstrating the `[Relationship]` feature to relate columns with a given table. * Entity Type Definitions and upload of an entity using that custom type. * Updating Lineage between two existing assets by creating a linking 'Process' entity. In addition, updated the README to reflect all of the features of the package.
- Loading branch information
Showing
8 changed files
with
400 additions
and
37 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
__version__ = "0.0b18" | ||
__version__ = "0.0b19" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
# Excel Samples for PyApacheAtlas | ||
|
||
There are four key features of the PyApacheAtlas package with respect to the Excel frontend. | ||
|
||
## Features and Samples | ||
* **Bulk upload entities** | ||
* [Bulk Entities Excel Sample](./excel_bulk_entities_upload.py) | ||
* You want to dump entity information into excel and upload. | ||
* You want to provide some simple relationship mapping (e.g. the columns of a table). | ||
* Your entities may exist and you want them updated or they do not exist and you want them created. | ||
* **Hive Bridge Style Table and Column Lineage** | ||
* [Custom Table and Column Lineage Excel Sample](./excel_custom_table_column_lineage.py) | ||
* You are willing to use a custom type to capture more data about lineage. | ||
* You are interested in capturing more complex column level lineage. | ||
* None of the entities you want to upload exist in your catalog. | ||
* **Creating Custom DataSet Types** | ||
* [Custom Type Excel Sample](./excel_custom_type_and_entity_upload.py) | ||
* You have a custom dataset type you want to create with many attributes. | ||
* You want to upload an entity using that custom type as well. | ||
* **Create Lineage Between Two Existing Entities** | ||
* [Update / Create Lineage Between Existing Entities](./excel_update_lineage_upload.py) | ||
* You have two existing entities (an input and output) but there is no lineage between them. | ||
* You want to create a "process entity" that represents the process that ties the two tables together. | ||
|
||
Each sample linked above is stand alone and will create an excel spreadsheet with all of the data to be uploaded. It will then parse that spreadsheet and then upload to your data catalog. | ||
|
||
I would strongly encourage you to run this in a dev / sandbox environment since it's a bit frustrating to find and delete the created entities. | ||
|
||
## Requirements | ||
|
||
* Follow the steps to install the latest version of PyApacheAtlas and its dependencies on the main ReadMe. | ||
* You'll need a Service Principal (for Azure Data Catalog) with the Catalog Admin role. | ||
* You'll need to set the following environment variables. | ||
|
||
``` | ||
set TENANT_ID=YOUR_TENANT_ID | ||
set CLIENT_ID=YOUR_SERVICE_PRINCIPAL_CLIENT_ID | ||
set CLIENT_SECRET=YOUR_SERVICE_PRINCIPAL_CLIENT_SECRET | ||
set ENDPOINT_URL=https://YOURCATALOGNAME.catalog.babylon.azure.com/api/atlas/v2 | ||
``` | ||
|
||
## Deleting Demo Entities and Types | ||
|
||
* You have to delete entities based on GUID and types can be deleted by name. | ||
* If you're following along with the built-in demos, search for 'pyapacheatlas' to find the majority of the entities. | ||
* To find the guid, select the asset from your search and grab the guid from the URL. | ||
|
||
``` | ||
# Delete an Entity | ||
client.delete_entity(guid=["myguid1","myguid2"]) | ||
# Delete a Type Definition | ||
client.delete_type(name="mytypename") | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
import json | ||
import os | ||
|
||
import openpyxl | ||
from openpyxl import Workbook | ||
from openpyxl import load_workbook | ||
|
||
# PyApacheAtlas packages | ||
# Connect to Atlas via a Service Principal | ||
from pyapacheatlas.auth import ServicePrincipalAuthentication | ||
from pyapacheatlas.core import AtlasClient # Communicate with your Atlas server | ||
from pyapacheatlas.readers import ExcelConfiguration, ExcelReader | ||
|
||
def fill_in_workbook(filepath, excel_config): | ||
# You can safely ignore this function as it just | ||
# populates the excel spreadsheet. | ||
wb = load_workbook(file_path) | ||
bulkEntity_sheet = wb[excel_config.bulkEntity_sheet] | ||
|
||
# BULK Sheet SCHEMA | ||
#"typeName", "name", "qualifiedName", "classifications" | ||
# Adding a couple columns to show the power of this sheet | ||
# [Relationship] table, type | ||
entities_to_load = [ | ||
["DataSet", "exampledataset", "pyapacheatlas://dataset", None, | ||
None, None], | ||
["hive_table", "hivetable01", "pyapacheatlas://hivetable01", None, | ||
None, None], | ||
["hive_column", "columnA", "pyapacheatlas://hivetable01#colA", None, | ||
'pyapacheatlas://hivetable01', 'string'], | ||
["hive_column", "columnB", "pyapacheatlas://hivetable01#colB", None, | ||
'pyapacheatlas://hivetable01', 'long'], | ||
["hive_column", "columnC", "pyapacheatlas://hivetable01#colC", None, | ||
'pyapacheatlas://hivetable01', 'int'] | ||
] | ||
|
||
# Need to adjust the default header to include our extra attributes | ||
bulkEntity_sheet['E1'] = '[Relationship] table' | ||
bulkEntity_sheet['F1'] = 'type' | ||
|
||
# Populate the excel template with samples above | ||
table_row_counter = 0 | ||
for row in bulkEntity_sheet.iter_rows(min_row=2, max_col=6, | ||
max_row=len(entities_to_load) + 1): | ||
for idx, cell in enumerate(row): | ||
cell.value = entities_to_load[table_row_counter][idx] | ||
table_row_counter += 1 | ||
|
||
wb.save(file_path) | ||
|
||
|
||
if __name__ == "__main__": | ||
""" | ||
This sample provides an end to end sample of reading an excel file, | ||
generating a batch of entities, and then uploading the entities to | ||
your data catalog. | ||
""" | ||
|
||
# Authenticate against your Atlas server | ||
oauth = ServicePrincipalAuthentication( | ||
tenant_id=os.environ.get("TENANT_ID", ""), | ||
client_id=os.environ.get("CLIENT_ID", ""), | ||
client_secret=os.environ.get("CLIENT_SECRET", "") | ||
) | ||
client = AtlasClient( | ||
endpoint_url=os.environ.get("ENDPOINT_URL", ""), | ||
authentication=oauth | ||
) | ||
|
||
# SETUP: This is just setting up the excel file for you | ||
file_path = "./demo_bulk_entities_upload.xlsx" | ||
excel_config = ExcelConfiguration() | ||
excel_reader = ExcelReader(excel_config) | ||
|
||
# Create an empty excel template to be populated | ||
excel_reader.make_template(file_path) | ||
# This is just a helper to fill in some demo data | ||
fill_in_workbook(file_path, excel_config) | ||
|
||
# ACTUAL WORK: This parses our excel file and creates a batch to upload | ||
entities = excel_reader.parse_bulk_entities(file_path) | ||
|
||
# This is what is getting sent to your Atlas server | ||
# print(json.dumps(entities,indent=2)) | ||
|
||
results = client.upload_entities(entities) | ||
|
||
print(json.dumps(results,indent=2)) | ||
|
||
print("Completed bulk upload successfully!\nSearch for hivetable01 to see your results.") |
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,128 @@ | ||
import json | ||
import os | ||
|
||
import openpyxl | ||
from openpyxl import Workbook | ||
from openpyxl import load_workbook | ||
|
||
# PyApacheAtlas packages | ||
# Connect to Atlas via a Service Principal | ||
from pyapacheatlas.auth import ServicePrincipalAuthentication | ||
from pyapacheatlas.core import AtlasClient # Communicate with your Atlas server | ||
from pyapacheatlas.readers import ExcelConfiguration, ExcelReader | ||
|
||
from pyapacheatlas.core import TypeCategory | ||
|
||
def fill_in_type_workbook(filepath, excel_config): | ||
# You can safely ignore this function as it just | ||
# populates the excel spreadsheet. | ||
wb = load_workbook(file_path) | ||
entityDef_sheet = wb[excel_config.entityDef_sheet] | ||
|
||
# ENTITYDEF Sheet SCHEMA | ||
# "Entity TypeName", "name", "description", | ||
# "isOptional", "isUnique", "defaultValue", | ||
# "typeName", "displayName", "valuesMinCount", | ||
# "valuesMaxCount", "cardinality", "includeInNotification", | ||
# "indexType", "isIndexable" | ||
attributes_to_load = [ | ||
["pyapacheatlas_custom_type", "fizz", "This will be the optional fizz attribute", | ||
None, None, None, | ||
None, None, None, | ||
None, None, None, | ||
None, None | ||
], | ||
["pyapacheatlas_custom_type", "buzz", "This will be the REQUIRED buzz attribute", | ||
False, None, None, | ||
None, None, None, | ||
None, None, None, | ||
None, None | ||
], | ||
] | ||
|
||
# Populate the excel template with samples above | ||
table_row_counter = 0 | ||
for row in entityDef_sheet.iter_rows(min_row=2, max_col=6, | ||
max_row=len(attributes_to_load) + 1): | ||
for idx, cell in enumerate(row): | ||
cell.value = attributes_to_load[table_row_counter][idx] | ||
table_row_counter += 1 | ||
|
||
wb.save(file_path) | ||
|
||
def fill_in_entity_workbook(filepath, excel_config): | ||
# You can safely ignore this function as it just | ||
# populates the excel spreadsheet. | ||
wb = load_workbook(file_path) | ||
bulkEntity_sheet = wb[excel_config.bulkEntity_sheet] | ||
|
||
# BULK Sheet SCHEMA | ||
#"typeName", "name", "qualifiedName", "classifications" | ||
# Adding a couple columns to show the power of this sheet | ||
# fizz, buzz | ||
entities_to_load = [ | ||
["pyapacheatlas_custom_type", "custom_type_entity", | ||
"pyapacheatlas://example_from_custom_type", None, | ||
"abc", "123" | ||
], | ||
] | ||
|
||
# Need to adjust the default header to include our extra attributes | ||
bulkEntity_sheet['E1'] = 'fizz' | ||
bulkEntity_sheet['F1'] = 'buzz' | ||
|
||
# Populate the excel template with samples above | ||
table_row_counter = 0 | ||
for row in bulkEntity_sheet.iter_rows(min_row=2, max_col=6, | ||
max_row=len(entities_to_load) + 1): | ||
for idx, cell in enumerate(row): | ||
cell.value = entities_to_load[table_row_counter][idx] | ||
table_row_counter += 1 | ||
|
||
wb.save(file_path) | ||
|
||
|
||
if __name__ == "__main__": | ||
""" | ||
This sample provides an end to end sample of reading an excel file, | ||
creating a custom type and then uploading an entity of that custom type. | ||
""" | ||
|
||
# Authenticate against your Atlas server | ||
oauth = ServicePrincipalAuthentication( | ||
tenant_id=os.environ.get("TENANT_ID", ""), | ||
client_id=os.environ.get("CLIENT_ID", ""), | ||
client_secret=os.environ.get("CLIENT_SECRET", "") | ||
) | ||
client = AtlasClient( | ||
endpoint_url=os.environ.get("ENDPOINT_URL", ""), | ||
authentication=oauth | ||
) | ||
|
||
# SETUP: This is just setting up the excel file for you | ||
file_path = "./demo_custom_type_and_entity_upload.xlsx" | ||
excel_config = ExcelConfiguration() | ||
excel_reader = ExcelReader(excel_config) | ||
|
||
# Create an empty excel template to be populated | ||
excel_reader.make_template(file_path) | ||
# This is just a helper to fill in some demo data | ||
fill_in_type_workbook(file_path, excel_config) | ||
fill_in_entity_workbook(file_path, excel_config) | ||
|
||
# ACTUAL WORK: This parses our excel file and creates a batch to upload | ||
typedefs = excel_reader.parse_entity_defs(file_path) | ||
entities = excel_reader.parse_bulk_entities(file_path) | ||
|
||
## This is what is getting sent to your Atlas server | ||
# print(json.dumps(typedefs,indent=2)) | ||
# print(json.dumps(entities,indent=2)) | ||
|
||
type_results = client.upload_typedefs(typedefs, force_update=True) | ||
entity_results = client.upload_entities(entities) | ||
|
||
print(json.dumps(type_results,indent=2)) | ||
print("\n") | ||
print(json.dumps(entity_results,indent=2)) | ||
|
||
print("Completed type and bulk upload successfully!\nSearch for exampledataset to see your results.") |
Oops, something went wrong.