Skip to content

Commit

Permalink
Excel Samples for All Features (wjohnson#53)
Browse files Browse the repository at this point in the history
All tabs / features of the excel reader now have stand alone samples that are reproducible / re-runnable (i.e. for the the type defs, we flip the force_update flag to be True).

I demonstrate the following features:
* Complex table and column level lineage in the Hive Bridge Style.
* Bulk Entity uploads including demonstrating the `[Relationship]` feature to relate columns with a given table.
* Entity Type Definitions and upload of an entity using that custom type.
* Updating Lineage between two existing assets by creating a linking 'Process' entity.

In addition, updated the README to reflect all of the features of the package.
  • Loading branch information
wjohnson authored Nov 6, 2020
1 parent 129d1ee commit 0860421
Show file tree
Hide file tree
Showing 8 changed files with 400 additions and 37 deletions.
59 changes: 25 additions & 34 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,20 +3,27 @@
A python package to work with the Apache Atlas API and support bulk loading from different file types.

The package currently supports:
* Creating a column lineage scaffolding as in the [Hive Bridge style](https://atlas.apache.org/0.8.3/Bridge-Hive.html).
* Creating and reading from an excel template file
* From Excel, constructing the defined entities and column lineages.
* Table entities
* Column entities
* Table lineage processes
* Column lineage processes
* Supports Azure Data Catalog ColumnMapping Attributes.
* Bulk upload of entities.
* Bulk upload of type definitions.
* Creating custom lineage between two existing entities.
* Creating custom table and complex column level lineage in the [Hive Bridge style](https://atlas.apache.org/0.8.3/Bridge-Hive.html).
* Supports Azure Data Catalog ColumnMapping Attributes.
* Creating a column lineage scaffolding as in the Hive Bridge Style .
* Performing "What-If" analysis to check if...
* Your entities are valid types.
* Your entities are missing required attributes.
* Your entities are using undefined attributes.
* Working with the glossary.
* Uploading terms.
* Downloading individual or all terms.
* Working with relationships.
* Able to create arbitrary relationships between entities.
* e.g. associating a given column with a table.
* Able to upload relationship definitions.
* Deleting types (by name) or entities (by guid).
* Search (only for Azure Data Catalog advanced search).
* Authentication to Azure Data Catalog via Service Principal.
* Authentication using basic authentication of username and password.
* Authentication using basic authentication of username and password for open source Atlas.

## Quickstart

Expand All @@ -27,7 +34,7 @@ Create a wheel distribution file and install it in your environment.
```
python -m pip install wheel
python setup.py bdist_wheel
python -m pip install ./dist/pyapacheatlas-0.0b18-py3-none-any.whl
python -m pip install ./dist/pyapacheatlas-0.0b19-py3-none-any.whl
```

### Create a Client Connection
Expand Down Expand Up @@ -81,35 +88,19 @@ upload_results = client.upload_entities([ae.to_json()])

### Create Entities from Excel

Read from a standardized excel template to create table, column, table process, and column lineage entities. Follows / Requires the hive bridge style of column lineages.
Read from a standardized excel template that supports...

```
from pyapacheatlas.core import TypeCategory
from pyapacheatlas.scaffolding import column_lineage_scaffold
from pyapacheatlas.readers import ExcelConfiguration, ExcelReader
file_path = "./atlas_excel_template.xlsx"
# Create the Excel Template
ExcelReader.make_template(file_path)
# Populate the excel file manually!
* Bulk uploading entities into your data catalog.
* Creating custom table and column level lineage.
* Creating custom type definitions for datasets
* Creating custom lineage between existing assets / entities in your data catalog.

# Generate the base atlas type defs
all_type_defs = client.get_typedefs(TypeCategory.ENTITY)
See end to end samples for each scenario in the [excel samples](./samples/excel/README.md).

# Create objects for
ec = ExcelConfiguration()
excel_reader = ExcelReader(ec)
# Read from excel file and convert to
entities = excel_reader.parse_lineage(file_path, all_type_defs)
upload_results = client.upload_entities(entities)
print(json.dumps(upload,results,indent=1))
```
Learn more about the Excel [features and configuration in the wiki](https://github.com/wjohnson/pyapacheatlas/wiki/Excel-Template-and-Configuration).

## Additional Resources

* Learn more about this package in the github wiki.
* Learn more about this package in the [github wiki](https://github.com/wjohnson/pyapacheatlas/wiki/Excel-Template-and-Configuration).
* The [Apache Atlas client in Python](https://pypi.org/project/pyatlasclient/)
* The [Apache Atlas REST API](http://atlas.apache.org/api/v2/)
2 changes: 1 addition & 1 deletion pyapacheatlas/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.0b18"
__version__ = "0.0b19"
8 changes: 6 additions & 2 deletions pyapacheatlas/core/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,8 +90,12 @@ def delete_type(self, name):
atlas_endpoint,
headers=self.authentication.get_authentication_headers())

results = self._handle_response(deleteType)

try:
deleteType.raise_for_status()
except requests.RequestException:
raise Exception(deleteType.text)

results = {"message":f"successfully delete {name}"}
return results

def get_entity(self, guid=None, qualifiedName=None, typeName=None):
Expand Down
54 changes: 54 additions & 0 deletions samples/excel/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# Excel Samples for PyApacheAtlas

There are four key features of the PyApacheAtlas package with respect to the Excel frontend.

## Features and Samples
* **Bulk upload entities**
* [Bulk Entities Excel Sample](./excel_bulk_entities_upload.py)
* You want to dump entity information into excel and upload.
* You want to provide some simple relationship mapping (e.g. the columns of a table).
* Your entities may exist and you want them updated or they do not exist and you want them created.
* **Hive Bridge Style Table and Column Lineage**
* [Custom Table and Column Lineage Excel Sample](./excel_custom_table_column_lineage.py)
* You are willing to use a custom type to capture more data about lineage.
* You are interested in capturing more complex column level lineage.
* None of the entities you want to upload exist in your catalog.
* **Creating Custom DataSet Types**
* [Custom Type Excel Sample](./excel_custom_type_and_entity_upload.py)
* You have a custom dataset type you want to create with many attributes.
* You want to upload an entity using that custom type as well.
* **Create Lineage Between Two Existing Entities**
* [Update / Create Lineage Between Existing Entities](./excel_update_lineage_upload.py)
* You have two existing entities (an input and output) but there is no lineage between them.
* You want to create a "process entity" that represents the process that ties the two tables together.

Each sample linked above is stand alone and will create an excel spreadsheet with all of the data to be uploaded. It will then parse that spreadsheet and then upload to your data catalog.

I would strongly encourage you to run this in a dev / sandbox environment since it's a bit frustrating to find and delete the created entities.

## Requirements

* Follow the steps to install the latest version of PyApacheAtlas and its dependencies on the main ReadMe.
* You'll need a Service Principal (for Azure Data Catalog) with the Catalog Admin role.
* You'll need to set the following environment variables.

```
set TENANT_ID=YOUR_TENANT_ID
set CLIENT_ID=YOUR_SERVICE_PRINCIPAL_CLIENT_ID
set CLIENT_SECRET=YOUR_SERVICE_PRINCIPAL_CLIENT_SECRET
set ENDPOINT_URL=https://YOURCATALOGNAME.catalog.babylon.azure.com/api/atlas/v2
```

## Deleting Demo Entities and Types

* You have to delete entities based on GUID and types can be deleted by name.
* If you're following along with the built-in demos, search for 'pyapacheatlas' to find the majority of the entities.
* To find the guid, select the asset from your search and grab the guid from the URL.

```
# Delete an Entity
client.delete_entity(guid=["myguid1","myguid2"])
# Delete a Type Definition
client.delete_type(name="mytypename")
```
90 changes: 90 additions & 0 deletions samples/excel/excel_bulk_entities_upload.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
import json
import os

import openpyxl
from openpyxl import Workbook
from openpyxl import load_workbook

# PyApacheAtlas packages
# Connect to Atlas via a Service Principal
from pyapacheatlas.auth import ServicePrincipalAuthentication
from pyapacheatlas.core import AtlasClient # Communicate with your Atlas server
from pyapacheatlas.readers import ExcelConfiguration, ExcelReader

def fill_in_workbook(filepath, excel_config):
# You can safely ignore this function as it just
# populates the excel spreadsheet.
wb = load_workbook(file_path)
bulkEntity_sheet = wb[excel_config.bulkEntity_sheet]

# BULK Sheet SCHEMA
#"typeName", "name", "qualifiedName", "classifications"
# Adding a couple columns to show the power of this sheet
# [Relationship] table, type
entities_to_load = [
["DataSet", "exampledataset", "pyapacheatlas://dataset", None,
None, None],
["hive_table", "hivetable01", "pyapacheatlas://hivetable01", None,
None, None],
["hive_column", "columnA", "pyapacheatlas://hivetable01#colA", None,
'pyapacheatlas://hivetable01', 'string'],
["hive_column", "columnB", "pyapacheatlas://hivetable01#colB", None,
'pyapacheatlas://hivetable01', 'long'],
["hive_column", "columnC", "pyapacheatlas://hivetable01#colC", None,
'pyapacheatlas://hivetable01', 'int']
]

# Need to adjust the default header to include our extra attributes
bulkEntity_sheet['E1'] = '[Relationship] table'
bulkEntity_sheet['F1'] = 'type'

# Populate the excel template with samples above
table_row_counter = 0
for row in bulkEntity_sheet.iter_rows(min_row=2, max_col=6,
max_row=len(entities_to_load) + 1):
for idx, cell in enumerate(row):
cell.value = entities_to_load[table_row_counter][idx]
table_row_counter += 1

wb.save(file_path)


if __name__ == "__main__":
"""
This sample provides an end to end sample of reading an excel file,
generating a batch of entities, and then uploading the entities to
your data catalog.
"""

# Authenticate against your Atlas server
oauth = ServicePrincipalAuthentication(
tenant_id=os.environ.get("TENANT_ID", ""),
client_id=os.environ.get("CLIENT_ID", ""),
client_secret=os.environ.get("CLIENT_SECRET", "")
)
client = AtlasClient(
endpoint_url=os.environ.get("ENDPOINT_URL", ""),
authentication=oauth
)

# SETUP: This is just setting up the excel file for you
file_path = "./demo_bulk_entities_upload.xlsx"
excel_config = ExcelConfiguration()
excel_reader = ExcelReader(excel_config)

# Create an empty excel template to be populated
excel_reader.make_template(file_path)
# This is just a helper to fill in some demo data
fill_in_workbook(file_path, excel_config)

# ACTUAL WORK: This parses our excel file and creates a batch to upload
entities = excel_reader.parse_bulk_entities(file_path)

# This is what is getting sent to your Atlas server
# print(json.dumps(entities,indent=2))

results = client.upload_entities(entities)

print(json.dumps(results,indent=2))

print("Completed bulk upload successfully!\nSearch for hivetable01 to see your results.")
File renamed without changes.
128 changes: 128 additions & 0 deletions samples/excel/excel_custom_type_and_entity_upload.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
import json
import os

import openpyxl
from openpyxl import Workbook
from openpyxl import load_workbook

# PyApacheAtlas packages
# Connect to Atlas via a Service Principal
from pyapacheatlas.auth import ServicePrincipalAuthentication
from pyapacheatlas.core import AtlasClient # Communicate with your Atlas server
from pyapacheatlas.readers import ExcelConfiguration, ExcelReader

from pyapacheatlas.core import TypeCategory

def fill_in_type_workbook(filepath, excel_config):
# You can safely ignore this function as it just
# populates the excel spreadsheet.
wb = load_workbook(file_path)
entityDef_sheet = wb[excel_config.entityDef_sheet]

# ENTITYDEF Sheet SCHEMA
# "Entity TypeName", "name", "description",
# "isOptional", "isUnique", "defaultValue",
# "typeName", "displayName", "valuesMinCount",
# "valuesMaxCount", "cardinality", "includeInNotification",
# "indexType", "isIndexable"
attributes_to_load = [
["pyapacheatlas_custom_type", "fizz", "This will be the optional fizz attribute",
None, None, None,
None, None, None,
None, None, None,
None, None
],
["pyapacheatlas_custom_type", "buzz", "This will be the REQUIRED buzz attribute",
False, None, None,
None, None, None,
None, None, None,
None, None
],
]

# Populate the excel template with samples above
table_row_counter = 0
for row in entityDef_sheet.iter_rows(min_row=2, max_col=6,
max_row=len(attributes_to_load) + 1):
for idx, cell in enumerate(row):
cell.value = attributes_to_load[table_row_counter][idx]
table_row_counter += 1

wb.save(file_path)

def fill_in_entity_workbook(filepath, excel_config):
# You can safely ignore this function as it just
# populates the excel spreadsheet.
wb = load_workbook(file_path)
bulkEntity_sheet = wb[excel_config.bulkEntity_sheet]

# BULK Sheet SCHEMA
#"typeName", "name", "qualifiedName", "classifications"
# Adding a couple columns to show the power of this sheet
# fizz, buzz
entities_to_load = [
["pyapacheatlas_custom_type", "custom_type_entity",
"pyapacheatlas://example_from_custom_type", None,
"abc", "123"
],
]

# Need to adjust the default header to include our extra attributes
bulkEntity_sheet['E1'] = 'fizz'
bulkEntity_sheet['F1'] = 'buzz'

# Populate the excel template with samples above
table_row_counter = 0
for row in bulkEntity_sheet.iter_rows(min_row=2, max_col=6,
max_row=len(entities_to_load) + 1):
for idx, cell in enumerate(row):
cell.value = entities_to_load[table_row_counter][idx]
table_row_counter += 1

wb.save(file_path)


if __name__ == "__main__":
"""
This sample provides an end to end sample of reading an excel file,
creating a custom type and then uploading an entity of that custom type.
"""

# Authenticate against your Atlas server
oauth = ServicePrincipalAuthentication(
tenant_id=os.environ.get("TENANT_ID", ""),
client_id=os.environ.get("CLIENT_ID", ""),
client_secret=os.environ.get("CLIENT_SECRET", "")
)
client = AtlasClient(
endpoint_url=os.environ.get("ENDPOINT_URL", ""),
authentication=oauth
)

# SETUP: This is just setting up the excel file for you
file_path = "./demo_custom_type_and_entity_upload.xlsx"
excel_config = ExcelConfiguration()
excel_reader = ExcelReader(excel_config)

# Create an empty excel template to be populated
excel_reader.make_template(file_path)
# This is just a helper to fill in some demo data
fill_in_type_workbook(file_path, excel_config)
fill_in_entity_workbook(file_path, excel_config)

# ACTUAL WORK: This parses our excel file and creates a batch to upload
typedefs = excel_reader.parse_entity_defs(file_path)
entities = excel_reader.parse_bulk_entities(file_path)

## This is what is getting sent to your Atlas server
# print(json.dumps(typedefs,indent=2))
# print(json.dumps(entities,indent=2))

type_results = client.upload_typedefs(typedefs, force_update=True)
entity_results = client.upload_entities(entities)

print(json.dumps(type_results,indent=2))
print("\n")
print(json.dumps(entity_results,indent=2))

print("Completed type and bulk upload successfully!\nSearch for exampledataset to see your results.")
Loading

0 comments on commit 0860421

Please sign in to comment.