Excel Samples for All Features (wjohnson#53)

All tabs / features of the excel reader now have stand alone samples that are reproducible / re-runnable (i.e. for the the type defs, we flip the force_update flag to be True). I demonstrate the following features: * Complex table and column level lineage in the Hive Bridge Style. * Bulk Entity uploads including demonstrating the `[Relationship]` feature to relate columns with a given table. * Entity Type Definitions and upload of an entity using that custom type. * Updating Lineage between two existing assets by creating a linking 'Process' entity. In addition, updated the README to reflect all of the features of the package.
MPRS-Labs · Nov 6, 2020 · 0860421 · 0860421
1 parent 129d1ee
commit 0860421
Show file tree

Hide file tree

Showing 8 changed files with 400 additions and 37 deletions.
diff --git a/README.md b/README.md
@@ -3,20 +3,27 @@
 A python package to work with the Apache Atlas API and support bulk loading from different file types.
 
 The package currently supports:
-* Creating a column lineage scaffolding as in the [Hive Bridge style](https://atlas.apache.org/0.8.3/Bridge-Hive.html).
-* Creating and reading from an excel template file
-* From Excel, constructing the defined entities and column lineages.
-   * Table entities
-   * Column entities
-   * Table lineage processes
-   * Column lineage processes
-* Supports Azure Data Catalog ColumnMapping Attributes.
+* Bulk upload of entities.
+* Bulk upload of type definitions.
+* Creating custom lineage between two existing entities.
+* Creating custom table and complex column level lineage in the [Hive Bridge style](https://atlas.apache.org/0.8.3/Bridge-Hive.html).
+  * Supports Azure Data Catalog ColumnMapping Attributes.
+* Creating a column lineage scaffolding as in the Hive Bridge Style .
 * Performing "What-If" analysis to check if...
    * Your entities are valid types.
    * Your entities are missing required attributes.
    * Your entities are using undefined attributes.
+* Working with the glossary.
+  * Uploading terms.
+  * Downloading individual or all terms.
+* Working with relationships.
+  * Able to create arbitrary relationships between entities.
+  * e.g. associating a given column with a table.
+  * Able to upload relationship definitions.
+* Deleting types (by name) or entities (by guid).
+* Search (only for Azure Data Catalog advanced search).
 * Authentication to Azure Data Catalog via Service Principal.
-* Authentication using basic authentication of username and password.
+* Authentication using basic authentication of username and password for open source Atlas.
 
 ## Quickstart
 
@@ -27,7 +34,7 @@ Create a wheel distribution file and install it in your environment.
 ```
 python -m pip install wheel
 python setup.py bdist_wheel
-python -m pip install ./dist/pyapacheatlas-0.0b18-py3-none-any.whl
+python -m pip install ./dist/pyapacheatlas-0.0b19-py3-none-any.whl
 ```
 
 ### Create a Client Connection
@@ -81,35 +88,19 @@ upload_results = client.upload_entities([ae.to_json()])
 
 ### Create Entities from Excel
 
-Read from a standardized excel template to create table, column, table process, and column lineage entities.  Follows / Requires the hive bridge style of column lineages.
+Read from a standardized excel template that supports...
 
-```
-from pyapacheatlas.core import TypeCategory
-from pyapacheatlas.scaffolding import column_lineage_scaffold
-from pyapacheatlas.readers import ExcelConfiguration, ExcelReader
-
-file_path = "./atlas_excel_template.xlsx"
-# Create the Excel Template
-ExcelReader.make_template(file_path)
-
-# Populate the excel file manually!
+* Bulk uploading entities into your data catalog.
+* Creating custom table and column level lineage.
+* Creating custom type definitions for datasets
+* Creating custom lineage between existing assets / entities in your data catalog.
 
-# Generate the base atlas type defs
-all_type_defs = client.get_typedefs(TypeCategory.ENTITY)
+See end to end samples for each scenario in the [excel samples](./samples/excel/README.md).
 
-# Create objects for 
-ec = ExcelConfiguration()
-excel_reader = ExcelReader(ec)
-# Read from excel file and convert to 
-entities = excel_reader.parse_lineage(file_path, all_type_defs)
-
-upload_results = client.upload_entities(entities)
-
-print(json.dumps(upload,results,indent=1))
-```
+Learn more about the Excel [features and configuration in the wiki](https://github.com/wjohnson/pyapacheatlas/wiki/Excel-Template-and-Configuration).
 
 ## Additional Resources
 
-* Learn more about this package in the github wiki.
+* Learn more about this package in the [github wiki](https://github.com/wjohnson/pyapacheatlas/wiki/Excel-Template-and-Configuration).
 * The [Apache Atlas client in Python](https://pypi.org/project/pyatlasclient/)
 * The [Apache Atlas REST API](http://atlas.apache.org/api/v2/)
diff --git a/pyapacheatlas/__init__.py b/pyapacheatlas/__init__.py
@@ -1 +1 @@
-__version__ = "0.0b18"
+__version__ = "0.0b19"
diff --git a/pyapacheatlas/core/client.py b/pyapacheatlas/core/client.py
@@ -90,8 +90,12 @@ def delete_type(self, name):
             atlas_endpoint,
             headers=self.authentication.get_authentication_headers())
 
-        results = self._handle_response(deleteType)
-
+        try:
+            deleteType.raise_for_status()
+        except requests.RequestException:
+            raise Exception(deleteType.text)
+
+        results = {"message":f"successfully delete {name}"}
         return results
 
     def get_entity(self, guid=None, qualifiedName=None, typeName=None):

diff --git a/samples/excel/README.md b/samples/excel/README.md
@@ -0,0 +1,54 @@
+# Excel Samples for PyApacheAtlas
+
+There are four key features of the PyApacheAtlas package with respect to the Excel frontend.
+
+## Features and Samples
+* **Bulk upload entities**
+  * [Bulk Entities Excel Sample](./excel_bulk_entities_upload.py)
+  * You want to dump entity information into excel and upload.
+  * You want to provide some simple relationship mapping (e.g. the columns of a table).
+  * Your entities may exist and you want them updated or they do not exist and you want them created.
+* **Hive Bridge Style Table and Column Lineage**
+  * [Custom Table and Column Lineage Excel Sample](./excel_custom_table_column_lineage.py)
+  * You are willing to use a custom type to capture more data about lineage.
+  * You are interested in capturing more complex column level lineage.
+  * None of the entities you want to upload exist in your catalog.
+* **Creating Custom DataSet Types**
+  * [Custom Type Excel Sample](./excel_custom_type_and_entity_upload.py)
+  * You have a custom dataset type you want to create with many attributes.
+  * You want to upload an entity using that custom type as well.
+* **Create Lineage Between Two Existing Entities**
+  * [Update / Create Lineage Between Existing Entities](./excel_update_lineage_upload.py)
+  * You have two existing entities (an input and output) but there is no lineage between them.
+  * You want to create a "process entity" that represents the process that ties the two tables together.
+
+Each sample linked above is stand alone and will create an excel spreadsheet with all of the data to be uploaded. It will then parse that spreadsheet and then upload to your data catalog.
+
+I would strongly encourage you to run this in a dev / sandbox environment since it's a bit frustrating to find and delete the created entities.
+
+## Requirements
+
+* Follow the steps to install the latest version of PyApacheAtlas and its dependencies on the main ReadMe.
+* You'll need a Service Principal (for Azure Data Catalog) with the Catalog Admin role.
+* You'll need to set the following environment variables.
+
+```
+set TENANT_ID=YOUR_TENANT_ID
+set CLIENT_ID=YOUR_SERVICE_PRINCIPAL_CLIENT_ID
+set CLIENT_SECRET=YOUR_SERVICE_PRINCIPAL_CLIENT_SECRET
+set ENDPOINT_URL=https://YOURCATALOGNAME.catalog.babylon.azure.com/api/atlas/v2
+```
+
+## Deleting Demo Entities and Types
+
+* You have to delete entities based on GUID and types can be deleted by name.
+* If you're following along with the built-in demos, search for 'pyapacheatlas' to find the majority of the entities.
+* To find the guid, select the asset from your search and grab the guid from the URL.
+
+```
+# Delete an Entity
+client.delete_entity(guid=["myguid1","myguid2"])
+
+# Delete a Type Definition
+client.delete_type(name="mytypename")
+```
diff --git a/samples/excel/excel_bulk_entities_upload.py b/samples/excel/excel_bulk_entities_upload.py
@@ -0,0 +1,90 @@
+import json
+import os
+
+import openpyxl
+from openpyxl import Workbook
+from openpyxl import load_workbook
+
+# PyApacheAtlas packages
+# Connect to Atlas via a Service Principal
+from pyapacheatlas.auth import ServicePrincipalAuthentication
+from pyapacheatlas.core import AtlasClient  # Communicate with your Atlas server
+from pyapacheatlas.readers import ExcelConfiguration, ExcelReader
+
+def fill_in_workbook(filepath, excel_config):
+    # You can safely ignore this function as it just
+    # populates the excel spreadsheet.
+    wb = load_workbook(file_path)
+    bulkEntity_sheet = wb[excel_config.bulkEntity_sheet]
+
+    # BULK Sheet SCHEMA
+    #"typeName", "name", "qualifiedName", "classifications"
+    # Adding a couple columns to show the power of this sheet
+    # [Relationship] table, type
+    entities_to_load = [
+        ["DataSet", "exampledataset", "pyapacheatlas://dataset", None,
+            None, None],
+        ["hive_table", "hivetable01", "pyapacheatlas://hivetable01", None,
+            None, None],
+        ["hive_column", "columnA", "pyapacheatlas://hivetable01#colA", None,
+            'pyapacheatlas://hivetable01', 'string'],
+       ["hive_column", "columnB", "pyapacheatlas://hivetable01#colB", None,
+            'pyapacheatlas://hivetable01', 'long'],
+        ["hive_column", "columnC", "pyapacheatlas://hivetable01#colC", None,
+            'pyapacheatlas://hivetable01', 'int']
+    ]
+
+    # Need to adjust the default header to include our extra attributes
+    bulkEntity_sheet['E1'] = '[Relationship] table'
+    bulkEntity_sheet['F1'] = 'type'
+
+    # Populate the excel template with samples above
+    table_row_counter = 0
+    for row in bulkEntity_sheet.iter_rows(min_row=2, max_col=6,
+                                     max_row=len(entities_to_load) + 1):
+        for idx, cell in enumerate(row):
+            cell.value = entities_to_load[table_row_counter][idx]
+        table_row_counter += 1
+
+    wb.save(file_path)
+
+
+if __name__ == "__main__":
+    """
+    This sample provides an end to end sample of reading an excel file,
+    generating a batch of entities, and then uploading the entities to
+    your data catalog.
+    """
+
+    # Authenticate against your Atlas server
+    oauth = ServicePrincipalAuthentication(
+        tenant_id=os.environ.get("TENANT_ID", ""),
+        client_id=os.environ.get("CLIENT_ID", ""),
+        client_secret=os.environ.get("CLIENT_SECRET", "")
+    )
+    client = AtlasClient(
+        endpoint_url=os.environ.get("ENDPOINT_URL", ""),
+        authentication=oauth
+    )
+
+    # SETUP: This is just setting up the excel file for you
+    file_path = "./demo_bulk_entities_upload.xlsx"
+    excel_config = ExcelConfiguration()
+    excel_reader = ExcelReader(excel_config)
+
+    # Create an empty excel template to be populated
+    excel_reader.make_template(file_path)
+    # This is just a helper to fill in some demo data
+    fill_in_workbook(file_path, excel_config)
+
+    # ACTUAL WORK: This parses our excel file and creates a batch to upload
+    entities = excel_reader.parse_bulk_entities(file_path)
+
+    # This is what is getting sent to your Atlas server
+    # print(json.dumps(entities,indent=2))
+
+    results = client.upload_entities(entities)
+
+    print(json.dumps(results,indent=2))
+
+    print("Completed bulk upload successfully!\nSearch for hivetable01 to see your results.")
diff --git a/samples/end_to_end_excel_sample.py → ...xcel/excel_custom_table_column_lineage.py b/samples/end_to_end_excel_sample.py → ...xcel/excel_custom_table_column_lineage.py
diff --git a/samples/excel/excel_custom_type_and_entity_upload.py b/samples/excel/excel_custom_type_and_entity_upload.py
@@ -0,0 +1,128 @@
+import json
+import os
+
+import openpyxl
+from openpyxl import Workbook
+from openpyxl import load_workbook
+
+# PyApacheAtlas packages
+# Connect to Atlas via a Service Principal
+from pyapacheatlas.auth import ServicePrincipalAuthentication
+from pyapacheatlas.core import AtlasClient  # Communicate with your Atlas server
+from pyapacheatlas.readers import ExcelConfiguration, ExcelReader
+
+from pyapacheatlas.core import TypeCategory
+
+def fill_in_type_workbook(filepath, excel_config):
+    # You can safely ignore this function as it just
+    # populates the excel spreadsheet.
+    wb = load_workbook(file_path)
+    entityDef_sheet = wb[excel_config.entityDef_sheet]
+
+    # ENTITYDEF Sheet SCHEMA
+    # "Entity TypeName", "name", "description",
+    # "isOptional", "isUnique", "defaultValue",
+    # "typeName", "displayName", "valuesMinCount",
+    # "valuesMaxCount", "cardinality", "includeInNotification",
+    # "indexType", "isIndexable"
+    attributes_to_load = [
+        ["pyapacheatlas_custom_type", "fizz", "This will be the optional fizz attribute",
+        None, None, None,
+        None, None, None,
+        None, None, None,
+        None, None
+        ],
+        ["pyapacheatlas_custom_type", "buzz", "This will be the REQUIRED buzz attribute",
+        False, None, None,
+        None, None, None,
+        None, None, None,
+        None, None
+        ],
+    ]
+
+    # Populate the excel template with samples above
+    table_row_counter = 0
+    for row in entityDef_sheet.iter_rows(min_row=2, max_col=6,
+                                     max_row=len(attributes_to_load) + 1):
+        for idx, cell in enumerate(row):
+            cell.value = attributes_to_load[table_row_counter][idx]
+        table_row_counter += 1
+
+    wb.save(file_path)
+
+def fill_in_entity_workbook(filepath, excel_config):
+    # You can safely ignore this function as it just
+    # populates the excel spreadsheet.
+    wb = load_workbook(file_path)
+    bulkEntity_sheet = wb[excel_config.bulkEntity_sheet]
+
+    # BULK Sheet SCHEMA
+    #"typeName", "name", "qualifiedName", "classifications"
+    # Adding a couple columns to show the power of this sheet
+    # fizz, buzz
+    entities_to_load = [
+        ["pyapacheatlas_custom_type", "custom_type_entity", 
+        "pyapacheatlas://example_from_custom_type", None,
+        "abc", "123"
+        ],
+    ]
+
+    # Need to adjust the default header to include our extra attributes
+    bulkEntity_sheet['E1'] = 'fizz'
+    bulkEntity_sheet['F1'] = 'buzz'
+
+    # Populate the excel template with samples above
+    table_row_counter = 0
+    for row in bulkEntity_sheet.iter_rows(min_row=2, max_col=6,
+                                     max_row=len(entities_to_load) + 1):
+        for idx, cell in enumerate(row):
+            cell.value = entities_to_load[table_row_counter][idx]
+        table_row_counter += 1
+
+    wb.save(file_path)
+
+
+if __name__ == "__main__":
+    """
+    This sample provides an end to end sample of reading an excel file,
+    creating a custom type and then uploading an entity of that custom type.
+    """
+
+    # Authenticate against your Atlas server
+    oauth = ServicePrincipalAuthentication(
+        tenant_id=os.environ.get("TENANT_ID", ""),
+        client_id=os.environ.get("CLIENT_ID", ""),
+        client_secret=os.environ.get("CLIENT_SECRET", "")
+    )
+    client = AtlasClient(
+        endpoint_url=os.environ.get("ENDPOINT_URL", ""),
+        authentication=oauth
+    )
+
+    # SETUP: This is just setting up the excel file for you
+    file_path = "./demo_custom_type_and_entity_upload.xlsx"
+    excel_config = ExcelConfiguration()
+    excel_reader = ExcelReader(excel_config)
+
+    # Create an empty excel template to be populated
+    excel_reader.make_template(file_path)
+    # This is just a helper to fill in some demo data
+    fill_in_type_workbook(file_path, excel_config)
+    fill_in_entity_workbook(file_path, excel_config)
+
+    # ACTUAL WORK: This parses our excel file and creates a batch to upload
+    typedefs = excel_reader.parse_entity_defs(file_path)
+    entities = excel_reader.parse_bulk_entities(file_path)
+
+    ## This is what is getting sent to your Atlas server
+    # print(json.dumps(typedefs,indent=2))
+    # print(json.dumps(entities,indent=2))
+
+    type_results = client.upload_typedefs(typedefs, force_update=True)
+    entity_results = client.upload_entities(entities)
+
+    print(json.dumps(type_results,indent=2))
+    print("\n")
+    print(json.dumps(entity_results,indent=2))
+
+    print("Completed type and bulk upload successfully!\nSearch for exampledataset to see your results.")