Skip to content

Latest commit

 

History

History
424 lines (277 loc) · 12 KB

File metadata and controls

424 lines (277 loc) · 12 KB

ELT with Mapping Dataflows – Practice excercises

Dataflow ELT for FoodDim and FoodNutrientsDim

Contents

ELT with Mapping Dataflows 1

Task 1: Create Data warehouse tables for SmartFoods in Azure SQLDB 1

Task 2: Practice exercise – Create Dataflow ELT for FoodDim 5

Task 3: Create Dataflow ELT for foodNutrientsDim 18

ELT with Mapping Dataflows

Create Data warehouse tables for SmartFoods in Azure SQLDB

If you already completed previous exercise, You can skip creating tables and schemas and continue to creating the dataflow.

Here is the initial star schema we are building for SmartFoods DW. Later we will also introduce some aggregate tables for easier reporting.

Either using Query Editor in Azure Portal or using SSMS connect to your Azure SQL DB and create a schema for SmartFoods DW and all the tables by running the following SQL script.

Note: You may need to add your Client IP Address to your SQL DB through “Set Firewall” page.

CREATE SCHEMA SmartFoodsDW;
GO
CREATE TABLE [SmartFoodsDW].[customerDim](
	[CustomerKey] [bigint],
	[LoyaltyNum] [nvarchar](max),
	[FirstName] [nvarchar](max) NULL,
	[LastName] [nvarchar](max) NULL,
	[City] [nvarchar](max) NULL,
	[State] [nvarchar](max) NULL,
	[Email] [nvarchar](max) NULL,
	[Address] [nvarchar](max) NULL,
	[PostCode] [nvarchar](max) NULL,
	[MemberSince] [date] NULL,
	[Dob] [date] NULL,
	[RecInsertDt] [date] NULL,
	[RecStartDt] [date] NULL,
	[RecEndDt] [date] NULL,
	[RecCurrInd] [bit] NULL,
	[sourceLineage] [nvarchar](max),
	[RecMd5Hash] [nvarchar](max) 
) ;
GO
CREATE TABLE [SmartFoodsDW].[foodDim](
	[sku] [nvarchar](max),
	[foodKey] [bigint],
	[desc] [nvarchar](max) NULL,
	[foodGroup] [nvarchar](max) NULL,
	[RecInsertDt] [date] NULL,
	[RecStartDt] [date] NULL,
	[RecEndDt] [date] NULL,
	[RecCurrInd] [bit] NULL,
	[sourceLineage] [nvarchar](max),
	[RecMd5Hash] [nvarchar](max) 
) ;
GO
CREATE TABLE [SmartFoodsDW].[foodNutDim](
	[foodKey] [bigint],
	[nutrientId] [nvarchar](max),
	[nutritionValue] [float] NULL,
	[desc] [nvarchar](max) NULL,
	[nutUnit] [nvarchar](60) NULL,
	[RecInsertDt] [date] NULL
);
GO

CREATE TABLE [SmartFoodsDW].[invoiceLineTxn](
	[invoiceNumber] [nvarchar](max),
	[lineNumber] [int],
	[foodKey] [bigint],
	[itemDesc] [nvarchar](max) NULL,
	[itemFoodGroup] [nvarchar](max) NULL,
	[uPrice] [float],
	[qty] [bigint],
	[gst] [float],
	[lineTotalGstExc] [float],
	[lineTotalGstInc] [float],
	[sourceLineage] [nvarchar](max),
	[recInsertDt] [date] 
);
GO
CREATE TABLE [SmartFoodsDW].[invoiceTxn](
	[invoiceNumber] [nvarchar](max),
	[loyaltyNum] [nvarchar](max) NULL,
	[CustomerKey] [bigint] NULL,
	[store] [nvarchar](max) NULL,
	[State] [nvarchar](max) NULL,
	[lineItemCount] [bigint],
	[invoiceTotalGSTinc] [float],
	[invoiceTotalGSTexc] [float],
	[InvoiceGst] [float],
	[timestamp] [datetime2](7),
	[sourceFileLineage] [nvarchar](max),
	[recInsertDt] [date] 
) ;
GO

Practice exercise – Create Dataflow ELT for FoodDim

Note: This is an practice exercise and as such there is minimal instructions. The dataflow follows the same pattern as previous part. If you don’t feel confident about any part of this exercise refer to the previous part for full explanations.

Below is the schema of foodDim table:

Sku* -> Natural key Foodkey* Desc* Foodgroup* RecInsertDt RecStartDt RecEndDt RecCurrInd sourceLineage RecMd5Hash

Columns marked with * are available in source dataset and the rest will be generated by the ELT process.

  1. Create a new Mapping DF rename it to “SmartFoodFoodELT

  2. Create two Dataflow parameters as “MaxSurrogateKey” and “BatchDt”

  1. Add Source transform and rename it to “SmartFoodsFoodStagingBlob

    1. Set it up using Screenshots as guide

  1. Go to debug settings and fill in the parameters

  1. In source transform run “Import projection”

  1. Add “Select” transform and rename it to “FixColumnNames”

  2. Set it up using screenshots as guide:

  1. Add “Derived column” Transform to calculated MD5 and rename it to “MD5Hash

    1. Set it up using screenshot as guide:

iif(isNull(ifoodGroup),'',toString(ifoodGroup)))
  1. Add a second “Source” transform to connect to DW table and rename it to “SmartFoodsFoodSQLDW

    1. Set it up using screenshot as guide:

  1. Fill in the dataset parameters in debug settings

  1. Import projection for new data source

  1. Add filter transform after DW source to filter active records only and rename it to “FilterCurrentRows”

    1. Set it up using screenshots as guide:

  1. After MD5Hash transform add “Join” transform and rename it to “JoinStagingToDWDim

    1. Set it up using screenshots as guide

  1. Add a “Conditional Split” transform and rename it to “SDC2Split

    1. Set it up using the screenshot as a guide

  1. After “Changed” stream add a “New branch” transform

  2. Add a “Select” transform after one of the branches and rename it to “SelectChangedUpdate

  1. Set it up using screenshot as guide

  1. After “SelectChangeUpdate” add a “Derived column” transform and rename it to “UpdateRecsBatchColumns

    1. Set it up

  1. After the other stream of “new branch” add a “Select” transform and rename it to “SelectChangedInsert

  1. Set it up using screenshot as guide:

  1. After the “New” stream of “Conditional Split” add “select” transform rename it to “SelectNewInsert”

  1. Set it up using screenshot as a guide:

  1. Add a “Union” transform after “SelectNewInsert” and rename it to “AllInserts”

    1. Set it up using screenshot as a guide:

  1. Add “Surrogate Key” transform and rename it to “SurrogateKey”

    1. Set it up using screenshot as a guide:

  1. Add “Derived Column” transform to update the surrogate key with maximum SK and rename it to “AddMaxFoodKey”

    1. Set it up using screenshot as guide:

  1. Add another “Derived column” transform to generate batch columns and rename it to “InsertRecsBatchColumns

    1. Set it up using screenshot as guide:

  1. Add “Union” transform to put updates and inserts together. Rename it to “UnionInsertUpdates

    1. Set it up using screenshot as guide:

  1. Add an “Alter Row” transform to mark each row with the type of DB action. Rename it to “MarkRow

    1. Set it up using screenshot as guide:

  1. Finally add “Sink” transform and rename it to “FoodDim

    1. Set it up using screenshots as guide:

  1. Your flow should now look like this:

  1. Validate and publish your flow.

Create Dataflow ELT for foodNutrientsDim

The next dimension we need to create is for food-nutrients and for practicing purpose, we are building this dimension as slowly changing dimension type 1 (This method overwrites old with new data, and therefore does not track historical data.). As this method overwrites the data the ELT dataflow is much simpler.

Also, the other thing to note is we are combining two source files to build this dimension. 1. Food_nut.csv and 2. NutDim.csv. As the relationship between food and nutrients is a many-to-many the source system provided the full 3rd normal form which needs to be transformed into dimensional start schema.

The schema for for FoodNutDim is as below:

foodKey -> SK from foodDim nutrientId* nutritionValue* Desc* nutUnit* RecInsertDt

Columns marked with * are available in source dataset and the rest will be generated by the ELT process.

  1. Create a new Mapping dataflow and rename it to “SmartFoodFoodNutritionELT

  1. Create a parameter named “BatchDt”

  1. Add a “Source” transform and rename it to “SmartFoodsFoodNutrientsStagingBlob

    1. Set it up using screenshot as guide:

  1. Add a second “Source” transform and rename it to “SmartFoodsNutrientsStaging

    1. Set it up using screenshot as guide:

  1. Add a thirds “Source” transform and rename it to “SmartFoodsFoodSQLDW

    1. Set it up using screenshot as guide:

  1. Go to Debug settings and fill in all dataset parameters:

  1. Run “Import projection” on all three “source” transforms:

Make sure for the data types match the below screenshots.

  1. After “SmartFoodsFoodSQLDW” transform add a “Filter” transform and rename it to “CurrentRecsOnly

    1. Set it up using screenshot as guide:

  1. After “SmartFoodsFoodNutrientsStagingBlob” transform add a “Filter” transform and rename it to “FilterZeroNutrients

Source dataset has a record for every nutrient for every food, even if that food dose not contain that nutrient. As such we need to remove those nutrients with value equals to zero from our dataset.

  1. Set it up using screenshot as guide:

  1. Add a “Select” transform and rename it to “FixColumnNames

    1. Set it up using screenshot as guide:

  1. Add “Join” transform and rename it to “JoinStagingToDWDim

  1. Add another “Join” transform and rename it to “JoinNutrients”.

The right-hand side of the join is “SmartFoodsNutrientsStaging”

  1. Add “Select” transform and rename it to “removeExtraCols

  1. Add a “Derived” column transform and rename it to “InsertRecsBatchColumns

  1. Add a “Sink” transform and rename it to “DBSink

Note1: As this is a truncate-load pattern we changed the table action to “Truncate table”

Note2: Sink transform in this dataflow will only perform inserts and as such there is NO NEED for an “Alter Row” transform.

  1. Your final dataflow should look like this: