Category: Azure Data Lake

Technical Tips for a Successful Azure Data & Analytics Implementation

4/27/2019

This is my second attempt to articulate lessons learned from a recent global Azure implementation for data and analytics. My first blog post became project management centered. You can find project management tips here. While writing it, it became apparent to me that the tool stack isn't the most important thing. I'm pretty hard-core Microsoft, but in the end, success was determined by how well we coached and trained the team -- not what field we played on.

Turning my attention in this blog post to technical success points, please allow me to start off by saying that with over twenty countries dropping data into a shared Azure Data Lake, my use of "global" (above) is no exaggeration. I am truly not making this all up by compiling theories from Microsoft Docs. Second, the most frequent question people ask me is "what tools did you use?", because migrating from on-prem, Microsoft SSIS or Informatica to the cloud can feel like jumping off the high dive at the community pool for the first time. Consequently, I'm going to provide the tool stack list right out of the gate. You can find a supporting diagram for this data architecture here.

Microsoft Azure Tool Stack for Data & Analytics
Hold on, your eyes have already skipped to the list, but before you make an "every man for himself" move and bolt out of the Microsoft community pool, please read my migration blog post. There are options! For example, I just stood up an Azure for Data & Analytics solution that has no logic apps, Azure function, event hub, blob storage, databricks, HDI, or data lake. The solution is not event-driven and takes an ELT (extract, load, and then transform) approach. It reads from sources via Azure Data Factory and writes to an Azure Database logging the ELT activities in an Azure Database as well. Now, how simple is that? Kindly, you don't have to build the Taj MaSolution to be successful. You do have to fully understand your customer's reporting and analysis requirements, and who will be maintaining the solution on a long-term basis.

If you still wish to swan dive into the pool, here's the list!

Disclaimer: This blog post will not address Analysis Services or Power BI as these are about data delivery and my focus today is data ingestion.

Technical Critical Success Points (CSP)
Every single line item above has CSPs. How long do you want to hang out with me reading? I'm with you! Consequently, here are my top three CSP areas.

Azure Data Factory (ADF)

Parameterize the living bejeebers out of everything. Assuming an ELT methodology, have one dataset for every source, one dataset for every destination, and one pipeline for every destination. Even better, have one linked service for each type (SQL Server, SFTP, Cosmos DB etc.) Reality check: this is actually impossible as not everything in ADF is parameterized, but get as close as you can get!!
Understand volume (file size and number) vs velocity (frequency and speed) of your files. This will help you make the ADF Data Flow vs Databricks vs HDI decision.
Incorporate logging right at the start. Log every pipeline start, success, and failure. Also log every success or failure data copy. I've gone overboard on this and logged the success or failure of every lookup and stored procedure activity. I encourage you to find your own balance.
Use ADF for your data lake data validation QA process. Minimally, let's assume your are verifying source-to-target row counts after every ingestion. Put this in an ADF pipeline.
Use ADF for orchestration, but don't smash everything into one pipeline -- there is a 40 activity limitation which you will quickly exceed especially with a good logging methodology. It also becomes unmanageable to troubleshot multi-process pipelines. Have a grandparent orchestrator (like a SSIS master package) which calls subject or process area orchestrators, which in turn call copy pipelines. Your DevOps team will thank you profusely!
Use a Cosmos DB document, SQL Server table or TXT file for environment properties. Remember environments in SSIS? It is pretty much the same idea. There should be no hard-coded URLs, blob storage or data lake locations, SQL data base names etc. in ADL parameters. Any parameter value that changes between DEV, STG, UAT and PRD should be in the environment lookup activity. There should be no global find and replace ever needed at time of deployment. I've attached a sample environment properties JSON document at the bottom of this post.

Azure Data Lake (ADL)

Plan your security methodology before you load your first file. Map it out in Excel or tool of choice. Data lake security cannot be an afterthought. If you skip this, there is a high probability that all your pipelines will need rework.
Don't leave a bunch of junk in your trunk! Clean up. Tomorrow never comes -- oddly, it is always tomorrow.
Assuming your data lake represents your persisted storage for reporting, use Azure Blog Storage (ABS) for temporary staging and pre-processing. This keeps your ingestion work area separate from your long-term storage. ABS is cheap and a lifecycle management policy will automatically clean up temporary work areas.
If at all possible, divide your data lake into two areas: deltas and current. For each file ingestion, deltas hold one file for adds, one file for updates and one file for deletes. This is like file-level change data capture. "Current" is what the source system looks like right now. There is no history in "current", but "current" is created by summing up all of the deltas. This allows you then to only have one QA process on current as there is an ingestion dependency.
Use parquet files for data scientists as they will know how to use them and be wanting point-in-time analysis (deltas). Provide *.CSV (current snapshot) files for non-technical users. Be honest with me here, have you ever known a non-technical self-service user to NOT want to open a data lake file in Excel? Point made.

Data Driven Ingestion Methodology

A data-driven ingestion methodology derives all of its source, destination, transform, and load information from a file, table or document. (If you have worked with BIML for SSIS, this is a very similar concept.) The properties in the metadata object become ADF parameter values, are used by Databricks or HDI for file ingestion, and are used to create correct data types in SQL Server, just to name a few uses. Metadata runs the whole show. I have attached a sample file metadata JSON document at the bottom of this post.
Auto generate the metadata. Either some poor soul has to type all of this out, or a Visual Studio Python project can read each source file and spill out the metadata needed for each environment. The file metadata document I've supplied below was auto generated...along with hundreds of others. Can you even imagine creating this by hand when one source table has over a hundred columns?
If your transform and load has followed a repeatable process, you can also auto generate the transform views and load stored procedures needed by an Azure SQL Data Warehouse (ADW) where CTAS() is your bff (best friend forever). This will require an extremely accurate source-to-target mapping document. I would consider the auto-gen process successful if it correctly turns out 90% of what is needed for an ADW load.

Wrapping It Up
Supporting Applications
It is only honest to share three supporting applications that help to make all of this possible.

Link to creating a Python project in Visual Studio.
Link to Azure SQL Data Warehouse Data Tools (Schema Compare) preview information. At the writing of this post, it still has to be requested from Microsoft and direct feedback to Microsoft is expected.
Link to integrate Azure Data Factory with GitHub.
Link to integrate Databricks with GitHub.

Sample Files
Below are example of data driven metadata and environment properties. Both are in JSON format.

environmentproperties.json
File Size:	2 kb
File Type:	json

Download File

data_driven_metadata.json
File Size:	2 kb
File Type:	json

Download File

A Bit of Humor Before You Go
Just for fun, here are some of my favorite sayings in no certain order

Now, how easy was that!
There is no reward for complexity.
For everything, there is an appropriate place.
We can, but we won't.

5. Into every project a little rain must fall ...

... I said "A LITTLE" !!!!

1 Comment

Transitioning from Traditional to Azure Data Architectures (SSIS to Azure Data Factory, SQL Database to SQL Data Warehouse, Scheduled extract to event-based extracts)

12/17/2018

2 Comments

Confession: I put a lot of subtexts in this blog post in an attempt to catch how people may be describing their move from SSIS to ADF, from SQL DBs, to SQL DWs or from scheduled to event-based data ingestion. The purpose of this post is to give you a visual picture of how our well loved "traditional" tools of on-prem SQL Databases, SSIS, SSAS and SSRS are being replaced by the Azure tool stack. If you are moving form "Traditional Microsoft" to "Azure Microsoft" and need a road map, this post is for you.

Summary of the Matter: If you only read one thing, please read this: transitioning to Azure is absolutely "doable", but do not let anyone sell you "lift and shift". Azure data architecture is a new way of thinking. Decide to think differently.

First Determine Added Value: Below are snippets from a slide deck I shared during Pragmatic Work's 2018 Azure Data Week. (You can still sign up for the minimal cost of $29 and watch all 40 recorded sessions, just click here.) However, before we begin, let's have a little chat. Why in the world would anyone take on an Azure migration if their on-prem SQL database(s) and SSIS packages are humming along with optimum efficiency? The first five reasons given below are my personal favorites.

Cost (scale up, scale down)
Event Based File Ingestion
File based history (SCD2 equivalent but in your Azure Data Lake)
Support for Near Real Time Requirements
Support for Unstructured Data
Large Data Volumes
Offset Limited Local IT Resources
Data Science Capabilities
Development Time to Production
Support for large audiences
Mobile
Collaboration

Each of the reasons given above are a minimum one hour working session on their own, but I'm sharing my thoughts in brief in an effort to help you to get started compiling our own list. Please also look at the following diagram (Figure 1) and note two things: a.) the coinciding "traditional" components and b.) the value add boxed in red.

Figure 1 - Value Added by an Azure Data Architecture

If you compare my Traditional Data Architecture diagram first posted on this blob site in 2015 and the Azure Data Architecture diagram posted in 2018, I hope that you see what makes the second superior to the first is the value add available from Azure. In both diagrams we are still moving data from "source" to "destination", but what we have with Azure is an infrastructure built for events (i.e. when a row or file is added or modified in a source), near real time data ingestion, unstructured data, and data science.

In my thinking, if Azure doesn't give us added value, then why bother. A strict 1:1 "traditional" vs "Azure" data architecture would look something like this (blue boxes only) -->

Figure 2 - Traditional Components Aligned with Azure Components

It is the "white space" showing in Figure 2 that gives us the added value for an Azure Data Architecture. A diagram that is not from Azure Data Week, but I sometime adapt to explain how to move from "traditional" to "Azure" data architectures is Figure 3. It really is the exact same story as Figure 2, but I've stacked "traditional" and "Azure" in the same diagram.

Figure 3 - Traditional Components Aligned with Azure Components (Second Perspective)

Tips for Migration: Having worked with SQL Server and data warehouses since 1999 (Microsoft tool stack specifically), I am well aware of the creative solutions to get "near real time" from a SQL Agent job into an on-prem SQL Server, or to query large data sets effectively with column store indexing. For the sake of argument, let's say that nothing is impossible in either architecture. The point I'm trying to make here, however, is rather simple:

Transitioning from "traditional" to "Azure" is absolutely possible, but it is very helpful to identify the value added in order to justify the effort and budget. The "sell" is what you are gaining: file-based data storage, change data capture within your Azure Data Lake, support for true unstructured data, and much (much!) better support for true data science. Be very clear on what you are gaining, because some days, you'll need to be reminded (!).
When migrating, it is critical to see how the new architecture is replacing the old, but you DO NOT want to work your little heart out to make Azure look, feel and act like SSIS or an on-prem SQL Server data store. Consider acquiring a technical Azure lead (or fantastic consultant!) who has learned to capitalize on features of the Azure tool stack and will help your BI staff to begin thinking differently.
Azure requires new skills, but please don't give up on your current BI staff; instead, give them a chance with Pragmatic Works training courses, both free and on demand. (Check out courses and free webinars offered here.) It frustrates me that Microsoft expects everyone to now be Python and PySpark developers, but if you bring on a consultant that builds the framework of these Databricks and HDInsight components, there is a very high probability that the t-sql and C# developers you have on staff will adapt.
There are many 1:1 similarities, especially with the coming of (what I'll call) Azure Data Factory v3 which will offer row-level data manipulation, much like SSIS. There are also options to each of my "pipelines" (Figure 4). You need to decide what is best for your company.

Figure 4

In Figure 4, we have made the following substitutions to simplify migration:
1. We have selected Azure Data Factory version 3 to replace the Python of Databricks or the PySpark of HDInsight.
2. We have removed the change data capture files in Azure Data Lake and are keeping simple "is most recent" files.
3. Unless you have data volumes to justify a data warehouse, which should have a minimum of 1 million rows for each of its 60 partitions, go with an Azure Database! You'll avoid the many creative solutions that Azure Data Warehouse requires to offset its unsupported table features, and stored procedures limitations

Every company I work with has a different motivation for moving to Azure, and I'm surely not trying to put you into my box. The diagrams shared on my blob site change with every engagement, as no two companies have the same needs and business goals. Please allow me to encourage you to start thinking as to what your Azure Data Architecture might look like, and what your true value add talking points for a migration to Azure might be.

Moving onward and upward, my technical friend,

~Delora Bradish

p.s. If you have arrived at the blog post looking for a Database vs Data Warehouse, or Multidimensional vs Tabular discussion, those really are not Azure discussion points as much as they are data volume discussion points; consequently, I did not blog about these talking points here. Please contact me via www.pragmaticworks.com to schedule on site working sessions in either area.

2 Comments

All Things Azure

Technical Tips for a Successful Azure Data & Analytics Implementation

Transitioning from Traditional to Azure Data Architectures (SSIS to Azure Data Factory, SQL Database to SQL Data Warehouse, Scheduled extract to event-based extracts)

Categories