Category: Azure Data Warehouse

Transitioning from Traditional to Azure Data Architectures (SSIS to Azure Data Factory, SQL Database to SQL Data Warehouse, Scheduled extract to event-based extracts)

12/17/2018

Confession: I put a lot of subtexts in this blog post in an attempt to catch how people may be describing their move from SSIS to ADF, from SQL DBs, to SQL DWs or from scheduled to event-based data ingestion. The purpose of this post is to give you a visual picture of how our well loved "traditional" tools of on-prem SQL Databases, SSIS, SSAS and SSRS are being replaced by the Azure tool stack. If you are moving form "Traditional Microsoft" to "Azure Microsoft" and need a road map, this post is for you.

Summary of the Matter: If you only read one thing, please read this: transitioning to Azure is absolutely "doable", but do not let anyone sell you "lift and shift". Azure data architecture is a new way of thinking. Decide to think differently.

First Determine Added Value: Below are snippets from a slide deck I shared during Pragmatic Work's 2018 Azure Data Week. (You can still sign up for the minimal cost of $29 and watch all 40 recorded sessions, just click here.) However, before we begin, let's have a little chat. Why in the world would anyone take on an Azure migration if their on-prem SQL database(s) and SSIS packages are humming along with optimum efficiency? The first five reasons given below are my personal favorites.

Cost (scale up, scale down)
Event Based File Ingestion
File based history (SCD2 equivalent but in your Azure Data Lake)
Support for Near Real Time Requirements
Support for Unstructured Data
Large Data Volumes
Offset Limited Local IT Resources
Data Science Capabilities
Development Time to Production
Support for large audiences
Mobile
Collaboration

Each of the reasons given above are a minimum one hour working session on their own, but I'm sharing my thoughts in brief in an effort to help you to get started compiling our own list. Please also look at the following diagram (Figure 1) and note two things: a.) the coinciding "traditional" components and b.) the value add boxed in red.

Figure 1 - Value Added by an Azure Data Architecture

If you compare my Traditional Data Architecture diagram first posted on this blob site in 2015 and the Azure Data Architecture diagram posted in 2018, I hope that you see what makes the second superior to the first is the value add available from Azure. In both diagrams we are still moving data from "source" to "destination", but what we have with Azure is an infrastructure built for events (i.e. when a row or file is added or modified in a source), near real time data ingestion, unstructured data, and data science.

In my thinking, if Azure doesn't give us added value, then why bother. A strict 1:1 "traditional" vs "Azure" data architecture would look something like this (blue boxes only) -->

Figure 2 - Traditional Components Aligned with Azure Components

It is the "white space" showing in Figure 2 that gives us the added value for an Azure Data Architecture. A diagram that is not from Azure Data Week, but I sometime adapt to explain how to move from "traditional" to "Azure" data architectures is Figure 3. It really is the exact same story as Figure 2, but I've stacked "traditional" and "Azure" in the same diagram.

Figure 3 - Traditional Components Aligned with Azure Components (Second Perspective)

Tips for Migration: Having worked with SQL Server and data warehouses since 1999 (Microsoft tool stack specifically), I am well aware of the creative solutions to get "near real time" from a SQL Agent job into an on-prem SQL Server, or to query large data sets effectively with column store indexing. For the sake of argument, let's say that nothing is impossible in either architecture. The point I'm trying to make here, however, is rather simple:

Transitioning from "traditional" to "Azure" is absolutely possible, but it is very helpful to identify the value added in order to justify the effort and budget. The "sell" is what you are gaining: file-based data storage, change data capture within your Azure Data Lake, support for true unstructured data, and much (much!) better support for true data science. Be very clear on what you are gaining, because some days, you'll need to be reminded (!).
When migrating, it is critical to see how the new architecture is replacing the old, but you DO NOT want to work your little heart out to make Azure look, feel and act like SSIS or an on-prem SQL Server data store. Consider acquiring a technical Azure lead (or fantastic consultant!) who has learned to capitalize on features of the Azure tool stack and will help your BI staff to begin thinking differently.
Azure requires new skills, but please don't give up on your current BI staff; instead, give them a chance with Pragmatic Works training courses, both free and on demand. (Check out courses and free webinars offered here.) It frustrates me that Microsoft expects everyone to now be Python and PySpark developers, but if you bring on a consultant that builds the framework of these Databricks and HDInsight components, there is a very high probability that the t-sql and C# developers you have on staff will adapt.
There are many 1:1 similarities, especially with the coming of (what I'll call) Azure Data Factory v3 which will offer row-level data manipulation, much like SSIS. There are also options to each of my "pipelines" (Figure 4). You need to decide what is best for your company.

Figure 4

In Figure 4, we have made the following substitutions to simplify migration:
1. We have selected Azure Data Factory version 3 to replace the Python of Databricks or the PySpark of HDInsight.
2. We have removed the change data capture files in Azure Data Lake and are keeping simple "is most recent" files.
3. Unless you have data volumes to justify a data warehouse, which should have a minimum of 1 million rows for each of its 60 partitions, go with an Azure Database! You'll avoid the many creative solutions that Azure Data Warehouse requires to offset its unsupported table features, and stored procedures limitations

Every company I work with has a different motivation for moving to Azure, and I'm surely not trying to put you into my box. The diagrams shared on my blob site change with every engagement, as no two companies have the same needs and business goals. Please allow me to encourage you to start thinking as to what your Azure Data Architecture might look like, and what your true value add talking points for a migration to Azure might be.

Moving onward and upward, my technical friend,

~Delora Bradish

p.s. If you have arrived at the blog post looking for a Database vs Data Warehouse, or Multidimensional vs Tabular discussion, those really are not Azure discussion points as much as they are data volume discussion points; consequently, I did not blog about these talking points here. Please contact me via www.pragmaticworks.com to schedule on site working sessions in either area.

2 Comments

ADW PolyBase Error: Error converting RCFile Type string to Sql type NVARCHAR

8/10/2018

0 Comments

When testing Azure Data Lake (ADL) to Azure Data Warehouse (ADW) file ingestion, this error continued to come up on various external table SELECTs. The confusion was that the ADL only contained parquet files. There was only one external file format defined, and that too was obviously for parquet. From where was an RCFile error originating? The bottom line, in this particular engagement scenario, was that this error is actually a truncation error.

Things to Verify:

The external file column list is defined in the right order. If all your parquet file columns are defined as string, then this will not be evident until you look at the returned values from your SELECT query. However, if you have defined actual data types in the file store, those data types must align with SQL data types in your external table definition.
All NVARCHAR() columns in the external table definition are long enough.

Solution:

Align your external table columns properly with the file store
Enlarge the NVARCHAR() column length in your external table definition. External tables do not store data, they just point to an external data source. Therefore, consider going with NVARCHAR(4000) to avoid this error in the future. The ELT data transforms is where you should CAST() all data to correct formats and lengths anyway.

Supporting t-sql Scripts
If you are new to PolyBase and external tables in SQL Server environments, here are a three t-sql scripts that are supporting references to the error resolution given above.

Example CREATE EXTERNAL DATA SOURCE t-SQL. Click here for more information.
CREATE EXTERNAL DATA SOURCE [MyDataSourceName]
WITH (TYPE = HADOOP,
LOCATION = N'adl://MyDataLakeName.azuredatalakestore.net',
CREDENTIAL = [MyCredential])
GO

Example CREATE EXTERNAL FILE FORMAT t-SQL. Click here for more information.
CREATE EXTERNAL FILE FORMAT [MyFileFormatName]
WITH (FORMAT_TYPE = PARQUET
, DATA_COMPRESSION = N'org.apache.hadoop.io.compress.SnappyCodec')
GO

Example CREATE EXTERNAL TABLE t-sql script. Click here for more information.
BEGIN TRY DROP EXTERNAL TABLE [ext].[MyExternalTableName] END TRY BEGIN CATCH END CATCH

CREATE EXTERNAL TABLE [ext].[MyExternalTableName]
(
[ColumnName1] bigint NULL
,[ColumnName2] nvarchar(4000) NULL -- if this value is too small, you will get the conversion error
,[ColumnName3] bit NULL
,[ColumnName4] datetime NULL
,[ADLcheckSum] nvarchar(64) NULL -- if this value is too small, you will get the conversion error
,[ADFIngestionId] nvarchar(64) NULL -- if this value is too small, you will get the conversion error
)
WITH (DATA_SOURCE = [MyDataSourceName]
, LOCATION = N'/Folder1/Folder2/'
, FILE_FORMAT = [MyFileFormatName]
, REJECT_TYPE = VALUE
,REJECT_VALUE = 0)
GO

0 Comments

All Things Azure

Transitioning from Traditional to Azure Data Architectures (SSIS to Azure Data Factory, SQL Database to SQL Data Warehouse, Scheduled extract to event-based extracts)

ADW PolyBase Error: Error converting RCFile Type string to Sql type NVARCHAR

Categories