Category: Modeling

Data Architecture for Azure BI Programs

7/27/2018

A Bit of Intro
If I recall correctly, I completed the first version of this data architecture diagram in 2012 when we used terms like "road map" and "blueprint" Back then, along with different terms, we were also using traditional SSIS, SSAS-MultiD and SSRS tools. Now we live in the world of cloud everything, although we are still driving from SRC-to-DST (source to destination). I'm up for whatever terminology you want to use, but can we agree that we are surely on a different highway? For my classical BI Blueprint, click here, but to see an Azure road map for BI, please take a look below.

Disclaimer: I create a different diagram for every engagement, so think of this as a suggestion, not a mold.

Azure Data Architecture BI Talking Points:

Start thinking "event", "file based ingestion", "streaming" and "near real time" replacing the former batch mode thought process.
Adopt an "I can, but I won't" methodology as you reach for optimal Azure solutions. Identify the intended purpose for each Azure tool and stick with it. There shouldn't be data transforms happening in every column, and stand alone semantic layers growing in every Power BI report.
Simplistic repeatability is the key to successful CI/CD (continuous integration, continuous delivery). Data enters ABS (Azure Blob Storage) in different ways, but all data moves through the remainder of the ingestion pipeline in a uniform process.
Consider hiring a former web developer. More and more Azure offerings are coming with a GUI, but many will always require .NET, R, Python, Spark, PySpark, and JSON developer skills (just to name a few). You will need these skills for columns #2 and #5 above.
Be prepared to replace SSIS functionality with ADFv2, v3, v4 (eventually), and Sql Server User Stored Procedures. I think of Azure Data Factory v2 as an "orchestrator" right now, but I ADF Data Flows is Microsoft's next step in replacing SSIS.
Build current and future state data architectures. This foresight helps to ensure a solid foundation as you build your BI house in increments. It is generally a misstep to plan a Taj Madashboard, that needs information for every system in your company, as your first deliverable. This is also contrary to an Agile manifesto. Keep something in your back pocket. Deliver business value in consistent increments.
Model your data stores around reporting and security requirements, not what is easiest for data ingestion.
Start small and scale up with all your Azure resources. This is the premise of an Azure cost-effective solution.

BI Advice from the University of Hard Knocks:

Every decision point should be what is best for reporting and analytics, not data transform and load. Please see my BI Wheel for success. The wheel spokes change with Azure, but the theory does not.
Require uniformity and avoid one-off creative solutions. Make your first exception on the last day of the 5th year after you have gone to production.
Don't complicate your data architecture just because it's new, challenging and fun. Always produce a finished product that can be handed off to an entry-level developer. More than one person in the BI team should be able to service each part of the architecture.
Don't design your data architecture around 10% of your user base i.e. the number of people who may hold a master's degree in statistics at your company.
Design a BI solution without end-user input, and they will not come. Attitude is 50% of the success of your BI solution. Give strategic company users an investment in the project, and then they will adopt it.

Conclusion of the Matter: I am not explaining every column in the data architecture because the columns in the above diagram are not applicable to everyone. For example, almost everyone needs a semantic layer, but not everyone needs a logical data store for operational reporting. Column #5 can be done in Spark as well as Data Bricks; instead of my telling you what the best solution is, let's talk about it. For every column there is a good, better and best solution, and good heavens (!) not everyone needs a thirteen point data architecture! All things in moderation, right?

I am asking, if you have taken the time to read this, please start planning before you start building! Opening Power BI and mashing up data from three different sources is generally not a scalable solution. Get started with a data architecture diagram and build a better BI house!

3 Comments

Substituting Integers for Source System Primary Key Varchar() Data Types

2/9/2017

0 Comments

Situation: Many CRM data source use varchar() GUID-looking values for primary keys. This blog post applies to any source system for a reporting and analytics project that uses text/string/character values to join transactional tables. Below are example PKs from the SalesForce Opportunity table.

If you have the privilege of a data warehouse, the extract, transform and load (ETL) process often, as best practice, replaces source system PKs with data warehouse identity seed integer values. However, with the trends in data mashups (Excel Power Query and Power BI Query Editor), this may not be happening. Also, some PKs, like the ones pictured above, are often brought forward into a data warehouse as a secondary "business key" and users are pulling them into their report data sources for drill-down / source system lookup capabilities.

Problem: String values do not compress as well as integer values, so when using these varchar() PKs in multidimensional cubes, tabular models, Excel Power Pivot and Power BI (PBI) Desktop, file or memory sizes increase exponentially. As of January 2017, PBI in memory files have a maximum file size of 250MB. This can be highly problematic as explained by my Pragmatic Works colleague, Rachael Martino, in her SQL Saturday presentation Tips and Techniques for Power BI. (You can find a corresponding blog post from Rachael here.) With her permission, I have borrowed the following screen print which shows the problem and resolution result clearly.

Summary Resolution: Assign a unique integer value to each varchar() primary key value. This may be easier said then done, but look at the result above. On the left is memory consumption "Before" by a SalesForce varchar() PK. When an integer value was substituted "After", memory size dropped from 25,563.63KB to 0.12KB.

Resolution Illustrated: For the next screen print I totally cheated and used the t-sql ROW_NUMBER() and RANK() functions to illustrate my point and assign a unique integer to each varchar() value. However, there are at least three potential problem here:
1. NewAccountID and NewRecordTypeID share the same integer value. This may be okay -- it depends how your ETL is written.
2. If you are working in Excel Power Pivot, SSAS data source or Power BI query editor, you do not have the ETL capabilities that will push these same integer values into multiple child tables.
3. If you are working in Azure DW, as of January 2017 Azure DW did not have auto-incrementing identity seed capabilities, but that is a semi-related topic for another day.

Creative Problem Solving, Please: When I find myself in a bit of a fix like this, the answer is always the same: What is best for reporting and analytics (R&A)? ETL (or ELT) is not the spoke of my BI Wheel. In fact, data transformation frequently writes a check payable to Father Time to make a better R&A experience. This is another one of those instances. You should handle this in your source-to-data warehouse data integration step.

As a last resort, you can play ROW_NUMBER() and RANK() games inside your data source views. You can also continue to use these varchar() PK values for table relationships inside of SSAS tabular models, but be sure to 'Hide from Client Tools' so they don't end up being pulled into PBI memory or used as slicers or column values. If you are using tabular models and include these columns in your design, there is no way around paying the memory price in your SSAS processed model. Multidimensional cubes will throw a warning for bad cardinality of a dimension attribute, but if you do not place them inside any *.dim, and only use them for relationships in your DSV, you should be okay. There really is no happy ending here if you cannot get rid of these things from within medium to large sized data sets.

Let's remember, each MS BI tool is designed for a specific purpose. SSRS is a reporting tool although it can also provide dashboards. Power BI Desktop is designed for analysis of aggregated data -- not paginated granular reporting. Consequently, if we use each MS BI tool for what it does best, a SSAS Action or Power BI link to a granular SSRS report can be a good solution here. "Simply" pass a set of input parameters to SSRS and present to the user only the varchar() values needed.

0 Comments

Data Modeling for BI Analytics vs Reporting

11/6/2015

1 Comment

To create data models for business intelligence, you first need to understand your BI Blueprint, then it is pretty critical to truly understand the difference between reporting and analytics (R&A) data models. Modeling data for R&A happens in pipes #5, #7 and sometimes (but not optimally) in the DSVs (data source views) found in pipe #8.

Keeping in mind that tables arranged in a circle does not a star schema make, below is a slide that articulates in part the difference between the two. Think of reporting as a pile of Tinker Toys -- you SELECT tables and JOIN...JOIN...JOIN to a bunch more. Analytics is about flattened "denormalized" data arranged into subject area dimensions and measure groups, preferably with pre-processed totals, stored like Rubik's Cube.

If you are serious about data modeling for business intelligence, dig deep into the following concepts each which are worthy of individual blog posts.

Subject area dimensions vs tables that contain codes and relate to transactional records
Slowly changing dimensions
Correct modeling of deleted rows
Denormalization techniques
Degenerate dimension and techniques for handling large dimensions in cubes
Modeling for many-to-many relationships
Modeling for predictive analytics
Modeling for ABC (audit, balance and control) aka metadata and data verification

When I get my blogging juices on, I'd like to post a bit about each one. In the interim, you can contact me in About.

1 Comment

Modeling for BI

Data Architecture for Azure BI Programs

Substituting Integers for Source System Primary Key Varchar() Data Types

Data Modeling for BI Analytics vs Reporting

Categories