Microsoft Data & AI

SSIS 2016: An efficient way to handle transform and load of SCD2 (type 2 slowly changing dimensions)

8/23/2017

Quick Review:
This blog post is about type two slowly changing dimensions (SCD2). This is when an attribute change in row 1 results in SSIS expiring the current row and inserting a new dimension table row like this -->

SSIS comes packaged with a SCD2 task, but just because it works, does not mean that we should use it. Think of the pre-packaged Microsoft supplied SCD2 task as a suggestion. If we really want to get the job done, we will want to use Cozy Roc or Pragmatic Works Task Factory (TF). I strongly suggest Task Factory’s [Dimension Merge Slowly Changing Dimension] add-in to SSIS for the following reasons:

Performance
Easy of use
Rich features

The key to success, in my case, was two-fold

Sort the two OleDB source components (#1 and 3) both in the SELECT statement, and under the Advanced tab of the OleDB source task.
Be sure to choose the right keys in the TF Upsert destination (#6 and #8).

You can read about additional performance tips from Pragmatic Works on line help (performance tab) and they have an instructional video here. The point of this blog post is to share a screen print that would have been helpful to me the first time I setup this component.

Now for the dissection...

This is your SELECT transform statement from your source system. It must contain an ORDER BY clause. The [OleDB SRC vwDimCustomer] task then must be sorted. Right mouse click --> Show Advanced Editor --> Input and Output Properties tab --> Ole DB Source Output --> IsSorted property = True. On this same tab click on Ole DB Source Output --> Output Columns --> [Your Business Key Column Name]--> SortKeyPosition = 1
This is an optional TF component, but best practice is to not allow NULL values, assuming your data warehouse is modeled and optimized for analytics. Although all of the columns in vwDimCustomer have IsNull() functions, all the lookup values will come into the SCD2 task as NULL, so this step replaces NULL values with 'Unknown', '-1', 'N/A', or whatever be your preference.
In order to know if something is new or changed, we need a comparison table and this is the function of #3. Just like #1, this source data must be sorted. Example: SELECT * FROM edw.DimCustomer ORDER BY Customer BK.
This is the hinge pin of the SCD2 process. I will leave you to read up on how to configure the component as I don't wish to rewrite Pragmatic Works Help. One word of advice: Do not rest until all warnings have been corrected. Where it is easy to get lost is configuring the outputs.
The easiest of the four: a BK (business key) does not exist. Insert it. This is a traditional OleDB insert and we don't map the PK (primary key) or SCD2 ExpirationDateTime column. The PK is an identity seed and is taken care of by SQL Server. The SCD2 expiration date should be NULL. If you want a audacious default, like 12/31/9999, then this column must, of course, be mapped.
This TF Upsert takes care of SCD1 rows: Rows that do not have a SCD2 change and are therefore updated completely, including historical columns. This is a standard t-sql UPDATE. My personal preference is to use the TF Upsert Column Compare update method so rows that have no actual change, do not have an unnecessary update and meaningless [RowLastUpdated] timestamp. TF Upsert Column Compare works much like a hash value in many ETL methodologies. "If exists, go down SCD1 pipeline, but don't update the row unless there is an actual change." Critical success point: key on the BK, not the PK!
We finally get to SCD2 results with #7. This is where new rows are created because of an identified SCD2 column change.
When a new row is added, the old row must be expired. This is the functionality of #8. Because we definitely have a change in the row, there is no purpose to spend the time doing a TF Upsert Column Compare. Set this TG Upsert to bulk update. Critical success point: key on PK, not BK!

Conclusion: Having fulling understood SCD2 concepts, this TF component took me a little bit to configure -- I had to actually think about things vs the rhythmic repetition of common data flow tasks. On first pass I skipped past the sort of the two source components and out of habit, picked up the PK in both TF Upsert components. I didn't pay attention to the <New> OleDB mapping and tried to insert my edw.DimCustomer.PK (hello?!) . My advice is to SLOW DOWN and get the first SCD2 dimension SSIS package built and aggressively tested, then fall back into the rinse and repeat rhythm of SSIS package development.

If you get stuck, reach out to Pragmatic Works Product Support. I highly recommend their online chat.

0 Comments

SSIS 2016, An efficient way to mark deleted source system rows in a persisted data warehouse table

8/23/2017

2 Comments

Even though many people think data warehouses ETLs (extract, transform and load) should contain insert data flows only, the vast majority of people I work with also have to deal with updates. Many have to also handle marking data warehouse rows as IsDeleted = "Y" in their ODS and EDW data repositories.

If you are working with a dimension tables with less than 500K rows (an estimate), a traditional lookup task might work just fine. The data flow would look like this:

(TF = Task Factory. This is a great SSIS add-in for a nominal cost available from Pragmatic Works.)

The problem with the above methodology is that although it may work, it is not efficient. I am a firm believer in performance tuning SSIS packages attempting to shave off minutes, and often seconds. Consequently, just because we get a green check mark, that is NOT indicative of “all is well”. In the above data flow, every ods.Customer has to be looked up in the src.Customer table. P-a-i-n-f-u-l!

As with everything SSIS, there are multiple ways to get the same data flow accomplished, and I tip my hat to those of you who like to write C#. In my experience, C# seems to be able to complete a data flow task faster than many comparable SSIS components, but C# is not my go-to solution. An entry-level developer will probably be maintaining and enhancing the package, so I try to find an alternative. Keeping with OOP (object oriented programming) techniques, I tested two alternative options this week that I thought were worth a share.

MergeJoin
Temp (stage) table

Working with a dimension table containing 5 million rows (it was “people” and there really were that many people so there was no trimming down the dimension row count), MergeJoin took 3 minutes. The standard lookup took over 20 minutes.

Assumptions:

Using a JOIN between tables located in two different servers is not considered as a solution. Although a linked server in Sql Server would allow you to do this, I consider this bad form.
The source system is not keeping a change data capture or log file of deleted rows
Truncating and reloading our data warehouse table is not an option. This is another ETL methodology that I see frequently, but personally consider bad form.

MergeJoin Solution:
It isn’t that complex. Add two OleDB source tasks to a data flow: one for your actual source that is deleting records (and in my opinion, behaving badly, but sometimes that just cannot be helped) and a second source component for the data warehouse. Use a MergeJoin to bring the two together, use a conditional split to send (now) missing PKs to an upsert or stage destination. I like to use TF (Task Factory’s) Upsert component as it is super easy to configure.
A MergeJoin solution for marking data warehouse records as deleted in the source will look something like this:

Key to performance success in the above data flow is sorting both the ods.Customer and src.Customer tables including setting the sort properties of the Advanced Editor. Right mouse click on the OleDB source task --> Show Advanced Editor --> Input and Output Properties tab --> Ole DB Source Output --> IsSorted property = True. On this same tab click on Ole DB Source Output --> Output Columns --> [Your Business Key Column Name]--> SortKeyPosition = 1

For those new to SSIS, the Merge Join and Conditional Split components are pictured below to fill in the blanks of the MergeJoin data flow.

Temp Table Solution:
I won’t spend too much time here because this option was a bit slower than the MergeJoin, and requires a stage table in the same database as the ODS. The basic concept is to stage the unique BKs (business keys) of the source into your ODS database using a TRUNCATE and full reINSERT. Then perform the lookup between your ODS database and the newly loaded stage table. This does work, and avoids a lookup to a 2nd server or source, but it is not my personal preference.

A temp table solution (stg.Customer) will look something like this:

Conclusion: We might not like it, but sometimes we must make a full comparison between our incrementally loaded data warehouse table and a source system to find business keys that no longer exist in the source. SSIS is built to handle this, but we still need to try several options to find the best performing solutions for our unique environments.

This blog post just dealt with the ODS; deleted source records still have to be handled in the EDW. I loath seeing IsDeleted columns in true data warehouses. Is there a reporting requirement for deleted records? Are you going to put an index on that column? Will you have two sets of views, one with deleted records and one without? A much better way to handle deleted source records already persisted to an EDW is to create a DEL (deleted) schema and move the edw.Customer rows to del.Customer. It takes more ETL effort, but once developed, always done. ROI (return on investment) is not having to wade through deleted rows in the EDW. I feel another blog post coming ...

2 Comments

Error Handling in SSIS, Redirecting Error Rows, Report and Pass

7/12/2017

0 Comments

I classify SSIS errors into two groups:

Report and pass -- non-fatal errors that should be reported but not stop the ETL process like (my opinion) truncation or data type mismatch from a text file.
Report and fail -- fatal errors that should fail a package, like "cannot acquire connection from connection manager". (Honestly, no package is going anywhere after this one!)

This blog post is about "Report and Pass" which I first talked about in Top 5 Best Practices for SSIS. There are hundreds of SSIS potential errors, and to start with, I STRONGLY recommend BI xPress from Pragmatic Works. SSIS package deployment and monitoring aside, however, what I'd like to talk to you about here is error handling inside of a data flow so that we are not sacrificing the permanent, the completion of your master-master package, on the alter of the immediate, a dateTime data type that came through last night in a text file's integer column space.

Last month I dusted off an old SQL 2008 SSIS package written about that same time. I was hoping that error handling in SSIS through VS 2015 had some cool new task to "report and pass". Not finding anything, I was hoping that Task Factory, a second "don't deploy without it" SSIS tool, had picked up the slack. Nothing doing, as some say. Here then is a review of an old proven SSIS methodology for recording data of redirected errors in a SSIS data flow.

Process Overview
First we need a SQL Server data table to hold our error rows. You might consider something like this:

CREATE TABLE [etl].[SSISErrorHandlingLog](
[RowID] [int] IDENTITY(1,1) NOT NULL,
[SSISPackageId] [varchar](50) NOT NULL,
[SSISTaskName] [varchar](50) NOT NULL,
[SSISTableName] [varchar](50) NULL,
[ErrorCode] [varchar](25) NULL,
[ErrorColumnID] [varchar](10) NULL,
[ErrorColumnName] [varchar](250) NULL,
[ErrorDescription] [varchar](1200) NOT NULL,
[RowData] [varchar](1500) NULL,
[InsertedDateTime] [datetime] NOT NULL,

CONSTRAINT [PK_SSISErrorHandlingLog] PRIMARY KEY CLUSTERED
([RowID] ASC) ON [PRIMARY]) ON [PRIMARY]
GO
ALTER TABLE [etl].[SSISErrorHandlingLog] ADD CONSTRAINT [DF_SSISErrorHandlingLog_InsertedDateTime] DEFAULT (getdate()) FOR [InsertedDateTime]
GO

We then need to add three or four tasks to each data flow: a union all (optional), a derived column, a script task (optional), and finally an OLE DB destination to our etl.SSISErrorHandlingLog table. All things completed, a data flow will look something like this (on the left):

The script component has two inputs: ErrorCode and ErrorColumn. It has one output: ErrorText. (You can also add addition C3 for ErrorColumnName.) Other than that, the script component exposes the error description.
  public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
Row.ErrorDescription= ComponentMetaData.GetErrorDescription(Row.ErrorCode);
    }

The DC (derived column) task has an expression for [RowData].
"My Table's PK=\"" + REPLACENULL((DT_WSTR,50)IntKey," ") + "\"" + " My User BK=\"" + REPLACENULL(MyStringColumn," ") + "\""

Additional derived columns might be as shown below: (I did not do myself any favors by making my table's string data types VARCHAR() instead of NVARCHAR(), so please follow your own DB data type standards.)

If you are redirecting error from multiple places, like both your SRC and DST, use a Union All task before the script task making sure to only include needed columns -- you may get SSIS error when trying to union large string values, so only union needed PKs, FKs and BKs consumed by the RowData expression.

IMPORTANT: All things in moderation! Here is an error handling example of what NOT to do.

Where Things Go Wrong:

Your error handling was NOT done in moderation
The RowData expression is incorrect. Work in small pieces building the script one small part at a time.
The data types of the derived columns or script output do not match the etl.SSISErrorHandlingLog. Either convert your SSIS output columns or change or destination data types.
Your script component follows a DOT NET component and there is no error description to be had. The script task will "hang".
The binary code for the script is not found. I'm still trying to figure this out as it seem intermittent. Open the script component, edit the script, save the script, save the package, open and close the package. If you have a more technical solution, I'd love to hear from you.

Conclusion:
There is more than one way to trap and report SSIS errors, but in the end, we are all traditionally using a package event handler or redirecting errors in the data flow. I like to see the data that caused the problem, so I tend to go a little overboard with the information concatenated for the etl.SSISErrorHandlingLog.RowData column. I also hook up my error table to a SSRS report and SSRS subscription in an effort to actually act on the data collected, not just store it.

0 Comments

Performance Testing OLE DB vs ADO.NET in SSIS

7/12/2017

3 Comments

I always "knew" that ADO.NET was slower than OLE DB destination SSIS components, but I had never tested it. This month I was testing a 3rd party SSIS destination component that had been built on ADO.NET and, oh my (!!). The performance was substantially slower than what I could get done with a traditional OLE DB upsert process. However, before I told the client, "Abandon ship! Every SSIS package for itself!", I decided to try a few simple tests. Here are my results of my second test, an upsert (insert new rows and update existing rows):

Environment

(local) SQL Server 2016 v.13.0.1782.2
(local) Visual Studio 2015 v.14.0.25431.01
Source table contains 12,243,275 rows and was 31.1 MB on disc
Source and destination tables had identical column structures
- Tables contained 33 columns with total max row length of 181 bytes (information taken from sys.columns)
- There was a PK in both the dbo.SourceTable and dbo.DestinationTable. dbo.DestinationTableForUpdate contained no PK
- There was no FK, additional constraints or indexes on any table
dbo.DestinationTable was manually truncated (test #1) or half the rows were deleted (test #2) before each test and was not part of the test time results
All tests were executed through Visual Studio 2015 unless noted otherwise

Test: Insert and Bulk Update, or Merge()
Setup: 6,121,637 rows were removed from dbo.DestinationTable prior to each test. Consequently, these tests inserted 6,121,637 rows and updated 6,121,638 existing rows.

Results:

1. This was a typical t-sql MERGE() statement with no DELETE and no output into a result table.
2. The same MERGE() statement executed through SSMS in #1, was executed through a SSIS Execute Sql task
3. An old-fashioned OLE DB upsert process: If key found, insert to DestinationTableForUpdate, if new, insert into DestinationTable. An execute Sql task followed the data flow to UPDATE DestinationTable with values found in DestinationTableForUpdate.
4. Rinse and repeat #3 using ADO.NET components
5. Third party upsert component which handles the insert and update in one beautiful simple ... easy ... low maintenance ... any-entry-level-SSIS-developer-can-do-it task. (I really wanted to use this SSIS add-in!)

Conclusion:
OLE DB has better performance, and because the nifty 3rd upsert task is built on ADO.NET, I won't be using it nor recommending it for use in anything other than a small to mid-size company with small MB incremental loads. Bummer! The good news is that I now have a benchmark and if Microsoft improves ADO.NET in future releases on SQL Server, I'll be pulling out my POC database and hoping for better results. (p.s. If you are wondering why I will not recommend sacrificing the permanent daily ETL load on the alter of an immediate faster ETL development, please see my BI Wheel. ETL is not the center of the wheel.)

3 Comments

Top 5 Best Practices for SSIS Performance

5/24/2015

0 Comments

Microsoft has built in multiple data flow performance features that you can read about here (http://msdn.microsoft.com/en-us/library/ms141031.aspx). The following list is not all-inclusive, but the following best practices will help you to avoid the majority of common SSIS oversights and mistakes.

1. Give your SSIS process its own server. The resources needed for data integration, primary memory and lots of it, are different than for data storage. Granted, if your entire ETL process runs in just a few hours during the night when no end users are connecting, the case can be made to share servers; however, more often, real-time requirements and on-demand transformation are reasons to give your ETL process dedicated hardware.

2. Only update rows in your data warehouse that have been changed or deleted from your source system(s).

Use SQL Server’s change data capture abilities whenever possible if working with many updates and deletes in the source system. This works well in a one-source-to-one-destination transform, but can become quite complex in a multi-source-to-one-destination ETL package
Consider the use of hash values when your source system does not indicate last update date time, or when you transforms involve multi-source-to-one-destination transforms.

3. Install and test for adequate RAM on the SSIS server. Sorting 2TB of data requires 2TB of RAM, and SSIS will start to write the data to disc when all available memory is taken. As part of your Test / QA methodology, you should use Performance Monitor and have your network team notify you whenever Buffers Spooled goes above zero or Avg. Disk sec/Transfer gets above 10. Test and monitor the following PerfMon counters.

% Processor Time
Avg Disk sec/Transfer
Available Mbytes
Buffer memory
Buffers in use
Buffers spooled
Private buffer memory
Private buffers in use

4. Take note of your MaxConcurrentExecutables package property. This defines how many tasks can run simultaneously. The default value is negative one (-1), which means the number of physical or logical processors plus two (2) is the number of control flow items that can be executed in parallel. It is generally recommended that you leave this setting at the default unless you are absolutely sure that parallelism is causing an issue.

5. Adequately test and update data flow properties that impact performance.

EngineThreads
DefaultBufferMaxRows
DefaultBufferSize

You can find a plethora of information on these package properties on the Internet, and so their definitions will not reiterated here; however, the important thing is not to push a large ETL project into production with the SSIS defaults. Best practice is to develop small, mid-sized and large data transforms with the SSIS package defaults and then take the time to test changes to each of these properties based on volume. Unfortunately, there is no exact science.

0 Comments

Top 5 Best Practices for SSIS Design

5/23/2015

3 Comments

1. A good SSIS package design will be repeatable. If you find yourself adding new tasks and data flow exceptions to your packages, you need to stop and reevaluate the original layout. One SSIS project will have several “templates” that are reused without a single design change for the life of the data warehouse. You should only have one or two template variations for each of these areas:

Staging package
Fact package
dimension package
Slowly changing dimension type 2 package
Control package

The key to manageable QA and the longevity of your ETL process is using a repeatable process. The same thing should happen over and over again as data is moved from staging to your IDS (Information Data Store) to your EDW (Enterprise Data Warehouse). Consider using t-SQL in USPs to work out complex business logic.

2. Plan for restartability. As of SQL 2014, SSIS checkpoint files still did not work with sequence containers. (The whole sequence container will restart including successfully completed tasks.) The solution is to build Restartability into your ABC framework.

3. Verify your ETL process. Just because your ETL finished without error – or you successfully handled your errors, doesn’t necessarily mean the SUM() of SourceSystemNet equals the SUM() of SSAScubeNet. Use your verification process this way.
a. It should be the final step of your ETL / ELT process
b. It should confirm that strategic SUM() and row COUNT() are accurate
c. It should report on dropped rows discarded during the ETL from an INNER JOIN or WHERE clause
d. It should automatically send emails of errors that have been allowed to “report and pass”.

4. Collect Metadata! Audit, balance and control (ABC) should be planned for and implemented from the very first day. You should store this ABC data in a separate SQL Server database, and at any point in time be able to determine the following:

What packages are currently running? (Your SQL Agent will kick off a master package but won’t tell you what child packages / sequence containers are in process.)
When did a package last successfully execute?
How many records were selected vs. inserted, updated or deleted from any given task?
At what data flow process did a package fail, and where should it restart?
How long did each task take to run?
How long did each package take to run?
What tasks are taking the longest to execute?

5. Trap for Errors both through On Error events and through precedence constraints. There are two types of errors to successfully handle in an ETL / ELT process

Report and fail. Use Event Handles and your package properties for this.
Report and pass. Use your precedence constraints to allow for errors, but always keep row-level reporting so that someone can be notified and the problem can be researched and corrected

With everything else, there must be a balance. It is not appropriate to have every single task with an Error Output. Choose wisely. It is appropriate often to ‘report and fail’, but when there are hundreds of packages and tables churning through your ETL system, you cannot have it breaking constantly. (This is often the case when source systems keep sending through data quality “surprises”.) Don’t jeopardize the reliability of your data warehouse by not handling a source system that continues to behave badly. The DW is your responsibility. The source system is often outside of your control.

For a full list of MS BI Best practices, download the following file:

ms_bi_best_practices.docx
File Size:	2137 kb
File Type:	docx

Download File

3 Comments

SQL Server Integration Services

SSIS 2016: An efficient way to handle transform and load of SCD2 (type 2 slowly changing dimensions)

SSIS 2016, An efficient way to mark deleted source system rows in a persisted data warehouse table

Error Handling in SSIS, Redirecting Error Rows, Report and Pass

Performance Testing OLE DB vs ADO.NET in SSIS

Top 5 Best Practices for SSIS Performance

Top 5 Best Practices for SSIS Design

Categories