SSIS - Optimize package for fewer queries

Problem:

I need to associate customers' order with their "level" (Silver, Gold, etc.) when they placed the order:
CRM server::CRM db::CRM table
----------
CustomerID   PreviousLevel    NewLevel    NewLevelGrantedOn

Order server::Order db::Order table
----------
OrderID     CustomerID    OrderPlacedOn
In an SSIS package I did this:
  1. Execute SQL against Order db to extract the orders and put them in an object variable;
  2. Loop through each order using "foreach container", in which I put a data flow task select top 1 * where CustomerID = ? and LevelGrantedOn < ? order by LevelGrantedOn desc that extract data from CRM db (both parameters come from step one), derive some columns and write the output to another table.
There are more then twenty thousand records in the order db, which means the data flow task will be executed for more than twenty thousand times. The CRM db will be queried for more than twenty thousand times, too. It takes more than an hour to do these.
Can I utilize some built-in features to speed up these (or do it in a "smart" way)? And, is an hour a long time, in the context of ETL and / or SSIS?

Solution:

You can achieve this without using Foreach enumerator, just use one data flow task to achieve this.

Building the Package

First of all, add a Data Flow Task to the control Flow
1. OLEDB Sources
In the DataFlow task add an OLEDB Source that Read from Orders Table (the same command used in the Execute SQL Task (first step in your question)
Select * FROM Order
Also add a second OLEDB Source that read from customer table:
select * FROM CRM
2. Sort
After each OLEDB Source add a source component:
  1. The First (order table) you have to select CustomerID and OrderID columns for sorting and select the sort type as ascending
  2. The Second (customer table) you have to select CustomerID (sort type = ascending) and LevelGrantedOn (sort type = descending) columns for sorting
3. Merge Join
Add a merge Join component to join both sorted outputs. and select CustomerID column from both outputs as Join Key and select the output columns you need from both tables
4. Conditional Split
After merge Join add a conditional split to filter rows only that match the following expression
[LevelGrantedOn] < [orderdate]
5. Script Component
The final component we have to use is a script component to get only the first row for each customerID (because both source are well sorted just grabbing the first row is similar for Select top 1 ... ORDER BY LevelGrantedOn desc)
In the script component add an Output Column OutFlag of type DT_BOOL and Use the following script:
enter image description here
This script will set OutFlag to True when the customerID occur for the first time (similar to TOP 1)
    Public Class ScriptMain
        Inherits UserComponent


        Dim lstCustomerID As New System.Collections.Generic.List(Of Integer)
        Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)

            If lstCustomerID.Contains(Row.ID) Then

                Row.OutFlag = False
            Else

                lstCustomerID.Add(Row.ID)
                Row.OutFlag = True


            End If



        End Sub

    End Class
6. Conditional Split
Add a second Conditional Split to filter Top 1 rows:
[OutFlag] == True
7.OLEDB Destination
Connect the Conditional Split Output top the OLEDB Destination and map the columns.
The Dataflow task must looks like
enter image description here

Original Post: https://dba.stackexchange.com/questions/176886/optimize-ssis-package-for-fewer-queries/176939#176939

Comments

Popular posts from this blog

SSIS - Script Task error: Exception has been thrown by the target of an invocation.

Don’t install Hadoop on Windows!

SSIS - script component DateTime Formats Implicit Conversion