Finally an Enterprise Class ETL Tool From Microsoft

                By Chris Becker MCDBA MCSE OCP

 

 

With the release of SQL Server 2005 – SQL Server Integration Services (SSIS) is included, and is an enterprise strength Extract Transform Load (ETL) tool. No longer is it a side-note add on as with DTS in SQL 2000, this tool will compete aggressively against the leaders in the ETL market. If you have a licensed version of MS SQL 2005 on your network, you already own an enterprise class ETL tool.

 

For those of you not familiar with the purpose of an ETL tool, it is a way for developers and administrators to pull data from 1 data source, massage or transform the data while it is in transit such as combining FirstName and LastName fields, then loading to a destination data source.  This is a very effective method for moving large data sets around your network, loading data warehouses, or extracting even small data sets to present to users even customers in Excel or in Text files for example. 

 

In this article I will introduce you to just a few transformations which would have caused many lines of VBScript code in DTS.  I will demonstrate how to pull data from multiple sources and bring together with the Union All transformation, then eliminate duplicates with the Aggregate transform, perform a Lookup at the destination to determine rows to be added only, and finally send the data to multiple destinations with Multicast.

 

First, an introduction to the development environment. If you are familiar with DTS, you will notice some dramatic changes to the entire environment and architecture. No longer are you opening Enterprise Manager to develop your ETL packages, you launch SQL Server Business Intelligence Development Studio. Another major architecture changes, you do NOT need a connection to an instance of a SQL Server on your network to develop packages.

 

 

 

 

For those that develop Microsoft technologies using Visual Studio, you will feel right at home with the Toolbox, Solution Explorer, Server Explorer, and development workbench area. You are using Visual Studio 2005 to develop your ETL packages. For those that do not develop code, or are not familiar with a programming language, no worry – most of the projects you will encounter can be solved without knowing or using 1 line of code.  As you will see, much can be accomplished with setting properties of transformations and tasks already available out-of-the-box.

 

 

 

The ETL package I will create will involve pulling data from multiple different SQL Servers, bring together to evaluate remove duplicates, perform a lookup at the destination to determine if the record already exists, if not I want to prepare for an INSERT, if the record does exist I will load to an separate table for logging purposes. And for documentation purposes, I want a table to log those records that were inserted in it’s own audit table, so I can track exactly what is inserted and when.

 

The first step is to create a Data Flow Task, as seen above, by dragging the icon from our toolbox onto our workbench or ‘Control Flow’ tab area. Double click on the Data Flow Task to take you to the ‘Data Flow’ tab. Here is where we define our data sources, define the transformations, and define the destinations for the completed data.

 

 

 

From the toolbox we click and drag 3 OLE DB Source icons onto the Data Flow work area. These 3 sources represent 3 different SQL Servers on our network. Of course SSIS is not limited to Microsoft data sources, anything that is ODBC or OLEDB compliant, or even text file sources can be used with ease.

 

 

 

 

 

 

Now we configure each data source to a specific server. Double click on any source icon above and specify the details of our connection. Here we are connecting to a SQL Server named DEVLAB3, a database named AdventureWorks, and we are executing a SQL command to pull our data. In this case we are selecting from the Employee table where any record was modified on or after 1/1/2007.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Each of the 3 sources contain different (and sometimes the same) employee records. The next step would be to bring all of the above records together into 1 location to then massage, transform, and prepare for our destination. So here we create a ‘Union All’ transformation to bring together our data from the 3 separate sources.

 

 

 

 

 

 

 

 

 

 

 

 

 

Next we will need to remove duplicates, the same employee record in this environment can possibly exist in multiple data sources unfortunately. So here we want to eliminate the chance of loading duplicates into the destination data source by creating an Aggregation Transformation. Here is a look at the editor, notice the ‘Group By’ Operation.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Now that our data has been removed of duplicates, we now need to determine which records are to be loaded. Perhaps which are to be INSERTED and which are to be UPDATED if the already exist. This is performed the a Lookup Transformation.  A datasource or Connection Manager must be defined on the Reference Table tab, in this case to the actual destination table. Instead of defining an entire table, I use the Select EmployeeID from Employee at destination to eliminate all unnecessary columns. In effect, all we need to check is the EmployeeID field, if that EmployeeID exists already we do not want to Insert. As defined in the Lookup Transformation Editor, an Output Alias is created. This is an extra field passed to the next transformation downstream to determine whether each individual record exists or not at the destination.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Now that the lookup has taken place, we evaluate the Output Alias from above and check for the NULL value. If EmployeeID_Lookup = NULL, the record is loaded into the destination table. NULL means the record did not exist at the destination when the lookup was performed.  If the value is not NULL, meaning the record did exist, we want to do something else with these records.  Therefore, the appropriately named ‘Conditional Split Transformation’ is used.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

So our ETL package has gone through the Extract and Transformations stages, and now we are ready for the Load to the destination. The package so far consists of 3 data sources, they are brought together by a Union All Transformation, then the Aggregate removes any dupes, followed by a Lookup to determine ‘do I exist or not’, with a Conditional Split to identify records for Loading.

The next transformation used is the Multicast, this takes a single input, and can take the input and send to as many different destinations as is necessary. In the example here, I am taking those records from the

 

 

 

Conditional Split Transformation where the records are NULL for EmployeeID_Lookup and only sending those to the Multicast Transform. Where the records are NOT NULL, I am sending to their own destination. Now the purpose of using the Multicast in my example is to load 1 copy to an ETLLog table on my destination server called DEVLAB1, the other copy is for the production table – AdventureWorks.Employee.

 

Now the purpose of this article is to display not only the power of SSIS, but it’s user friendliness. We created a rather complex package that is very common in many production environments today, and without 1 line of VB or Java script our package is doing some pretty amazing stuff.  For those DTS users out there, I know you can appreciate how much development time SSIS will save you.  If you have SQL 2005 in your environment, this truly enterprise class tool is just a click away.