Finally an
By Chris Becker MCDBA MCSE OCP
With the release of SQL
Server 2005 – SQL Server Integration Services (SSIS) is included, and is an
enterprise strength Extract Transform Load (ETL) tool. No longer is it a
side-note add on as with DTS in SQL 2000, this tool
will compete aggressively against the leaders in the ETL market. If you have a
licensed version of MS SQL 2005 on your network, you already own an enterprise
class ETL tool.
For those of you not familiar
with the purpose of an ETL tool, it is a way for developers and administrators
to pull data from 1 data source, massage or transform
the data while it is in transit such as combining FirstName
and LastName fields, then loading to a destination
data source. This is a very effective
method for moving large data sets around your network, loading data warehouses,
or extracting even small data sets to present to users even customers in Excel
or in Text files for example.
In this article I will
introduce you to just a few transformations which would have caused many lines
of VBScript code in DTS. I will
demonstrate how to pull data from multiple sources and bring together with the
Union All transformation, then eliminate duplicates with the Aggregate transform,
perform a Lookup at the destination to determine rows to be added only, and
finally send the data to multiple destinations with Multicast.
First, an
introduction to the development environment. If you are familiar with DTS, you will notice some dramatic
changes to the entire environment and architecture. No longer are you opening
Enterprise Manager to develop your ETL packages, you launch SQL Server Business
Intelligence Development Studio. Another major architecture changes, you do NOT
need a connection to an instance of a SQL Server on your network to develop
packages.
For those that develop
Microsoft technologies using Visual Studio, you will feel right at home with
the Toolbox, Solution Explorer, Server Explorer, and development workbench area.
You are using Visual Studio 2005 to develop your ETL packages. For those that
do not develop code, or are not familiar with a programming language, no worry
– most of the projects you will encounter can be solved without knowing or
using 1 line of code. As you will see,
much can be accomplished with setting properties of transformations and tasks
already available out-of-the-box.
The ETL package I will create
will involve pulling data from multiple different SQL Servers, bring together
to evaluate remove duplicates, perform a lookup at the destination to determine
if the record already exists, if not I want to prepare for an INSERT, if the
record does exist I will load to an separate table for logging purposes. And
for documentation purposes, I want a table to log those records that were
inserted in it’s own audit table, so I can track
exactly what is inserted and when.
The first step is to create a
Data Flow Task, as seen above, by dragging the icon from our toolbox onto our
workbench or ‘Control Flow’ tab area. Double click on the Data Flow Task to
take you to the ‘Data Flow’ tab. Here is where we define our data sources,
define the transformations, and define the destinations for the completed data.
From the toolbox we click and
drag 3 OLE DB Source icons onto the Data Flow work area. These 3 sources
represent 3 different SQL Servers on our network. Of course SSIS is not limited
to Microsoft data sources, anything that is ODBC or OLEDB compliant, or even
text file sources can be used with ease.
Now we configure each data
source to a specific server. Double click on any source icon above and specify
the details of our connection. Here we are connecting to a SQL Server named
DEVLAB3, a database named AdventureWorks, and we are
executing a SQL command to pull our data. In this case we are selecting from
the Employee table where any record was modified on or after 1/1/2007.
Each of the 3 sources contain different (and sometimes
the same) employee records. The next step would be to bring all of the above
records together into 1 location to then massage, transform, and prepare for
our destination. So here we create a ‘Union All’ transformation to bring
together our data from the 3 separate sources.
Next we will need to remove duplicates,
the same employee record in this environment can possibly exist in multiple
data sources unfortunately. So here we want to eliminate the chance of loading
duplicates into the destination data source by creating an Aggregation
Transformation. Here is a look at the editor, notice the ‘Group By’ Operation.
Now that our data has been removed of duplicates, we
now need to determine which records are to be loaded. Perhaps which are to be
INSERTED and which are to be UPDATED if the already exist. This is performed the a Lookup Transformation.
A datasource or Connection Manager must be
defined on the Reference Table tab, in this case to the actual destination
table. Instead of defining an entire table, I use the Select EmployeeID from Employee at destination to eliminate all
unnecessary columns. In effect, all we need to check is the EmployeeID
field, if that EmployeeID exists already we do not
want to Insert. As defined in the Lookup
Transformation Editor, an Output Alias is created. This is an extra field
passed to the next transformation downstream to determine whether each
individual record exists or not at the destination.
Now that the lookup has taken place, we evaluate the
Output Alias from above and check for the NULL value. If EmployeeID_Lookup
= NULL, the record is loaded into the destination table. NULL means the record
did not exist at the destination when the lookup was performed. If the value is not NULL, meaning the record
did exist, we want to do something else with these records. Therefore, the appropriately named
‘Conditional Split Transformation’ is used.
So our ETL package has gone
through the Extract and Transformations stages, and now we are ready for the
Load to the destination. The package so far consists of 3 data sources, they
are brought together by a Union All Transformation, then the Aggregate removes
any dupes, followed by a Lookup to determine ‘do I exist or not’, with a
Conditional Split to identify records for Loading.
The next transformation used
is the Multicast, this takes a single input, and can
take the input and send to as many different destinations as is necessary. In
the example here, I am taking those records from the
Conditional Split
Transformation where the records are NULL for EmployeeID_Lookup
and only sending those to the Multicast Transform. Where the records are NOT
NULL, I am sending to their own destination. Now the purpose of using the
Multicast in my example is to load 1 copy to an ETLLog
table on my destination server called DEVLAB1, the other copy is for the
production table – AdventureWorks.Employee.
Now the purpose of this
article is to display not only the power of SSIS, but it’s
user friendliness. We created a rather complex package that is very common in
many production environments today, and without 1 line of VB or Java script our
package is doing some pretty amazing stuff.
For those DTS users out there, I know you can appreciate how much
development time SSIS will save you. If
you have SQL 2005 in your environment, this truly enterprise class tool is just
a click away.