SSIS Best Practices | Microsoft Sql Server | Oracle Database
This document has leveraged best practices from multiple sources including technical papers, blogs and experts in the to...
SSIS – Best Practices Introduction About This Document This document outlines some best practices for developing ETL solutions in SQL Server Integration Services (SSIS). This document has leveraged best practices from multiple sources including technical papers, blogs and experts in the tool. The purpose of establishing best practices will provide the following benefits: Reusable and flexible ETL architecture solution Solutions to common problems Tips and tricks
Audience The primary audience for this document is developers who will be involved in the detailed design and implementation of the ETL solution. Secondary audiences are the architects who will be involved in validating the guiding principles used in the design and implementation. Pre-Requisites The audience should be familiar with SSIS. Scope The scope of this document covers the various best practices to be followed when developing ETL solutions in SSIS. This document does not include: Introduction to ETL, Business Intelligence or Data Warehouse concepts Best practices for an ETL solution (e.g. architecture, logging, auditing etc.) Introduction to SSIS How to use a specific component of SSIS Package Versions This document uses: Microsoft SQL Server Integration Services Designer version 9.00.3033.00 Microsoft Visual Studio 2005 Version 8.0.50727.762 Microsoft .NET Framework Version 2.0.50727 SP1 Installed Edition: IDE Standard Microsoft Visual Studio 2005 Premier Partner Edition - ENU Service Pack 1 (KB926601)
Architecture and Setup Master-Child Packages In most cases, the problem is large enough to separate the solution into multiple packages based on a logical property or module (e.g. dimensions and facts, line of business, helper packages etc.). Having multiple packages will also be efficient as developers can work in parallel. The master package should have minimal to none business extraction logic. Its main purpose is to call
other packages based on dependencies of the data (dimensions before facts). The master package would also be responsible for cycle run date range, auditing, global variable values, database connection strings, error recovery etc. to pass to the child packages. Master package example below:
Templates Use templates to standardize package design, configuration, event handling and auditing. An important item to remember when using a template package to create a new package is to generate a new ID for the package. A new ID can be generated from the Properties window of the package. The property is under Identification->ID->Generate New ID. SSIS may not log the package execution correctly if they are sharing the same IDs. To read more about the benefits of using templates, visit http://blogs.conchango.com/jamiethomson/archive/2007/03/11/SSIS_3A00_-Package-Template.aspx
Configurations Package Configurations Package configurations are a mechanism for dynamically changing properties of your IS objects and components at run-time using values that are stored externally to the package. Properties for variables (e.g. values, name, read-only, expression etc.), event handlers (e.g. error handling, system error variables etc.), connection managers (connection string, security parameters etc.), logging (e.g. events, description) and other general (e.g. checkpoint usage, failure recovery, version control, access etc.) are easily configurable in SSIS by using the package configuration wizard. During run-time, SSIS applies the configurations to the package based on the method and its order created in the wizard. SSIS provides a number of methods to store and pass these values to a package:
Parent Package Variable Each method should be used as described below. SQL Server This is a form of a direct configuration method. It is the recommended way to store most of your package configuration values. SSIS stores the values in a table with a default schema at a package, property and value grain. The developer has flexibility and control over the configurations as it can easily modify the configuration values prior to run-time (e.g. date range for which the package should run for). The package itself has control over the configurations during run-time as it can easily modify the property values (e.g. the global error count variable whenever an error is encountered by any packages in the solution) with an update to the configuration table. Use a variable to store the connection string to the configuration table database for a connection manager. You would then only need to update this variable value by using an indirect configuration method to specify the table connection properties. By storing values in a SQL Server table, we have a central repository to review and modify variable values and package properties without performing the tedious task of opening the package, modifying the values or property within the package and re-deploying the package. XML File An effective direct method for storing configurations as well but more complicated to modify configurations during run-time. Each package of a solution can have its configurations in one file allowing a central location to change package properties. One can also use the XML configuration file as an indirect configuration (storing the direct configuration table connection value) by setting the file location as a static configuration entry in the package configuration wizard. You will need to ensure that when deploying the solution to different environments, you must deploy the configuration file in the same structure in those environments. Environment Variable Leverage indirect configuration by using it to store the XML configuration file location or the SQL Server
connection of the configuration table. This will allow you to have and easily switch between multiple configuration sets by changing the environment variable value rather than modifying it at the package level and re-building the package. Also, when deploying solutions to different environments, only the environment variable would need to be changed to identify the new location or database connection for the direct configuration. Registry Settings Similar usage as an environment variable except the direct configuration value is set in a registry key. Parent Package Variable Use it to pass global variable values from the parent package to the child package. Use it to pass database connection string values, cycle run ID, audit data and other useful information across packages. Ensure that you create a variable with the same data type in the child package else data value compatibility errors may arise. Configurable Packages As read above, SSIS allows multiple ways to dynamically configure packages. This is a powerful feature that can be used in multiple ways to make a package extensible and flexible. One should store most of the values of variables used and expression property of a variable in the SQL Server configuration table. Additionally, store the source queries used to extract data in a variable and leverage the use of expressions to dynamically generate a query for a set of criteria. An example is provided in the DynamicQuery package of the SSIS_Best_Practices solution. Recommended Reading SSIS Package Configuration in SQL Server 2005 http://www.sql-server-performance.com/articles/dba/package_configuration_2005_p1.aspx
Connection Managers For an overview of the plethora of connection managers available in SSIS, review the MS Connection Manager white paper.
Native Providers Experience tells us to use native providers. However, in the world of SSIS, native providers are usually not the most efficient method to use. The white paper ‘SSIS 2008 Connectivity Options’ highlights some of the third-party options available. You may need to install additional client connectors to effectively connect to a database from SSIS (view the frequent error you may receive to connect to an Oracle DB below). Oracle Oracle’s own provider for an Oracle DB is quite slow and inefficient for extracting data. The better option is using the SSIS’ provider with a few performance tweaks to the registry of the client that will host the packages. In the registry structure - HKEY_LOCAL_MACHINE\SOFTWARE\ORACLE\OLEDB, change the values of the following keys: FetchSize to ‘200’, CacheType to ‘File’, PLSQLRSet to‘1’. For writing to an Oracle DB, the provider from Persistent has great reviews and is supposedly 100 times faster than other providers. Visit
http://www.persistentsys.com/ServiceOfferings/TimeToMarketAccelerators/Connectors/SSISOracleBulk LoadConnector/tabid/195/Default.aspx for more information. Microsoft has released new Oracle providers (included in SSIS 2008) which are supposedly much faster and efficient than their previous counterparts. You can download them here: http://www.microsoft.com/downloads/details.aspx?FamilyID=d9cb21fe-32e9-4d34-a3816f9231d84f1e&DisplayLang=en A frequent error received while connecting to an Oracle DB from SSIS is “client and networking components were not found”. The fix for this is to install the Oracle client for the machine you are running SSIS from. This will install the correct libraries and drivers needed to connect to an Oracle source. Visit http://www.oracle.com/technology/software/tech/oci/instantclient/index.html to install the correct client. Recommended Reading Connectivity and SQL Server 2005 Integration Services http://technet.microsoft.com/en-us/library/bb332055.aspx How to Choose the Best Connectors for SSIS http://www.sqlservercentral.com/articles/SQL+Server+2005+-+SSIS/3209/ SSIS Data Sources Categories and Compatibility (highly recommended) http://ssis.wik.is/Data_Sources
Data Flow Task The Data Flow Task is the most useful and important task. It facilitates the movement of data between sources to destinations allowing complex transformations to take place within the flow. Eliminating unnecessary work that the data flow task does is the single most important method to improve its performance. Using the highly recommended reference, http://www.simple-talk.com/sql/sql-server-2005/sql-server2005-ssis-tuning-the-dataflow-task/, I have categorized the performance improvement tips below. Do note that I have added or removed some tips based on experience. General Increase DefaultBufferMaxSize and DefaultBufferMaxRows - Increasing the values for these two properties can boost performance by decreasing the number of buffers moving through the data flow. However, you should avoid increasing the values too much to the point where the Execution Engine starts swapping out buffers to disk. That would defeat the purpose. Use match indexes for repeat data cleansing sessions- When the package runs again, the transformation can either use an existing match index or create a new index. If the reference table is static, the package can avoid the potentially expensive process of rebuilding the index for repeat sessions of data cleaning. If you choose to use an existing index, the index is created the first time that the package runs. If multiple Fuzzy Lookup transformations use the same reference table, they can all use the same index. To reuse the index, the lookup operations must be identical; the lookup must use the same columns. You can name the index and select the connection to the SQL Server database that saves the index. Implement Parallel Execution - Both the Execution Engine for the Dataflow Task and the
Execution Engine for the Control Flow are multithreaded. o Use EngineThreads Property – controls the number of worker threads the Execution Engine will use. The default for this property is 5. However, as you now know, by simply adding a few components, data flow thread requirements will quickly exceed the default. Be aware of how many threads the data flow naturally needs and try to keep the EngineThreads value reasonably close to it. o Set MaxConcurrentExecutables – if there are multiple Dataflow Tasks in the Control Flow, say 10 and MaxConcurrentExecutables is set to 4, only four of the Dataflow Tasks will execute simultaneously. Set, test and measure various value combinations of this property and the EngineThreads property to determine the optimal setting for your packages. Data Flow Sources Other online links would urge you to use the DataReader Source as opposed to the OLE DB Source for performance improvements. However, experience suggests that no credible improvement is seen and one should use the OLE DB Source as the flexibility it provides to move between environments and servers outweighs the negligible performance degradation. Remove unnecessary columns - unneeded columns are columns that never get referenced in the data flow. The Execution Engine emits warnings for unused columns, so they are easy to identify. This makes the buffer rows narrower. The narrower the row, the more rows that can fit into one buffer and the more efficiently the rows can be processed. o NOTE: Buffers are objects with associated chunks of memory that contain the data to be transformed. As data flows through the Dataflow Task, it lives in a buffer from the time that the source adapter reads it until the time that the destination adapter writes it. Binary Large Objects (BLOBs) are particularly burdensome to the Dataflow Task and should be eliminated if at all possible. Use the queries in the source adapters to eliminate unnecessary columns. Use a SQL select statement to retrieve data from a view- avoid using the “Table or view” access mode in the OLE DB Source Adapter. It is not as fast as using a SELECT statement because the adapter opens a rowset-based on the table or view. Then it calls OpenRowset in the validation phase to retrieve column metadata, and later in the execution phase to read out the data. o Testing has shown that using a SELECT statement can be at least an order of magnitude faster, because the adapter issues the specified command directly through the provider and fetches the data using sp_prepare without executing the command, avoiding the extra roundtrip and a possibly inappropriate cached query plan. Optimize source queries - using traditional query optimization techniques optimize the source adapter SQL query. SSIS doesn't optimize the query on your behalf, but passes it on verbatim. Set based operations - when possible, perform set based operations in SQL Server. For example, SQL Server can generally sort faster than the sort transform, especially if the table being sorted is indexed. Set based operations such as joins, unions, and selects with ORDER BY and GROUP BY tend to be faster on the server. If you know that data in a source is sorted, set IsSorted=TRUE on the source adapter output. This may save unnecessary SORTs later in the pipeline which can be expensive. Setting this value does not perform a sort operation; it only indicates that the data it sorted. If you need a dynamic SQL statement in an OLE DB Source component, set AccessMode="SQL Command from variable" and build the SQL statement in a variable that has EvaluateAsExpression=TRUE. If you can, filter your data in the Source Adapter rather than filter the data using a Conditional
Split transform component. This will make your data flow perform quicker. Break out all tasks requiring the Jet engine (Excel or Access data sources) into their own packages that do nothing but that data flow task. Load the data into Staging tables if necessary. This will ensure that solutions can be migrated to 64bit with no rework as there is no 64bit Jet driver.
Flat File and Other File Sources Retrieving data from file sources presents its own set of performance challenges because the data is typically in some format that requires conversion. For example, the Jet Provider only supports a limited set of data types when reading Excel files and flat file data are always of type string until converted. Here are a few hints on how to eliminate unnecessary data flow work: Combine Adjacent Unneeded Flat File Columns – to eliminate unnecessary parsing Leave Unneeded Flat File Columns as Strings – don't convert them to dates etc. unless absolutely necessary Eliminate Hidden Operations – mostly the Dataflow Task is explicit about what it is doing. However, there are some components that perform hidden or automatic conversions. For example, the Flat File Source Adapter will attempt to convert external column types to their associated output column types. Use the Advanced Editor to explore each column type so that you know where such conversions occur. Only Parse or Convert Columns When Necessary – Reorganize the data flow to eliminate the Type Conversion Transform if possible. Even better, if possible, modify the source column data type to match the type needed in the data flow. Use the FastParse Option in Flat File Source – Fastparse is a set of optimized parsing routines that replace some of the SSIS locale-specific parsing functions. Eliminate Unneeded Logging – logging is useful for debugging and troubleshooting but, when deploying completed packages to production, be mindful and careful about the log entries you leave enabled and the log provider you use. Notably, OnPipelineRowsSent is somewhat verbose. When using raw files and your Raw File Source Component and Raw File Destination Component are in the same package, configure your Raw File Source and Raw File Destination to get the name of the raw file from a variable. This will avoid hardcoding the name of the raw file into the two seperate components and running the risk that one may change and not the other. Variables that contain the name of a raw file should be set using an expression. This will allow you to include the value returned from System::PackageName in the raw file name. This will allow you to easily identify which package a raw file is to be used by. o Note: This approach will only work if the Raw File Source Component and Raw File Destination Component are in the same package. Data Flow Transformations Be mindful of transforms with internal File IO - Some of the stock dataflow transforms perform internal file Input/Output. For example, the Raw Source and Destination, Import/Export Column Transforms, Flat File Source and Destination and Excel File Source and Destination Adapters are all directly impacted by the performance of the file system. File IO isn't always a bottleneck, but when combined with low memory conditions, causing spooling or with other disk intense operations, it can significantly impact performance. Components that read and write to disk should be scrutinized carefully, and if possible, configured to read and write to dedicated hard drives. Look for ways to optimize the performance of the hard drives using RAID, defragmentation and/or correct partitioning.
Monitor memory intensive transforms - if your package is memory bound, look for ways to eliminate the memory intensive transforms or shift them to another package. Some transforms such as the Aggregate, Lookup and Sort use a lot of memory. The Sort, for example, holds all buffers until the last buffer and then releases the sorted rows. If memory runs low, these transforms may spool to disk, causing expensive hard page faults. o Monitor other memory intensive applications - when running on the same machine as other memory intensive applications, the data flow can become memory starved, even if there is plenty of memory on the machine. This is typically true when running packages on the same machine with SQL Server. SQL Server is aggressive about using memory. You can use the sp_configure system stored procedure instruct SQL Server to limit its memory usage.
Lookup Transformation Pare down the Lookup reference data - the default lookup query for the Lookup Transform is SELECT * FROM TABLE(S). Instead, select the option to use the results of a query for the reference data. Generally, the reference data should only contain the key and the desired lookup column. So, for a dimension table lookup, that would be the natural key and the surrogate key. Use Lookup Partial or Full Cache mode - depending on the requirements and the data, you should choose one of these two modes to speed up the Lookup. Partial cache mode is useful when the incoming data is repetitive and only references a small percentage of the total reference table. Full cache mode is useful when the reference table is relatively small and the incoming data references the full spectrum of reference table rows. LOOKUP components will generally work quicker than MERGE JOIN components where the 2 can be used for the same task (Case study: http://blogs.conchango.com/jamiethomson/archive/2005/10/21/2289.aspx) Slowly Changing Dimension (SCD) Transformation It is interesting that SSIS does provide an out-of-the-box solution for SCDs however for most practical purposes, the performance is extremely poor. Some other cons of this transform are: o It can be slow when applied to large dimensions, i.e. anything over 1,000 rows. o You can’t configure the implied comparison/lookup between the two tables. It is strictly a .NET comparison for equality. That means, for example, that a varchar(10) may not compare correctly to a char(10), even though to the human eye they are identical. o If the results of any part of the comparison results in a NULL, the transformation throws an exception, which may or may not cause the entire package to abort. It depends on how you have “programmed” the package. o It can’t handle any “special cases”. Like updating Oracle tables. An alternative approach to using SSIS’s SCD transformation would be to consider using an open source solution such as the Kimbal Method SSIS SCD Component (http://www.codeplex.com/kimballscd). Another alternative approach would be to use existing SSIS components and build a solution for the SCD types as shown below:
Data Flow Destination Other online links would urge you to use the DataReader Destination as opposed to the OLE DB Destination for performance improvements. However, experience suggests that no credible improvement is seen and one should use the OLE DB Destination as the flexibility it provides to move between environments and servers outweighs the negligible performance degradation. Set the Commit Size - The Commit Size option allows you to set a larger buffer commit size for loading into SQL Server. This setting is only available in the OLEDB Destination when using the SQL Server OLEDB driver. A setting of zero indicates that the adapter should attempt to commit all rows in a single batch. Turn on Table Lock - This option is available in the OLEDB Destination Editor. Selecting "Table Lock" also enables fast load, which tells the adapter to use the IRowsetFastload bulk insert interface for loading. o NOTE: Fastload delivers much better performance, however it does not provide as much information if there is an error. Generally, for development you should turn it off and then turn it on when deploying to production. Disable Constraints - This option is also available in the OLEDB Destination Editor by unchecking the "Check constraints" option. Trash Destination During development, it is quite useful to have a destination that consumes rows but does not necessarily provide any functionality. There is an adapter developed by Konesans (along with many other useful adapters), that does exactly that. Using it with a Data Viewer task allows you to monitor rows that could have been potentially rejected by the previous component in the data flow. You can download the adapter here: http://www.sqlis.com/post/Trash-Destination-Adapter.aspx
http://www.sqlis.com/post/Easy-Package-Configuration.aspx http://www.sqlis.com/post/Trash-Destination-Adapter.aspx http://www.simple-talk.com/sql/sql-server-2005/sql-server-2005-ssis-tuning-the-dataflow-task/ http://www.whitworth.org/2009/04/14/todd-mcdermids-blog-an-alternative-to-ssiss-scd-wizard/