Sunday, 25 November 2012

IBM InfoSphere DataStage (ETL) Introduction

Hi everyone,
 
This is a deep overview of what is DataStage and it's components.  

What is DataStage

InfoSphere DataStage is a powerful data integration/ETL tool. It was acquired by IBM in 2005 and has become a part of IBM Information Server Platform. It uses a client/server design where jobs are created and administered via a Windows client against central repository on a server. The IBM InfoSphere DataStage is capable of integrating data on demand across multiple and high volumes of data sources and target applications using a high performance parallel framework. InfoSphere DataStage also facilitates extended metadata management and enterprise connectivity 
 

Basic Overview
            It is part of the IBM Information Platforms Solutions suite and IBM InfoSphere. It uses a graphical notation to construct data integration solutions and is available in various versions such as the Server Edition, the Enterprise Edition, and the MVS Edition.
Following are the various editions
  • Enterprise Edition (PX): a name give to the version of DataStage that had a parallel processing architecture and parallel ETL jobs.
  • Server Edition: the name of the original version of DataStage representing Server Jobs. Early DataStage versions only contained Server Jobs. DataStage 5 added Sequence Jobs and DataStage 6 added Parallel Jobs via Enterprise Edition.
  • MVS Edition: mainframe jobs, developed on a Windows or Unix/Linux platform and transferred to the mainframe as compiled mainframe jobs.
  • DataStage for PeopleSoft: a server edition with prebuilt PeopleSoft EPM jobs under an OEM arrangement with PeopleSoft and Oracle Corporation.
  • DataStage TX: for processing complex transactions and messages, formerly known as Mercator. Now known as WebSphere Transformation Extender.
  • ISD (Information Services Director, ex. DataStage RTI): Real Time Integration pack can turn server or parallel jobs into SOA services.

ETL –Understanding

Extract, transform and load (ETL) refers to a process in database usage and especially in data warehousing that involves:
  1. Extracting data from outside sources
  2. Transforming it to fit operational needs (which can include quality levels)
  3. Loading it into the end target (database, more specifically, operational data store, data mart or data warehouse)

EXTRACTION

The first part of an ETL process involves extracting the data from the source systems. In many cases this is the most challenging aspect of ETL, as extracting data correctly will set the stage for how subsequent processes will go.
Most data warehousing projects consolidate data from different source systems. Each separate system may also use a different data organization/format. Common data source formats are relational databases and flat files, but may include non-relational database structures such as Information Management System (IMS) or other data structures such as Virtual Storage Access Method (VSAM) or Indexed Sequential Access Method (ISAM), or even fetching from outside sources such as through web spidering or screen-scraping. The streaming of the extracted data source and load on-the-fly to the destination database is another way of performing ETL when no intermediate data storage is required. In general, the goal of the extraction phase is to convert the data into a single format which is appropriate for transformation processing.

An intrinsic part of the extraction involves the parsing of extracted data, resulting in a check if the data meets an expected pattern or structure. If not, the data may be rejected entirely or in part.

TRANSFORM

The transform stage applies to a series of rules or functions to the extracted data from the source to derive the data for loading into the end target. Some data sources will require very little or even no manipulation of data. In other cases, one or more of the following transformation types may be required to meet the business and technical needs of the target database:
  • Selecting only certain columns to load (or selecting null columns not to load). For example, if the source data has three columns (also called attributes), for example roll_no, age, and salary, then the extraction may take only roll_no and salary. Similarly, the extraction mechanism may ignore all those records where salary is not present (salary = null).
  • Translating coded values (e.g., if the source system stores 1 for male and 2 for female, but the warehouse stores M for male and F for female)
  • Encoding free-form values (e.g., mapping "Male" to "1")
  • Deriving a new calculated value (e.g., sale_amount = qty * unit_price)
  • Sorting
  • Joining data from multiple sources (e.g., lookup, merge) and deduplicating the data
  • Aggregation (for example, rollup — summarizing multiple rows of data — total sales for each store, and for each region, etc.)
  • Generating surrogate-key values
  • Transposing or pivoting (turning multiple columns into multiple rows or vice versa)
  • Splitting a column into multiple columns (e.g., putting a comma-separated list specified as a string in one column as individual values in different columns)
  • Disaggregation of repeating columns into a separate detail table (e.g., moving a series of addresses in one record into single addresses in a set of records in a linked address table)
  • Lookup and validate the relevant data from tables or referential files for slowly changing dimensions.
 Applying any form of simple or complex data validation. If validation fails, it may result in a full, partial or no rejection of the data, and thus none, some or all the data is handed over to the next step, depending on the rule design and exception handling. Many of the above transformations may result in exceptions, for example, when a code translation parses an unknown code in the extracted data.


LOAD

The load phase loads the data into the end target, usually the data warehouse (DW). Depending on the requirements of the organization, this process varies widely. Some data warehouses may overwrite existing information with cumulative information, frequently updating extract data is done on daily, weekly or monthly basis. Other DW (or even other parts of the same DW) may add new data in a historicized form, for example, hourly. To understand this, consider a DW that is required to maintain sales records of the last year. Then, the DW will overwrite any data that is older than a year with newer data. However, the entry of data for any one year window will be made in a historicized manner. The timing and scope to replace or append are strategic design choices dependent on the time available and the business needs. More complex systems can maintain a history and audit trail of all changes to the data loaded in the DW.

DataStage – Deep Dive

DataStage is actually two separate things.

• In production (and, of course, in development and test environments) DataStage is just another application on the server, an application which connects to data sources and targets and processes ("transforms") the data as they move through the application. Therefore  ataStage is classed as an "ETL tool", the initials standing for extract, transform and load respectively. DataStage "jobs", as they are known, can execute on a single server or on multiple machines in a cluster or grid environment. Like all applications, DataStage jobs consume resources: CPU, memory, disk space, I/O bandwidth and network bandwidth.


• DataStage also has a set of Windows-based graphical tools that allow ETL processes to be designed, the metadata associated with them managed, and the ETL processes monitored. These client tools connect to the DataStage server because all of the design information and metadata are stored on the server. On the DataStage server, work is organized into one or more "projects". There are also two DataStage engines, the "server engine" and the
"parallel engine".


• The server engine is located in a directory called DSEngine whose location is recorded in a hidden file called /.dshome (that is, a hidden file called .dshome in the root directory) and/or as the value of the environment variable DSHOME. (On Windows-based DataStage servers the folder name is Engine, not DSEngine, and its location is recorded in the Windows registry rather than in /.dshome.)

• The parallel engine is located in a sibling directory called PXEngine whose location is recorded in the environment variable APT_ORCHHOME and/or in the environment variable PXHOME.


DataStage Engines

The server engine is the original DataStage engine and, as its name suggests, is restricted to running jobs on the server. The parallel engine results from acquisition of Orchestrate, a parallel execution technology developed by Torrent Systems, in 2003. This technology enables work (and data) to be distributed over multiple logical "processing nodes" whether these are in a single machine or multiple machines in a cluster or grid configuration. It also allows the degree of parallelism to be changed without change to the design of the job.
 
Design-Time Architecture

At its simplest, there is a DataStage server and a local area network on which one or more DataStage client machines may be connected. When clients are remote from the server, a wide area network may be used or some form of tunnelling protocol (such as Citrix MetaFrame) may be used instead.

DataStage Client/Server Connectivity

Connection from a DataStage client to a DataStage server is managed through a mechanism based upon the UNIX remote procedure call mechanism. DataStage uses a proprietary protocol called DataStage RPC which consists of an RPC daemon (dsrpcd) listening on TCP port number 31538 for connection requests from DataStage clients. Before dsrpcd gets involved, the connection request goes through an authentication process. Prior to version 8.0, this was the standard operating system authentication based on a supplied user ID and password (an option existed on Windows-based DataStage servers to authenticate using Windows LAN Manager, supplying the same credentials as being used on the DataStage client machine – this option was removed for version 8.0). With effect from version 8.0 authentication is handled by the Information Server through its login and security service.

Understanding tiers and components

You install IBM InfoSphere Information Server product modules in logical tiers. A tier is a logical group of components within InfoSphere Information Server and the computers on which those components are installed. Each tier includes a subgroup of the components that make up the InfoSphere Information Server product modules. The tiers provide services, job execution, and metadata and other data storage for your product modules.

Tier Description

Client

The client programs and consoles that are used for development and administration and the computers where they are installed.

Engine 

The logical group of components (the InfoSphere Information Server engine components, service agents, and so on) and the computer where those components are installed. The engine runs jobs and other tasks for product modules.

Services 

The application server, common services, and product services for the suite and product modules and the computer where those components are installed. The services tier provides common services (such as metadata and logging) and services that are specific to certain product modules. On the tier, IBM WebSphereApplication Serverhosts the services. The services tier also hosts InfoSphere Information Server applications that are Web-based.

Metadata repository

The metadata repository and, if installed, the IBM InfoSphere Information Analyzer analysis database and the computer where these components are installed. The metadata repository database contains the shared metadata, data, and configuration information for InfoSphere Information Server product modules. The analysis
database stores extended analysis data for InfoSphere Information Analyzer. To install InfoSphere Information Server, you first design the hardware and software topology for each tier. Your installation can range from simple single-computer configurations, to multiple-computer highly available configurations, to complex and powerful clusters with many computers. You optimize your topology design for the product modules that you install and your requirements for high availability, performance and throughput, and user concurrency.

“Effort only fully releases its reward after a person refuses to quit.”
Regards,
Akhilesh B. Humbe

Popular Posts