Hi everyone,
This is a deep overview of what is DataStage and it's components.
What is DataStage
InfoSphere DataStage is a powerful data integration/ETL tool. It was
acquired by IBM in 2005 and has become a part of IBM Information Server
Platform. It uses a client/server design where jobs are created and
administered via a Windows client against central repository on a server. The IBM InfoSphere DataStage is capable of
integrating data on demand across multiple and high volumes of data sources and
target applications using a high performance parallel framework. InfoSphere
DataStage also facilitates extended metadata management and enterprise
connectivity
Basic Overview
It is part of the IBM Information
Platforms Solutions suite and IBM InfoSphere. It uses a graphical
notation to construct data integration solutions and is available in various
versions such as the Server Edition, the Enterprise Edition, and the MVS
Edition.
Following are the
various editions
- Enterprise Edition (PX): a name give to the version of DataStage that had a parallel processing architecture and parallel ETL jobs.
- Server Edition: the name of the original version of DataStage representing Server Jobs. Early DataStage versions only contained Server Jobs. DataStage 5 added Sequence Jobs and DataStage 6 added Parallel Jobs via Enterprise Edition.
- MVS Edition: mainframe jobs, developed on a Windows or Unix/Linux platform and transferred to the mainframe as compiled mainframe jobs.
- DataStage for PeopleSoft: a server edition with prebuilt PeopleSoft EPM jobs under an OEM arrangement with PeopleSoft and Oracle Corporation.
- DataStage TX: for processing complex transactions and messages, formerly known as Mercator. Now known as WebSphere Transformation Extender.
- ISD (Information Services Director, ex. DataStage RTI): Real Time Integration pack can turn server or parallel jobs into SOA services.
ETL –Understanding
Extract, transform and load (ETL) refers to a process in database usage and
especially in data warehousing that involves:
- Extracting data from outside sources
- Transforming it to fit operational needs (which can include quality levels)
- Loading it into the end target (database, more specifically, operational data store, data mart or data warehouse)
EXTRACTION
The first part of an ETL process involves extracting the data from the
source systems. In many cases this is the most challenging aspect of ETL, as
extracting data correctly will set the stage for how subsequent processes will
go.
Most data warehousing projects consolidate data from different source
systems. Each separate system may also use a different data
organization/format. Common data source formats are relational databases and
flat files, but may include non-relational database structures such as
Information Management System (IMS) or other data structures such as Virtual
Storage Access Method (VSAM) or Indexed Sequential Access Method (ISAM), or
even fetching from outside sources such as through web spidering or
screen-scraping. The streaming of the extracted data source and load on-the-fly
to the destination database is another way of performing ETL when no
intermediate data storage is required. In general, the goal of the extraction
phase is to convert the data into a single format which is appropriate for
transformation processing.
An intrinsic part of the extraction involves the parsing of extracted data,
resulting in a check if the data meets an expected pattern or structure. If
not, the data may be rejected entirely or in part.
TRANSFORM
The transform stage applies to a series of rules or functions to the
extracted data from the source to derive the data for loading into the end
target. Some data sources will require very little or even no manipulation of
data. In other cases, one or more of the following transformation types may be
required to meet the business and technical needs of the target database:
- Selecting only certain columns to load (or selecting null columns not to load). For example, if the source data has three columns (also called attributes), for example roll_no, age, and salary, then the extraction may take only roll_no and salary. Similarly, the extraction mechanism may ignore all those records where salary is not present (salary = null).
- Translating coded values (e.g., if the source system stores 1 for male and 2 for female, but the warehouse stores M for male and F for female)
- Encoding free-form values (e.g., mapping "Male" to "1")
- Deriving a new calculated value (e.g., sale_amount = qty * unit_price)
- Sorting
- Joining data from multiple sources (e.g., lookup, merge) and deduplicating the data
- Aggregation (for example, rollup — summarizing multiple rows of data — total sales for each store, and for each region, etc.)
- Generating surrogate-key values
- Transposing or pivoting (turning multiple columns into multiple rows or vice versa)
- Splitting a column into multiple columns (e.g., putting a comma-separated list specified as a string in one column as individual values in different columns)
- Disaggregation of repeating columns into a separate detail table (e.g., moving a series of addresses in one record into single addresses in a set of records in a linked address table)
- Lookup and validate the relevant data from tables or referential files for slowly changing dimensions.
Applying any form of simple or
complex data validation. If validation fails, it may result in a full, partial
or no rejection of the data, and thus none, some or all the data is handed over
to the next step, depending on the rule design and exception handling. Many of
the above transformations may result in exceptions, for example, when a code
translation parses an unknown code in the extracted data.
LOAD
The load phase loads the data into the end target, usually the data
warehouse (DW). Depending on the requirements of the organization, this process
varies widely. Some data warehouses may overwrite existing information with
cumulative information, frequently updating extract data is done on daily, weekly
or monthly basis. Other DW (or even other parts of the same DW) may add new
data in a historicized form, for example, hourly. To understand this, consider
a DW that is required to maintain sales records of the last year. Then, the DW
will overwrite any data that is older than a year with newer data. However, the
entry of data for any one year window will be made in a historicized manner.
The timing and scope to replace or append are strategic design choices
dependent on the time available and the business needs. More complex systems
can maintain a history and audit trail of all changes to the data loaded in the
DW.
DataStage – Deep Dive
DataStage is actually two
separate things.
• In production (and, of
course, in development and test environments) DataStage is just another
application on the server, an application which connects to data sources and
targets and processes ("transforms") the data as they move through
the application. Therefore ataStage is
classed as an "ETL tool", the initials standing for extract,
transform and load respectively. DataStage "jobs", as they are known,
can execute on a single server or on multiple machines in a cluster or grid
environment. Like all applications, DataStage jobs consume resources: CPU,
memory, disk space, I/O bandwidth and network bandwidth.
• DataStage also has a set
of Windows-based graphical tools that allow ETL processes to be designed, the
metadata associated with them managed, and the ETL processes monitored. These
client tools connect to the DataStage server because all of the design
information and metadata are stored on the server. On the DataStage server,
work is organized into one or more "projects". There are also two
DataStage engines, the "server engine" and the
"parallel
engine".
• The server engine is
located in a directory called DSEngine whose location is recorded in a hidden
file called /.dshome (that is, a hidden file called .dshome in the root
directory) and/or as the value of the environment variable DSHOME. (On
Windows-based DataStage servers the
folder name is Engine, not DSEngine, and its location is recorded in the
Windows registry rather than in /.dshome.)
• The parallel engine is
located in a sibling directory called PXEngine whose location is recorded in
the environment variable APT_ORCHHOME and/or in the environment variable
PXHOME.
DataStage Engines
The server engine is the
original DataStage engine and, as its name suggests, is restricted to running
jobs on the server. The parallel engine results from acquisition of Orchestrate,
a parallel execution technology developed by Torrent Systems, in 2003. This
technology enables work (and data) to be distributed over multiple logical
"processing nodes" whether these are in a
single machine or multiple machines in a cluster or grid configuration. It also
allows the degree of parallelism to be changed without change to the design of
the job.
Design-Time Architecture
At its simplest, there is a
DataStage server and a local area network on which one or more DataStage client
machines may be connected. When clients are remote from the server, a wide area
network may be used or some form of tunnelling protocol (such as Citrix
MetaFrame) may be used instead.
DataStage Client/Server
Connectivity
Connection from a DataStage
client to a DataStage server is managed through a mechanism based upon the UNIX
remote procedure call mechanism. DataStage uses a proprietary protocol called
DataStage RPC which consists of an RPC daemon (dsrpcd) listening on TCP port
number 31538 for connection requests from DataStage clients. Before dsrpcd gets
involved, the connection request goes through an authentication process. Prior
to version 8.0, this was the standard operating system authentication based on
a supplied user ID and password (an option existed on Windows-based DataStage
servers to authenticate using Windows LAN Manager, supplying the same
credentials as being used on the DataStage client machine – this option was
removed for version 8.0). With effect from version 8.0 authentication is
handled by the Information Server through its login and security service.
Understanding tiers and components
You install IBM InfoSphere
Information Server product modules in logical tiers. A tier is a logical group of
components within InfoSphere Information Server and the computers on which those
components are installed. Each tier includes a subgroup of the components that
make up the InfoSphere Information Server product modules. The tiers provide
services, job execution, and metadata and other data storage for your product
modules.
Tier Description
Client
The client programs and
consoles that are used for development and administration and the computers where
they are installed.
Engine
The logical group of
components (the InfoSphere Information Server engine components, service agents,
and so on) and the computer where those components are installed. The engine runs
jobs and other tasks for product modules.
Services
The application server,
common services, and product services for the suite and product modules and the
computer where those components are installed. The services tier provides common
services (such as metadata and logging) and services that are specific to certain product
modules. On the tier, IBM WebSphereApplication Serverhosts the services. The
services tier also hosts InfoSphere Information Server applications that are
Web-based.
Metadata repository
The metadata repository
and, if installed, the IBM InfoSphere Information Analyzer analysis database and the
computer where these components are installed. The metadata repository
database contains the shared metadata, data, and configuration information for InfoSphere
Information Server product modules. The analysis
database stores extended
analysis data for InfoSphere Information Analyzer. To install InfoSphere
Information Server, you first design the hardware and software topology for each
tier. Your installation can range from simple single-computer
configurations, to multiple-computer highly available configurations, to complex
and powerful clusters with many computers. You optimize your topology
design for the product modules that you install and your requirements for high
availability, performance and throughput, and user concurrency.
“Effort only fully releases its reward
after a person refuses to quit.”
Regards,
Akhilesh
B. Humbe