Wednesday, April 6, 2011

Data warehouse Fundamentals

Data warehouse Fundamentals

1. Ralph Kimball Definition
2. Inmon Definition
3. Time Variant
4. Non-Volatile
5. Integrated
6. Subject – Oriented
7. Difference between OLTP Vs DWH
8. ETL Fundamentals
9. Data Acquisition
10. Data Extraction
11. Data Loading
12. Data Transformation
I. Data Cleansing
II. Data Scrubbing
III. Data Aggregation
IV. Data Merging
13. Data Loading
14. ETL Client Server Architecture.
15. Data warehouse – Database design
Star Schema
Snow Flake Schema
Galaxy Schema
16. Types of Dimensions
Slowly changing dimensions
Type I Dimension
Type II Dimension
Type III Dimension
Junk Dimension
Dirty Dimension
17. Types of Facts
i)Additive facts
ii)Semi Additive facts
iii)Non-Additive fact
18. OLAP (Online Analytical Processing)
19. Desktop OLAP
20. Data Marts
Top-down DWH approach
Bottom-Up DWH approach



Ralph Kimball Definition:
It is RDBMS specifically designed for analysing the business operations to make decisions to achieve business goals. But it is not for running business operations.

DSS[ Decision Support System]: Its analyse the business and make to support the decision on the business.

Historical DB’s:
It is designed to store massive / maintain decades/years of data to analyse the business and make to decision.

Inmon Definition:
Data warehouse have following 4 characteristic features as per the Inmons definition.
- Time variant
- Non-Volatile
- Integrated
- Subject-Oriented

Time Variant:
A Data warehouse is time variant database supports business community and comparing business with different time periods.



Non-Volatile:
Data warehouse is a non-volatile database. Once data entered into the Data warehouse it should not change when the change takes place OLTP databases and data is static in DWH.

Integrated:
A data warehouse is a collection of multiple OLTP databases.



Subject – Oriented:
Data warehouse is subject oriented database which supports business of specific area.

Ex: Insurance [Health, Life, General and Wealth]
Sales, HR, Accounts, Loans, Share Marketing etc



ETL Fundamentals:
Data Acquisition:
A data acquisition is the process of extracting relevant information, processing into required format and loading into data warehouse.



There are 2 types of ETL’s
1) Code based ETL
2) GUI based ETL

1) Code Based ETL: An ETL application can be developed using programming languages such as SQL, PL/SQL.
Ex: SAS Base, SAS Access
2) GUI Based ETL: An ETL application can be developed using the GUI, point and Click techniques.
Ex: Informatics, Data Stage, Ab-Initio

Data Extraction:
It is the process of extracting the data from various types of source systems. The following are the different types of source systems used for running the business.
SAP/ Oracle App/ People soft/ JD Edward / IBM main frames/ Cobol files/ XML Files/ Flat Files.


Data Transformation: It is a process of converting the data and cleansing the data into required format.
There are 4 types of data transformation activities takes place in the buffer (Temporary memory).
1) Data Cleansing
2) Data Scrubbing
3) Data Aggregation
4) Data Merging

Data Cleansing: Data Cleansing is process of changing inconsistent data into
Consistent format (or) it’s a process removing unwanted data (Filtering).

Data Scrubbing: Data Scrubbing is a process of deriving new attributes in DWH.



Data Aggregation: Data aggregation is a process of calculating business summaries from detailed data using the aggregate functions.

Data Merging: Data Merging is a process of Integrating data from multiple OLTP DB’s.

Data Loading: Data loading is a process of inserting data into target system. There are 2 types of data loads.
1. Initial Load
2. Incremental Load

Initial Load: It is a process of inserting the data into empty target tables. An initial load is also known as Full load.
Incremental Load: It is a process of inserting data only new records after initial load.

ETL Client Server Architecture.

ETL client:
An ETL client is a GUI component which allows you to design plan of ETL process know as “Mapping” or “Job”.
An ETL plan is designed with the following types of objects.
1. Source Definition[E]
2. Transformation[T]
3. Target definition [L].
ETL server: An ETL server is engine that defines, execute, transformations and Load.
An ETL plan executes in ETL server.
ETL repository: An ETL Repository is brain of the ETL system which stores meta data. i.e required to perform ETL.



Data warehouse – Database design:-

A DWH is designed with the following types of schemas.
1. Star Schema
2. Snow Flake Schema
3. Galaxy Schema
- A Data base is designed with GUI based tool called ER Win.
- A Data base architect or data modular designs the data ware house from business requirements.

Star Schema:

A star schema is database design which is contains centrally located FACT table which is surrounded by multiple dimensional tables.

-Since the data base schema looks like star hence it is called Star schema data base design.

-A FACT table contains facts, Facts are numeric.
-Not every numeric is a fact , but numeric which are of type “key performance Indicator” are known as facts.
-Facts are business measures which are used to evaluate the performance of enterprise.
-A fact table contains the facts at lowest level granularity.
-A Fact granularity defines the level of details.
-Dimensions are descriptive data which as describes the key performance indicators known as facts.
-Dimensions are organized in dimension table. Which are de-normalized?
-Dimensions provide answers to the following business questions.
Who?
What?
When?
Where?
Note: Dimension tables are created with artificial keys are called surrogate keys.



Note: 1) Fact tables are normalized
2) Fact tables contain normalized data as a unique combination of surrogate
keys.


Snow Flake Schema:

A Very large demoralized dimension table is spitted into one or more dimensional tables its results in the normalization in dimensions.
Advantage: Query performance by saving a quite bit of table space.
Disadvantage: Number of joins increases.

Note: In a star schema dimension tables doesn’t contain any parents where as snow flake schema contains parents.


Galaxy Schema: Its is also known as
1) Multi Star Schema
2) Hybrid Schema
3) Constellation Schema
4) Bus Schema
5) Integrated Schema
Fact constellation: It is a process of joining to facts tables.
Confirmed Dimensions:
A dimension table which can be shared by multiple fact tables is known as confirmed dimensions



Types of Dimensions:
Slowly changing dimensions: A dimension which can be changed over a period of time is known as slowly changing dimensions. Following are the types of dimensions.

i) Type I Dimension
ii) Type II Dimension
iii) Type III Dimension

Type I Dimension: A Type I dimension store only current data into the target table, It doesn’t maintain history.

Type II Dimension: A Type II dimension stores full history data in to the target table, for each update in OLTP it insert a new record in the target table.

Type III Dimension: A Type III dimension stores partial history in the target table.
Hybrid slowly changing dimension: A hybrid dimension is combination of Type I and Type II dimensions.



2) Junk Dimension: A dimension which can’t be used to describe key performance indicators (Facts) is known as Jun Dimensions.
Note: Junk dimensions are provides Additional Information



3) Dirty Dimension: A dimension which exists more than once is called Dirty dimensions.

Types of fact tables:
There are 3 types of fact tables.
i) Detailed fact tables: Fact tables stores details of transactions are known as detailed fact table.
ii) Summary or aggregate tables: A table which stores summary or aggregate data is known as known as summary or aggregate tables.
iii) Fact less fact tables: A table without facts are known as fact less fact table.
Note: Fact les fact table are derived in Galaxy schema.

Types of Facts: There are 3 types of facts
i) Additive facts
ii) Semi Additive facts
iii) Non-Additive facts

Additive facts: A fact which can be summarised for any dimension is known as Additive facts. Ex: QTY, REVENUE



ii) Semi Additive facts: A fact which can be summarised for few dimensions but not for all dimensions is known as Semi Additive facts.
Ex: Current balance
iii) Non-Additive fact: A fact can’t be summarised for any of the dimension is known as Non-Additive facts.
Ex: Profit percent.


OLAP (Online Analytical Processing): An OLAP is a set of specification which provides an interface between user community and data ware house.
Types of OLAP’s
i) Desktop OLAP
ii) Relational OLAP
iii) Multidimensional OLAP
iv) Hybrid OLAP
Desktop OLAP: A OLAP which can query data from desk top databases like FoxPro, DBase, clipper etc… is know as DOLAP.



Data Marts:
A Data mart is subject oriented data base which supports business needs of multiple management in enterprise.

A Data mart is subset of enterprise data warehouse. A data mart is also known as high performance query structure (HPQS). There are 2 types of data marts.
i) Dependents Data mart
ii) Independent Data mart
Top-down DWH approach: Accordingly the Inmons first build on enterprise specific database known as DWH, from enterprise DWH designed a development specific DB’s known as Data mart.



Dependent Data mart: In top-down approach a data mart development depends on as enterprise data warehouse such data marts known as dependent data marts.

Bottom-Up DWH approach: According to Ralp Kimball first build department specific data base known as data mart (DM) integrate data marts into an enterprise DWH.




Data Modelling or Dimensional modelling:
It is a process of designing DWH with following phases.
i) Conceptual Modelling
ii) Logical Modelling
iii) Physical Modelling
Conceptual Modelling:
1) Study the business requirements from BRS.
2) Identify entities (Tables)
3) Identify Attributes( Columns)
4) Identify relations
Logical Modelling: Design the dimensional tables and fact tables with required attributes.
Create relationship between dimension tables and fact tables using GUI based DB design tool called ERWin.
Physical Modelling: Use forward engineering technique to create tables in DB, To physically store data.
Note: 1) Target load order
Dimension tables ------ Fact tables.
2) If all dimension tables are loads is success then load data into fact tables.

Continue next .... By Arun Mannepula

1 comment: