DWH-Performance Tuning Using Metadata Driven ...

2 downloads 0 Views 1MB Size Report
Unfortunately, corporate data remained centralized and was ...... “Teradata® RDBMS Performance Optimization”: By NCR Corporation, B035-1097-122A.
DWH-Performance Tuning Using Metadata Driven Approach

A Dissertation Submitted In partial fulfillment For the award of the Degree of Master of Technology In Department of Computer Science & Engineering (With specialization in software engineering)

Supervisor

Submitted By:

Naveen Hemrajani

Sandeep Bhargava

Assoc. Professor

Enroll.No:SGVU091513431

SGVU, Jaipur

Department of Computer Science & Engineering Suresh Gyan Vihar University Mahal, Jagatpura, Jaipur September 2011

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

1

Candidate’s Declaration

I hereby that the work, which is being presented in the Dissertation, entitled “DWH-Performance Tuning Using Metadata Driven Approach” in partial fulfillment for the award of Degree of “Master of Technology” in Deptt. of Computer Science & Engineering with specialization in Software Engineering and submitted to the Department of Computer Science& Engineering, Suresh Gyan Vihar University is a record of my own investigations carried under the Guidance of Mr. Naveen Hemrajani Department of Computer Science & Engineering

I have not submitted the matter presented in this Dissertation any where for the award of any other Degree.

(Name and Signature of Candidate) Sandeep Bhargava Enrolment No: SGVU091513431

Counter Singed by Name (s) of Supervisor (s)

Naveen Hemrajani Associate Professor, CSE Dept Suresh Gyan Vihar University

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

2

DETAILS OF CANDIDATE, SUPERVISOR (S) AND EXAMINER

Name of Candidate: Sandeep Bharagava

Roll No. 102615

Deptt. of Study: Computer Science & Engineering Enrolment No. SGVU091513431 Thesis Title: DWH-Performance tuning using meta driven approach

Supervisor (s) and Examiners Recommended (with Office Address including Contact Numbers, email ID) Supervisor

Co-Supervisor

1. Mr Naveen Hemrajani

Associate Professor SGVU, Jaipur, India [email protected]

Examiner 1

2

3

Signature with Date

Ms. Ruchi Dave

Mr. Naveen Hemrajani

Program Coordinator

Vice Principal SGVU,Jaipur

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

3

ABSTRACT Today world is running behind the artificial intelligence provided by group of transactional data to predict the growth pattern of industry. And for the same everyone is coming up with concept with in-house data warehouse, so called concept of data warehouse deals in huge amount of data and lot of analytically queries runs on data warehouse(DWH), which covers base data in terms of thousands of gigabytes, to unveil the hidden pattern of business. So response time of query is exponential proportional (metaphorically) to involved base data. So we can say THUMB RULE as "MORE BASE DATE MORE ACCURATE RESULTS". But it will degrade the performance if not taken care properly. Also we, as human, hate to wait due to natural phenomenon encoded in our DNA. We do not want to miss a single moment of time, want to utilize every single piece at the best. Every single time we keep looking for opportunity of betterment in existing process. Similarly lot of works has been done by many literates around the globe on DWH performance tuning by proposing many frameworks related with various focus data quality, Metadata management etc.... In this thesis, an effort has been made to discuss about the real industry problems and how we improve the performance of data warehouse by minimize the existing CPU cycles wisely using metadata driven approach.

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

4

CONTENTS

Abstract List of Tables List of Figures Acknowledgements

iv viii ix x

1. Introduction

2

1.1 Early History

2

1.2 Recent History

3

1.3 Data warehouse

4

1.3.1 What is data warehouse

4

1.3.2 Need of data warehouse

5

1.3.3 Benefits of operational/informational systems

5

1.3.4 Advantage of data warehouse

6

1.3.5 Data warehouse is different from database

6

1.4 Basic architecture of warehouse

7

1.4.1 Processes in data warehouse

7

1.4.2 Basic components of Data warehouse

8

1.4.2.1 Load Manager

8

1.4.2.1.1 Extract and load

9

1.4.2.1.1 Fast Load

9

1.4.2.1.1 Simple transformations

9

1.4.2.2 Warehouse Manager

10

1.4.2.2.1 Using Temporary Destinations Tables

11

1.4.2.2.2 Complex Transformations

11

1.4.2.2.3 Transform into a Star flake Schema 1.4.2.2.4 Create Indexes and views 1.4.2.3 Query Manager 1.4.2.4 Operational Data Sources 1.5 Classification of Data Warehouse Architecture.

11 11 12 12 13

1.5.1 Generic Two-Level Architecture

13

1.5.2 Independent data mart architecture

14

1.5.3 Dependent Data Mart and Operational Data Store

15

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

5

1.5.4 Three layer data architecture 1.5.5 Logical data mart and real-time data warehouse architecture 1.6 ETL vs. ELT 1.6.1 ETL – Extract Load Transform 1.6.1.1 Strengths 1.6.1.2 Weakness 1.6.2 ELT – Extract Transform Load

16 17 18 18 19 19 20

1.6.2.1 Strength

21

1.6.2.2 Weakness

21

1.6.3 Staging

22

1.7 OLTP

22

1.8 OLAP

23

1.8.1 Classification of OLAP

23

1.8.2 Comparison among MOLAP, ROLAP

24

2. Data warehouse design and implementation issues

25

2.1 Tool integration

26

2.2 Extraction

26

2.3 Scalability

27

2.4 Database design

27

2.5 Domain issues

28

2.6 Metadata management

28

2.7 Performance and monitoring

29

3. Meta Data

30

3.1Technical Meta data

31

3.2 Business Meta data

31

3.3 Operational Meta data

32

3.3.1 Incorporation of Operational Meta Data

32

3.3.2 Operational Use Strategies

36

3.4 Metadata primary goals

40

3.5 Benefits of Meta data

41

4. Literature Review M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

42 6

5. Performance Of Data warehouse

52

5.1 General

53

5.2 Performance Tuning

53

5.2.1 Proactive Monitoring

54

5.2.2 Bottleneck Elimination

54

5.3 Problem Area.

55

5.3.1 ETL

55

5.3.2 Query Processing

55

5.3.3 Reporting Response Time

55

5.4 Factor Affecting Reporting Response Time

55

5.5 Statistics

56

5.5.1Traditional Methodology For Statistics Refresh.

57

6. Proposed Methodology

59

6.1 What altered us for soloution

60

6.2 Proposed Solution 6.2.1 Flow Chart

60

6.2.2 Process Flow Model

62

7. Experiment/Result

63

7.1 Teradata software

64

7.2 Basic Set Up Required

67

7.3 Result

68

8 Conclusions and Summary

69

References

71

Paper Publications

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

7

List of Tables Table1.1 Operational vs. informational systems

6

Table 3.1 Benefits of metadata

41

Table 6.1 Teradata software’s

65

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

8

List of Figures Figure 1.1: Architecture of data warehouse

8

Figure 1.2: Load Manager Architecture

9

Figure 1.3: Architecture of warehouse manager

10

Figure 1.4: Generic two-level data warehousing architecture

14

Figure 1.5 Independent data mart data warehousing architecture

15

Figure 1.6 Dependent Data Mart and Operational Data Store

16

Figure 1.7 Layer Data Architecture

16

Figure 1.8 Logical data mart and real-time data warehouse architecture

17

Figure1.9 ETL Architecture

18

Figure1.9 ELT Architecture

20

Figure 3.1 Metadata Primary Goals

40

Figure 5.1 Available Traditional Methodology

58

Figure 6.1 Flow Chart of Proposed Solution

61

Figure 6.2 Proposed model using STATS

62

Figure 7.1 Typical CPU consumption chart for an application in a DWH

68

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

9

Acknowledgements

I would like to thank everyone who helped me to write this thesis. In particular, I would like to thank my guide Mr.Naveen Hemrajani and HOD. Mr. Dinesh Goyal for this moral support, and guidance for completion of this report. I would like to take this opportunity to thank, Ms. Ruchi Dave Program coordinator Suresh Gyan Vihar University Jaipur . I express my special gratitude to Mr Subhash Gander and Ms.Savita Shivani and also all my friends and classmates for their support and help me. Last but not the least I wish to express my gratitude to god almighty for his abundant blessings without which this seminar would not have been successful.

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

10

CHAPTER-1 Introduction

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

11

Introduction

In the era of competition across globe which backed up by powerful IT services, only selling large quantities products / services is not what indicates the so-called “GREEN” health of business. What makes a product / services more visible , is selling / offering services “SMARTLY”. In lay man terminology this just and difference between working very hard and working very smartly. So exactly what is and how industry is capitalizing this SMART way of offering????? Answer is DWH (DATA WARE HOUSING), a relatively complex term, is a key ingredient for leading industries to identify hidden trends and unveil over all trends in business. Many successful companies have been investing large sums of money in business intelligence and data warehousing tools and technologies. They believe that up-to-date, accurate and integrated information about their supply chain, products and customers are critical for their very survival. For example: WALMART, giant retail chain in world, has invested billions of dollars to build DWH from daily transactional data. Presently WALMART is among top five (in terms of data and computational power) DWHs in world. The power of WALMART data ware housing can be imagine that WALMART generates reports at below parameters across the globe. 

What was the sell of orange juice by brand by store by location by season?



What was the profitability for Nevada region compare to LA area in terms of market capitalization vs. sells?



How many additional dollars has been invested on a gallon of milk from to farm store?



What is the ratio of operating cost vs. capital cost for small, mid and big size store?

Base data, out of which above metrics can be derived, is hundreds of pet bytes. WALMART derives these metrics for higher management to plan business strategy for coming future. 1.1 EARLY HISTORY To really understand business intelligence (BI) and data warehouses (DW), it is necessary to look at the evolution of business and technology. In the 1970s and 1980s, computer hardware was expensive and computer processing power was limited. A medium-sized business typically operated a handful of large mainframe-based application systems that were designed for operational, data entry purposes. Information needs were addressed through paper reports. Report programs, however, were expensive to write and generally inflexible. A computer programmer was required M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

12

to write the report programs. Fortunately, salaries for programmers were relatively low during this period. In the 1980s, relational databases became the rage. Data was stored in tables with rows and columns, not unlike Excel Spreadsheets of today. Although relational databases were much more intuitive for end users, complex logic was often needed to join multiple tables to obtain the information that was needed. Although it was possible for end users to write simple reports, the queries were often inefficient and had to be run after normal business hours, in order not to impact online transactions. In the late 1980s, many businesses migrated from mainframe computers to client servers. Business people were assigned a personal computer. Office applications such as MicroSoft World, Excel and Access became popular. The personal computer empowered end users and allowed them to develop their own applications and present data in a manner that was meaningful to them, such as in grid or graph format. Excel spreadsheets could easily be tweaked as business needs changed, without the assistance from the IT department. Unfortunately, corporate data remained centralized and was generally not directly accessible to end users. 1.2 RECENT HISTORY The need for improved business intelligence and data warehousing accelerated in the 1990s. During this period, huge technological changes occurred and competition increased as a result of free trade agreements, globalization, computerization and networking. In the early 1990, the Internet took the world by storm. Companies rushed to develop e-Business and e-Commerce applications with hopes of reducing their staffing needs and providing 24 hour service to customers. The volume of application systems mushroomed during this period as a parallel set of Internet applications was deployed. Back-end 'bridges' were built to try to integrate the 'self service' application systems with the legacy 'full service' applications. Unfortunately, integration was often messy and corporate data remained fragmented or inconsistent. As the demand for programmers increased and salaries climbed, businesses looked for alternatives to custom built application systems. In hopes of reducing costs and remaining competitive, companies purchased software packages from third parties. These packages were designed for generic business requirements and often did not integrate well with the existing legacy systems.

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

13

By the end of the millennium, businesses discovered that the number of application systems and databases had multiplied, that their systems were poorly integrated and that their data was inconsistent across the systems. More importantly, businesses discovered that they had lots of fragmented data, but not the integrated information that was required for critical decision making in a rapidly changing, competitive, global economy. Companies began building Data Warehouses to consolidate data from disparate databases and to better support their strategic and tactical decision making needs. 1.3 DATA WAREHOUSE 1.3.1 What is Data Warehouse? A Data Warehouse (DW) is simply a consolidation of data from a variety of sources that is designed to support strategic and tactical decision making. Its main purpose is to provide a coherent picture of the business at a point in time. Using various Data Warehousing toolsets, users are able to run online queries and 'mine" their data. Inmon and Kimball, DWH genius state that “A DWH is a subject-oriented, integrated, time-variant, non-updatable collection of data used in support of management decision-making processes” Subject-Oriented A data warehouse can be used to analyze a particular subject area. For example, "sales" can be a particular subject. Integrated A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product. Time-Variant Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

14

the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer. Non-volatile Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered. 1.3.2 Need of data warehouse: 1. Integrated, company-wide view of high-quality information (from disparate databases) 

Data in operational systems are often fragmented and inconsistent, distributed on a variety of incompatible hardware and software platform o Problems of consolidating all data o Inconsistent key structure o Synonyms o Free-form versus structured fields o Inconsistent data values o Missing data

2. Separation of operational and informational systems 

Operational system



Run business in real time, based on current data



Sales order processing, reservation systems, patient registration Informational system: 

Support decision making based on historical point-in-time and prediction data



Sales trend analysis, customer segmentation, human resources planning

1.3.3 Benefits of operational/informational systems separation 

A data warehouse can logically centralize data scattered throughout separate operational systems



A well-designed data warehouse can improve data quality and consistency



A separate data warehouse greatly reduces contention for data resources between operational and informational systems[6]

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

15

Operational vs. informational systems: Table-1.1 Operational vs. informational systems[6] Characteristics

Operational System

Informational System

Primary Purpose

Run the business on a current basis

Support managerial decision making

Types of data

Current representation of state of state Historical point-in-time (snapshots),

Primary users

of the business

and predictions

Clerk, salespersons, administrators

Managers,

business

analysts,

customers Scope of usage

Narrow, planned and simple updates Broad , ad-hoc, complex queries and and queries

analysis

Design goal

Performance: throughput, availability

Ease of flexible access and use

Volume

Many, constant updates and queries on Periodic batch updates and queries one or few tables row

requires many or all rows

1.3.4 Advantages of Warehousing: 

High query performance



Queries not visible outside warehouse



Local processing at sources unaffected



Can operate when sources unavailable



Can query data not stored in a DBMS



Extra information at warehouse



Modify, summarize (store aggregates)



Add historical information

1.3.5 Warehouse is different from Database Database is designed ( and optimized) to record while a data warehouse is designed to respond to analysis question that are critical for your business. Application database are OLTP( on line transaction processing) systems where every transaction has to be recorded. For example where a bank ATM has disbursed cash to a customer but was M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

16

unable to record this event in the bank record. While a datawarehouse is a database that is designed for facilating querying analysis. Often designed as OLAP( on line analytical processing0 system, these database contain read only data that can be queried and analyzed for more efficiently than OLTP. These OLTP systems are designed to be write optimized while OLAP system are designed to be read optimized. Database ensures that your business intelligence is scalable better documented and managed and can answer questions more efficiently and frequently while data warehouse leads to direct increase in quality of analysis as the table structure are simpler, standardized and often denormalized. It is structured collection of records or data that is stored in a computer system while data warehouse is a repository stored data, data warehouse are designed to facilitate reporting and analysis. Size of database in Mb - Gb of data while data warehouse Gb - Tb of data Database has thousand of users while data warehouse has hundreds of users (e.g., decision-makers, analysts)[6]

1.4 Basic architecture of data warehouse 1.4.1 Process in a data warehouse The various processes occur in a data warehouse and that are [6]: 

Extract and load: In this process the load manager, extract and load the data and performing simple transformations before and during load.



Cleaning and Transformation: Before loading the data, transformation of data is required. The purpose of the exercise is to blitz through all the transformation that do not require complex logic or the use of relational set operator



Backup and Archive: In this process the warehouse manager Back up and Archive incrementally the data warehouse

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

17



Query Management: In this process query manager direct queries to appropriate tables and schedule the execution of user queries Data

Operational data source1

Operational data source2

Operational data source3

Information

Decision

L O A D M A N A G E R

Q U E R Y Summary Information

Detailed Information

Meta Data

M A N A G E

R

Data Dippe r

OLAP tool

Warehoouse Manager

Fig.1.1 Architecture of data warehouse[6] 1.4.2 Basic components of Data warehouse 1.4.2.1 Load Manager: It is the system component that performs all the operation necessary to support the extract and load process. The size and complexity of the load manager will vary between specific solution from data warehouse to a data warehouse but as an indication the larger the degree of overlap between source system, larger the load manager will be. Effort laid to develop a load manager should be planned in production phase. Although the entire system does not have to be built, a significant proportion of the functionality should be provided in the first phase[6].

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

18

Load Manager Controlling Process Temporary Data Store

Store Procedures File Structure

Copy Management Tool

Warehouse Structure

Fast Loader

Fig. 1.2: Load Manager Architecture[6]

A load manager perform the following functions 

Extraction of data from source systems



Fast-load the extracted data into a temporary data store



Perform simple transformation into a structure similar to the one in the data warehouse

1.4.2.1.1 Extract data from source To get extracted the source data it has to be transferred from source systems and get stored in the data warehouse. 1.4.2.1.2 Fast Load Data should be loaded into the warehouse in the fastest possible time in order to minimize the total load window. It become critical as the number of data source increased and the time window shrinks. The speed at which the data is proposed into the warehouse is affected by the kind of transformations that are taking place. It is better or effective to load data in a relational database than to apply transformation and checks. This is because most database venders have very performant facilities to fast load data from ASCII flat files. 1.4.2.1.3 Simple Transformations Before loading the data, transformation of data is required. The purpose of the exercise is to blitz through all the transformation that do not require complex logic or the use of relational set operators. Once this has been completed, we can click the complex logics. This transformation is M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

19

significant because complex data have to be simplified first before loading it into the data warehouse or data mart[6]. 1.4.2.2 Warehouse Manager It is the system component that performs all the operations necessary to support the warehouse management process. It is constructed using third party management software example, C program and shell scripts. 

The warehouse manager perform the following operations



Analyze the data to perform consistency and referential integrity checks



Transform and merge the source data in the temporary data store into the published data warehouse



Create indexes, partition views against the base data



Generate demoralizations if appropriate



Generate any new aggregations that may be required



Update all existing aggregations



Back-up incrementally or completely the data within the data warehouse



Archive data that has reached the end of it’s capture



It also analyze query profile to determine which indexes and aggregations are appropriate.

Warehouse Manager Controlling process

Temporary data store

Stored Procedure Back-up Recovery tool

SQL Scripts

Starflaks Scheme

Summary tables

Fig . 1.3: Architecture of warehouse manager[6]

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

20

1.4.2.2.1 Using Temporary Destinations Tables When data present in temporary tables, users can perform simple conversions to start getting it into shape. The next phase is to create permanent destination tables like temporary table in the data warehouse. After this phase before loading the data into the data warehouse consistency checks are applied to the information. If these checks fail then data still remains in the temporary tables and is not loaded in the data warehouse. As we use temporary tables just like destinations tables it becomes a imple exercise for users to first transform data temporary to destination tables and then load it into the data warehouse. 1.4.2.2.2 Complex Transformations The populated set of temporary tables match the destination tables in the data warehouse. These tables have been populated with the source data, but still without any of the complex transformations that need to take place. The data in both the tables are compared before getting stored in the data warehouse. The complex data is transformed first and then stored or loaded in the warehouse 1.4.2.2.3 Transform into a Starflake Schema Once source data is in the temporary data store and has been cleaned up, the warehouse manager can then transform it into a form suitable for decision support queries. The data is transformed it into a form in which the bulk of the actual data lies in the center surrounded by the reference data. These variations are used in star, snowflake and starflake schemas 1.4.2.2.4 Create Indexes and views The warehouse manger has to create indexes against the information in the fact or dimension table. The overhead of inserting a row into a table and indexes can be far higher with a large number of rows than the overhead of re-creating the indexes once the some have been created. Therefore it is often more effective to drop all indexes against table prior to inserting larger number of rows. Fcat data that tends to have a large amount of data inserted on a daily basis is a prime candidate for dropping indexes before the data load. This method will be effective unless the partition that is being loaded to already contains a substantial amount of data dimension data tends to charge or be added to in far smaller volumes than fact data, so unless the dimension is being changed, wholesale retain existing indexes on the dimension tables. M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

21

1.4.2.2.5 Summary Generations After loading and staging of data in the data warehouse then comes the creation of summary report after excluding queries. Summaries are generated by warehouse manager automatically without any intervention. SQL is used to generate summaries. Meat data can be used to generate summary 1.4.2.2.6 Using Query Statistics The area to maintain the responsiveness of the system even when the query profile change over time. This can achieved by gathering statistics or the queries being physically executed against the data warehouse and deducing the appropriate summaries to speed up common queries. Query statistics can be collected by the query manager as it intercepts any query hitting the database. This process ensures that only relevant summaries are retained by the system on an ongoing basis. New summaries are created to match changing query profile and summaries that speed up popular queries are maintained. 1.4.2.3 Query Manager It is the system component that performs all the operations necessary to support the query management process. It uses tool like C programs, scripts etc. In some cases the query manager also store query profile to allow the warehouse manager to determine whi8ch indexes and aggregations are appropriate. It perform the following operations 

Direct queries to appropriate tables



Schedule the execution of user queries

1.4.2.4 Operational Data Sources Operational data sources for the Data warehouse is supplied from mainframe operational data held in first generation hierarchical and network database, departmental data held in proprietary file systems, private data held on workstations and private seves and external systems[6]

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

22

1.5 Classification of Data Warehouse Architecture: Data Warehouses can be architected in many different way : 

Physical architectures:



Generic Two-Level Architecture



Independent Data Mart



Dependent Data Mart and Operational Data Store



Three-Layer data architecture



Logical Data Mart and Real-Time Data warehouse

1.5.1 Generic Two-Level Architecture: It is basically for entry level data warehouses There are four basic steps to go through the generic two level architecture In the first phase the data is extracted from various external and internal legacy systems In the data staging area data from various legacy systems is integrated, updated and transformated and then loaded to the data warehouse. The data is made error free also in this stage.The stage of data and metadata storage comprises of both detailed and summarized data.Many end users access information from data warehouse by using variety of query tools and design it into a summarized report formal using reporting tools and this formatted output can be stored back into the operational database or back to the data warehouse itself. The ETL process i.e extraction, transformation and loading is covered majorly into the first three phase i.e. extraction information from source data systems, data staging area and the storage of data and metadata stage. Extraction and loading of data are periodic in nature but the intermediate process i.e. transformation is done on a day-to-day level basis.

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

23

Fig. 1.4: Generic two-level data warehousing architecture 1.5.2 Independent data mart architecture: Data Mart: A data warehouse that is limited in scope, support decision making for a particular business function or end-user group 

Independent: filled with data extracted directly from the operational environment



Dependent: filled with data derived from enterprise data warehouse

Why independent data mart? Firms faced with short-term, expedient business objectives Lower cost and time commitment Organizational and political difficulties to reach a common view of enterprise data in a central data warehouse Technology constraints

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

24

Separate ETL for each

Data access complexity due

independent data mart

to multiple data marts

Fig. 1.5 Independent data mart data warehousing architecture 

Limitations of independent data mart



A separate ETL process is developed for each data mart



Data marts may not be consistent with one another



No capability to drill down into greater detail or into related facts in other data marts



Scaling cost are high , each new application creates a new data mart and repeat the ETL steps

1.5.3

Dependent Data Mart and Operational Data Store

A dependant data mart is basically a combination of EDW and reconciled data. EDW is a central integrated data warehouse and made available information to end users for decision support applications. The dependent data mart and operational data store architecture is called a ”hub and spoke” approach where EDW is acting as a hub and source data systems and data mart are input and output spokes An operational data stores (ODS) is an integrated subject oriented, updatable, current valued, detailed, database designed to serve operational users as they do decision support processing. An M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

25

ODS just acts as a relational database and it is different for an organization individual ERP (Enterprise Resource Planning) database. This type of architecture is also known as corporate information factory or CIF ODS provides option for Obtaining current data.

Fig. 1.6 Dependent Data Mart and Operational Data Store 1.5.4 Three-Layer data architecture: It has 3-layers 

Source data systems (operational databases)



Operational data store



Data and metadata storage (enterprise data warehouse and dependent data mart)

Derived Data

Data Mart

[Data Mart]

Metadata

Reconciled Data

EDW

[EDW & ODS]

Metadata

Operational Data

Operational

[Operational System]

Metadata

Fig.1.7 Layer Data Architecture[6]

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

26

1.5.5 Logical data mart and real-time data warehouse architecture: Real-time data warehouse Accepts near-real-time feed of transactional data from operational systems, analyzes Warehouse data, and in near-real-time relays business rules to the data warehouse and Operational systems so that immediate action can. Characteristics: Logical data marts are not physically separated databases, but different Relational views of a data warehouse. Data are moved into data warehouse rather than to a separate staging area. New data mart can be created quickly. Data marts are always up to date ODS and warehouse are one and same.

Fig. 1.8 Logical data mart and real-time data warehouse architecture

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

27

1.6 ETL VS. ELT 1.6.1 ETL – Extract, Transform and Load The following diagram illustrates each of the individual stages in the process. Extract, Staging Database, Transform, Load, Staging Database Data Warehouse

Fig.1.9 ETL Architecture[5] Data is “extracted” from the data sources (line of business applications) using a data extraction tool via whatever data connectivity is available. It is then transformed using a series of transformation routines. This transformation process is largely dictated by the data format of the output. Data quality and integrity checking is performed as part of the transformation process, and corrective actions are built into the process. Transformations and integrity checking are performed in the data staging area. Finally, once the data is in the target format, it is then loaded into the data warehouse ready for presentation. The process is often designed from the end backwards, in that the required output is designed first. In so doing, this informs exactly what data is required from the source. The routines designed and developed to implement the process are written specifically for the purpose of achieving the desired output, and only the data required for the output is included in the extraction process. In addition, the output design must incorporate all facts and dimensions required to present both the aggregation levels required by the BI solution and any possible future requirements. Business rules that define how aggregations are achieved and the relationships between the various entities in both the source and target, are designed and therefore coded into the routines that implement the ETL process. This approach leads to tight dependencies in the routines at each stage of the process. In spite of the fact that there are effectively three stages, the design is often characterized as a monolithic process since the target is predefined and clear. In addition to the tools M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

28

provided by the database manufacturer, there are a prolific number of tools available on the market that implement this approach and provide a solid platform, productivity, and flexibility[5]. 1.6.1.1 Strengths Development Time Designing from the output backwards ensures that only data relevant to the solution is extracted and processed, potentially reducing development, extract, and processing overhead; and therefore time. Targeted data Due to the targeted nature of the load process, the warehouse contains only data relevant to the presentation. Administration Overhead Reduced warehouse content simplifies the security regime implemented and hence the administration overhead. Tools Availability The prolific number of tools available that implement ETL provides flexibility of approach and the opportunity to identify a most appropriate tool. The proliferation of tools has lead to a competitive functionality war, which often results in loss of maintainability[5]. 1.6.1.2 Weaknesses [5]

Flexibility Targeting only relevant data for output means that any future requirements, that may

need data that was not included in the original design, will need to be added to the ETL routines. Due to nature of tight dependency between the routines developed, this often leads to a need for fundamental re-design and development. As a result this increases the time and costs involved. Hardware Most third party tools utilize their own engine to implement the ETL process. Regardless of the size of the solution this can necessitate the investment in additional hardware to implement the tool’s ETL engine. Skills Investment The use of third party tools to implement ETL processes compels the learning of new scripting languages and processes. Learning Curve Implementing a third party tool that uses foreign processes and languages results in the learning curve that is implicit in all technologies new to an organization and can often lead to following blind alleys in their use due to lack of experience[5].

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

29

1.6.2 ELT – Extract, Load and Transform Whilst this approach to the implementation of a warehouse appears on the surface to be similar to ETL, it differs in a number of significant ways. The following diagram illustrates the process.

Fig.1.10 ELT Architecture[5] Data is “extracted” from the data sources (line of business applications) into the “Staging Database” using a data extraction tool via whatever data connectivity is available. Whilst in the staging area, integrity and business rules checks can be applied, and relevant corrections can be made. The source data is then loaded into the warehouse. In effect this provides a validated and cleaned offline copy of the source data in the data warehouse. Once in the warehouse, transformations are performed to re-shape the data into its target output format. The extract and load process can be isolated from the transformation process. This has a number of benefits. Isolating the load process from the transformation process removes an inherent dependency between these stages. In addition to including the data necessary for the transformations, the extract and load process can include elements of data that may be required in the future. Indeed, the load process could take the entire source and load it into the warehouse. Separating the processes enables the project to be broken down into smaller chunks, thus making it more predictable and manageable. Performing the data integrity checks in the staging area enables a further stage in the process to be isolated and dealt with at the most appropriate point in the process. This approach also helps to ensure that only cleaned and checked data is loaded into the warehouse for transformation. Isolating the transformations from the load process helps to encourage a more staged approach to the warehouse design and implementation. This embraces the ongoing changing nature of the warehouse build. Tools are available that inherently implement the ELT process. However, their availability is sparser since this is a more emergent approach to warehouse process design[5]. M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

30

1.6.2.1 Strengths Project Management Being able to split the warehouse process into specific and isolated tasks, enables a project to be designed on a smaller task basis, therefore the project can be broken down into manageable chunks. Flexible & Future Proof In general, in an ELT implementation all data from the sources are loaded into the warehouse as part of the extract and load process. This, combined with the isolation of the transformation process, means that future requirements can easily be incorporated into the warehouse structure[5]. Risk minimization removing the close interdependencies between each stage of the warehouse build process enables the development process to be isolated, and the individual process design can thus also be isolated. This provides an excellent platform for change, maintenance and management. Utilize Existing Hardware In implementing ELT as a warehouse build process, the inherent tools provided with the database engine can be used. Alternatively, the vast majority of the third party ELT tools available employ the use of the database engine’s capability and hence the ELT process is run on the same hardware as the database engine underpinning the data warehouse, using the existing hardware deployed. Utilize Existing Skill sets By using the functionality provided by the database engine, the existing investments in database skills are re-used to develop the warehouse. No new skills need be learned and the full weight of the experience in developing the engine’s technology is utilized, further reducing the cost and risk in the development process[5]. 1.6.2.2 Weaknesses [5]

Against the Norm ELT is an emergent approach to data warehouse design and development.

Whilst it has proven itself many times over through its abundant use in implementations throughout the world, it does require a change in mentality and design approach against traditional methods. Though some principles remain true to all approaches, ELT is different in many ways. To get the best from an ELT approach requires an open mind. Tools Availability Being an emergent technology approach, ELT suffers from a limited availability of tools M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

31

such as the internet, commercially available DB or DB associated with and organizations suppliers or customers[5] 1.6.3 Staging: Due to varying business cycles, data processing cycles, hardware and network resource limitations and geographical factors, it is not feasible to extract all the data from all Operational databases at exactly the same time. For example, it might be reasonable to extract sales data on a daily basis, however, daily extracts might not be suitable for financial data that requires a month-end reconciliation process. Similarly, it might be feasible to extract "customer" data from a database in Singapore at noon eastern standard time, but this would not be feasible for "customer" data in a Chicago database. Data in the Data Warehouse can be either persistent (i.e. remains around for a long period) or transient (i.e. only remains around temporarily). Thus The Data Warehouse Staging Area is temporary location where data from source

systems is

copied. A staging area is mainly required in a Data Warehousing Architecture for timing reasons. In short, all required data must be available before data can be integrated into the Data Warehouse. Note: Not all business requires a Data Warehouse Staging Area. For many businesses it is feasible to use ETL to copy data directly[5] 1.7 OLTP Online transaction processing refers to a class of systems that facilitate and manage transactionoriented applications, typically for data entry and retrieval transaction processing. The term is somewhat ambiguous; some understand a "transaction" in the context of computer or database transactions, while others (such as the Transaction Processing Performance Council) define it in terms of business or commercial transactions.OLTP has also been used to refer to processing in which the system responds immediately to user requests. An automatic teller machine (ATM) for a bank is an example of a commercial transaction processing application. For example when we purchase railway ticket from either counter or online, then there will be an entry in some database. And that entry will go in some collections of tables in some database mostly will be oracle, sql server, mainframe etc or DB, which works on binary file logic or works on M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

32

closest to the operating layer. So in this example whenever a transaction happens it will make an entry in the system that is why it is called as OLTP and end user can retrieve corresponding entry at any given point of time. 1.8 OLAP online analytical processing is an approach to swiftly answer multi-dimensional analytical queries.OLAP is part of the broader category of business intelligence, which also encompasses relational reporting and data mining. .OLAP is the use of set of graphical tools that provide users with multidimensional views of their data and allow them to analayze the data using simple windowing techniques. The core of any OLAP system is an OLAP cube (also called a 'multidimensional cube' or a hypercube). It consists of numeric facts called measures which are categorized by dimensions. The cube metadata is typically created from a star schema or snowflake schema of tables in a relational database. Measures are derived from the records in the fact table and dimensions are derived from the dimension tables. 1.8.1 Classification of OLAP OLAP systems have been traditionally categorized using the following taxonomy 

MOLAP: is the 'classic' form of OLAP and is sometimes referred to as just OLAP. MOLAP stores this data in an optimized multi-dimensional array storage, rather than in a relational database. Therefore it requires the pre-computation and storage of information in the cube the operation known as processing.



ROLAP works directly with relational databases. The base data and the dimension tables are stored as relational tables and new tables are created to hold the aggregated information. Depends on a specialized schema design.This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.



HOLAP: There is no clear agreement across the industry as to what constitutes "Hybrid OLAP", except that a database will divide data between relational and specialized storage. For example, for some vendors, a HOLAP database will use relational tables to hold the

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

33

larger quantities of detailed data, and use specialized storage for at least some aspects of the smaller quantities of more-aggregate or less-detailed data 1.9 Comparison Each type has certain benefits, although there is disagreement about the specifics of the benefits between providers[6]. Some MOLAP implementations are prone to database explosion, a phenomenon causing vast amounts of storage space to be used by MOLAP databases when certain common conditions are met: high number of dimensions, pre-calculated results and sparse multidimensional data. MOLAP generally delivers better performance due to specialized indexing and storage optimizations. MOLAP also needs less storage space compared to ROLAP because the specialized storage typically includes compression techniques[6] ROLAP is generally more scalable.[14] However, large volume pre-processing is difficult to implement efficiently so it is frequently skipped. ROLAP query performance can therefore suffer tremendously. Since ROLAP relies more on the database to perform calculations, it has more limitations in the specialized functions it can use. HOLAP encompasses a range of solutions that attempt to mix the best of ROLAP and MOLAP. It can generally pre-process swiftly, scale well, and offer good function support[6]

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

34

CHAPTER – 2 Data warehouse design and implementation issues

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

35

Data warehouse design and implementation issues[7] 1. Tool integration 2. Extraction 3. Scalability 4. Database design 5. Domain issues 6. Metadata management 7. Performance and monitoring 8. Hyper-growth issues

2.1 Tool integration. There are two kinds of data warehouse vendors: 

full-solution (e.g. IBM) providing multiple components of the data warehouse architecture (integrated for the customer)



component (e.g. Sequent) providing specific components of the data warehouse architecture

The biggest problem with both kinds of vendors is that tools may not work together, despite vendors' claims that their products are compatible[7] 2.2 Extraction. 

In data extraction, routines read the source data, convert it to an intermediate schema, and dump it to a staging area.



Once the source data is gathered into the staging area, several utilities are used to clean the data, transform it to an intermediate schema, aggregate it, load it into fact tables, and finally refresh it.

Two typical and common problems at this stage are:

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

36



The data extraction tools don't meet the complex needs of the company, resulting in the need to write customized code for extraction.



The quantity of the data to be extracted adds further complications if, for example, the database, stored in a very high-speed database manager/database communications applications (e.g. IMS), has to be loaded to a Unix-based data warehouse server.

2.3. Scalability (a critical success factor) 

As the number of users and applications and the size of the warehouse increase, the system should be able to meet the extra load by adding computing resources as needed incrementally without having to change the applications and without the users noticing any degradation in performance.



The system should be able to deliver scalable performance.



This issue is also relevant in creating data marts However, pulling data marts together to create a coordinated enterprise-wide data warehouse Can be a painful experience.

Reasons include ignoring the creation of an enterprise-wide infrastructure for data marts, excessive data redundancy due to data marts, data marts running on different servers, and very large quantities of data (in terabytes) over several data marts 2.4. Database design 

Entity-relationship diagrams, their translation to relations, and the associated normalization techniques common in transaction processing are not useful for data warehousing



In data warehousing projects in which data-query efficiency is important, either a star schema or a snowflake schema is typically used



A data warehouse model aligns with the business structure rather than with the data model in a typical application



As most database designers are experienced enterprise modelers, a common problem they face is how to convert the enterprise schema (usually normalized) to a data warehouse schema

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

37

2.5. Domain issues 

The objective is for the warehouse to work synergistically with decision-support tools to meet the decision-support needs of a wide cross section of users within an organization.



Therefore, a warehouse needs to be able to address questions covering a wide domain area.

For example, in the health care arena, the warehouse may provide information to users in areas ranging from research (by, say, facilitating measurement of the effectiveness of treatment, in which relationships between causality and treatment protocols for systemic diseases and conditions are not established) to management (by, say, creating breakdowns of cost and charge information or forecasting demand to manage resources

2.6. Metadata management 

Metadata is data about data, or an abstraction of data that is instrumental in defining raw data. 

For example, the definition of data tells the applications using the data that a given stream of bits is a customer's address.



The need for metadata is acute in data warehouse projects, as different branches of an organization may be using different ways of defining, say, an employee number.



Unless these differences are reconciled, the chances of success for any application running on the warehouse are quite low.



Today, three kinds of metadata are being used –

Operations-level defines the structure of data in operational databases



warehouse-level defines the way the transformed data is interpreted



and business-level maps the warehouse metadata to business concepts[13]

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

38

2.7. Performance and monitoring 

Performance depends on various factors, including the speed of the server; online transaction processing (OLTP) issues, such as locks, latches, and tables accessed; and online analytical processing (OLAP) issues, such as monitoring runaway queries, managing free space in the

database by archiving infrequently accessed columns, and adjusting the extraction process by spotting specific areas in the database that are queried only at certain times during the month. 

Loading, cleaning, auditing, and analyzing also have to be optimized for high performance.

8. Hyper-growth issues 

The work really begins once the data warehouse is up and running.



User demand for more capabilities typically creates problems in maintaining and managing corporate warehouses.



Such demand is due to difficulties in understanding and using supposedly intuitive decisionsupport tools bought along with the data warehouse software.



This demand has to be handled on top of the daily and weekly processing, disaster recovery planning, security scares, and other responsibilities involved in warehouse administration.



The need to satisfy user information requirements can also lead to scalability issues within the warehouse[7].

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

39

CHAPTER-3 Meta Data

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

40

Meta Data: Most individuals with some level of involvement with data warehouses, from a technical or business perspective know of the term “meta data”. When asked to define “what is meta data” most of these individuals can reiterate the common definition “Data about data”. Still some have an even deeper understanding on this subject and the various categories of meta data. These categories often break into two areas, technical and business[8]. 3.1 Technical Metadata Technical Metadata includes load statistics such as the last date of the table were updated, known data quality problem, performance statistics etc. Technical metadata primarily serves the data warehouse support staff. Example such as 

Physical table and column names



Data mapping and transformation logic



Source systems



Foreign keys and indexes



Security



ETL process names

3.2 Business metadata [8]

Business metadata includes descriptive information about the warehouse such as full names and

synonyms for warehouse attributes and tables, business definitions, and the calculations used for derived attributes. Server based business metadata allow this information to be accessed with any data access tool. . Examples such as[8]: 

Business terms and definitions for tables and columns



Subject area names



Query and report definitions



Report mappings



Data Stewards



etc.

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

41

3.3 Operational Meta Data: Even with a standardized Meta data repository, a data warehouse architect can still add a further level of detail into their warehouse data model and ETL process designs through the incorporation of operational Meta data. This technique can be used as an extension of the design and architecture of the data warehouse that provides processing optimizations in data acquisition design, maintenance activities and end user reconciliation and auditing of information[9] Operational Meta data provides an additional bridge between the Meta data repository and the data warehouse through the addition of physical columns into the design for ease of use by both business and technical users. Operational Meta data, unlike information stored in the Meta data repository, is referenced at a row level of granularity in the data warehouse. Operational Meta data provides a detailed row level explanation of actual information content in the data warehouse. This direct association to each row of information in the warehouse is operational meta data’s key design distinction[8]. 3.3.1 Incorporation of Operational Meta Data There are varieties of operational Meta data columns that can utilize in the design of the data warehouse to enhance automated support. Depending on the business requirements of the warehouse project, inclusion of certain columns may make more or less sense to use. Incorporation of these fields would normally occur during the transformation of the business logical data model into the dimensional or data warehouse data model. Some of the more prevalent columns used are listed below. 

Load Cycle Identifier



Current Flag Indicator



Load Date



Update Date



Operational System(s) Identifier



Active in Operational System Flag



Confidence Level Indicator



Load Cycle Identifier:

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

42

One of the operational meta data columns a data warehouse architect can incorporate is the load cycle identifier. This column is a sequential identifier assigned during each load cycle to the data warehouse regardless of the refresh frequency. As a sequential identifier, it can be used to remove data from a particular load cycle run if data corruption or other data quality issues are discovered. The load cycle identifier is typically used in conjunction with a lookup or meta data repository table that’s describes other operational statistics about the load cycle[2]. Current Flag Indicator: The current flag indicator column is used to identify the latest version of a row in a table. It facilitates quick identification of the latest version of a row as compared to performing date comparisons. This flag is especially useful for managing the processing of slowly changing dimensions (SCD), type 2, where history of a production record is maintained. Load Date: The most commonly used and understood operational Meta data field in data warehouse designs is the load date column. It denotes when, date and/or time, a row of information was loaded into the data warehouse or in some business cases when the data was extracted from the operational source system. This snapshot date is used to maintain temporal integrity of the data in the warehouse as new information is added during each refresh cycle. The column can be referenced by warehouse administrators to identify candidate rows for archival or purge processing. End users can also use this column to reconcile and audit information in the data warehouse with the operational source systems. Update Date: Another common operational Meta data column is the update date. This column indicates when a row was lasted updated in the warehouse during a refresh cycle. This column, like load date, is used to maintain the temporal integrity of information in the data warehouse. It is routinely used in dimension table designs that implement SCD type 1, 2 or 3 processing methods to identify when the row was refreshed. The column, like load date , can be utilized in administration activities such as archival/purge processing or reconciliation/audit by end users.

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

43

Operational System(s) Identifier: One of the most useful operational Meta data fields for both the warehouse administrator and the end user is the operational system(s) identifier. This column is used to track the origination source or sources of a data row in the data warehouse. Example, in cases where a row of data was integrated from more than one operational source system, a column value denoting the combination of these systems can be assigned (see table example below). It can be used by end users that are questioning the quality and/or validity of data in the warehouse to trace back information to the operational source system that furnished the information. In certain cases, this column can be used by administrators to identify and remove corrupt data loaded from a particular operational source system(s). Active Operational System Flag: This flag is used to indicate whether the production keys in a dimension table are still active in the originating operational system. The active operational system flag provides intriguing analysis alternatives to queries posed to the data warehouse. This column can be used effectively in a variety of analysis activities to identify dormant data or data that should be constrained in reporting (e.g., out of bounds, products no longer supported, previous customers). Confidence Level Indicator: One of the more controversial operational meta data fields is the confidence level indicator. This column is used to indicate how business rules or assumptions were applied during the ETL processes for a particular row of data. This field provides a means of measure to the end user as to the credibility level of a data row based on the transformation processing performed. Often used to identify potential problems with data quality from operational source systems and facilitate in correcting these issues, each warehouse organization and/or project will vary in how the business requires ranking of this particular operational meta data field. The following table illustrates one approach to ranking.

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

44

Operational Meta Data Column Assignment: The importance of operational Meta data, now being realized, the architect has gone forward incorporating these columns into the design of the data warehouse data model and the ETL processes. The value assigned to these columns is accomplished through various means. Some examples of the methods used for assignment of the various operational meta data columns include:



Load Cycle Identifier: The value for this column is assigned through insertion of a new row in a meta data table or lookup table added to the data model to capture operational statistics. The primary key on the table consists of a serial or sequence data type. The value is incremented during each refresh cycle to the data warehouse regardless of the completion status.



Current Flag Indicator: The value of this column is assigned through a comparison of data currently in the data warehouse versus data in a new load cycle. The two data files are sorted and then consecutively read to determine the current record. The current record has a “Y” value assigned to it while historical records are assigned an “N”.



Load Date: The value of this column is determined at the beginning of the load cycle. If a Load Cycle table is available in the meta data repository or has been added to the model, the value for this column can be extracted for this source. In some cases the value assigned may be the extraction date from the operational system versus the actual date the information was loaded into the data warehouse.



Update Date: The value of this column is determined, like load date, at the beginning of the load cycle but is applied only to historical rows.



Operational System Identifier: The value of this column is assigned based on which operational system provided the information to the data warehouse. If two or more sources are integrated to construct a single row in a data warehouse table, the value assigned should represent the multiple operational systems.



Active Operational System Flag: The value of this column is determined by comparing the natural or production keys of the operational system to data rows in the data warehouse. A value of “Y” is assigned to a record if it currently exists in the operational system.



Confidence Level Indicator: The value of this column is assigned during ETL processing based on conformance to the business rules associated with the data warehouse target table.

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

45

If business rule validation indicates that transformation occurred correctly with no unknown or missing information the confidence level can be set high. If derived or estimated data is inserted during ETL processing, the confidence level may need to be set lower[9]. 3.3.2 Operational Use Strategies: There is a variety of processing optimizations in data acquisition design, maintenance activities and end user reconciliation and auditing of information that can now be performed against the data warehouse through the incorporation of operational meta data into the architecture. Some examples of the tasks that can now be performed include[9]: Extraction of current data 

Identification of rows for archive/purge processing



Load cycle rollback



Processing of slowly changing dimensions type 2 records

These are not the only tasks that can take advantage of the incorporation of operational meta data columns but demonstrates an initial starting set. Extraction of Current Data: The typical data warehouse data model requires the use of the load date column in order to maintain a unique perspective of history for a row of data in a table. While referential integrity is maintained through use of this column, it causes a high degree of processing overhead when a user is attempting to determine which row in a dimension table is the most current. The RDBMS must determine, by sequentially reading each row of data in the table, which row has the most current load date. An alternative to this brute force method is to utilize the current flag indicator column. Through ETL processing and comparison of natural or production keys, this column facilitates quick identification, for the business users, as to which row of information from the operational systems is the most current. The last row added to a data warehouse table for a particular production key is given an assignment of “Y” for the current indicator flag, while historical records are set to “N” or null. The business user can use this column in their queries as a constraint to retrieve only the most up to date information. For certain reporting requirements, RDBMS views can be established that M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

46

are constructed constraining on the current indicator column, value of “Y”, automatically to avoid errors in queries by users. Load Cycle Rollback: A monthly load cycle occurs against the data warehouse. During ETL processing of the load cycle, an error is detected in the database or, data from an operational source system is suspect of being corrupt or, some other data quality measurement issue is discovered in the data. Before the incorporation of operational Meta data columns, the data warehouse administrator would have only limited methods of isolating and removing this corrupt or suspect information. Now the administrator can be more selective in their methods of removing the erroneous data from the database. The first rollback method is to utilize the load cycle indicator to completely remove the last refresh cycle to the data warehouse. This can simply be accomplishing by constraining on the value of the load cycle in question and removing the affected rows in each table from the data warehouse. This same method can be further tailored, in certain circumstances, to remove rows from a particular load cycle for a specific operational system by further constraining on the operational system indicator in question. For either method used, the rollback process will need to accommodate the re-assignment of the current indicator, once the suspect rows are removed from the data warehouse, for rows that were flagged with a “Y” prior to the last load cycle. Archiving and Purging: The data warehouse has now been up and running with monthly load cycles being processed into successfully for some period of time. Through your business users or database monitoring tool you become aware of data that has become dormant over time or inactive for some other reason. Since this data is not currently being utilized on any type of regular basis, a decision is made to either archive it off to a near-line storage device or to purge it from the data warehouse for size and performance reasons. The incorporation of operational meta data columns again provides the data warehouse administrator with a variety of options on how to isolate and flag candidate rows for archiving or purging. First, the load date can be interrogated to isolate rows of data from a particular period. This M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

47

method again requires additional overhead of the RDBMS to analysis the load date. The second option is to constraint on rows of data from a particular set of load cycles. This method provides a more efficient means of flagging candidate rows and avoids tokenizing of a date column to identify a particular period. Slowing Changing Dimensions (SCD) Type 2: Data warehouses or marts that make use of some variant of a star schema data model design can utilize the current flag indicator to aid in SCD type 2 processing for dimension table loads. The established method for processing of SCD type 2 records is to sort both sources of information, operational system versus data warehouse, based on the production or natural key of the dimension table. The current flag indicator can be used to constraint which records from the dimension table will be compared to the current load cycle data being processed to the warehouse. Only dimension records containing a “Y” in the current indicator column are required to be compared to the load cycle data since they contain the most updated information for the specific production key(s). A consecutive comparison is made between the two data files based on these key column(s). Current load cycle records with production key values not found in the dimension table are loaded as new rows. These rows receive new surrogate keys for the dimension and have their current indicator flag set to “Y”. The current flag setting is due to the fact that no other occurrence of the production key(s) combination exists in the dimension table to date. Production key(s) found in both the load cycle data and the dimension table are further interrogated based on columns deemed relevant to track changes on by the business users. The two sources are sorted on by the key and relevant columns. A consecutive comparison is made against these files. Current load cycle records that have relevant columns that do not match their corresponding dimension table rows are flagged as new rows. These rows are inserted with new surrogate keys and have the current flag indicator set to “Y” due to an update made to a relevant column since the last load cycle. Previously loaded dimension rows that have a matching production key(s) when compared to the current load cycle file but have differences in relevant columns are assigned a current indicator of “N”. These rows receive this setting due to the fact that new information on relevant columns of the dimension have been received in the most recent load cycle.

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

48

This same process of constraining on the current flag indicator and performing comparison on production keys between the dimension table and load cycle is repeated during each update process to the data warehouse. The incorporation of operational meta data columns into the data warehouse architecture allows administrative and data quality measurement questions to be easily answered. Some examples of the questions that can now be answered are: 

What operational system(s) provided this particular row of information to the data warehouse? Purpose: Data quality measurement and publishing of ETL load cycle results.



When was this particular row of information last refreshed by the operational system? Purpose: Data quality measurement.



How many rows of information will be affected by an archive or purge of data? Purpose: Administrative and maintenance.



What is the growth rate for a table per load cycle? Purpose: Administrative, maintenance and publishing of ETL load cycle results.



What effective growth rate is being experienced from a particular operational source system on the data warehouse? Purpose: Administrative, maintenance and publishing of ETL results.



What is the relative confidence level of the data currently loaded into the data warehouse? Purpose: Data quality measurement.



What percentage effect on confidence level would be measured if dirty data from an operational system were corrected at the source? Purpose: Data quality measurement.

Additionally, data warehouse administrator’s benefit through the incorporation of operational meta data by gaining increased options and flexibility in accomplishing administrative tasks[9].

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

49

3.4 Metadata primary goals Metadata has three primary major goals:

Fig. 3.1 Periodic warehouse data[8]

Define: Manage the component necessary to provide an understanding of the DW and the enterprise through it’s content Build and Administrator: Manage the component necessary to efficiently define , build and operate the DW. DW metadata includes the structure and it’s content of the operational data that is used to execute DW jobs and perform the audit, balance and control aspect of the DW environment. This goal requires well-defined metadata about both data and processes. Navigate: Manage the component necessary to enable self-service use of the DW that is efficient, accurate and repeatable[8]

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

50

3.5 Benefits of metadata: Table 3.1 Benefits of metadata[8] Value

Description Downstream

Improved Productivity

and

Availability

Better Consistency

Quality Improvement

Quality Improvement

developers

can

achieve

greater

productivity

and

responsiveness by retrieving physical table and file definitions for the development of data movement and transformation processes. Information about data can be accessed on demand. The same language and understanding of data can be shared across business and technical users; better consistency can lead to reduced complexity A consistent language for data that combines business and technical metadata will lead to more reliable interpretations of data. A consistent language for data that combines business and technical metadata will lead to more reliable interpretations of data.

Smoother

Change management can be aided by providing metadata inventory and

Development

impact analysis reports. Taking a metadata-driven approach can smooth the

Techniques

transition from design to the development. Impact Analysis allows business and technical uses to see the potential

Impact Analysis

results of a change. In business-driven environments where changes occur frequently, this is instrumental is managing the impact to a complex, federated environment.

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

51

CHAPTER-4 Literature Review

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

52

Literature Review 1. “Data Warehouse Performance Management Techniques”: By Andrew Holdsworth, Oracle Services, Advanced Technologies, Data Warehousing Practice, 2/9/96 This document describes the management and technical issues involved in a Data Warehouse project. Eventually these techniques will be refined through time and experience to become part of Oracle’s overall Data Warehousing strategy. 2. “Performance Tuning Mechanisms for Data Warehouse: Query cache” International Journal of Computer Applications (0975 – 8887), Volume 2 – No.2, May 2010 Data warehousing encompasses architectures, algorithms, and tools for bringing together selected data from multiple databases or other information sources into a single repository, called a data warehouse suitable for direct querying or analysis. In recent years data warehousing has become a prominent buzzword in the database industry, but attention from the database research community has been limited. At the warehouse, queries can be answered and data analysis can be performed quickly and efficiently since the information is directly available, with model and semantic differences already resolved. The primary goal of data warehouse is to free the information locked up in the operational database so that decision makers and business analyst can make queries, analysis and planning regardless of the data changes in operational database. As the number of queries is large, therefore, in certain cases there is reasonable probability that same query submitted by the one or multiple users at different times. Each time when query is executed, all the data of warehouse is analyzed to generate the result of that query. In this paper an effort will be made to find the common problems faced. These kinds of problems are faced by Data Warehouse administrators which are minimizes response time and improves the efficiency of data warehouse overall, particularly when data warehouse is updated at regular interval. The overall performance of the system and provide good strategies that make superior Data Warehouse. 3. “Near Real Time ETL”: By Panos Vassiliadis, Alkis Simitsis, Springer journal Annals of Information Systems, AoIS Vol.3, 2008 Near real time ETL deviates from the traditional conception of data warehouse refreshment, which is performed off-line in a batch mode, and adopts the strategy of propagating changes that take place

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

53

in the sources towards the data warehouse to the extent that both the sources and the warehouse can sustain the incurred workload. In this article, we review the state of the art for both conventional and

near real time ETL, we discuss the background, the architecture, and the technical issues that arise in the area of near real time ETL, and we pinpoint interesting research challenges for future work. 4. ”DATA WAREHOUSE PROCESS MANAGEMENT”: By Panos Vassiliadis, Christoph Quix, Yannis Vassiliou, Matthias Jarke Previous research has provided metadata models that enable the capturing of the static components of a data warehouse architecture, along with information on different quality factors over these components. This paper complements this work with the modeling of the dynamic parts of the data warehouse. The proposed meta model of data warehouse operational processes is capable of modeling complex activities, their interrelationships, and the relationship of activities with data sources and execution details. Moreover, the metamodel complements the existing architecture and quality models in a coherent fashion, resulting in a fullframework for quality-oriented data warehouse management, capable of supporting the design, administration and especially evolution of a data warehouse. Finally, we exploit our framework to revert the widespread beliefthat data warehouses can be treated as collections of materialized views. We have implemented this meta model using the language Telos and the metadata repository system Concept Base. 5. ”Seven Steps to Optimizing Data Warehouse Performance”:By Rob Armstrong, NCR , 00189162/01/©December 2001 IEEE The evolution of data warehousing has followed a pattern of compromising the project to overcome technical barriers at the expense of business value. One such compromise is the automatic use of summary data or data models geared toward a single application. Initially, companies simply extracted data from operational systems and put it in another database—typically the one used for online transaction processing. This kept longer-running and more complex analysis workloads from interfering with more mission- critical OLTP work. Most operational systems store data in a normalized model in which certain rules eliminate redundancy and simplify data relationships. While beneficial for the OLTP workload, this model can inhibit those same OLTP databases from running analytical queries effectively. As data volume grew on analytical systems, queries took more time. Because the analytical systems did not need to support the OLTP workload, many M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

54

developers began preplanning for the answer sets. This preplanning included • creating summary tables of preaggregated data; • placing indexes in the system to eliminate scanning large volumes of data; denormalizing the model—putting data into one table instead of having tables that need to be joined together; and • storing the data in sorted order, thus avoiding and sort activity. For any of these four optimizations to occur, developers must first understand the analysis and reports being requested and the data demographics in detail. In the early days, many of those in charge of developing analytical systems did not have a background working in a data warehouse environment. They were trained to optimize performance by manipulating database parameters and data models, which was considered a step forward because previously the databases yielded little or no information. As user communities increased in size and queries became more complex, users often could not define all the analysis they wanted to run, and even if they could, the time required to process and store answer sets became unmanageable. The solution was to once again extract data, this time to online analytical processing systems that specialized in “slicing and dicing” data. 6. “Measuring the Success of a Data Warehouse”: Whitepaper, Threshold Consulting Services It is difficult to determine success metrics for a data warehouse. Certainly there is evidence that return on investment (ROI) is frequently used to determine warehouse success, but there are many potential metrics that may be used. For example, it is possible to define financial measurements, usage measurements, customer satisfaction, availability, performance, response time, etc., as reasonable warehouse success indicators. However, many of the possible measurements can be interpreted in different ways. For example, a decrease in queries against a data mart may indicate that use is decreasing, or it may indicate that the users are so happy with the data that they find it more efficient to download it to a cube on their desktop for easier access, thus increasing usage. Taking action based simply on query volume may lead to the wrong actions being taken. In order to obtain as objective a view of warehouse success as possible, the Balanced Scorecard, as described by Robert Kaplan and David Norton, can be modified to measure the data warehouse. The metrics presented in this article are not the only ones that can be used to measure the warehouse. Rather, these should be seen as a set of example measurements that can be refined over time as the organization’s perspective on the data warehouse changes and matures.

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

55

7. “ETL vs ELT”:By Robert J Davenport, June 2008, Insource IT Consultancy Ltd. With 80% of a Data Management (DM) or Business Intelligence (BI) project cost being attributable to the design and development of the data warehouse; it is crucial that the right decision is made over the architecture to be employed. Historically, we have had little choice but to use the ETL standard. Emergent technologies are providing toolsets and approaches that challenge this tradition. At least now we have a choice and, as with all choices, it is imperative that, that choice be an informed one. A wrong decision can be costly and potentially not meet the objectives set out by the project. In accepting the fact that change is a principle of data warehousing, it is imperative that the approach employed in the design and development of a data warehouse respects, and hopefully, embraces this need. Whilst both ETL and ELT routines can be changed, the process of change in ELT routines involves less cost, time and risk. Many organizations have recognized the benefits that can be gained from the employment of ELT. In spite of having already deployed warehouse solutions that use ETL, they have embarked upon a process of re-engineering their solutions using ELT through the use of effective tools. Whilst data warehouse development traditionalists will undoubtedly pour scorn on the ELT approach, the clear and obvious benefits cannot afford to be ignored. 8. “Teradata® RDBMS Performance Optimization”: By NCR Corporation, B035-1097-122A December 2003 Teradata RDBMS Performance Optimization provides information that: • Helps you ensure that Teradata operates at peak performance based on your applications and processing needs. • Describes specific release-specific performance enhancements that require no user intervention. This book contains sixteen chapters: 9. “Tuning Servers, Storage and Database for Energy Efficient Data Warehouses”:By Meikel Poess, Raghunath Othayoth Nambiar, 978-1-4244-5446-4/10/ 2010 IEEE, CDE Conference 2010 Undoubtedly, reducing power consumption is at the top of the priority list for system vendors, data center managers who are challenged by customers, analysts, and government agencies to implement green initiatives. Hardware and software vendors have developed an array of power preserving M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

56

techniques. On-demand-driven clock speeds for processors, energy efficient power supplies, and operating-system-controlled dynamic power modes are just a few hardware examples. Software vendors have contributed to energy efficiency by implementing power efficient coding methods, such as advanced compression and enabling applications to take advantage of large memory caches. However, adoption of these power-preserving technologies in data centers is not straightforward, especially, for large, complex applications such as data warehouses. Data warehouse workloads typically have oscillating resource utilizations, which makes identifying the largest power consumers difficult. Most importantly, while preserving power remains a critical consideration, performance and availability goals must still be met with systems using power-preserving technologies. This paper evaluates the tradeoffs between existing power-saving techniques and their performance impact on data warehouse applications. Our analysis will guide system developers and data center managers n making informed decisions regarding adopting power preserving techniques. 10.”TERADATA RDBMS“: By Roland Pfeffer, NCR, Teradata Division, The Teradata system software is the foundation of the relational database server. The purpose of the software is to support SQL manipulations of the database. Teradata software includes:Client software, which includes UNICODE support,Server software used to support the Teradata RDBMS,Structured Query Language (SQL) and its uses for application programming and interactive ,database queries,Stored procedures,Third-party software 11. “Active Data Warehousing . the Ultimate Fulfillment of the Operational Data Store ”: By Dr. Claudia Imhoff,Intelligent Solutions, Inc. Over the years, data warehousing has gone through a number of evolutions . from a relatively simple reporting database to sophisticated analytical applications such as analyzing customer lifetime values, market basket analyses, potentially defecting customers, fraud patterns, inventory churns, and so on. In all, though, these static sets of data could not give us the most current and recent changes necessary to ACT upon the results of Business Intelligence analyses. For example, once we could identify a customer likely to go to a competitor, we still could not view their current situation . e.g., what products does the customer have with the company, is the customer a VIP requiring special treatment, where are they in the sales cycle? The reason for this lack of insight was that the warehouse was set up to give us static snapshots of data, perhaps as recently as last week. But last week.s (or even last night.s) data is often not sufficient to react to current situations. Things M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

57

change rapidly in today.s e-business economy and the company with the best set of integrated, current data is the one that will not only survive but will actually thrive in this economy. Unfortunately most enterprises today do not have any integrated data other than the snapshots found in their data warehouses. This is where the need for the Operational Data Store (ODS) was generated. And fortunately now, you can have integrated data in the static snapshots and in live, current records . an environment in which both types of data and requirements can co-exist. This concept is called the Active Data Warehouse. To better understand this advance in technology, let.s examine the characteristics that make the ODS so very different from the traditional data warehouse. To do this, you must understand the difference between analytical and operational Applications. We classify the analytical applications as Business Intelligence, noting that they consist of the data warehouse supplying data to the various analytical applications in the data marts. The applications running against these components use decision support interfaces (DSI.s) and give us great insight into our customers. Demographics, buying habits, profitability, lifetime value, and so on. But insight into customer behaviour is not enough. As we stated, you also need the ability to ACT upon these findings by having ready access, from anywhere in the enterprise, to integrated, comprehensive and urgent information about customers, products, inventories, orders, and so on, as well as quick access to some analytical results. The operational applications are called Business Management components. These applications give the enterprise the ability to take action using its collective intelligence and subject knowledge. They also provide the organization with an enterprise-wide understanding of their situation, which facilitates a transition away from the silo business unit or functional viewpoint. For example, many of us have had the experience where we update our contact information (e.g., new address, phone and fax number) through a web site, only to find ourselves repeating that same information when we talk to a call center representative later that day. This is a prime example of an enterprise with severely isolated systems. 12. “Discover Teradata Meta Data Services”: You can’t effectively use the abundance of data in your data warehouse if you don’t have a clear understanding of the data, what they represent, how they interrelate, and therefore how to best use them. The data warehouse, by its very nature, is a complex and dynamic environment containing vast amounts of data. Arguably the data in the warehouse and the knowledge derived from it are your organization’s most valuable asset, and metadata is essential for understanding the information in the data warehouse. The most effective use of the data warehouse is made through the use of metadata. It helps an organization identify, describe, link, and disseminate the proper information to M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

58

all users of the warehouse. Metadata is increasingly noted as a critical factor for understanding the warehouse as it helps uncover the data that are useful, and just as importantly, the data that are not. A metadata management solution can provide:  Standardized data definitions.  Information sharing across the enterprise.  Impact analysis and reveal relationships between data elements.  Information about redundancies before they occur and eliminate duplication.  Data lineage information.  Consistency and quality.  Adherence to compliance and governance requirements.  Employee self service for search and discovery of information. 13.”Data Warehouse Performance”:Beixin (Betsy) Lin (Montclair State University, USA), Yu Hong (BearingPoint Inc., USA), and Zu-Hsu Lee (Montclair State University, USA) A data warehouse is a large electronic repository of information that is generated and updated in a structured manner by an enterprise over time to aid business intelligence and to support decision making. Data stored in a data warehouse is non-volatile and time variant and is organized by subjects in a manner to support decision making (Inmon et al., 2001). Data warehousing has been increasingly adopted by enterprises as the backbone technology for business intelligence reporting and query performance has become the key to the successful implementation of data warehouses. According to a survey of 358 businesses on reporting and end-user query tools, conducted by Appfluent Technology, data warehouse performance significantly affects the Return on Investment (ROI) on Business Intelligence (BI) systems and directly impacts the bottom line of the systems (Appfluent Technology, 2002). Even though in some circumstances it is very difficult to measure the benefits of BI projects in terms of ROI or dollar figures, management teams are still eager to have a “single version of the truth,” better information for strategic and tactical decision making, and more efficient business processes by using BI solutions (Eckerson, 2003). Dramatic increases in data volumes over time and the mixed quality of data can adversely affect the performance of a data warehouse. Some data may become outdated over time and can be mixed with data that are still valid for decision making. In addition, data are often collected to meet potential requirements, but may never be used. Data warehouses also contain external data (e.g. demographic, psychographic, etc.) to support a variety of predictive data mining activities. All these factors contribute to the massive growth of data volume. As a result, even a simple query may become burdensome to M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

59

process and cause overflowing system indices (Inmon et al., 1998). Thus, exploring the techniques of performance tuning becomes an important subject in data warehouse management. 14. “A Metadata-Driven Approach for Data Warehouse Refreshment”: Data warehousing technology supports information management for decision making by integrating data from operational systems and external sources in a separate database, the data warehouse. In contrast to operational systems which store detailed, atomic and current data accessed by OLTP (online transactional processing) applications, data warehousing technology aims at providing integrated, consolidated and historical data for OLAP (online analytical processing) applications. As time advances and the sources from which warehouse data is extracted change, data warehouse contents must be “synchronized” with the underlying sources. Thus, after an initial loading phase, warehouse data must be regularly refreshed, i.e., modifications of operational data since the last warehouse refreshment must be propagated into the warehouse such that warehouse data reflect the state of the underlying systems. Data warehouse refreshment is a complex process comprising several tasks like monitoring, extracting, transforming, integrating and cleaning data, deriving new data, building histories and loading the data warehouse. This thesis proposes an approach which allows for the modeling and execution of the data warehouse refreshment process at an intermediate layer between operational sources and the target data warehouse, and provides support for several tasks related to warehouse refreshment in data warehouse environments consisting of a wide variety of heterogeneous sources, and independently of how target warehouse data is stored persistently. The contribution of the thesis is twofold. First, the thesis proposes a metamodel which includes a set of modeling constructs for the definition of metadata required for executing the data warehouse refreshment process, such as the description of operational and target warehouse data, the definition of mappings between operational and warehouse data, including refreshment tasks like transformation, cleaning, integration, process scheduling, etc. Second, the thesis describes a methodology and an appropriate operational infrastructure for defining and executing concrete refreshment processes based on the above mentioned metamodel. The definition of the refreshment process is supported by a metadata administration tool which provides facilities for creating, modifying and checking metamodel instances stored in an object-oriented metadata repository. Furthermore, the thesis describes the way various monitoring techniques can be integrated in a data warehouse environment in order to detect updates in operational systems and refresh the warehouse incrementally. An advanced key management concept is provided which ensures the correctness of keys propagated into the target warehouse, thus improving data quality in the target warehouse. M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

60

Finally, target warehouse mapping and history management techniques are introduced which consider the specific database design techniques used for data warehouses, as well as the multidimensional and temporal character of warehouse data.

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

61

CHAPTER-5 Data Warehouse Performance Tuning

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

62

5. Data Warehouse Performance Tuning: Performance: performance in general can be defined as the accomplishment of a given task measured against preset known standards of accuracy, completeness, cost, and speed. In a contract, performance is deemed to be the fulfillment of an obligation, in a manner that releases the performer from all liabilities under the contract. In contrast of data warehouse:

[9]

The performance of a Data Warehouse is largely a function of the

quantity and type of data stored with in a database and the query/data loading work load placed upon the system[9]. Over the last decade, the largest data warehouses have increased from 5 to 100 terabytes, according to Winter Corp., and by 2010, most of today’s data warehouses will be 10 times larger, according to The Data Warehouse Institute (TDWI). As data warehouses grow in size to accommodate regulatory requirements and competitive pressures, ensuring adequate database performance will be a big challenge in order to answer more ad hoc queries from more people. [9]

Definition of Acceptable Performance -The simplest definition of acceptable performance is that

the Data Warehouse is able to meet all of the business requirements in the required time scales[9.] 5.1 Performance Tuning Tuning is a process to identify and tune target applications for performance improvements and proactively prevent application performance problems [10.]The goal of performance tuning is to minimize the response time for each query and to maximize the throughput of the entire database server by reducing network traffic, disk I/O, and CPU time. This goal is achieved through understanding application requirements, the logical and physical structure of the data, and tradeoffs between conflicting uses of the database, such as online transaction processing (OLTP) versus decision support and etc…….[12]

There are two distinct types of tuning: 

Proactive Monitoring



Bottleneck Elimination

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

63

5.1.1 Proactive Monitoring Proactive monitoring usually occurs on a regularly scheduled interval, where a number of performance statistics are examined to identify whether the system behavior and resource usage has changed. Proactive monitoring can also be considered as proactive tuning. Usually, monitoring does not result in configuration changes to the system, unless the monitoring exposes a serious problem that is developing. In some situations, experienced performance engineers can identify potential problems through statistics alone, although accompanying performance degradation is usual. Experimenting with or tweaking a system when there is no apparent performance degradation as a proactive action can be a dangerous activity, resulting in unnecessary performance drops. Tweaking a system should be considered reactive tuning, and the steps for reactive tuning should be followed. Monitoring is usually part of a larger capacity planning exercise, where resource consumption is examined to see changes in the way the application is being used, and the way the application is using the database and host resources[10]. 5.1.2 Bottleneck Elimination Tuning usually implies fixing a performance problem. However, tuning should be part of the life cycle of an application—through the analysis, design, coding, production, and maintenance stages. Oftentimes, the tuning phase is left until the system is in production. At this time, tuning becomes a reactive fire-fighting exercise, where the most important bottleneck is identified and fixed. Usually, the purpose for tuning is to reduce resource consumption or to reduce the elapsed time for an operation to complete. Either way, the goal is to improve the effective use of a particular resource. In general, performance problems are caused by the over-use of a particular resource. That resource is the bottleneck in the system. There are a number of distinct phases in identifying the bottleneck and the potential fixes. These are discussed in the sections that follow[10]

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

64

5.2 Problem Area: Areas where a data warehousing system can use a little performance tuning for a better performance: 5.2.1 ETL - Given that the data load is usually a very time-consuming process (and hence they are typically relegated to a nightly load job) and that data warehousing-related batch jobs are typically

of lower priority, that means that the window for data loading is not very long. A data warehousing system that has its ETL process finishing right on-time is going to have a lot of problems simply because often the jobs do not get started on-time due to factors that is beyond the control of the data warehousing team. As a result, it is always an excellent idea for the data warehousing group to tune the ETL process as much as possible. 5.2.2 Query Processing - Sometimes, especially in a ROLAP environment or in a system where the reports are run directly against the relationship database, query performance can be an issue. A study has shown that users typically lose interest after 30 seconds of waiting for a report to return. My experience has been that ROLAP reports or reports that run directly against the RDBMS often exceed this time limit, and it is hence ideal for the data warehousing team to invest some time to tune the query, especially the most popularly ones. We present a number of query optimization ideas. 5.2.3 Report Delivery - It is also possible that end users are experiencing significant delays in receiving their reports due to factors other than the query performance. For example, network traffic, server setup, and even the way that the front-end was built sometimes play significant roles. It is important for the data warehouse team to look into these areas for performance tuning. Here focus will be more on the report response time as discussed above and the proposed solution will help to make that better. So for better reporting many factors are there which can contribute in positive and negative sense depending upon how factors has been considered or implemented? 5.3 Factors Affecting Reporting Layer Response Time 

Tools Selection

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

65



Physical Design



Normalized vs. de-normalized



Relational vs. dimensional



Hybrid



Reporting frequency



Concurrent Users



Amount of data



Indexing



Statistics

All above-mentioned factors can impact reporting response time but in this thesis we will be explaining more on how to keep statistics up to date with help of available metadata for better reporting? 5.4 Statistics In DWH statics are milestones directives that help database optimizer / database engine to come up with “THE COST EFFECTIVE” execution plan. So what information is there in statistics? In statistics below information can be stored in DWH, which will make sure pre availability of required data to come up with effective execution plan. For example: 

Number of nulls



Number of Unique values



Average row correction per values change



Number of intervals



Number of rows updated.



Mode Frequency etc…

With this available precompiled information optimizer / database engine will use this information rather than calculating or estimating the same at run time which will increase the response time unnecessarily[10].

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

66

How statistics can be defined & do we need to define every Time? Statistics can be defined by developers / DBA / Architects to make sure smooth execution of queries on the box. It can be defined on Column or combination of column level and No, Statistics should be defined only one time and this is a physical attribute of a database table, which can be defined on a column or combination of column but it requires a frequent refresh on the basis of data changing frequency and amount of change this has to be refreshed on a table level. Is It Necessary? No, having statistics is not necessary but depending on the filter and join condition in queries it will help to avoid full table scan and over consumption of CPU cycles. But if statistics are there on the database tables then it is essential to have it up to date otherwise it will provide false information to optimizer / database engine which will turn as a feed in to bad execution plan. e.g. if statistics is not up to date ? Statics says there are 10, 00,000 records in a table. But actually it has increased by 10 millions so this type of inconsistency will lead in to cost and CPU consuming execution plans. Where this information will be saved & how this will be used? System will store this information in system tables and depending upon the database software used it can be string field or even a clob or blob object and when query will be submitted to database then optimizer / database engine will use this pre compiled information from system tables for cost effective execution plan.

5.5 TRADITIONAL METHODOLOGY FOR STATISTICS REFRESH. At many sites, to address Statistics refresh process, they have a weekly / fort nightly / monthly job to run on box which will simply go and refresh all available statistics on the box regardless of the functional data load frequency. On this traditional solution we did some analysis and below are some more description for the same.

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

67

5.5.1 What was done in investigation? During analysis we started to plot the various graphs and trends to identify the nature of wastage. And we figured out that absence of intelligence in weekly stats collection process is causing this wastage in CPU cycles. 

Evident Traits



Every time stats were getting refreshed on database level



No separation for static data



No special consideration on heavily floating facts



Requires a large chuck of CPU once in a week, reduces other process’s slices



No Aging information available



No auditing about was, what is and what should be?

Fig. 5.1 Available Traditional Methodology

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

68

CHAPTER-6 Proposed Methodology

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

69

6. Proposed Methodology What alerted us for solution? Vilferdo Pareto, an Itilian economist, observed 80% of the land in Italy was owned by 20% of the population; he developed the principle by observing that 20% of the pea pods in his garden contained 80% of the peas[11]. Later on same study was generalized by Joseph M. Juran, business management consultant, and named after Vilferdo Pareto as Pareto Principle “Pareto Principle states that 80% effects come from 20% causes.” If we allow the same extension in BI life cycle then we will see that more then 20 % CPU cycles were consumed by approx 80% system management tasks and if we were running low on CPU cycles then saving every bit will be an add on to pocket, which indirectly converts in saving. So basically “what we save is also a kind of earning”, quoted by Warren Edward Buffett, an American investor. On the similar lines when we start analyzing systems then we found that out of 100 % CPU cycles assigned to system management, approx 80 % were getting consumed in stats management. Upon further investigation we found that out of assigned 80%, approximately more then 50% was kind of wastage because of unnecessary repetition of process even though it is not required[11]

6.1 Flow Chart Step 1: As this process is so tightly integrated with ETL at table level so to initiate with we will pass Database name and Table name to the process. Step 2: In this Step process will check the existence of the database object in database to make sure this table does exist in database as a table. If table is there in database then it will go on step 3 otherwise will break the process and come out. Step 3: After getting confirmation on existence with help of step 2, process will make sure that table has data. If table is populated then it will go on next level otherwise will break the process and come out.

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

70

Fig.6.1 Flow Chart of Proposed Solution

Step 4: After getting confirmation on data population with help of step 3, process will make sure that stats are there on the table. If stats are there then it will proceed to next step otherwise will break the process and come out. Step 5: Now as process knows with help of previous steps about stats so now question is to know weather data load is incremental or will be truncate and load. Now because this cannot be identified using metadata information so end user as an input to the process will pass this information. If table is truncate and load then process will refresh the whole stats on the table but if not then simply proceeds to next step. Step 6: In this step process will make sure that data of base table has been modified by how much percentage. Ideally if data change is more the 9% then go with a refresh otherwise leave it alone. Step 7: In this step process will check the age of existing stats and if it is more then 30 days then will refresh the stats otherwise simply come out of the loop. M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

71

6.2 Process Flow Model

Fig. Proposed Model Using STATS

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

72

CHAPTER 7 Experiment/Result

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

73

Experiment/Result To verify the results of this approach pilot testing was done on Teradata data warehouse software so terminology is inclined toward that database software but this framework can be generalized across any platform. In this research work we are proposing a metadata driven approach for performance tuning using Teradata software. 7.1 Teradata Software The Teradata system software is the foundation of the relational database server. The purpose of the software is to support SQL manipulations of the database[9]. Teradata software includes: 

Client software, which includes UNICODE support



Server software used to support the Teradata RDBMS



Structured Query Language (SQL) and its uses for application programming and interactive



database queries



Stored procedures



Third-party software

Client Software Teradata RDBMS supports the UCS Transformation Format 8-bit form (UTF-8) UNICODE client character set on Teradata clients on a per session basis. One to three-byte UTF-8 UNICODE characters from the client are stored on Teradata RDBMS as two-byte UNICODE characters. By supporting UNICODE, the Teradata RDMBS can: 

Support data in multiple languages in the same session



Support non-Latin data, for example, Japanese, Chinese, and so forth, on all systems

Teradata RDBMS supports running with or without a channel- or network attached client. The following table describes the available client software, recognizing that the “client” may be the system running the Teradata RDBMS itself:

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

74

Table 6.1 Teradata softwares[9] Software BTEQ

Function

Supported Access

Interactive and batch query processor/report Channel-

C Preprocessor

and

generator

attached clients

Permits embedding SQL in C programs

Channel-

and

network-

network-

attached Clients COBOL

Permits embedding SQL in COBOL programs

Preprocessor

Channel-attached Some

clients

network-attached

clients PL/I Preprocessor

Permits embedding SQL in PL/I programs

CLI

Packages and unpackages requests Can be Channelembedded

in

application

programs

Channel-attached clients and

network-

using attached Clients

function calls TDP

Manages data communication Handles sessions, Channel-attached clients logging, recovery, restarts, physical I/O from the PEs, and security

MTDP/MOSI

Manages data communication Handles logging, Network-attached clients recovery, restarts, and physical I/O from the PEs Session and security management are handled by the Gateway software on the server.

Archive/ Restore

Archives data to tape; restores taped data to Channel-attached clients Teradata RDBMS

ASF2

(MR-RAS Archives data to tape; restores taped data to Network-attached clients

products only)

Teradata RDBMS Note: This software is supported on legacy systems only

Fast Export

Extracts large volumes of data from the Teradata ChannelRDBMS

Fast Load

Does high-performance data loading from client Channel-

Does

high-performance

network-

attached Clients

into empty Tables Multi Load

and

and

network-

attached Clients data

maintenance, Channel-

and

network-

including inserts, updates, and deletions to attached Clients existing tables

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

75

Open

Teradata Are open architecture products that provide Network-attached clients

Backup

(OTB) backup and restore functions for Windows

includes

the clients

following: 

Net Backup



NetVault

TPump

Provides for continuous update of tables; Channel-

and

network-

performs insert, update, and delete operations or attached Clients a combination of these operations on tables using the same source feed ODBC

Provides for access to the Teradata RDBMS Network-attached clients from various tools increasing the portability of data

Server Software The server software includes the following: Database Window Channel and/or network gateway communications support 

Teradata Parallel Database Extensions (PDE)



Teradata database management software



Request dispatcher



Session controller



Access module process (AMP)



Teradata file system



SQL parser (syntaxer, resolver, optimizer, and so forth)

A server may also contain data loading utilities such as MultiLoad and FastLoad, data export utilities like FastExport, and the SQL data access utility BTEQ[9]. M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

76

7.2 Basic Setup required Table size collection should be enabled on box for 24 hrs intervals. Below tables and views should be there in specific DB. Tables 

XXXX_DBO.IS_TABLE_EMPTY: Almost empty table and will have population only when master procedure executes



XXXX_DBO.STATS_FCT_STATUS: Status of stats collection at granular level



XXXX_DBO.STATS_FCT: Status of stats collection at granular level but from different perspective



XXXX_DBO.STATS_METADATA: For overriding default setting for any table by 9% and 30 days. This table should only be accessible to data architects only.

Corresponding View on above tables with access lock 

XXXX_VM.IS_TABLE_EMPTY



XXXX_VM.STATS_FCT_STATUS



XXXX_VM.STATS_FCT



XXXX_VM.STATS_METADATA

Procedures: 

XXXX_APP.INSERT_STATS_FACT



XXXX_APP.IS_10_PERCN_CHANGE



XXXX_APP.IS_STATS_30_DAYS_OLDER



XXXX_APP.IS_STATS_ON_TABLE



XXXX_APP.IS_TABLE_EMPTY



XXXX_APP.SMART_STATS_COLLECTION (Master procedure)

Syntax for master Procedure CALLXXX_APP.SMART_STATS_COLLECTION (DATABASE_NAME,TABLE_NAME,IS_TABLE_IS TRUNCATE_LOAD (Y/N),ETL_NAME WITH FOLDER NAME/ SCRIPT_NAME WITH PATH,OP_1_INTEGER,OP_2_VARCHAR) M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

77

7.3 RESULTS The above mentioned graph is showing CPU utilization of four different users for a application. These user were responsible to operates on stage load, reporting table loads, reporting needs etc. for a sample data of a firm M/s Adsect Technologies, Jaipur for two consecutive months in year 2011. This process was implemented for user 3 alone. And the trend of CPU utilization for user 3 (green) is supporting the fact process is working. Because in three and half spikes it has more requirement for CPU as compare to last three and half spikes. The above graph proves that the technique introduced in this paper has reduced the CPU utilization drastically and the same may be implemented practically for further better usage of technology. 7.3.1 Cost Benefit Analysis Let us assume the cost of 1 CPU Cycle is 1 unit then in first week cost occurred is approximately 4900 units which continues till first four week, this result is obtained without using stats, while next four week the cost is reduced to almost less than half when STATS approach was applied & new cost is just 2400 units. 7.3.2 Solution Piloting To verify the results of this approach pilot testing was done on Teradata database so terminology is inclined toward that database software but this framework can be generalized across any platform.

Fig. 7.1 Typical CPU consumption chart for an application in a DWH

M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting

78

CHAPTER 8 Conclusion

M .Tech Thesis

DWH-performance solution using Meta data driven approach

79

8. Conclusion Today Data Ware Housing is the back bone of most of MNC’s and large scale organization. But most of them have not looked into database administration cost especially over hardware. The data entry & updation cost. The above work is an aim for optimizing the cost of data warehousing by tuning the performance over OLAP. Instead of updating all attributes and rows of a table for even a single value of a entity we have proposed to a methodology to reduce the effort on the machine side. In this effort is made by reducing the number of CPU Cycles to be used for editing the value by using STATS. 8.1 Intended Audience This paper involves very advanced concepts of data ware housing. Also while writing this paper Tier 3 and Tier 4 population of IT industry was kept in mind as per below list. 

Senior Developers



Project Managers



Technical Leads



DBA



Solution Architects

M .Tech Thesis

DWH-performance solution using Meta data driven approach

80

REFERENCES: [1].A Subjective View Part of the series of the Insource Commercial Aspects of BI discussion papers Robert J Davenport June 2008 [2]. Introduction to Teradata® RDBMS V2R4.1 B035-1091-061A June 2001 [3]. Performance Tuning Mechanisms for Data Warehouse: Query cache, Vishal Gour International Journal of Computer Applications (0975 – 8887)Volume 2 – No.2, May 2010 [4].Teradata Magazine-December 2007 [5].” ETL vs ELT”: By Robert J Davenport June 2008,White Paper, © Insource IT Consultancy Ltd [6] “Data mining , data warehousing and Olap” By Gajendra Sharma,publisher- S.K Kataria & Sons [7].” Clinical Data Repositories and Data Warehouse”: Calgary Regional Health Authority, July 19, 1999. [8].” Metadata implementation with Ab Initio EME”: Article by MikeGreen on 13 May 2009 [9].”Data Warehouse Performance Management Techniques” By Andrew Holdsworth, Oracle Services, Advanced Technologies, Data Warehousing Practice, 2/9/96 [10]. “Oracle® Database Performance Tuning Guide”10g Release(10.2)Part Number B14211-03 [11]. http://en.wikipedia.org/wiki/Pareto_principle [12]. “Optimizing Database Performance Overview SQL Server 2000” [13]. Teradata® RDBMS performance optimization, NCR Corporation [14]. Teradata® RDBMS performance optimization, NCR Corporation

M .Tech Thesis

DWH-performance solution using Meta data driven approach

81