DWH-Performance Tuning Using Metadata Driven Approach
A Dissertation Submitted In partial fulfillment For the award of the Degree of Master of Technology In Department of Computer Science & Engineering (With specialization in software engineering)
Supervisor
Submitted By:
Naveen Hemrajani
Sandeep Bhargava
Assoc. Professor
Enroll.No:SGVU091513431
SGVU, Jaipur
Department of Computer Science & Engineering Suresh Gyan Vihar University Mahal, Jagatpura, Jaipur September 2011
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
1
Candidate’s Declaration
I hereby that the work, which is being presented in the Dissertation, entitled “DWH-Performance Tuning Using Metadata Driven Approach” in partial fulfillment for the award of Degree of “Master of Technology” in Deptt. of Computer Science & Engineering with specialization in Software Engineering and submitted to the Department of Computer Science& Engineering, Suresh Gyan Vihar University is a record of my own investigations carried under the Guidance of Mr. Naveen Hemrajani Department of Computer Science & Engineering
I have not submitted the matter presented in this Dissertation any where for the award of any other Degree.
(Name and Signature of Candidate) Sandeep Bhargava Enrolment No: SGVU091513431
Counter Singed by Name (s) of Supervisor (s)
Naveen Hemrajani Associate Professor, CSE Dept Suresh Gyan Vihar University
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
2
DETAILS OF CANDIDATE, SUPERVISOR (S) AND EXAMINER
Name of Candidate: Sandeep Bharagava
Roll No. 102615
Deptt. of Study: Computer Science & Engineering Enrolment No. SGVU091513431 Thesis Title: DWH-Performance tuning using meta driven approach
Supervisor (s) and Examiners Recommended (with Office Address including Contact Numbers, email ID) Supervisor
Co-Supervisor
1. Mr Naveen Hemrajani
Associate Professor SGVU, Jaipur, India
[email protected]
Examiner 1
2
3
Signature with Date
Ms. Ruchi Dave
Mr. Naveen Hemrajani
Program Coordinator
Vice Principal SGVU,Jaipur
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
3
ABSTRACT Today world is running behind the artificial intelligence provided by group of transactional data to predict the growth pattern of industry. And for the same everyone is coming up with concept with in-house data warehouse, so called concept of data warehouse deals in huge amount of data and lot of analytically queries runs on data warehouse(DWH), which covers base data in terms of thousands of gigabytes, to unveil the hidden pattern of business. So response time of query is exponential proportional (metaphorically) to involved base data. So we can say THUMB RULE as "MORE BASE DATE MORE ACCURATE RESULTS". But it will degrade the performance if not taken care properly. Also we, as human, hate to wait due to natural phenomenon encoded in our DNA. We do not want to miss a single moment of time, want to utilize every single piece at the best. Every single time we keep looking for opportunity of betterment in existing process. Similarly lot of works has been done by many literates around the globe on DWH performance tuning by proposing many frameworks related with various focus data quality, Metadata management etc.... In this thesis, an effort has been made to discuss about the real industry problems and how we improve the performance of data warehouse by minimize the existing CPU cycles wisely using metadata driven approach.
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
4
CONTENTS
Abstract List of Tables List of Figures Acknowledgements
iv viii ix x
1. Introduction
2
1.1 Early History
2
1.2 Recent History
3
1.3 Data warehouse
4
1.3.1 What is data warehouse
4
1.3.2 Need of data warehouse
5
1.3.3 Benefits of operational/informational systems
5
1.3.4 Advantage of data warehouse
6
1.3.5 Data warehouse is different from database
6
1.4 Basic architecture of warehouse
7
1.4.1 Processes in data warehouse
7
1.4.2 Basic components of Data warehouse
8
1.4.2.1 Load Manager
8
1.4.2.1.1 Extract and load
9
1.4.2.1.1 Fast Load
9
1.4.2.1.1 Simple transformations
9
1.4.2.2 Warehouse Manager
10
1.4.2.2.1 Using Temporary Destinations Tables
11
1.4.2.2.2 Complex Transformations
11
1.4.2.2.3 Transform into a Star flake Schema 1.4.2.2.4 Create Indexes and views 1.4.2.3 Query Manager 1.4.2.4 Operational Data Sources 1.5 Classification of Data Warehouse Architecture.
11 11 12 12 13
1.5.1 Generic Two-Level Architecture
13
1.5.2 Independent data mart architecture
14
1.5.3 Dependent Data Mart and Operational Data Store
15
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
5
1.5.4 Three layer data architecture 1.5.5 Logical data mart and real-time data warehouse architecture 1.6 ETL vs. ELT 1.6.1 ETL – Extract Load Transform 1.6.1.1 Strengths 1.6.1.2 Weakness 1.6.2 ELT – Extract Transform Load
16 17 18 18 19 19 20
1.6.2.1 Strength
21
1.6.2.2 Weakness
21
1.6.3 Staging
22
1.7 OLTP
22
1.8 OLAP
23
1.8.1 Classification of OLAP
23
1.8.2 Comparison among MOLAP, ROLAP
24
2. Data warehouse design and implementation issues
25
2.1 Tool integration
26
2.2 Extraction
26
2.3 Scalability
27
2.4 Database design
27
2.5 Domain issues
28
2.6 Metadata management
28
2.7 Performance and monitoring
29
3. Meta Data
30
3.1Technical Meta data
31
3.2 Business Meta data
31
3.3 Operational Meta data
32
3.3.1 Incorporation of Operational Meta Data
32
3.3.2 Operational Use Strategies
36
3.4 Metadata primary goals
40
3.5 Benefits of Meta data
41
4. Literature Review M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
42 6
5. Performance Of Data warehouse
52
5.1 General
53
5.2 Performance Tuning
53
5.2.1 Proactive Monitoring
54
5.2.2 Bottleneck Elimination
54
5.3 Problem Area.
55
5.3.1 ETL
55
5.3.2 Query Processing
55
5.3.3 Reporting Response Time
55
5.4 Factor Affecting Reporting Response Time
55
5.5 Statistics
56
5.5.1Traditional Methodology For Statistics Refresh.
57
6. Proposed Methodology
59
6.1 What altered us for soloution
60
6.2 Proposed Solution 6.2.1 Flow Chart
60
6.2.2 Process Flow Model
62
7. Experiment/Result
63
7.1 Teradata software
64
7.2 Basic Set Up Required
67
7.3 Result
68
8 Conclusions and Summary
69
References
71
Paper Publications
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
7
List of Tables Table1.1 Operational vs. informational systems
6
Table 3.1 Benefits of metadata
41
Table 6.1 Teradata software’s
65
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
8
List of Figures Figure 1.1: Architecture of data warehouse
8
Figure 1.2: Load Manager Architecture
9
Figure 1.3: Architecture of warehouse manager
10
Figure 1.4: Generic two-level data warehousing architecture
14
Figure 1.5 Independent data mart data warehousing architecture
15
Figure 1.6 Dependent Data Mart and Operational Data Store
16
Figure 1.7 Layer Data Architecture
16
Figure 1.8 Logical data mart and real-time data warehouse architecture
17
Figure1.9 ETL Architecture
18
Figure1.9 ELT Architecture
20
Figure 3.1 Metadata Primary Goals
40
Figure 5.1 Available Traditional Methodology
58
Figure 6.1 Flow Chart of Proposed Solution
61
Figure 6.2 Proposed model using STATS
62
Figure 7.1 Typical CPU consumption chart for an application in a DWH
68
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
9
Acknowledgements
I would like to thank everyone who helped me to write this thesis. In particular, I would like to thank my guide Mr.Naveen Hemrajani and HOD. Mr. Dinesh Goyal for this moral support, and guidance for completion of this report. I would like to take this opportunity to thank, Ms. Ruchi Dave Program coordinator Suresh Gyan Vihar University Jaipur . I express my special gratitude to Mr Subhash Gander and Ms.Savita Shivani and also all my friends and classmates for their support and help me. Last but not the least I wish to express my gratitude to god almighty for his abundant blessings without which this seminar would not have been successful.
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
10
CHAPTER-1 Introduction
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
11
Introduction
In the era of competition across globe which backed up by powerful IT services, only selling large quantities products / services is not what indicates the so-called “GREEN” health of business. What makes a product / services more visible , is selling / offering services “SMARTLY”. In lay man terminology this just and difference between working very hard and working very smartly. So exactly what is and how industry is capitalizing this SMART way of offering????? Answer is DWH (DATA WARE HOUSING), a relatively complex term, is a key ingredient for leading industries to identify hidden trends and unveil over all trends in business. Many successful companies have been investing large sums of money in business intelligence and data warehousing tools and technologies. They believe that up-to-date, accurate and integrated information about their supply chain, products and customers are critical for their very survival. For example: WALMART, giant retail chain in world, has invested billions of dollars to build DWH from daily transactional data. Presently WALMART is among top five (in terms of data and computational power) DWHs in world. The power of WALMART data ware housing can be imagine that WALMART generates reports at below parameters across the globe.
What was the sell of orange juice by brand by store by location by season?
What was the profitability for Nevada region compare to LA area in terms of market capitalization vs. sells?
How many additional dollars has been invested on a gallon of milk from to farm store?
What is the ratio of operating cost vs. capital cost for small, mid and big size store?
Base data, out of which above metrics can be derived, is hundreds of pet bytes. WALMART derives these metrics for higher management to plan business strategy for coming future. 1.1 EARLY HISTORY To really understand business intelligence (BI) and data warehouses (DW), it is necessary to look at the evolution of business and technology. In the 1970s and 1980s, computer hardware was expensive and computer processing power was limited. A medium-sized business typically operated a handful of large mainframe-based application systems that were designed for operational, data entry purposes. Information needs were addressed through paper reports. Report programs, however, were expensive to write and generally inflexible. A computer programmer was required M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
12
to write the report programs. Fortunately, salaries for programmers were relatively low during this period. In the 1980s, relational databases became the rage. Data was stored in tables with rows and columns, not unlike Excel Spreadsheets of today. Although relational databases were much more intuitive for end users, complex logic was often needed to join multiple tables to obtain the information that was needed. Although it was possible for end users to write simple reports, the queries were often inefficient and had to be run after normal business hours, in order not to impact online transactions. In the late 1980s, many businesses migrated from mainframe computers to client servers. Business people were assigned a personal computer. Office applications such as MicroSoft World, Excel and Access became popular. The personal computer empowered end users and allowed them to develop their own applications and present data in a manner that was meaningful to them, such as in grid or graph format. Excel spreadsheets could easily be tweaked as business needs changed, without the assistance from the IT department. Unfortunately, corporate data remained centralized and was generally not directly accessible to end users. 1.2 RECENT HISTORY The need for improved business intelligence and data warehousing accelerated in the 1990s. During this period, huge technological changes occurred and competition increased as a result of free trade agreements, globalization, computerization and networking. In the early 1990, the Internet took the world by storm. Companies rushed to develop e-Business and e-Commerce applications with hopes of reducing their staffing needs and providing 24 hour service to customers. The volume of application systems mushroomed during this period as a parallel set of Internet applications was deployed. Back-end 'bridges' were built to try to integrate the 'self service' application systems with the legacy 'full service' applications. Unfortunately, integration was often messy and corporate data remained fragmented or inconsistent. As the demand for programmers increased and salaries climbed, businesses looked for alternatives to custom built application systems. In hopes of reducing costs and remaining competitive, companies purchased software packages from third parties. These packages were designed for generic business requirements and often did not integrate well with the existing legacy systems.
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
13
By the end of the millennium, businesses discovered that the number of application systems and databases had multiplied, that their systems were poorly integrated and that their data was inconsistent across the systems. More importantly, businesses discovered that they had lots of fragmented data, but not the integrated information that was required for critical decision making in a rapidly changing, competitive, global economy. Companies began building Data Warehouses to consolidate data from disparate databases and to better support their strategic and tactical decision making needs. 1.3 DATA WAREHOUSE 1.3.1 What is Data Warehouse? A Data Warehouse (DW) is simply a consolidation of data from a variety of sources that is designed to support strategic and tactical decision making. Its main purpose is to provide a coherent picture of the business at a point in time. Using various Data Warehousing toolsets, users are able to run online queries and 'mine" their data. Inmon and Kimball, DWH genius state that “A DWH is a subject-oriented, integrated, time-variant, non-updatable collection of data used in support of management decision-making processes” Subject-Oriented A data warehouse can be used to analyze a particular subject area. For example, "sales" can be a particular subject. Integrated A data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product. Time-Variant Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
14
the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer. Non-volatile Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered. 1.3.2 Need of data warehouse: 1. Integrated, company-wide view of high-quality information (from disparate databases)
Data in operational systems are often fragmented and inconsistent, distributed on a variety of incompatible hardware and software platform o Problems of consolidating all data o Inconsistent key structure o Synonyms o Free-form versus structured fields o Inconsistent data values o Missing data
2. Separation of operational and informational systems
Operational system
Run business in real time, based on current data
Sales order processing, reservation systems, patient registration Informational system:
Support decision making based on historical point-in-time and prediction data
Sales trend analysis, customer segmentation, human resources planning
1.3.3 Benefits of operational/informational systems separation
A data warehouse can logically centralize data scattered throughout separate operational systems
A well-designed data warehouse can improve data quality and consistency
A separate data warehouse greatly reduces contention for data resources between operational and informational systems[6]
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
15
Operational vs. informational systems: Table-1.1 Operational vs. informational systems[6] Characteristics
Operational System
Informational System
Primary Purpose
Run the business on a current basis
Support managerial decision making
Types of data
Current representation of state of state Historical point-in-time (snapshots),
Primary users
of the business
and predictions
Clerk, salespersons, administrators
Managers,
business
analysts,
customers Scope of usage
Narrow, planned and simple updates Broad , ad-hoc, complex queries and and queries
analysis
Design goal
Performance: throughput, availability
Ease of flexible access and use
Volume
Many, constant updates and queries on Periodic batch updates and queries one or few tables row
requires many or all rows
1.3.4 Advantages of Warehousing:
High query performance
Queries not visible outside warehouse
Local processing at sources unaffected
Can operate when sources unavailable
Can query data not stored in a DBMS
Extra information at warehouse
Modify, summarize (store aggregates)
Add historical information
1.3.5 Warehouse is different from Database Database is designed ( and optimized) to record while a data warehouse is designed to respond to analysis question that are critical for your business. Application database are OLTP( on line transaction processing) systems where every transaction has to be recorded. For example where a bank ATM has disbursed cash to a customer but was M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
16
unable to record this event in the bank record. While a datawarehouse is a database that is designed for facilating querying analysis. Often designed as OLAP( on line analytical processing0 system, these database contain read only data that can be queried and analyzed for more efficiently than OLTP. These OLTP systems are designed to be write optimized while OLAP system are designed to be read optimized. Database ensures that your business intelligence is scalable better documented and managed and can answer questions more efficiently and frequently while data warehouse leads to direct increase in quality of analysis as the table structure are simpler, standardized and often denormalized. It is structured collection of records or data that is stored in a computer system while data warehouse is a repository stored data, data warehouse are designed to facilitate reporting and analysis. Size of database in Mb - Gb of data while data warehouse Gb - Tb of data Database has thousand of users while data warehouse has hundreds of users (e.g., decision-makers, analysts)[6]
1.4 Basic architecture of data warehouse 1.4.1 Process in a data warehouse The various processes occur in a data warehouse and that are [6]:
Extract and load: In this process the load manager, extract and load the data and performing simple transformations before and during load.
Cleaning and Transformation: Before loading the data, transformation of data is required. The purpose of the exercise is to blitz through all the transformation that do not require complex logic or the use of relational set operator
Backup and Archive: In this process the warehouse manager Back up and Archive incrementally the data warehouse
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
17
Query Management: In this process query manager direct queries to appropriate tables and schedule the execution of user queries Data
Operational data source1
Operational data source2
Operational data source3
Information
Decision
L O A D M A N A G E R
Q U E R Y Summary Information
Detailed Information
Meta Data
M A N A G E
R
Data Dippe r
OLAP tool
Warehoouse Manager
Fig.1.1 Architecture of data warehouse[6] 1.4.2 Basic components of Data warehouse 1.4.2.1 Load Manager: It is the system component that performs all the operation necessary to support the extract and load process. The size and complexity of the load manager will vary between specific solution from data warehouse to a data warehouse but as an indication the larger the degree of overlap between source system, larger the load manager will be. Effort laid to develop a load manager should be planned in production phase. Although the entire system does not have to be built, a significant proportion of the functionality should be provided in the first phase[6].
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
18
Load Manager Controlling Process Temporary Data Store
Store Procedures File Structure
Copy Management Tool
Warehouse Structure
Fast Loader
Fig. 1.2: Load Manager Architecture[6]
A load manager perform the following functions
Extraction of data from source systems
Fast-load the extracted data into a temporary data store
Perform simple transformation into a structure similar to the one in the data warehouse
1.4.2.1.1 Extract data from source To get extracted the source data it has to be transferred from source systems and get stored in the data warehouse. 1.4.2.1.2 Fast Load Data should be loaded into the warehouse in the fastest possible time in order to minimize the total load window. It become critical as the number of data source increased and the time window shrinks. The speed at which the data is proposed into the warehouse is affected by the kind of transformations that are taking place. It is better or effective to load data in a relational database than to apply transformation and checks. This is because most database venders have very performant facilities to fast load data from ASCII flat files. 1.4.2.1.3 Simple Transformations Before loading the data, transformation of data is required. The purpose of the exercise is to blitz through all the transformation that do not require complex logic or the use of relational set operators. Once this has been completed, we can click the complex logics. This transformation is M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
19
significant because complex data have to be simplified first before loading it into the data warehouse or data mart[6]. 1.4.2.2 Warehouse Manager It is the system component that performs all the operations necessary to support the warehouse management process. It is constructed using third party management software example, C program and shell scripts.
The warehouse manager perform the following operations
Analyze the data to perform consistency and referential integrity checks
Transform and merge the source data in the temporary data store into the published data warehouse
Create indexes, partition views against the base data
Generate demoralizations if appropriate
Generate any new aggregations that may be required
Update all existing aggregations
Back-up incrementally or completely the data within the data warehouse
Archive data that has reached the end of it’s capture
It also analyze query profile to determine which indexes and aggregations are appropriate.
Warehouse Manager Controlling process
Temporary data store
Stored Procedure Back-up Recovery tool
SQL Scripts
Starflaks Scheme
Summary tables
Fig . 1.3: Architecture of warehouse manager[6]
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
20
1.4.2.2.1 Using Temporary Destinations Tables When data present in temporary tables, users can perform simple conversions to start getting it into shape. The next phase is to create permanent destination tables like temporary table in the data warehouse. After this phase before loading the data into the data warehouse consistency checks are applied to the information. If these checks fail then data still remains in the temporary tables and is not loaded in the data warehouse. As we use temporary tables just like destinations tables it becomes a imple exercise for users to first transform data temporary to destination tables and then load it into the data warehouse. 1.4.2.2.2 Complex Transformations The populated set of temporary tables match the destination tables in the data warehouse. These tables have been populated with the source data, but still without any of the complex transformations that need to take place. The data in both the tables are compared before getting stored in the data warehouse. The complex data is transformed first and then stored or loaded in the warehouse 1.4.2.2.3 Transform into a Starflake Schema Once source data is in the temporary data store and has been cleaned up, the warehouse manager can then transform it into a form suitable for decision support queries. The data is transformed it into a form in which the bulk of the actual data lies in the center surrounded by the reference data. These variations are used in star, snowflake and starflake schemas 1.4.2.2.4 Create Indexes and views The warehouse manger has to create indexes against the information in the fact or dimension table. The overhead of inserting a row into a table and indexes can be far higher with a large number of rows than the overhead of re-creating the indexes once the some have been created. Therefore it is often more effective to drop all indexes against table prior to inserting larger number of rows. Fcat data that tends to have a large amount of data inserted on a daily basis is a prime candidate for dropping indexes before the data load. This method will be effective unless the partition that is being loaded to already contains a substantial amount of data dimension data tends to charge or be added to in far smaller volumes than fact data, so unless the dimension is being changed, wholesale retain existing indexes on the dimension tables. M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
21
1.4.2.2.5 Summary Generations After loading and staging of data in the data warehouse then comes the creation of summary report after excluding queries. Summaries are generated by warehouse manager automatically without any intervention. SQL is used to generate summaries. Meat data can be used to generate summary 1.4.2.2.6 Using Query Statistics The area to maintain the responsiveness of the system even when the query profile change over time. This can achieved by gathering statistics or the queries being physically executed against the data warehouse and deducing the appropriate summaries to speed up common queries. Query statistics can be collected by the query manager as it intercepts any query hitting the database. This process ensures that only relevant summaries are retained by the system on an ongoing basis. New summaries are created to match changing query profile and summaries that speed up popular queries are maintained. 1.4.2.3 Query Manager It is the system component that performs all the operations necessary to support the query management process. It uses tool like C programs, scripts etc. In some cases the query manager also store query profile to allow the warehouse manager to determine whi8ch indexes and aggregations are appropriate. It perform the following operations
Direct queries to appropriate tables
Schedule the execution of user queries
1.4.2.4 Operational Data Sources Operational data sources for the Data warehouse is supplied from mainframe operational data held in first generation hierarchical and network database, departmental data held in proprietary file systems, private data held on workstations and private seves and external systems[6]
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
22
1.5 Classification of Data Warehouse Architecture: Data Warehouses can be architected in many different way :
Physical architectures:
Generic Two-Level Architecture
Independent Data Mart
Dependent Data Mart and Operational Data Store
Three-Layer data architecture
Logical Data Mart and Real-Time Data warehouse
1.5.1 Generic Two-Level Architecture: It is basically for entry level data warehouses There are four basic steps to go through the generic two level architecture In the first phase the data is extracted from various external and internal legacy systems In the data staging area data from various legacy systems is integrated, updated and transformated and then loaded to the data warehouse. The data is made error free also in this stage.The stage of data and metadata storage comprises of both detailed and summarized data.Many end users access information from data warehouse by using variety of query tools and design it into a summarized report formal using reporting tools and this formatted output can be stored back into the operational database or back to the data warehouse itself. The ETL process i.e extraction, transformation and loading is covered majorly into the first three phase i.e. extraction information from source data systems, data staging area and the storage of data and metadata stage. Extraction and loading of data are periodic in nature but the intermediate process i.e. transformation is done on a day-to-day level basis.
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
23
Fig. 1.4: Generic two-level data warehousing architecture 1.5.2 Independent data mart architecture: Data Mart: A data warehouse that is limited in scope, support decision making for a particular business function or end-user group
Independent: filled with data extracted directly from the operational environment
Dependent: filled with data derived from enterprise data warehouse
Why independent data mart? Firms faced with short-term, expedient business objectives Lower cost and time commitment Organizational and political difficulties to reach a common view of enterprise data in a central data warehouse Technology constraints
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
24
Separate ETL for each
Data access complexity due
independent data mart
to multiple data marts
Fig. 1.5 Independent data mart data warehousing architecture
Limitations of independent data mart
A separate ETL process is developed for each data mart
Data marts may not be consistent with one another
No capability to drill down into greater detail or into related facts in other data marts
Scaling cost are high , each new application creates a new data mart and repeat the ETL steps
1.5.3
Dependent Data Mart and Operational Data Store
A dependant data mart is basically a combination of EDW and reconciled data. EDW is a central integrated data warehouse and made available information to end users for decision support applications. The dependent data mart and operational data store architecture is called a ”hub and spoke” approach where EDW is acting as a hub and source data systems and data mart are input and output spokes An operational data stores (ODS) is an integrated subject oriented, updatable, current valued, detailed, database designed to serve operational users as they do decision support processing. An M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
25
ODS just acts as a relational database and it is different for an organization individual ERP (Enterprise Resource Planning) database. This type of architecture is also known as corporate information factory or CIF ODS provides option for Obtaining current data.
Fig. 1.6 Dependent Data Mart and Operational Data Store 1.5.4 Three-Layer data architecture: It has 3-layers
Source data systems (operational databases)
Operational data store
Data and metadata storage (enterprise data warehouse and dependent data mart)
Derived Data
Data Mart
[Data Mart]
Metadata
Reconciled Data
EDW
[EDW & ODS]
Metadata
Operational Data
Operational
[Operational System]
Metadata
Fig.1.7 Layer Data Architecture[6]
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
26
1.5.5 Logical data mart and real-time data warehouse architecture: Real-time data warehouse Accepts near-real-time feed of transactional data from operational systems, analyzes Warehouse data, and in near-real-time relays business rules to the data warehouse and Operational systems so that immediate action can. Characteristics: Logical data marts are not physically separated databases, but different Relational views of a data warehouse. Data are moved into data warehouse rather than to a separate staging area. New data mart can be created quickly. Data marts are always up to date ODS and warehouse are one and same.
Fig. 1.8 Logical data mart and real-time data warehouse architecture
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
27
1.6 ETL VS. ELT 1.6.1 ETL – Extract, Transform and Load The following diagram illustrates each of the individual stages in the process. Extract, Staging Database, Transform, Load, Staging Database Data Warehouse
Fig.1.9 ETL Architecture[5] Data is “extracted” from the data sources (line of business applications) using a data extraction tool via whatever data connectivity is available. It is then transformed using a series of transformation routines. This transformation process is largely dictated by the data format of the output. Data quality and integrity checking is performed as part of the transformation process, and corrective actions are built into the process. Transformations and integrity checking are performed in the data staging area. Finally, once the data is in the target format, it is then loaded into the data warehouse ready for presentation. The process is often designed from the end backwards, in that the required output is designed first. In so doing, this informs exactly what data is required from the source. The routines designed and developed to implement the process are written specifically for the purpose of achieving the desired output, and only the data required for the output is included in the extraction process. In addition, the output design must incorporate all facts and dimensions required to present both the aggregation levels required by the BI solution and any possible future requirements. Business rules that define how aggregations are achieved and the relationships between the various entities in both the source and target, are designed and therefore coded into the routines that implement the ETL process. This approach leads to tight dependencies in the routines at each stage of the process. In spite of the fact that there are effectively three stages, the design is often characterized as a monolithic process since the target is predefined and clear. In addition to the tools M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
28
provided by the database manufacturer, there are a prolific number of tools available on the market that implement this approach and provide a solid platform, productivity, and flexibility[5]. 1.6.1.1 Strengths Development Time Designing from the output backwards ensures that only data relevant to the solution is extracted and processed, potentially reducing development, extract, and processing overhead; and therefore time. Targeted data Due to the targeted nature of the load process, the warehouse contains only data relevant to the presentation. Administration Overhead Reduced warehouse content simplifies the security regime implemented and hence the administration overhead. Tools Availability The prolific number of tools available that implement ETL provides flexibility of approach and the opportunity to identify a most appropriate tool. The proliferation of tools has lead to a competitive functionality war, which often results in loss of maintainability[5]. 1.6.1.2 Weaknesses [5]
Flexibility Targeting only relevant data for output means that any future requirements, that may
need data that was not included in the original design, will need to be added to the ETL routines. Due to nature of tight dependency between the routines developed, this often leads to a need for fundamental re-design and development. As a result this increases the time and costs involved. Hardware Most third party tools utilize their own engine to implement the ETL process. Regardless of the size of the solution this can necessitate the investment in additional hardware to implement the tool’s ETL engine. Skills Investment The use of third party tools to implement ETL processes compels the learning of new scripting languages and processes. Learning Curve Implementing a third party tool that uses foreign processes and languages results in the learning curve that is implicit in all technologies new to an organization and can often lead to following blind alleys in their use due to lack of experience[5].
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
29
1.6.2 ELT – Extract, Load and Transform Whilst this approach to the implementation of a warehouse appears on the surface to be similar to ETL, it differs in a number of significant ways. The following diagram illustrates the process.
Fig.1.10 ELT Architecture[5] Data is “extracted” from the data sources (line of business applications) into the “Staging Database” using a data extraction tool via whatever data connectivity is available. Whilst in the staging area, integrity and business rules checks can be applied, and relevant corrections can be made. The source data is then loaded into the warehouse. In effect this provides a validated and cleaned offline copy of the source data in the data warehouse. Once in the warehouse, transformations are performed to re-shape the data into its target output format. The extract and load process can be isolated from the transformation process. This has a number of benefits. Isolating the load process from the transformation process removes an inherent dependency between these stages. In addition to including the data necessary for the transformations, the extract and load process can include elements of data that may be required in the future. Indeed, the load process could take the entire source and load it into the warehouse. Separating the processes enables the project to be broken down into smaller chunks, thus making it more predictable and manageable. Performing the data integrity checks in the staging area enables a further stage in the process to be isolated and dealt with at the most appropriate point in the process. This approach also helps to ensure that only cleaned and checked data is loaded into the warehouse for transformation. Isolating the transformations from the load process helps to encourage a more staged approach to the warehouse design and implementation. This embraces the ongoing changing nature of the warehouse build. Tools are available that inherently implement the ELT process. However, their availability is sparser since this is a more emergent approach to warehouse process design[5]. M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
30
1.6.2.1 Strengths Project Management Being able to split the warehouse process into specific and isolated tasks, enables a project to be designed on a smaller task basis, therefore the project can be broken down into manageable chunks. Flexible & Future Proof In general, in an ELT implementation all data from the sources are loaded into the warehouse as part of the extract and load process. This, combined with the isolation of the transformation process, means that future requirements can easily be incorporated into the warehouse structure[5]. Risk minimization removing the close interdependencies between each stage of the warehouse build process enables the development process to be isolated, and the individual process design can thus also be isolated. This provides an excellent platform for change, maintenance and management. Utilize Existing Hardware In implementing ELT as a warehouse build process, the inherent tools provided with the database engine can be used. Alternatively, the vast majority of the third party ELT tools available employ the use of the database engine’s capability and hence the ELT process is run on the same hardware as the database engine underpinning the data warehouse, using the existing hardware deployed. Utilize Existing Skill sets By using the functionality provided by the database engine, the existing investments in database skills are re-used to develop the warehouse. No new skills need be learned and the full weight of the experience in developing the engine’s technology is utilized, further reducing the cost and risk in the development process[5]. 1.6.2.2 Weaknesses [5]
Against the Norm ELT is an emergent approach to data warehouse design and development.
Whilst it has proven itself many times over through its abundant use in implementations throughout the world, it does require a change in mentality and design approach against traditional methods. Though some principles remain true to all approaches, ELT is different in many ways. To get the best from an ELT approach requires an open mind. Tools Availability Being an emergent technology approach, ELT suffers from a limited availability of tools M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
31
such as the internet, commercially available DB or DB associated with and organizations suppliers or customers[5] 1.6.3 Staging: Due to varying business cycles, data processing cycles, hardware and network resource limitations and geographical factors, it is not feasible to extract all the data from all Operational databases at exactly the same time. For example, it might be reasonable to extract sales data on a daily basis, however, daily extracts might not be suitable for financial data that requires a month-end reconciliation process. Similarly, it might be feasible to extract "customer" data from a database in Singapore at noon eastern standard time, but this would not be feasible for "customer" data in a Chicago database. Data in the Data Warehouse can be either persistent (i.e. remains around for a long period) or transient (i.e. only remains around temporarily). Thus The Data Warehouse Staging Area is temporary location where data from source
systems is
copied. A staging area is mainly required in a Data Warehousing Architecture for timing reasons. In short, all required data must be available before data can be integrated into the Data Warehouse. Note: Not all business requires a Data Warehouse Staging Area. For many businesses it is feasible to use ETL to copy data directly[5] 1.7 OLTP Online transaction processing refers to a class of systems that facilitate and manage transactionoriented applications, typically for data entry and retrieval transaction processing. The term is somewhat ambiguous; some understand a "transaction" in the context of computer or database transactions, while others (such as the Transaction Processing Performance Council) define it in terms of business or commercial transactions.OLTP has also been used to refer to processing in which the system responds immediately to user requests. An automatic teller machine (ATM) for a bank is an example of a commercial transaction processing application. For example when we purchase railway ticket from either counter or online, then there will be an entry in some database. And that entry will go in some collections of tables in some database mostly will be oracle, sql server, mainframe etc or DB, which works on binary file logic or works on M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
32
closest to the operating layer. So in this example whenever a transaction happens it will make an entry in the system that is why it is called as OLTP and end user can retrieve corresponding entry at any given point of time. 1.8 OLAP online analytical processing is an approach to swiftly answer multi-dimensional analytical queries.OLAP is part of the broader category of business intelligence, which also encompasses relational reporting and data mining. .OLAP is the use of set of graphical tools that provide users with multidimensional views of their data and allow them to analayze the data using simple windowing techniques. The core of any OLAP system is an OLAP cube (also called a 'multidimensional cube' or a hypercube). It consists of numeric facts called measures which are categorized by dimensions. The cube metadata is typically created from a star schema or snowflake schema of tables in a relational database. Measures are derived from the records in the fact table and dimensions are derived from the dimension tables. 1.8.1 Classification of OLAP OLAP systems have been traditionally categorized using the following taxonomy
MOLAP: is the 'classic' form of OLAP and is sometimes referred to as just OLAP. MOLAP stores this data in an optimized multi-dimensional array storage, rather than in a relational database. Therefore it requires the pre-computation and storage of information in the cube the operation known as processing.
ROLAP works directly with relational databases. The base data and the dimension tables are stored as relational tables and new tables are created to hold the aggregated information. Depends on a specialized schema design.This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.
HOLAP: There is no clear agreement across the industry as to what constitutes "Hybrid OLAP", except that a database will divide data between relational and specialized storage. For example, for some vendors, a HOLAP database will use relational tables to hold the
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
33
larger quantities of detailed data, and use specialized storage for at least some aspects of the smaller quantities of more-aggregate or less-detailed data 1.9 Comparison Each type has certain benefits, although there is disagreement about the specifics of the benefits between providers[6]. Some MOLAP implementations are prone to database explosion, a phenomenon causing vast amounts of storage space to be used by MOLAP databases when certain common conditions are met: high number of dimensions, pre-calculated results and sparse multidimensional data. MOLAP generally delivers better performance due to specialized indexing and storage optimizations. MOLAP also needs less storage space compared to ROLAP because the specialized storage typically includes compression techniques[6] ROLAP is generally more scalable.[14] However, large volume pre-processing is difficult to implement efficiently so it is frequently skipped. ROLAP query performance can therefore suffer tremendously. Since ROLAP relies more on the database to perform calculations, it has more limitations in the specialized functions it can use. HOLAP encompasses a range of solutions that attempt to mix the best of ROLAP and MOLAP. It can generally pre-process swiftly, scale well, and offer good function support[6]
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
34
CHAPTER – 2 Data warehouse design and implementation issues
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
35
Data warehouse design and implementation issues[7] 1. Tool integration 2. Extraction 3. Scalability 4. Database design 5. Domain issues 6. Metadata management 7. Performance and monitoring 8. Hyper-growth issues
2.1 Tool integration. There are two kinds of data warehouse vendors:
full-solution (e.g. IBM) providing multiple components of the data warehouse architecture (integrated for the customer)
component (e.g. Sequent) providing specific components of the data warehouse architecture
The biggest problem with both kinds of vendors is that tools may not work together, despite vendors' claims that their products are compatible[7] 2.2 Extraction.
In data extraction, routines read the source data, convert it to an intermediate schema, and dump it to a staging area.
Once the source data is gathered into the staging area, several utilities are used to clean the data, transform it to an intermediate schema, aggregate it, load it into fact tables, and finally refresh it.
Two typical and common problems at this stage are:
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
36
The data extraction tools don't meet the complex needs of the company, resulting in the need to write customized code for extraction.
The quantity of the data to be extracted adds further complications if, for example, the database, stored in a very high-speed database manager/database communications applications (e.g. IMS), has to be loaded to a Unix-based data warehouse server.
2.3. Scalability (a critical success factor)
As the number of users and applications and the size of the warehouse increase, the system should be able to meet the extra load by adding computing resources as needed incrementally without having to change the applications and without the users noticing any degradation in performance.
The system should be able to deliver scalable performance.
This issue is also relevant in creating data marts However, pulling data marts together to create a coordinated enterprise-wide data warehouse Can be a painful experience.
Reasons include ignoring the creation of an enterprise-wide infrastructure for data marts, excessive data redundancy due to data marts, data marts running on different servers, and very large quantities of data (in terabytes) over several data marts 2.4. Database design
Entity-relationship diagrams, their translation to relations, and the associated normalization techniques common in transaction processing are not useful for data warehousing
In data warehousing projects in which data-query efficiency is important, either a star schema or a snowflake schema is typically used
A data warehouse model aligns with the business structure rather than with the data model in a typical application
As most database designers are experienced enterprise modelers, a common problem they face is how to convert the enterprise schema (usually normalized) to a data warehouse schema
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
37
2.5. Domain issues
The objective is for the warehouse to work synergistically with decision-support tools to meet the decision-support needs of a wide cross section of users within an organization.
Therefore, a warehouse needs to be able to address questions covering a wide domain area.
For example, in the health care arena, the warehouse may provide information to users in areas ranging from research (by, say, facilitating measurement of the effectiveness of treatment, in which relationships between causality and treatment protocols for systemic diseases and conditions are not established) to management (by, say, creating breakdowns of cost and charge information or forecasting demand to manage resources
2.6. Metadata management
Metadata is data about data, or an abstraction of data that is instrumental in defining raw data.
For example, the definition of data tells the applications using the data that a given stream of bits is a customer's address.
The need for metadata is acute in data warehouse projects, as different branches of an organization may be using different ways of defining, say, an employee number.
Unless these differences are reconciled, the chances of success for any application running on the warehouse are quite low.
Today, three kinds of metadata are being used –
Operations-level defines the structure of data in operational databases
–
warehouse-level defines the way the transformed data is interpreted
–
and business-level maps the warehouse metadata to business concepts[13]
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
38
2.7. Performance and monitoring
Performance depends on various factors, including the speed of the server; online transaction processing (OLTP) issues, such as locks, latches, and tables accessed; and online analytical processing (OLAP) issues, such as monitoring runaway queries, managing free space in the
database by archiving infrequently accessed columns, and adjusting the extraction process by spotting specific areas in the database that are queried only at certain times during the month.
Loading, cleaning, auditing, and analyzing also have to be optimized for high performance.
8. Hyper-growth issues
The work really begins once the data warehouse is up and running.
User demand for more capabilities typically creates problems in maintaining and managing corporate warehouses.
Such demand is due to difficulties in understanding and using supposedly intuitive decisionsupport tools bought along with the data warehouse software.
This demand has to be handled on top of the daily and weekly processing, disaster recovery planning, security scares, and other responsibilities involved in warehouse administration.
The need to satisfy user information requirements can also lead to scalability issues within the warehouse[7].
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
39
CHAPTER-3 Meta Data
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
40
Meta Data: Most individuals with some level of involvement with data warehouses, from a technical or business perspective know of the term “meta data”. When asked to define “what is meta data” most of these individuals can reiterate the common definition “Data about data”. Still some have an even deeper understanding on this subject and the various categories of meta data. These categories often break into two areas, technical and business[8]. 3.1 Technical Metadata Technical Metadata includes load statistics such as the last date of the table were updated, known data quality problem, performance statistics etc. Technical metadata primarily serves the data warehouse support staff. Example such as
Physical table and column names
Data mapping and transformation logic
Source systems
Foreign keys and indexes
Security
ETL process names
3.2 Business metadata [8]
Business metadata includes descriptive information about the warehouse such as full names and
synonyms for warehouse attributes and tables, business definitions, and the calculations used for derived attributes. Server based business metadata allow this information to be accessed with any data access tool. . Examples such as[8]:
Business terms and definitions for tables and columns
Subject area names
Query and report definitions
Report mappings
Data Stewards
etc.
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
41
3.3 Operational Meta Data: Even with a standardized Meta data repository, a data warehouse architect can still add a further level of detail into their warehouse data model and ETL process designs through the incorporation of operational Meta data. This technique can be used as an extension of the design and architecture of the data warehouse that provides processing optimizations in data acquisition design, maintenance activities and end user reconciliation and auditing of information[9] Operational Meta data provides an additional bridge between the Meta data repository and the data warehouse through the addition of physical columns into the design for ease of use by both business and technical users. Operational Meta data, unlike information stored in the Meta data repository, is referenced at a row level of granularity in the data warehouse. Operational Meta data provides a detailed row level explanation of actual information content in the data warehouse. This direct association to each row of information in the warehouse is operational meta data’s key design distinction[8]. 3.3.1 Incorporation of Operational Meta Data There are varieties of operational Meta data columns that can utilize in the design of the data warehouse to enhance automated support. Depending on the business requirements of the warehouse project, inclusion of certain columns may make more or less sense to use. Incorporation of these fields would normally occur during the transformation of the business logical data model into the dimensional or data warehouse data model. Some of the more prevalent columns used are listed below.
Load Cycle Identifier
Current Flag Indicator
Load Date
Update Date
Operational System(s) Identifier
Active in Operational System Flag
Confidence Level Indicator
Load Cycle Identifier:
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
42
One of the operational meta data columns a data warehouse architect can incorporate is the load cycle identifier. This column is a sequential identifier assigned during each load cycle to the data warehouse regardless of the refresh frequency. As a sequential identifier, it can be used to remove data from a particular load cycle run if data corruption or other data quality issues are discovered. The load cycle identifier is typically used in conjunction with a lookup or meta data repository table that’s describes other operational statistics about the load cycle[2]. Current Flag Indicator: The current flag indicator column is used to identify the latest version of a row in a table. It facilitates quick identification of the latest version of a row as compared to performing date comparisons. This flag is especially useful for managing the processing of slowly changing dimensions (SCD), type 2, where history of a production record is maintained. Load Date: The most commonly used and understood operational Meta data field in data warehouse designs is the load date column. It denotes when, date and/or time, a row of information was loaded into the data warehouse or in some business cases when the data was extracted from the operational source system. This snapshot date is used to maintain temporal integrity of the data in the warehouse as new information is added during each refresh cycle. The column can be referenced by warehouse administrators to identify candidate rows for archival or purge processing. End users can also use this column to reconcile and audit information in the data warehouse with the operational source systems. Update Date: Another common operational Meta data column is the update date. This column indicates when a row was lasted updated in the warehouse during a refresh cycle. This column, like load date, is used to maintain the temporal integrity of information in the data warehouse. It is routinely used in dimension table designs that implement SCD type 1, 2 or 3 processing methods to identify when the row was refreshed. The column, like load date , can be utilized in administration activities such as archival/purge processing or reconciliation/audit by end users.
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
43
Operational System(s) Identifier: One of the most useful operational Meta data fields for both the warehouse administrator and the end user is the operational system(s) identifier. This column is used to track the origination source or sources of a data row in the data warehouse. Example, in cases where a row of data was integrated from more than one operational source system, a column value denoting the combination of these systems can be assigned (see table example below). It can be used by end users that are questioning the quality and/or validity of data in the warehouse to trace back information to the operational source system that furnished the information. In certain cases, this column can be used by administrators to identify and remove corrupt data loaded from a particular operational source system(s). Active Operational System Flag: This flag is used to indicate whether the production keys in a dimension table are still active in the originating operational system. The active operational system flag provides intriguing analysis alternatives to queries posed to the data warehouse. This column can be used effectively in a variety of analysis activities to identify dormant data or data that should be constrained in reporting (e.g., out of bounds, products no longer supported, previous customers). Confidence Level Indicator: One of the more controversial operational meta data fields is the confidence level indicator. This column is used to indicate how business rules or assumptions were applied during the ETL processes for a particular row of data. This field provides a means of measure to the end user as to the credibility level of a data row based on the transformation processing performed. Often used to identify potential problems with data quality from operational source systems and facilitate in correcting these issues, each warehouse organization and/or project will vary in how the business requires ranking of this particular operational meta data field. The following table illustrates one approach to ranking.
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
44
Operational Meta Data Column Assignment: The importance of operational Meta data, now being realized, the architect has gone forward incorporating these columns into the design of the data warehouse data model and the ETL processes. The value assigned to these columns is accomplished through various means. Some examples of the methods used for assignment of the various operational meta data columns include:
Load Cycle Identifier: The value for this column is assigned through insertion of a new row in a meta data table or lookup table added to the data model to capture operational statistics. The primary key on the table consists of a serial or sequence data type. The value is incremented during each refresh cycle to the data warehouse regardless of the completion status.
Current Flag Indicator: The value of this column is assigned through a comparison of data currently in the data warehouse versus data in a new load cycle. The two data files are sorted and then consecutively read to determine the current record. The current record has a “Y” value assigned to it while historical records are assigned an “N”.
Load Date: The value of this column is determined at the beginning of the load cycle. If a Load Cycle table is available in the meta data repository or has been added to the model, the value for this column can be extracted for this source. In some cases the value assigned may be the extraction date from the operational system versus the actual date the information was loaded into the data warehouse.
Update Date: The value of this column is determined, like load date, at the beginning of the load cycle but is applied only to historical rows.
Operational System Identifier: The value of this column is assigned based on which operational system provided the information to the data warehouse. If two or more sources are integrated to construct a single row in a data warehouse table, the value assigned should represent the multiple operational systems.
Active Operational System Flag: The value of this column is determined by comparing the natural or production keys of the operational system to data rows in the data warehouse. A value of “Y” is assigned to a record if it currently exists in the operational system.
Confidence Level Indicator: The value of this column is assigned during ETL processing based on conformance to the business rules associated with the data warehouse target table.
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
45
If business rule validation indicates that transformation occurred correctly with no unknown or missing information the confidence level can be set high. If derived or estimated data is inserted during ETL processing, the confidence level may need to be set lower[9]. 3.3.2 Operational Use Strategies: There is a variety of processing optimizations in data acquisition design, maintenance activities and end user reconciliation and auditing of information that can now be performed against the data warehouse through the incorporation of operational meta data into the architecture. Some examples of the tasks that can now be performed include[9]: Extraction of current data
Identification of rows for archive/purge processing
Load cycle rollback
Processing of slowly changing dimensions type 2 records
These are not the only tasks that can take advantage of the incorporation of operational meta data columns but demonstrates an initial starting set. Extraction of Current Data: The typical data warehouse data model requires the use of the load date column in order to maintain a unique perspective of history for a row of data in a table. While referential integrity is maintained through use of this column, it causes a high degree of processing overhead when a user is attempting to determine which row in a dimension table is the most current. The RDBMS must determine, by sequentially reading each row of data in the table, which row has the most current load date. An alternative to this brute force method is to utilize the current flag indicator column. Through ETL processing and comparison of natural or production keys, this column facilitates quick identification, for the business users, as to which row of information from the operational systems is the most current. The last row added to a data warehouse table for a particular production key is given an assignment of “Y” for the current indicator flag, while historical records are set to “N” or null. The business user can use this column in their queries as a constraint to retrieve only the most up to date information. For certain reporting requirements, RDBMS views can be established that M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
46
are constructed constraining on the current indicator column, value of “Y”, automatically to avoid errors in queries by users. Load Cycle Rollback: A monthly load cycle occurs against the data warehouse. During ETL processing of the load cycle, an error is detected in the database or, data from an operational source system is suspect of being corrupt or, some other data quality measurement issue is discovered in the data. Before the incorporation of operational Meta data columns, the data warehouse administrator would have only limited methods of isolating and removing this corrupt or suspect information. Now the administrator can be more selective in their methods of removing the erroneous data from the database. The first rollback method is to utilize the load cycle indicator to completely remove the last refresh cycle to the data warehouse. This can simply be accomplishing by constraining on the value of the load cycle in question and removing the affected rows in each table from the data warehouse. This same method can be further tailored, in certain circumstances, to remove rows from a particular load cycle for a specific operational system by further constraining on the operational system indicator in question. For either method used, the rollback process will need to accommodate the re-assignment of the current indicator, once the suspect rows are removed from the data warehouse, for rows that were flagged with a “Y” prior to the last load cycle. Archiving and Purging: The data warehouse has now been up and running with monthly load cycles being processed into successfully for some period of time. Through your business users or database monitoring tool you become aware of data that has become dormant over time or inactive for some other reason. Since this data is not currently being utilized on any type of regular basis, a decision is made to either archive it off to a near-line storage device or to purge it from the data warehouse for size and performance reasons. The incorporation of operational meta data columns again provides the data warehouse administrator with a variety of options on how to isolate and flag candidate rows for archiving or purging. First, the load date can be interrogated to isolate rows of data from a particular period. This M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
47
method again requires additional overhead of the RDBMS to analysis the load date. The second option is to constraint on rows of data from a particular set of load cycles. This method provides a more efficient means of flagging candidate rows and avoids tokenizing of a date column to identify a particular period. Slowing Changing Dimensions (SCD) Type 2: Data warehouses or marts that make use of some variant of a star schema data model design can utilize the current flag indicator to aid in SCD type 2 processing for dimension table loads. The established method for processing of SCD type 2 records is to sort both sources of information, operational system versus data warehouse, based on the production or natural key of the dimension table. The current flag indicator can be used to constraint which records from the dimension table will be compared to the current load cycle data being processed to the warehouse. Only dimension records containing a “Y” in the current indicator column are required to be compared to the load cycle data since they contain the most updated information for the specific production key(s). A consecutive comparison is made between the two data files based on these key column(s). Current load cycle records with production key values not found in the dimension table are loaded as new rows. These rows receive new surrogate keys for the dimension and have their current indicator flag set to “Y”. The current flag setting is due to the fact that no other occurrence of the production key(s) combination exists in the dimension table to date. Production key(s) found in both the load cycle data and the dimension table are further interrogated based on columns deemed relevant to track changes on by the business users. The two sources are sorted on by the key and relevant columns. A consecutive comparison is made against these files. Current load cycle records that have relevant columns that do not match their corresponding dimension table rows are flagged as new rows. These rows are inserted with new surrogate keys and have the current flag indicator set to “Y” due to an update made to a relevant column since the last load cycle. Previously loaded dimension rows that have a matching production key(s) when compared to the current load cycle file but have differences in relevant columns are assigned a current indicator of “N”. These rows receive this setting due to the fact that new information on relevant columns of the dimension have been received in the most recent load cycle.
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
48
This same process of constraining on the current flag indicator and performing comparison on production keys between the dimension table and load cycle is repeated during each update process to the data warehouse. The incorporation of operational meta data columns into the data warehouse architecture allows administrative and data quality measurement questions to be easily answered. Some examples of the questions that can now be answered are:
What operational system(s) provided this particular row of information to the data warehouse? Purpose: Data quality measurement and publishing of ETL load cycle results.
When was this particular row of information last refreshed by the operational system? Purpose: Data quality measurement.
How many rows of information will be affected by an archive or purge of data? Purpose: Administrative and maintenance.
What is the growth rate for a table per load cycle? Purpose: Administrative, maintenance and publishing of ETL load cycle results.
What effective growth rate is being experienced from a particular operational source system on the data warehouse? Purpose: Administrative, maintenance and publishing of ETL results.
What is the relative confidence level of the data currently loaded into the data warehouse? Purpose: Data quality measurement.
What percentage effect on confidence level would be measured if dirty data from an operational system were corrected at the source? Purpose: Data quality measurement.
Additionally, data warehouse administrator’s benefit through the incorporation of operational meta data by gaining increased options and flexibility in accomplishing administrative tasks[9].
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
49
3.4 Metadata primary goals Metadata has three primary major goals:
Fig. 3.1 Periodic warehouse data[8]
Define: Manage the component necessary to provide an understanding of the DW and the enterprise through it’s content Build and Administrator: Manage the component necessary to efficiently define , build and operate the DW. DW metadata includes the structure and it’s content of the operational data that is used to execute DW jobs and perform the audit, balance and control aspect of the DW environment. This goal requires well-defined metadata about both data and processes. Navigate: Manage the component necessary to enable self-service use of the DW that is efficient, accurate and repeatable[8]
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
50
3.5 Benefits of metadata: Table 3.1 Benefits of metadata[8] Value
Description Downstream
Improved Productivity
and
Availability
Better Consistency
Quality Improvement
Quality Improvement
developers
can
achieve
greater
productivity
and
responsiveness by retrieving physical table and file definitions for the development of data movement and transformation processes. Information about data can be accessed on demand. The same language and understanding of data can be shared across business and technical users; better consistency can lead to reduced complexity A consistent language for data that combines business and technical metadata will lead to more reliable interpretations of data. A consistent language for data that combines business and technical metadata will lead to more reliable interpretations of data.
Smoother
Change management can be aided by providing metadata inventory and
Development
impact analysis reports. Taking a metadata-driven approach can smooth the
Techniques
transition from design to the development. Impact Analysis allows business and technical uses to see the potential
Impact Analysis
results of a change. In business-driven environments where changes occur frequently, this is instrumental is managing the impact to a complex, federated environment.
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
51
CHAPTER-4 Literature Review
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
52
Literature Review 1. “Data Warehouse Performance Management Techniques”: By Andrew Holdsworth, Oracle Services, Advanced Technologies, Data Warehousing Practice, 2/9/96 This document describes the management and technical issues involved in a Data Warehouse project. Eventually these techniques will be refined through time and experience to become part of Oracle’s overall Data Warehousing strategy. 2. “Performance Tuning Mechanisms for Data Warehouse: Query cache” International Journal of Computer Applications (0975 – 8887), Volume 2 – No.2, May 2010 Data warehousing encompasses architectures, algorithms, and tools for bringing together selected data from multiple databases or other information sources into a single repository, called a data warehouse suitable for direct querying or analysis. In recent years data warehousing has become a prominent buzzword in the database industry, but attention from the database research community has been limited. At the warehouse, queries can be answered and data analysis can be performed quickly and efficiently since the information is directly available, with model and semantic differences already resolved. The primary goal of data warehouse is to free the information locked up in the operational database so that decision makers and business analyst can make queries, analysis and planning regardless of the data changes in operational database. As the number of queries is large, therefore, in certain cases there is reasonable probability that same query submitted by the one or multiple users at different times. Each time when query is executed, all the data of warehouse is analyzed to generate the result of that query. In this paper an effort will be made to find the common problems faced. These kinds of problems are faced by Data Warehouse administrators which are minimizes response time and improves the efficiency of data warehouse overall, particularly when data warehouse is updated at regular interval. The overall performance of the system and provide good strategies that make superior Data Warehouse. 3. “Near Real Time ETL”: By Panos Vassiliadis, Alkis Simitsis, Springer journal Annals of Information Systems, AoIS Vol.3, 2008 Near real time ETL deviates from the traditional conception of data warehouse refreshment, which is performed off-line in a batch mode, and adopts the strategy of propagating changes that take place
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
53
in the sources towards the data warehouse to the extent that both the sources and the warehouse can sustain the incurred workload. In this article, we review the state of the art for both conventional and
near real time ETL, we discuss the background, the architecture, and the technical issues that arise in the area of near real time ETL, and we pinpoint interesting research challenges for future work. 4. ”DATA WAREHOUSE PROCESS MANAGEMENT”: By Panos Vassiliadis, Christoph Quix, Yannis Vassiliou, Matthias Jarke Previous research has provided metadata models that enable the capturing of the static components of a data warehouse architecture, along with information on different quality factors over these components. This paper complements this work with the modeling of the dynamic parts of the data warehouse. The proposed meta model of data warehouse operational processes is capable of modeling complex activities, their interrelationships, and the relationship of activities with data sources and execution details. Moreover, the metamodel complements the existing architecture and quality models in a coherent fashion, resulting in a fullframework for quality-oriented data warehouse management, capable of supporting the design, administration and especially evolution of a data warehouse. Finally, we exploit our framework to revert the widespread beliefthat data warehouses can be treated as collections of materialized views. We have implemented this meta model using the language Telos and the metadata repository system Concept Base. 5. ”Seven Steps to Optimizing Data Warehouse Performance”:By Rob Armstrong, NCR , 00189162/01/©December 2001 IEEE The evolution of data warehousing has followed a pattern of compromising the project to overcome technical barriers at the expense of business value. One such compromise is the automatic use of summary data or data models geared toward a single application. Initially, companies simply extracted data from operational systems and put it in another database—typically the one used for online transaction processing. This kept longer-running and more complex analysis workloads from interfering with more mission- critical OLTP work. Most operational systems store data in a normalized model in which certain rules eliminate redundancy and simplify data relationships. While beneficial for the OLTP workload, this model can inhibit those same OLTP databases from running analytical queries effectively. As data volume grew on analytical systems, queries took more time. Because the analytical systems did not need to support the OLTP workload, many M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
54
developers began preplanning for the answer sets. This preplanning included • creating summary tables of preaggregated data; • placing indexes in the system to eliminate scanning large volumes of data; denormalizing the model—putting data into one table instead of having tables that need to be joined together; and • storing the data in sorted order, thus avoiding and sort activity. For any of these four optimizations to occur, developers must first understand the analysis and reports being requested and the data demographics in detail. In the early days, many of those in charge of developing analytical systems did not have a background working in a data warehouse environment. They were trained to optimize performance by manipulating database parameters and data models, which was considered a step forward because previously the databases yielded little or no information. As user communities increased in size and queries became more complex, users often could not define all the analysis they wanted to run, and even if they could, the time required to process and store answer sets became unmanageable. The solution was to once again extract data, this time to online analytical processing systems that specialized in “slicing and dicing” data. 6. “Measuring the Success of a Data Warehouse”: Whitepaper, Threshold Consulting Services It is difficult to determine success metrics for a data warehouse. Certainly there is evidence that return on investment (ROI) is frequently used to determine warehouse success, but there are many potential metrics that may be used. For example, it is possible to define financial measurements, usage measurements, customer satisfaction, availability, performance, response time, etc., as reasonable warehouse success indicators. However, many of the possible measurements can be interpreted in different ways. For example, a decrease in queries against a data mart may indicate that use is decreasing, or it may indicate that the users are so happy with the data that they find it more efficient to download it to a cube on their desktop for easier access, thus increasing usage. Taking action based simply on query volume may lead to the wrong actions being taken. In order to obtain as objective a view of warehouse success as possible, the Balanced Scorecard, as described by Robert Kaplan and David Norton, can be modified to measure the data warehouse. The metrics presented in this article are not the only ones that can be used to measure the warehouse. Rather, these should be seen as a set of example measurements that can be refined over time as the organization’s perspective on the data warehouse changes and matures.
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
55
7. “ETL vs ELT”:By Robert J Davenport, June 2008, Insource IT Consultancy Ltd. With 80% of a Data Management (DM) or Business Intelligence (BI) project cost being attributable to the design and development of the data warehouse; it is crucial that the right decision is made over the architecture to be employed. Historically, we have had little choice but to use the ETL standard. Emergent technologies are providing toolsets and approaches that challenge this tradition. At least now we have a choice and, as with all choices, it is imperative that, that choice be an informed one. A wrong decision can be costly and potentially not meet the objectives set out by the project. In accepting the fact that change is a principle of data warehousing, it is imperative that the approach employed in the design and development of a data warehouse respects, and hopefully, embraces this need. Whilst both ETL and ELT routines can be changed, the process of change in ELT routines involves less cost, time and risk. Many organizations have recognized the benefits that can be gained from the employment of ELT. In spite of having already deployed warehouse solutions that use ETL, they have embarked upon a process of re-engineering their solutions using ELT through the use of effective tools. Whilst data warehouse development traditionalists will undoubtedly pour scorn on the ELT approach, the clear and obvious benefits cannot afford to be ignored. 8. “Teradata® RDBMS Performance Optimization”: By NCR Corporation, B035-1097-122A December 2003 Teradata RDBMS Performance Optimization provides information that: • Helps you ensure that Teradata operates at peak performance based on your applications and processing needs. • Describes specific release-specific performance enhancements that require no user intervention. This book contains sixteen chapters: 9. “Tuning Servers, Storage and Database for Energy Efficient Data Warehouses”:By Meikel Poess, Raghunath Othayoth Nambiar, 978-1-4244-5446-4/10/ 2010 IEEE, CDE Conference 2010 Undoubtedly, reducing power consumption is at the top of the priority list for system vendors, data center managers who are challenged by customers, analysts, and government agencies to implement green initiatives. Hardware and software vendors have developed an array of power preserving M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
56
techniques. On-demand-driven clock speeds for processors, energy efficient power supplies, and operating-system-controlled dynamic power modes are just a few hardware examples. Software vendors have contributed to energy efficiency by implementing power efficient coding methods, such as advanced compression and enabling applications to take advantage of large memory caches. However, adoption of these power-preserving technologies in data centers is not straightforward, especially, for large, complex applications such as data warehouses. Data warehouse workloads typically have oscillating resource utilizations, which makes identifying the largest power consumers difficult. Most importantly, while preserving power remains a critical consideration, performance and availability goals must still be met with systems using power-preserving technologies. This paper evaluates the tradeoffs between existing power-saving techniques and their performance impact on data warehouse applications. Our analysis will guide system developers and data center managers n making informed decisions regarding adopting power preserving techniques. 10.”TERADATA RDBMS“: By Roland Pfeffer, NCR, Teradata Division, The Teradata system software is the foundation of the relational database server. The purpose of the software is to support SQL manipulations of the database. Teradata software includes:Client software, which includes UNICODE support,Server software used to support the Teradata RDBMS,Structured Query Language (SQL) and its uses for application programming and interactive ,database queries,Stored procedures,Third-party software 11. “Active Data Warehousing . the Ultimate Fulfillment of the Operational Data Store ”: By Dr. Claudia Imhoff,Intelligent Solutions, Inc. Over the years, data warehousing has gone through a number of evolutions . from a relatively simple reporting database to sophisticated analytical applications such as analyzing customer lifetime values, market basket analyses, potentially defecting customers, fraud patterns, inventory churns, and so on. In all, though, these static sets of data could not give us the most current and recent changes necessary to ACT upon the results of Business Intelligence analyses. For example, once we could identify a customer likely to go to a competitor, we still could not view their current situation . e.g., what products does the customer have with the company, is the customer a VIP requiring special treatment, where are they in the sales cycle? The reason for this lack of insight was that the warehouse was set up to give us static snapshots of data, perhaps as recently as last week. But last week.s (or even last night.s) data is often not sufficient to react to current situations. Things M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
57
change rapidly in today.s e-business economy and the company with the best set of integrated, current data is the one that will not only survive but will actually thrive in this economy. Unfortunately most enterprises today do not have any integrated data other than the snapshots found in their data warehouses. This is where the need for the Operational Data Store (ODS) was generated. And fortunately now, you can have integrated data in the static snapshots and in live, current records . an environment in which both types of data and requirements can co-exist. This concept is called the Active Data Warehouse. To better understand this advance in technology, let.s examine the characteristics that make the ODS so very different from the traditional data warehouse. To do this, you must understand the difference between analytical and operational Applications. We classify the analytical applications as Business Intelligence, noting that they consist of the data warehouse supplying data to the various analytical applications in the data marts. The applications running against these components use decision support interfaces (DSI.s) and give us great insight into our customers. Demographics, buying habits, profitability, lifetime value, and so on. But insight into customer behaviour is not enough. As we stated, you also need the ability to ACT upon these findings by having ready access, from anywhere in the enterprise, to integrated, comprehensive and urgent information about customers, products, inventories, orders, and so on, as well as quick access to some analytical results. The operational applications are called Business Management components. These applications give the enterprise the ability to take action using its collective intelligence and subject knowledge. They also provide the organization with an enterprise-wide understanding of their situation, which facilitates a transition away from the silo business unit or functional viewpoint. For example, many of us have had the experience where we update our contact information (e.g., new address, phone and fax number) through a web site, only to find ourselves repeating that same information when we talk to a call center representative later that day. This is a prime example of an enterprise with severely isolated systems. 12. “Discover Teradata Meta Data Services”: You can’t effectively use the abundance of data in your data warehouse if you don’t have a clear understanding of the data, what they represent, how they interrelate, and therefore how to best use them. The data warehouse, by its very nature, is a complex and dynamic environment containing vast amounts of data. Arguably the data in the warehouse and the knowledge derived from it are your organization’s most valuable asset, and metadata is essential for understanding the information in the data warehouse. The most effective use of the data warehouse is made through the use of metadata. It helps an organization identify, describe, link, and disseminate the proper information to M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
58
all users of the warehouse. Metadata is increasingly noted as a critical factor for understanding the warehouse as it helps uncover the data that are useful, and just as importantly, the data that are not. A metadata management solution can provide: Standardized data definitions. Information sharing across the enterprise. Impact analysis and reveal relationships between data elements. Information about redundancies before they occur and eliminate duplication. Data lineage information. Consistency and quality. Adherence to compliance and governance requirements. Employee self service for search and discovery of information. 13.”Data Warehouse Performance”:Beixin (Betsy) Lin (Montclair State University, USA), Yu Hong (BearingPoint Inc., USA), and Zu-Hsu Lee (Montclair State University, USA) A data warehouse is a large electronic repository of information that is generated and updated in a structured manner by an enterprise over time to aid business intelligence and to support decision making. Data stored in a data warehouse is non-volatile and time variant and is organized by subjects in a manner to support decision making (Inmon et al., 2001). Data warehousing has been increasingly adopted by enterprises as the backbone technology for business intelligence reporting and query performance has become the key to the successful implementation of data warehouses. According to a survey of 358 businesses on reporting and end-user query tools, conducted by Appfluent Technology, data warehouse performance significantly affects the Return on Investment (ROI) on Business Intelligence (BI) systems and directly impacts the bottom line of the systems (Appfluent Technology, 2002). Even though in some circumstances it is very difficult to measure the benefits of BI projects in terms of ROI or dollar figures, management teams are still eager to have a “single version of the truth,” better information for strategic and tactical decision making, and more efficient business processes by using BI solutions (Eckerson, 2003). Dramatic increases in data volumes over time and the mixed quality of data can adversely affect the performance of a data warehouse. Some data may become outdated over time and can be mixed with data that are still valid for decision making. In addition, data are often collected to meet potential requirements, but may never be used. Data warehouses also contain external data (e.g. demographic, psychographic, etc.) to support a variety of predictive data mining activities. All these factors contribute to the massive growth of data volume. As a result, even a simple query may become burdensome to M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
59
process and cause overflowing system indices (Inmon et al., 1998). Thus, exploring the techniques of performance tuning becomes an important subject in data warehouse management. 14. “A Metadata-Driven Approach for Data Warehouse Refreshment”: Data warehousing technology supports information management for decision making by integrating data from operational systems and external sources in a separate database, the data warehouse. In contrast to operational systems which store detailed, atomic and current data accessed by OLTP (online transactional processing) applications, data warehousing technology aims at providing integrated, consolidated and historical data for OLAP (online analytical processing) applications. As time advances and the sources from which warehouse data is extracted change, data warehouse contents must be “synchronized” with the underlying sources. Thus, after an initial loading phase, warehouse data must be regularly refreshed, i.e., modifications of operational data since the last warehouse refreshment must be propagated into the warehouse such that warehouse data reflect the state of the underlying systems. Data warehouse refreshment is a complex process comprising several tasks like monitoring, extracting, transforming, integrating and cleaning data, deriving new data, building histories and loading the data warehouse. This thesis proposes an approach which allows for the modeling and execution of the data warehouse refreshment process at an intermediate layer between operational sources and the target data warehouse, and provides support for several tasks related to warehouse refreshment in data warehouse environments consisting of a wide variety of heterogeneous sources, and independently of how target warehouse data is stored persistently. The contribution of the thesis is twofold. First, the thesis proposes a metamodel which includes a set of modeling constructs for the definition of metadata required for executing the data warehouse refreshment process, such as the description of operational and target warehouse data, the definition of mappings between operational and warehouse data, including refreshment tasks like transformation, cleaning, integration, process scheduling, etc. Second, the thesis describes a methodology and an appropriate operational infrastructure for defining and executing concrete refreshment processes based on the above mentioned metamodel. The definition of the refreshment process is supported by a metadata administration tool which provides facilities for creating, modifying and checking metamodel instances stored in an object-oriented metadata repository. Furthermore, the thesis describes the way various monitoring techniques can be integrated in a data warehouse environment in order to detect updates in operational systems and refresh the warehouse incrementally. An advanced key management concept is provided which ensures the correctness of keys propagated into the target warehouse, thus improving data quality in the target warehouse. M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
60
Finally, target warehouse mapping and history management techniques are introduced which consider the specific database design techniques used for data warehouses, as well as the multidimensional and temporal character of warehouse data.
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
61
CHAPTER-5 Data Warehouse Performance Tuning
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
62
5. Data Warehouse Performance Tuning: Performance: performance in general can be defined as the accomplishment of a given task measured against preset known standards of accuracy, completeness, cost, and speed. In a contract, performance is deemed to be the fulfillment of an obligation, in a manner that releases the performer from all liabilities under the contract. In contrast of data warehouse:
[9]
The performance of a Data Warehouse is largely a function of the
quantity and type of data stored with in a database and the query/data loading work load placed upon the system[9]. Over the last decade, the largest data warehouses have increased from 5 to 100 terabytes, according to Winter Corp., and by 2010, most of today’s data warehouses will be 10 times larger, according to The Data Warehouse Institute (TDWI). As data warehouses grow in size to accommodate regulatory requirements and competitive pressures, ensuring adequate database performance will be a big challenge in order to answer more ad hoc queries from more people. [9]
Definition of Acceptable Performance -The simplest definition of acceptable performance is that
the Data Warehouse is able to meet all of the business requirements in the required time scales[9.] 5.1 Performance Tuning Tuning is a process to identify and tune target applications for performance improvements and proactively prevent application performance problems [10.]The goal of performance tuning is to minimize the response time for each query and to maximize the throughput of the entire database server by reducing network traffic, disk I/O, and CPU time. This goal is achieved through understanding application requirements, the logical and physical structure of the data, and tradeoffs between conflicting uses of the database, such as online transaction processing (OLTP) versus decision support and etc…….[12]
There are two distinct types of tuning:
Proactive Monitoring
Bottleneck Elimination
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
63
5.1.1 Proactive Monitoring Proactive monitoring usually occurs on a regularly scheduled interval, where a number of performance statistics are examined to identify whether the system behavior and resource usage has changed. Proactive monitoring can also be considered as proactive tuning. Usually, monitoring does not result in configuration changes to the system, unless the monitoring exposes a serious problem that is developing. In some situations, experienced performance engineers can identify potential problems through statistics alone, although accompanying performance degradation is usual. Experimenting with or tweaking a system when there is no apparent performance degradation as a proactive action can be a dangerous activity, resulting in unnecessary performance drops. Tweaking a system should be considered reactive tuning, and the steps for reactive tuning should be followed. Monitoring is usually part of a larger capacity planning exercise, where resource consumption is examined to see changes in the way the application is being used, and the way the application is using the database and host resources[10]. 5.1.2 Bottleneck Elimination Tuning usually implies fixing a performance problem. However, tuning should be part of the life cycle of an application—through the analysis, design, coding, production, and maintenance stages. Oftentimes, the tuning phase is left until the system is in production. At this time, tuning becomes a reactive fire-fighting exercise, where the most important bottleneck is identified and fixed. Usually, the purpose for tuning is to reduce resource consumption or to reduce the elapsed time for an operation to complete. Either way, the goal is to improve the effective use of a particular resource. In general, performance problems are caused by the over-use of a particular resource. That resource is the bottleneck in the system. There are a number of distinct phases in identifying the bottleneck and the potential fixes. These are discussed in the sections that follow[10]
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
64
5.2 Problem Area: Areas where a data warehousing system can use a little performance tuning for a better performance: 5.2.1 ETL - Given that the data load is usually a very time-consuming process (and hence they are typically relegated to a nightly load job) and that data warehousing-related batch jobs are typically
of lower priority, that means that the window for data loading is not very long. A data warehousing system that has its ETL process finishing right on-time is going to have a lot of problems simply because often the jobs do not get started on-time due to factors that is beyond the control of the data warehousing team. As a result, it is always an excellent idea for the data warehousing group to tune the ETL process as much as possible. 5.2.2 Query Processing - Sometimes, especially in a ROLAP environment or in a system where the reports are run directly against the relationship database, query performance can be an issue. A study has shown that users typically lose interest after 30 seconds of waiting for a report to return. My experience has been that ROLAP reports or reports that run directly against the RDBMS often exceed this time limit, and it is hence ideal for the data warehousing team to invest some time to tune the query, especially the most popularly ones. We present a number of query optimization ideas. 5.2.3 Report Delivery - It is also possible that end users are experiencing significant delays in receiving their reports due to factors other than the query performance. For example, network traffic, server setup, and even the way that the front-end was built sometimes play significant roles. It is important for the data warehouse team to look into these areas for performance tuning. Here focus will be more on the report response time as discussed above and the proposed solution will help to make that better. So for better reporting many factors are there which can contribute in positive and negative sense depending upon how factors has been considered or implemented? 5.3 Factors Affecting Reporting Layer Response Time
Tools Selection
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
65
Physical Design
Normalized vs. de-normalized
Relational vs. dimensional
Hybrid
Reporting frequency
Concurrent Users
Amount of data
Indexing
Statistics
All above-mentioned factors can impact reporting response time but in this thesis we will be explaining more on how to keep statistics up to date with help of available metadata for better reporting? 5.4 Statistics In DWH statics are milestones directives that help database optimizer / database engine to come up with “THE COST EFFECTIVE” execution plan. So what information is there in statistics? In statistics below information can be stored in DWH, which will make sure pre availability of required data to come up with effective execution plan. For example:
Number of nulls
Number of Unique values
Average row correction per values change
Number of intervals
Number of rows updated.
Mode Frequency etc…
With this available precompiled information optimizer / database engine will use this information rather than calculating or estimating the same at run time which will increase the response time unnecessarily[10].
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
66
How statistics can be defined & do we need to define every Time? Statistics can be defined by developers / DBA / Architects to make sure smooth execution of queries on the box. It can be defined on Column or combination of column level and No, Statistics should be defined only one time and this is a physical attribute of a database table, which can be defined on a column or combination of column but it requires a frequent refresh on the basis of data changing frequency and amount of change this has to be refreshed on a table level. Is It Necessary? No, having statistics is not necessary but depending on the filter and join condition in queries it will help to avoid full table scan and over consumption of CPU cycles. But if statistics are there on the database tables then it is essential to have it up to date otherwise it will provide false information to optimizer / database engine which will turn as a feed in to bad execution plan. e.g. if statistics is not up to date ? Statics says there are 10, 00,000 records in a table. But actually it has increased by 10 millions so this type of inconsistency will lead in to cost and CPU consuming execution plans. Where this information will be saved & how this will be used? System will store this information in system tables and depending upon the database software used it can be string field or even a clob or blob object and when query will be submitted to database then optimizer / database engine will use this pre compiled information from system tables for cost effective execution plan.
5.5 TRADITIONAL METHODOLOGY FOR STATISTICS REFRESH. At many sites, to address Statistics refresh process, they have a weekly / fort nightly / monthly job to run on box which will simply go and refresh all available statistics on the box regardless of the functional data load frequency. On this traditional solution we did some analysis and below are some more description for the same.
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
67
5.5.1 What was done in investigation? During analysis we started to plot the various graphs and trends to identify the nature of wastage. And we figured out that absence of intelligence in weekly stats collection process is causing this wastage in CPU cycles.
Evident Traits
Every time stats were getting refreshed on database level
No separation for static data
No special consideration on heavily floating facts
Requires a large chuck of CPU once in a week, reduces other process’s slices
No Aging information available
No auditing about was, what is and what should be?
Fig. 5.1 Available Traditional Methodology
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
68
CHAPTER-6 Proposed Methodology
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
69
6. Proposed Methodology What alerted us for solution? Vilferdo Pareto, an Itilian economist, observed 80% of the land in Italy was owned by 20% of the population; he developed the principle by observing that 20% of the pea pods in his garden contained 80% of the peas[11]. Later on same study was generalized by Joseph M. Juran, business management consultant, and named after Vilferdo Pareto as Pareto Principle “Pareto Principle states that 80% effects come from 20% causes.” If we allow the same extension in BI life cycle then we will see that more then 20 % CPU cycles were consumed by approx 80% system management tasks and if we were running low on CPU cycles then saving every bit will be an add on to pocket, which indirectly converts in saving. So basically “what we save is also a kind of earning”, quoted by Warren Edward Buffett, an American investor. On the similar lines when we start analyzing systems then we found that out of 100 % CPU cycles assigned to system management, approx 80 % were getting consumed in stats management. Upon further investigation we found that out of assigned 80%, approximately more then 50% was kind of wastage because of unnecessary repetition of process even though it is not required[11]
6.1 Flow Chart Step 1: As this process is so tightly integrated with ETL at table level so to initiate with we will pass Database name and Table name to the process. Step 2: In this Step process will check the existence of the database object in database to make sure this table does exist in database as a table. If table is there in database then it will go on step 3 otherwise will break the process and come out. Step 3: After getting confirmation on existence with help of step 2, process will make sure that table has data. If table is populated then it will go on next level otherwise will break the process and come out.
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
70
Fig.6.1 Flow Chart of Proposed Solution
Step 4: After getting confirmation on data population with help of step 3, process will make sure that stats are there on the table. If stats are there then it will proceed to next step otherwise will break the process and come out. Step 5: Now as process knows with help of previous steps about stats so now question is to know weather data load is incremental or will be truncate and load. Now because this cannot be identified using metadata information so end user as an input to the process will pass this information. If table is truncate and load then process will refresh the whole stats on the table but if not then simply proceeds to next step. Step 6: In this step process will make sure that data of base table has been modified by how much percentage. Ideally if data change is more the 9% then go with a refresh otherwise leave it alone. Step 7: In this step process will check the age of existing stats and if it is more then 30 days then will refresh the stats otherwise simply come out of the loop. M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
71
6.2 Process Flow Model
Fig. Proposed Model Using STATS
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
72
CHAPTER 7 Experiment/Result
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
73
Experiment/Result To verify the results of this approach pilot testing was done on Teradata data warehouse software so terminology is inclined toward that database software but this framework can be generalized across any platform. In this research work we are proposing a metadata driven approach for performance tuning using Teradata software. 7.1 Teradata Software The Teradata system software is the foundation of the relational database server. The purpose of the software is to support SQL manipulations of the database[9]. Teradata software includes:
Client software, which includes UNICODE support
Server software used to support the Teradata RDBMS
Structured Query Language (SQL) and its uses for application programming and interactive
database queries
Stored procedures
Third-party software
Client Software Teradata RDBMS supports the UCS Transformation Format 8-bit form (UTF-8) UNICODE client character set on Teradata clients on a per session basis. One to three-byte UTF-8 UNICODE characters from the client are stored on Teradata RDBMS as two-byte UNICODE characters. By supporting UNICODE, the Teradata RDMBS can:
Support data in multiple languages in the same session
Support non-Latin data, for example, Japanese, Chinese, and so forth, on all systems
Teradata RDBMS supports running with or without a channel- or network attached client. The following table describes the available client software, recognizing that the “client” may be the system running the Teradata RDBMS itself:
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
74
Table 6.1 Teradata softwares[9] Software BTEQ
Function
Supported Access
Interactive and batch query processor/report Channel-
C Preprocessor
and
generator
attached clients
Permits embedding SQL in C programs
Channel-
and
network-
network-
attached Clients COBOL
Permits embedding SQL in COBOL programs
Preprocessor
Channel-attached Some
clients
network-attached
clients PL/I Preprocessor
Permits embedding SQL in PL/I programs
CLI
Packages and unpackages requests Can be Channelembedded
in
application
programs
Channel-attached clients and
network-
using attached Clients
function calls TDP
Manages data communication Handles sessions, Channel-attached clients logging, recovery, restarts, physical I/O from the PEs, and security
MTDP/MOSI
Manages data communication Handles logging, Network-attached clients recovery, restarts, and physical I/O from the PEs Session and security management are handled by the Gateway software on the server.
Archive/ Restore
Archives data to tape; restores taped data to Channel-attached clients Teradata RDBMS
ASF2
(MR-RAS Archives data to tape; restores taped data to Network-attached clients
products only)
Teradata RDBMS Note: This software is supported on legacy systems only
Fast Export
Extracts large volumes of data from the Teradata ChannelRDBMS
Fast Load
Does high-performance data loading from client Channel-
Does
high-performance
network-
attached Clients
into empty Tables Multi Load
and
and
network-
attached Clients data
maintenance, Channel-
and
network-
including inserts, updates, and deletions to attached Clients existing tables
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
75
Open
Teradata Are open architecture products that provide Network-attached clients
Backup
(OTB) backup and restore functions for Windows
includes
the clients
following:
Net Backup
NetVault
TPump
Provides for continuous update of tables; Channel-
and
network-
performs insert, update, and delete operations or attached Clients a combination of these operations on tables using the same source feed ODBC
Provides for access to the Teradata RDBMS Network-attached clients from various tools increasing the portability of data
Server Software The server software includes the following: Database Window Channel and/or network gateway communications support
Teradata Parallel Database Extensions (PDE)
Teradata database management software
Request dispatcher
Session controller
Access module process (AMP)
Teradata file system
SQL parser (syntaxer, resolver, optimizer, and so forth)
A server may also contain data loading utilities such as MultiLoad and FastLoad, data export utilities like FastExport, and the SQL data access utility BTEQ[9]. M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
76
7.2 Basic Setup required Table size collection should be enabled on box for 24 hrs intervals. Below tables and views should be there in specific DB. Tables
XXXX_DBO.IS_TABLE_EMPTY: Almost empty table and will have population only when master procedure executes
XXXX_DBO.STATS_FCT_STATUS: Status of stats collection at granular level
XXXX_DBO.STATS_FCT: Status of stats collection at granular level but from different perspective
XXXX_DBO.STATS_METADATA: For overriding default setting for any table by 9% and 30 days. This table should only be accessible to data architects only.
Corresponding View on above tables with access lock
XXXX_VM.IS_TABLE_EMPTY
XXXX_VM.STATS_FCT_STATUS
XXXX_VM.STATS_FCT
XXXX_VM.STATS_METADATA
Procedures:
XXXX_APP.INSERT_STATS_FACT
XXXX_APP.IS_10_PERCN_CHANGE
XXXX_APP.IS_STATS_30_DAYS_OLDER
XXXX_APP.IS_STATS_ON_TABLE
XXXX_APP.IS_TABLE_EMPTY
XXXX_APP.SMART_STATS_COLLECTION (Master procedure)
Syntax for master Procedure CALLXXX_APP.SMART_STATS_COLLECTION (DATABASE_NAME,TABLE_NAME,IS_TABLE_IS TRUNCATE_LOAD (Y/N),ETL_NAME WITH FOLDER NAME/ SCRIPT_NAME WITH PATH,OP_1_INTEGER,OP_2_VARCHAR) M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
77
7.3 RESULTS The above mentioned graph is showing CPU utilization of four different users for a application. These user were responsible to operates on stage load, reporting table loads, reporting needs etc. for a sample data of a firm M/s Adsect Technologies, Jaipur for two consecutive months in year 2011. This process was implemented for user 3 alone. And the trend of CPU utilization for user 3 (green) is supporting the fact process is working. Because in three and half spikes it has more requirement for CPU as compare to last three and half spikes. The above graph proves that the technique introduced in this paper has reduced the CPU utilization drastically and the same may be implemented practically for further better usage of technology. 7.3.1 Cost Benefit Analysis Let us assume the cost of 1 CPU Cycle is 1 unit then in first week cost occurred is approximately 4900 units which continues till first four week, this result is obtained without using stats, while next four week the cost is reduced to almost less than half when STATS approach was applied & new cost is just 2400 units. 7.3.2 Solution Piloting To verify the results of this approach pilot testing was done on Teradata database so terminology is inclined toward that database software but this framework can be generalized across any platform.
Fig. 7.1 Typical CPU consumption chart for an application in a DWH
M.Tech Thesis-Data Warehouse Performance Tuning In Better Reporting
78
CHAPTER 8 Conclusion
M .Tech Thesis
DWH-performance solution using Meta data driven approach
79
8. Conclusion Today Data Ware Housing is the back bone of most of MNC’s and large scale organization. But most of them have not looked into database administration cost especially over hardware. The data entry & updation cost. The above work is an aim for optimizing the cost of data warehousing by tuning the performance over OLAP. Instead of updating all attributes and rows of a table for even a single value of a entity we have proposed to a methodology to reduce the effort on the machine side. In this effort is made by reducing the number of CPU Cycles to be used for editing the value by using STATS. 8.1 Intended Audience This paper involves very advanced concepts of data ware housing. Also while writing this paper Tier 3 and Tier 4 population of IT industry was kept in mind as per below list.
Senior Developers
Project Managers
Technical Leads
DBA
Solution Architects
M .Tech Thesis
DWH-performance solution using Meta data driven approach
80
REFERENCES: [1].A Subjective View Part of the series of the Insource Commercial Aspects of BI discussion papers Robert J Davenport June 2008 [2]. Introduction to Teradata® RDBMS V2R4.1 B035-1091-061A June 2001 [3]. Performance Tuning Mechanisms for Data Warehouse: Query cache, Vishal Gour International Journal of Computer Applications (0975 – 8887)Volume 2 – No.2, May 2010 [4].Teradata Magazine-December 2007 [5].” ETL vs ELT”: By Robert J Davenport June 2008,White Paper, © Insource IT Consultancy Ltd [6] “Data mining , data warehousing and Olap” By Gajendra Sharma,publisher- S.K Kataria & Sons [7].” Clinical Data Repositories and Data Warehouse”: Calgary Regional Health Authority, July 19, 1999. [8].” Metadata implementation with Ab Initio EME”: Article by MikeGreen on 13 May 2009 [9].”Data Warehouse Performance Management Techniques” By Andrew Holdsworth, Oracle Services, Advanced Technologies, Data Warehousing Practice, 2/9/96 [10]. “Oracle® Database Performance Tuning Guide”10g Release(10.2)Part Number B14211-03 [11]. http://en.wikipedia.org/wiki/Pareto_principle [12]. “Optimizing Database Performance Overview SQL Server 2000” [13]. Teradata® RDBMS performance optimization, NCR Corporation [14]. Teradata® RDBMS performance optimization, NCR Corporation
M .Tech Thesis
DWH-performance solution using Meta data driven approach
81