Operational versus Informational Systems
Why not operational environment for decision-support?
A Data Warehouse Architecture Model
Operational Data versus Warehouse Data
The Multidimensional versus Relational Model
Data Warehousing for Parallel Environments
Operational System | Informational System | |
---|---|---|
1 | Supports day-to-day decisions | Supports long-term, strategic decisions |
2 | Transaction driven | Analysis driven |
3 | Data constantly changes | Data rarely changes |
4 | Repetitive processing | Heuristic processing |
5 | Holds current data | Holds historical data |
6 | Stores detailed data | Stores summarized and detailed data |
7 | Application oriented | Subject oriented |
8 | Predictable pattern of usage | Unpredictable pattern of usage |
9 | Serves clerical, transactional community | Serves managerial community |
Limitations/challenges:
Common Issues:
Two-tier architecture is not scalable and cannot support large numbers of on-line endusers without additional software modifications. It runs into performance problems associated with PC and network limitations.
The template determines the facts and metrics to be viewed, along with their granularity, multidimensional orientation, and formatting.
The filter performs the qualification function, narrowing the amount of information to be viewed so that it is intelligible.
The report is created by combining a template with a filter. The filter selects a certain location in the n-dimensional state space of the data warehouse. The template is the viewing instrument for assessing the terrain.
An agent is essentially a collection of reports, scheduled to execute with some periodicity. The range of functionality of an agent is limited only by the library of templates. Likewise, the intelligence of the agent is directly related to the sophistication of its filters.
For maximum benefit, it is important to allow for the sharing of filters, templates, reports and agents over the network.
Feature | Operational | Data Warehouse |
---|---|---|
Data content | current values | archival data, summarized data, calculated data |
Data organization | application by application | subject areas across enterprise |
Nature of data | dynamic | static until refreshed |
Data structure, format | complex; suitable for operational computation | simple; suitable for business analysis |
Access probability | high | moderate to low |
Data update | updated on a field-by-field basis | accessed and manipulated; no direct update |
Usage | highly structured repetitive processing | highly unstructured analytical processing |
Response time | subsecond to 2-3 seconds | seconds to minutes |
Operational Data | Warehouse Data |
---|---|
Short-lived, rapidly changing | Long-living, static |
Requires record-level access | Data is aggregated into sets, similar to relational database |
Repetitive standard transactions and access patterns | Ad hoc queries with some specific reporting |
Updated in real time | Updated periodically with mass loads |
Event driven - process generates data | Data driven - data governs process |
A user's view of the enterprise is multidimensional in nature. Sales, can be viewed not only by product but also by region, time period, and so on. That's why OLAP models should be multidimensional in nature.
Most approaches to OLAP center around the idea of reformulating relational or flat file data into a multidimensional data store that is optimized for data analysis. This multidimensional data store known as hypercube stores the data along dimensions. Analysis requirements span a spectrum from statistics to simulation. The two popular forms of analysis are 'slice and dice' and 'drill-down'.
Feature | OLTP | OLAP |
---|---|---|
Purpose | Run day-to-day operation | Information retrieval and analysis |
Structure | RDBMS | RDBMS |
Data Model | Normalized | Multidimensional |
Access | SQL | SQL plus data analysis extensions |
Type of Data | Data that runs the business | Data to analyse the business |
Condition of data | Changing, incomplete | Historical, descriptive |
Systems can share disks and main memory. In addition, each processor has local cache memory. These are referred to as tightly coupled or SMP (Symmetric Multi Processing) systems because they share a single operating system instance. SMP looks like a single computer with a single operating system. A DBMS can use it with little, if any, reprogramming.
In a shared resource environment, each processor executes a task on the required data, which is shipped to it. The only problem with data shipping is that it limits the computer's scalability. The scaling problems are caused by interprocessor communication.
Each processor has its own memory, its own OS, and its own DBMS instance, and each executes tasks on its private data stored on its own disks. Shared-nothing architectures offer the most scalability and are known as loosely coupled or Massively Parallel Processing (MPP) systems. The processors are connected, and messages or functions are passed among them. Shipping tasks to the data, instead of data to the tasks, reduces interprocessor communications. Programming, administration and database design are intrinsically more difficult in this environment than in the SMP environments.
An example is the high-performance switch used in IBM's Scalable Power Parallel Systems 2 (SP2). This switch is a high bandwidth crossbar, just like the one used in telephone switching, that can connect any node to any other node, eliminating transfer through intermediate nodes.
A node failure renders data on that node inaccessible. Therefore, there is a need for replication of data across multiple nodes so that you can still access it even if one node fails, or provide alternate paths to the data in a hybrid shared-nothing architecture.
In this type, multiple 'tightly coupled' SMP systems are linked together to form a 'loosely coupled' processing complex. Clustering requires shared resource coordination via a lock manager to preserve data integrity across the RDBMS instances, disks, and tape drives. While clustering SMP systems requires a looser coupling among the nodes, there is no need to replace hardware or rewrite applications.
An example, is the Sequent's Symmetry 5000 SE100 cluster, which supports more than 100 processors.
A natural benefit of clustered SMP is much greater availability than MPP systems and even more availability than SMP.
Every component of an SMP system is controlled by a single executing copy of an OS managing a shared global memory. Because memory in an SMP system is shared among the CPUs, SMP systems have a single address space and run a single copy of the OS and application. All processes are fully symmetric in the sense that any process can execute on any processor at any time. As system loads and configurations change, tasks or processes are automatically distributed among the CPUs - providing a benefit known as dynamic load balancing.
Early multiprocessing systems were designed around an asymmetric paradigm, where one master processor is designed to handle all operating systems tasks. The rest of the processors only handle user processes. They are referred to as slave processors. The disadvantages are:
For example, the system can buffer data in memory for multiple tasks. It can retrieve data to be scanned and sorted and also retrieve more data for the next transaction. The more disks and controllers the system has, the faster it can feed memory and the CPU.
In practice, there is a combination of simultaneous and sequential SQL operations to be performed. Therefore, partitioned parallelism is typically combined with pipelined parallelism.
There are 3 types of multidimensional OLAP tools:
This notes was compiled by V.V.S.Raveendra, in June 1999.