CS614 Short Notes - CS614 Short Questions Answers
Why both aggregation and summarization are required? - Data ware housing
Briefly describe snowflake schema - Data Ware housing
Difference between Low granular and high granular - Data Ware housing
CDC time stamping triggers and Portion, which is the best? tell reason
Aggregate or hardware which is best to enhance the DWH.
Factors behind poor data Quality - Data Ware Housing
Differentiate between MOLAP and ROLAP implementation - Data Ware Housing
How Cube is created in ROLAP? (CS614 - Data Warehousing)
How does aggregates awareness helps the users? (CS614 - Data Warehousing)
Timestamp - (CS614 - Data Ware housing )
Classification Process and Accuracy Measurement - ( CS614 - Data Ware Housing)
Data parallelism - Data Ware housing
Purposes of Data Profiling - Data Ware housing
Real life Examples of Clustering - Data Warehousing
Explain the Additive and non-additive (Data Warehousing)
Clustering and Association Rules
Reason to summarization during data transformation
One-to-One Transformation and One-to-many Transformation
Although summarization and aggregation are sometimes used interchangeably
Summarization and aggregation are typically used for the following reasons:
They are required when the lowest level of detail stored in the data ware at a higher level than the detail arriving from the source. This situation of when data warehouse queries do not require the lowest level of detail or sometimes when sufficient disk space is not available to store all the time frame required by the data warehouse.
• They can be used to populate data marts from the data warehouse where mart does not require the same level of detail as is stored in the warehouse
• They can be used to roll up detail values when the detail is removed from warehouse
Sometimes a pure star schema might suffer performance problems.
This can occur when a de-normalized dimension table becomes very large and penalizes
the star join operation.
Conversely, sometimes a small outer-level dimension table does
not incur a significant join cost because it can be permanently stored in a memory buffer.
Furthermore, because a star structure exists at the center of a snowflake, an efficient star
join can be used to satisfy part of a query. Finally, some queries will not access data from
outer-level dimension tables.
These queries effectively execute against a star schema that
contains smaller dimension tables. Therefore, under some circumstances, a snowflake
schema is more efficient than a star schema.
Source: CS614 - Data Warehousing - Handouts
higher the level of aggregation of the fact table, fewer will be the number of dimensions can attach to the fact records.
The converse of this is surprising.
more granular the data, the more dimensions make sense.
data in any organization is the most dimensional.
Source: CS614 - Data Ware Housing - Handouts
Q. CDC time stamping triggers, and portion which is the best? also tell the reason.
Some source systems might use range partitioning, such that the source tables are
partitioned along a date key, which allows for easy identification of new data.
if you are extracting from an orders table, and the orders table is partitioned by
week, then it is easy to identify the current week's data.
Source : CS614 - Data Ware housing - Handouts of virtual university of pakistan
Q. Aggregate or hardware which is best to enhance the DWH.
An aggregate is information stored in a data warehouse in summarized form. Aggregations are used to improve the performance of business intelligence tools - When queries run faster, they take up less processing time and the users get their information back more quickly.
• To exclude one or more dimensions when summarizing a fact table.
• To have one or more dimensions replaced by rolled up versions of themselves
• The goal of an aggregate program in a large data warehouse must be more than just improving performance.
Poor quality data the pandemic problem that needs addressing. The factor of poor quality data inability to match payroll records to the official employment records can cost millions in payroll overpayments to retirees, personnel on leave without pay status, and “ghost” employees. Inability to correlate purchase orders to invoices may be a major problem in unmatched disbursements. Resultant costs, such as payroll overpayments and unmatched disbursements, may be significant enough to warrant extensive changes in processes, systems, policy and procedure, and information system data designs.
Source : CS614 - Data Ware Housing - Handouts
• implemented with a multi-dimensional data structure.
• It is physically builds “cubes” for direct access
• ANSI SQL is not supported.
• proprietary file format of a multi-dimensional database (MDD
• Relational OLAP provides access to information
• ANSI SQL is supported.
• data is stored in a relational database (for example a star schema)
• The fact table is a way of visualizing as an “un-rolled” cube.
Source : CS614 - Data Ware Houses - Handouts
Cube is a logical entity containing values of a certain fact at a certain aggregation level at an intersection of a combination of dimensions.
Source: CS614 - Data Ware housing - Virtual University of pakistan handouts
The existence of aggregate tables does not help user queries. Users must access the summary tables, ideally automatically with awareness built into the BI tool or the RDBMS to force the query to be processed in the most efficient way. In recent years, BI vendors have debated if aggregate awareness belongs in the BI tool or in the RDBMS. Business Objects is the only BI tool that has built-in functionality to take advantage of RDBMS aggregate tables in a way that is seamless to users.
Timestamp specifies the time and date that a given row was last modified. If the tables in an operational system have columns containing timestamps, then the latest data can easily be identified using the timestamp columns.
Source : CS614 - Data Ware housing - Virtual university of Pakistan Handouts
Classification process actually set is divided into two parts, one is called test set and the other is called the training set. We pick the training set and a model is constructed based on known facts, historical data and class properties as we already know the number of classes. After building the classification model, every record of the test set is posed to the classification model which decides the class of the input record.
Source - CS614 - Virtual University Handouts
Data parallelism is simplest form of parallelization. The idea is that we have parallel execution of single data operation across multiple partitions of data.
we have a question that we want to select the number of accounts where balance is greater that 5,000$ and the open data is after first of June 2000. This account table and what we end up doing is say ok send the query to each query server and each query server then runs the query against a particular partition
Reference: Data Ware Housing - Handouts