We are here with you hands in hands to facilitate your learning & don't appreciate the idea of copying or replicating solutions. Read More>>
+ Link For Assignments, GDBs & Online Quizzes Solution
+ Link For Past Papers, Solved MCQs, Short Notes & More
Dear Students! Share your Assignments / GDBs / Quizzes files as you receive in your LMS, So it can be discussed/solved timely. Add Discussion
How to Add New Discussion in Study Group ? Step By Step Guide Click Here.
See the attached file please
.+ http://bit.ly/vucodes (Link for Assignments, GDBs & Online Quizzes Solution)
+ http://bit.ly/papersvu (Link for Past Papers, Solved MCQs, Short Notes & More)+ Click Here to Search (Looking For something at vustudents.ning.com?) + Click Here To Join (Our facebook study Group)
Total Number of Female students in BS Telecom
SELECT COUNT (DISTINCT r.SID) AS Expr1
FROM Registration r INNER JOIN
Student s ON r.SID = s.SID AND
s.[Last Degree] IN ('F.Sc.', 'FSc',
'HSSC', 'A-Level', 'A level') AND
r.Discipline = 'TC' AND s.Gender = '1‘
Describe the lessons learn at during agri-data ware house case study?
What are the fundamental strengths and weakness of k means clustering?
Data profiling is a process of gathering information about columns, what are the purpose that it must fulfill? Describe briefly
Data profiling is a process which involves gathering of information about column through execution of certain queries with intention to identify erroneous records. In this process we identify the following:
We run different SQL queries to get the answers of above questions. During this process we can identify the erroneous records. Whenever we will come across an erroneous record, we will just copy it in error or exception table and set the dirty bit of record in the actual student table. Then we will correct the exception table. After this profiling process we will transform the records and load them into a new table Student_Info
Define additive and non additive facts
Additive facts are those facts which give the correct result by an addition operation. Examples of such facts could be number of items sold, sales amount etc. Non-additive facts can also be added, but the addition gives incorrect results. Some examples of non-additive facts are average, discount, ratios etc.
What are three fundamental reasons for warehousing web data?
1. Searching the web (web mining).
2. Analyzing web traffic.
3. Archiving the web.
First, web warehousing can be used to mine the huge web content for searching information of interest. It’s like searching the golden needle from the haystack. Second reason of Web warehousing is to analyze the huge web traffic. This can be of interest to Web Site owners, for e-commerce, for e-advertisement and so on. Last but not least reason of Web warehousing is to archive the huge web content because of its dynamic nature.
What are the two basic data warehousing implementation strategies and their suitability conditions?
Top Down & Bottom Up approach: A Top Down approach is generally useful for projects where the technology is mature and well understood, as well as where the business problems that must be solved are clear and well understood. A Bottom Up approach is useful, on the other hand, in making technology assessments and is a good technique for organizations that are not leading edge technology implementers. This approach is used when the business objectives that are to be met by the data warehouse are unclear, or when the current or proposed business process will be affected by the data warehouse.
Bitmap Indexes: Concept
List and explain fundamental advantages of bit map indexing
Bitmap Index: Adv.
An obvious advantage of this technique is the potential for dramatic reductions in storage overhead. Consider a table with a million rows and four distinct values with column header of 4 bytes resulting in 4 MB. A bitmap indicating which of these rows are for these values requires about 500KB.
More importantly, the reduction in the size of index "entries" means that the index can sometimes be processed with no I/O and, more often, with substantially less I/O than would otherwise be required. In addition, many index-only queries (queries whose responses are derivable through index scans without searching the database) can benefit considerably.
Database retrievals using a bitmap index can be more flexible and powerful than a B-tree in that a bitmap can quickly obtain a count by inspecting only the index, without retrieving the actual data. Bitmap indexing can also use multiple columns in combination for a given retrieval.
Finally, you can use low-level Boolean logic operations at the bit level to perform predicate evaluation at increased machine speeds. Of course, the combination of these factors can result in better query performance.
List and explain fundamental disadvantages of bit map indexing
Bitmap Index: Dis. Adv.
Row locking: A potential drawback of bitmaps involves locking. Because a page in a bitmap contains references to so many rows, changes to a single row inhibit concurrent access for all other referenced rows in the index on that page.
Low cardinality: Bitmap indexes create tables that contain a cell for each row times each possible value (the product of the number of rows times the number of unique values).
Therefore, a bitmap is practical only for low- cardinality columns that divide the data into a small number of categories, such as "M/F", "T/F", or "Y/N" values.
Keyword parsing: Bitmap indexes can parse multiple values in a column into separate keywords. For example, the title “Marry had a little lamb" could be retrieved by entering the word “Marry" or “lamb" or a combination. Although this keyword parsing and lookup capability is extremely useful, textual fields tend to contain high-cardinality data (a large number of values) and therefore are not a good choice for bitmap indexes.
What are major operations of data mining?
What will be the effect if we program a package by using DTS object model?
DTS package is exactly like a computer program. Like a computer program DTS package is also prepared to achieve some goal. Computer program contains set of instructions whereas DTS package contains set of tasks. Tasks are logically related to each other. When a computer program is run, some instructions are executed in sequence and some in parallel. Likewise when a DTS package is run some tasks are performed in sequence and some in parallel. The intended goal of a computer program is achieved when all instructions are successfully executed. Similarly the intended goal of a package is achieved when all tasks are successfully accomplished
Package can also be programmed by using DTS object model instead of using graphical tools but DTS programming is rather complicated.
Write down the steps of handling skew in range partitioning?
There are number of ways to handle the skew in the data when it is partitioned based on the range, here date is a good example with data distributed based in quarters across four processors. One solution is to sort the data this would identify the “clusters” within the data, then bases on them more or less equal partitions could be created resulted in elimination or reduction of skew.
Q12what type of anomalies exists if a table is in 2NF not in 3NF? 
The table is in 2NF but NOT in 3NF Tables in 2NF but not in 3NF contain modification anomalies
What are three methods for creating a DTS package?
–Programming DTS applications
Write two extremes of Tech. Arch Design?
Attacking the problem from two extremes, neither is correct.
Q1: Explain analytic data application specification in Kimball 5 marks
Analytic applications development
Q2: Business rules are validated using student database in LAB 5 marks
Data profiling is a process which involves gathering of information about column through execution of certain queries with intention to identify erroneous records. In this process we identify Validation of business rules
Q3: 2 real life examples of clustering 5 marks
Examples of Clustering Applications
Marketing: Discovering distinct groups in customer databases, such as customers who make lot of long-distance calls and don’t have a job. Who are they? Students. Marketers use this knowledge to develop targeted marketing programs.
Insurance: Identifying groups of crop insurance policy holders with a high average claim rate. Farmers crash crops, when it is “profitable”.
Land use: Identification of areas of similar land use in a GIS database.
Seismic studies: Identifying probable areas for oil/gas exploration based on seismic data.
Q5: What issues may occur during data acquisition and cleansing in agriculture case study? 3marks
Q6: Meant of classification process, how measure accuracy of classification? 3marks
First of the available data set is divided into two parts, one is called test set and the other is called the training set. We pick the training set and a model is constructed based on known facts, historical data and class properties as we already know the number of classes. After building the classification model, every record of the test set is posed to the classification model which decides the class of the input record. It should be noted that you know the class for each record in test set and this fact is used to measure the accuracy or confidence level of the classification model. You can find accuracy by
Accuracy or confidence level = matches/ total number of matches
In simple words, accuracy is obtained by dividing number of correct assignments by total number of assignments by the classification model
Q7: Data parallelism explain with example 3 marks
So data parallelism is I think the simplest form of parallelization. The idea is that we have parallel execution of single data operation across multiple partitions of data. So the idea here is that these partitions of data may be defined statically or dynamically fine, but we are requiring the same operator across these multiple partitions concurrently. And this idea actually of data parallelism has existed for a very long time. So the idea is that you are getting parallelization because we are getting semi-independent execution, data manipulation across the partitions. And as long as we keep the coordination required, we can get very good speedups. Well again this query coordinator, the thing that keeps the query distributed but still working and then collects its results. Now that query coordinator can potentially be a bottleneck, because if it does too much work, that is serial execution. So the query coordination has to be very small amount of work. Otherwise the overhead gets higher and the serialization of the workload gets higher.
Q8: Under what condition an operation can be execute in parallel? 3 marks
Under the things which can be divided into two such as with reference to size and with reference to divide and conquer an operation can be execute in parallel.
Q9: What sorts of objectives metric are use by companies what are possible issues in formulation these metric? 2 marks
Q10: Which script languages are used to perform complex transformation in DTS pachage? 2 marks
Complex transformations are achieved through VB Script or Java Script that is loaded in DTS package.
Q11: Cleansing can be break down in Who many steps, write their names? 2 marks
One can break down the cleansing into six steps: elementizing, standardizing, verifying, matching, house holding, and documenting.
Q12: What does u mean by “keep competition hot in context of production selection and transformation while designing a data warehouse “. 2 marks
Q13: Who merge column is selected in case of sort merge? 2 marks
The Sort-Merge join requires that both tables to be joined are sorted on those columns that are identified by the equality in the WHERE clause of the join predicate. Subsequently the tables are merged based on the join columns.
3) Different b/w non key or key data access? 2
Non-keyed access uses no index. Each record of the database is accessed sequentially, beginning with the first record, then second, third and so on. This access is good when you wish to access a large portion of the database (greater than 85%). Keyed access provides direct addressing of records. A unique number or character(s) is used to locate and access records. In this case, when specified records are required (say, record 120, 130, 200 and 500), indexing is much more efficient than reading all the records in between.
4) “Be a diplomat not a technologist”? 2
The biggest problem you will face during a warehouse implementation will be people, not the technology or the development. You’re going to have senior management complaining about completion dates and unclear objectives. You’re going to have development people protesting that everything takes too long and why can’t they do it the old way? You’re going to have users with outrageously unrealistic expectations, who are used to systems that require mouse-clicking but not much intellectual investment on their part. And you’re going to grow exhausted, separating out Needs from Wants at all levels. Commit from the outset to work very hard at communicating the realities, encouraging investment, and cultivating the development of new skills in your team and your users (and even your bosses).
5) Dirty bit?2
6) What are the problem face industry when the growth in usage of master table file increase?3
The spreading of master files and massive redundancy of data presented some very serious problems, such as:
7) Indexing using I/0 bottelneck?3
Need For Indexing: I/O Bottleneck
Throwing more hardware at the problem doesn't really help, either. Expensive and multiprocessing servers can certainly accelerate the CPU-intensive parts of the process, but the bottom line of database access is disk access, so the process is I/O bound and I/O doesn't scale as fast as CPU power. You can get around this by putting the entire database into main memory, but the cost of RAM for a multi-gigabyte database is likely to be higher than the server itself! Therefore we index. Although DBAs can overcome any given set of query problems by tuning, creating indexes, summary tables, and multiple data marts, or forbidding certain kinds of queries, they must know in advance what queries users want to make and would be useful, which requires domain-specific knowledge they often don't have. While 80% of database queries are repetitive and can be optimized, 80% of the ROI from database information comes from the 20% of queries that are not repetitive. The result is a loss of business or competitive advantage because of the inability to access the data in corporate databases in a timely fashion.
9) W8 is Click stream? Limitations?3
Click stream data has many issues.
1. Identifying the Visitor Origin
2. Identifying the Session
3. Identifying the Visitor
4. Proxy Servers
5. Browser Caches
10) Import/export wizard tasks?3
1. Connect to source Text files
2. Connect to Destination SQL Server
3. Create new database ‘Lahore_Campus’
4. Create two tables Student & Registration
5. Load data from the text files containing student information into Student table
6. Load data from the text files containing registration records into Registration table
11) Problem using SQL to fill up tables of ROLAP cube?3
Problem with simple approach
12) How data mining is different from statics?which one is better?5
Data Mining Vs. Statistics
13) Persistent cookies limitations? 5
Using Persistent Cookies
Establish a persistent cookie in the visitor's PC. The Web site may establish a persistent cookie in the visitor's PC that is not deleted by the browser when the session ends.
Misconception about data quality
1) You Can Fix Data
2) Data Quality is an IT Problem
3. All Problem is in the Data Sources or Data Entry
4. The Data Warehouse will provide a single source of truth
5. Compare with the master copy will fix the problem
Issues of data cleansing
Major issues of data cleansing had arisen due to data processing and handling at four levels by different groups of people
1. Hand recordings by the scouts at the field level.
2. Typing hand recordings into data sheets at the DPWQCP office.
3. Photocopying of the typed sheets by DPWQCP personnel.
4. Data entry or digitization by hired data entry operators.
Classification and estimation
As opposed to discrete outcome of classification i.e. YES or NO, deals with continuous valued outcomes
Star Schema: A star schema is generally considered to be the most efficient design for two reasons. First, a design with de-normalized tables encounters fewer join operations. Second, most optimizers are smart enough to recognize a star schema and generate access plans that use efficient "star join" operations. It has been established that a "standard template” data warehouse query directly maps to a star schema.
Why a pilot project strategy is highly recommended in DWH construction? 5
A pilot project strategy is highly recommended in data warehouse construction, as a full blown data warehouse construction requires significant capital investment, effort and resources. Therefore, the same must be attempted only after a thorough analysis, and a valid proof of concept.
Q2define nested loop join list and describe its variants? 5
Traditionally Nested-Loop join has been and is used in OLTP environments, but for many reasons, such a join mechanism is not suitable for VLDB and DSS environments. Nested loop joins are useful when small subsets of data are joined and if the join condition is an efficient way of accessing the inner table.
Nested-Loop Join: Variants
1. Naive nested-loop join
2. Index nested-loop join
3. Temporary index nested-loop join
Define Dense and Sparse index, adv and disadv (3)
For each record store the key and a pointer to the record in the sequential file. Why? It uses less space, hence less time to search. Time (I/Os) logarithmic in number of blocks used by the index can also be used as secondary index i.e. with another order of records.
Dense Index: Every key in the data file is represented in the index file
Pro: A dense index, if fits in the memory, costs only one disk I/O access to locate a record given a key
Con: A dense index, if too big and doesn’t fit into the memory, will be expense when used to find a record given its key
Sparse Index: Concept
Figure-26.2: Sparse index concept
In this case, normally only one key per data block is kept. A sparse index uses less space at the expense of somewhat more time to find a record given its key.
What happens when record 35 is inserted?
Sparse Index: Adv & Dis Adv
• Store first value in each block in the sequential file and a pointer to the block.
• Uses even less space than dense index, but the block has to be searched, even for unsuccessful searches.
• Time (I/Os) logarithmic in the number of blocks used by the index.
Sparse Index: Multi level
What should b dune in the case where golden copy is missing dates?
If the dates are missing we must need to consult golden copy. If gender is missing we are not required to consult golden copy. In many cases name can help us in identifying the gender of the person.
Tasks performed through import/export data wizard
Tasks can be as follows:
You can't tell when the visitor returns to the site at a later time in a new session.
What is value validation process?
Value validation is the process of ensuring that each value that is sent to the data warehouse is accurate.
What is the difference between training data and test data?
The existing data set is divided into two subsets, one is called the training set and the other is called test set. The training set is used to form model and the associated rules. Once model built and rules defined, the test set is used for grouping. It must be noted the test set groupings are already known but they are put in the model to test its accuracy.
Clustering vs. Cluster Detection (5 marks)
In one-way clustering, reordering of rows (or columns) assembles clusters.
If the clusters are NOT assembled, they are very difficult to detect.
Once clusters are assembled, they can be detected automatically, using classical techniques such as K-means.
The K-Means Clustering
PROS AND CONS OF KMEANS
12) Define the project planning task?
Lifecycle Key Steps
Lifecycle begins with project planning during which we assess the organization's readiness for a data warehouse initiative, establish the preliminary scope and justification, obtain resources, and launch the project.
Write down the limitations of pipelining parallelism?
Pipeline parallelism is a good fit for data warehousing (where we are working with lots of data), but it makes no sense for OLTP because OLTP tasks are not big enough to justify breaking them down into subtasks.
these are final term notes?