Data Warehouse & Mining Viva Questions PDF

Summary

This document contains a set of viva questions on data warehousing and data mining, suitable for the University of Mumbai. The questions cover topics such as data warehousing, data mining, ETL processes, different schemas, and more. It is a study guide likely for a student taking a data warehousing and data mining class.

Full Transcript

lOMoARcPSD|43401287 Data Warehouse & Mining Viva Questions Data Warehousing & Mining (University of Mumbai) Scan to open on Studocu Studocu is not sponsored or endorsed by any college or university Downloaded by malgi malgi ([email protected]...

lOMoARcPSD|43401287 Data Warehouse & Mining Viva Questions Data Warehousing & Mining (University of Mumbai) Scan to open on Studocu Studocu is not sponsored or endorsed by any college or university Downloaded by malgi malgi ([email protected]) lOMoARcPSD|43401287 1. What is Data warehousing? A Data Warehousing (DW) is process for collecting and managing data from varied sources to provide meaningful business insights. A Data warehouse is typically used to connect and analyze business data from various sources. 2. What is data warehouse? A data warehouse is an electronic storage of an organization’s historical data for the purpose of reporting, analysis and data mining or knowledge discovery. 3. What Is Data Purging? The process of cleaning junk data is termed as data purging. Purging data would mean getting rid of unnecessary NULL values of columns. This usually happens when the size of the database gets too large. 4. What Are the Different Problems That "data Mining" Can Solve? Data mining helps analysts in making faster business decisions which increases revenue with lower costs. Data mining helps to understand, explore, and identify patterns of data. Data mining automates process of finding predictive information in large databases. Helps to identify previously hidden patterns. 5. What is Dimension Table? A dimension table is a table in star schema and snowflake schema of a data warehouse. A dimension table stores attributes, or dimensions, that describe the objects in a fact table. 6. What is Fact Table? A fact table is the central table in a star schema and snowflake schema of a data warehouse. Fact table contains the measurement of business processes, and it contains foreign keys for the dimension tables. Downloaded by malgi malgi ([email protected]) lOMoARcPSD|43401287 7. What is data mining? Data mining is the process of sorting through large data sets to identify patterns and relationships that can help solve business problems through data analysis. 8. Difference between OLAP and OLTP OLAP OLTP OLAP is an acronym Online analytical OLTP is an acronym for Online transaction processing processing Consists of historical data from various Consists only operational current data. Databases. OLAP has long transactions. OLTP has short transactions. Based on SELECT commands to aggregate Based on INSERT, UPDATE, DELETE data for reporting commands Complex queries. Simpler queries. 9. What is ETL? ETL is abbreviated as Extract, Transform and Load. ETL is a software which is used to reads the data from the specified data source and extracts a desired subset of data. Next, it transforms the data using rules and lookup tables and convert it to a desired state. Then, load function is used to load the resulting data to the target database. 10. What is Datamart A data mart is a subset of data stored within the overall data warehouse, for the needs of a specific team, section, or department within the business enterprise. Data marts make it much easier for individual departments to access key data insights more quickly and helps prevent departments within the business organization from interfering with each other’s data. 11. What is the difference between Datawarehouse and OLAP? Datawarehouse is a place where the whole data is stored for analyzing, but OLAP is used for analyzing the data, managing aggregations Downloaded by malgi malgi ([email protected]) lOMoARcPSD|43401287 12. What is Star Schema? A star schema is a data warehousing architecture model where one fact table references multiple dimension tables, which, when viewed as a diagram, looks like a star with the fact table in the center and the dimension tables radiating from it. It is the simplest among the data warehousing schemas and is currently in wide use. 13. What is Snowflake Schema The snowflake schema is an extension of a star schema. The main difference is that in this architecture, each dimension table can be linked to one or more-dimension tables as well. The aim is to normalize the data. 14. What is Metadata Metadata is defined as data about the data. The metadata contains information like number of columns used, fix width and limited width, ordering of fields and data types of the fields. 15. What is a Decision Tree Algorithm? Decision tree is a supervised learning algorithm used for classification. It uses a flowchart like a tree structure to show the predictions that result from a series of feature-based splits. It starts with a root node and ends with a decision made by leaves. 16. What is Naïve Bayes Algorithm? Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for solving classification problems. It is mainly used in text classification that includes a high-dimensional training dataset. It is one of the simple and most effective Classification algorithms. Downloaded by malgi malgi ([email protected]) lOMoARcPSD|43401287 17. Explain clustering algorithm. Clustering algorithm is used to group sets of data with similar characteristics also called as clusters. These clusters help in making faster decisions and exploring data. The algorithm first identifies relationships in a dataset following which it generates a series of clusters based on the relationships. The process of creating clusters is iterative. The algorithm redefines the groupings to create clusters that better represent the data. 18. Explain Association algorithm in Data mining? Association rule mining is a procedure which is meant to find frequent patterns, correlations, associations, or causal structures from data sets found in various kinds of databases such as relational databases, transactional databases, and other forms of data repositories. Given a set of transactions, association rule mining aims to find the rules which enable us to predict the occurrence of a specific item based on the occurrences of the other items in the transaction. 19. Differentiate Star Schema and Snowflake Schema Star Schema Snowflake Schema It contains a fact table surrounded by One fact table surrounded by dimension dimension tables. table which are in turn surrounded by dimension table Simple DB Design. Very Complex DB Design. High level of Data redundancy Very low-level of data redundancy Denormalized Data structure and query also Normalized Data Structure. run faster. Single Dimension table contains aggregated Data Split into different Dimension Tables. data Downloaded by malgi malgi ([email protected]) lOMoARcPSD|43401287 20. What are the characteristics of data warehouse?  Subject Oriented o A data warehouse is subject oriented because it provides information around a subject rather than the organization's ongoing operations. o These subjects can be product, customers, suppliers, sales, revenue, etc. A data warehouse does not focus on the ongoing operations, rather it focuses on modelling and analysis of data for decision making.  Integrated o A data warehouse is constructed by integrating data from heterogeneous sources such as relational databases, flat files, etc. o This integration enhances the effective analysis of data.  Time Variant o The data collected in a data warehouse is identified with a particular time period. o The data in a data warehouse provides information from the historical point of view.  Non-volatile o Non-volatile means the previous data is not erased when new data is added to it. 21. What are typical data mining techniques? 1. Classification: This analysis is used to retrieve important and relevant information about data, and metadata. This data mining method helps to classify data in different classes. 2. Clustering: Clustering analysis is a data mining technique to identify data that are like each other. This process helps to understand the differences and similarities between the data. 3. Regression: Regression analysis is the data mining method of identifying and analyzing the relationship between variables. It is used to identify the likelihood of a specific variable, given the presence of other variables. 4. Association Rules: This data mining technique helps to find the association between two or more Items. It discovers a hidden pattern in the data set. Downloaded by malgi malgi ([email protected])

Use Quizgecko on...
Browser
Browser