Data Analytics Assignments Solutions PDF

# Data Analytics Assignments Solutions December 5, 2024 ## 1. Explain the design of database architecture for managing data Database architecture is the design of how data is stored, accessed, and managed in a system. It consists of several layers and components, each with a specific role in organizing and handling data. ### Key Components of Database Architecture 1. **Physical Layer:** Deals with how data is physically stored on devices (e.g., hard drives). 2. **Logical Layer:** Defines the structure of the data (e.g., tables, columns, and relationships between them) without worrying about how it is stored physically. 3. **View Layer:** Provides different ways to view the data for users, depending on their needs. ### Core Components - **DBMS (Database Management System):** Software that manages the database. It includes: - **Query Processor:** Interprets and runs queries. - **Transaction Manager:** Ensures data changes are safe and consistent. - **Storage Manager:** Handles data storage and retrieval. - **Data Models:** Abstractions that describe how data is organized. Common models include: - **Relational Model:** Data in tables. - **Object-Oriented Model:** Data in objects, like in programming. - **Data Integrity & Security:** Ensures data is accurate and protected. This includes enforcing rules (like primary keys) and using security measures (like passwords). - **Database Connectivity:** Enables apps to interact with the database using APIs or other methods. ### Types of Database Architecture 1. **Single-Tier:** All components are in one system. 2. **Two-Tier:** Client-server setup where the client interacts with a server storing the database. 3. **Three-Tier:** A more complex system with: - **Presentation Layer:** User interface. - **Application Layer:** Business logic. - **Data Layer:** Database server ## 2. List and discuss various sources of data The data that is collected is known as raw data, which is not useful now but after cleaning the impure data and utilizing it for further analysis, it forms information. The information obtained is known as "knowledge". Knowledge has many meanings, such as business knowledge, sales of enterprise products, disease treatment, etc. The main goal of data collection is to collect information-rich data. ### Primary Data Primary data refers to raw, original data collected directly from the official sources. This type of data is collected by performing techniques such as questionnaires, interviews, and surveys. Some methods of collecting primary data are: - **Interview Method:** The data collected through interviews, where questions are asked, and responses are recorded in notes, audio, or video format. This can be face-to-face, via phone, email, or online. - **Survey Method:** Data collected through surveys, which may be online or offline, asking a set of questions and recording the responses. Examples include market research surveys, product feedback surveys, etc. - **Observation Method:** Data collected by observing the behaviors and actions of a target group. The data can be recorded as text, audio, or video. Examples include observing consumer behavior in a store or on a website. - **Experimental Method:** Data collected by performing experiments. This includes designs like Completely Randomized Design (CRD), Randomized Block Design (RBD), Latin Square Design (LSD), and Factorial Design (FD). ### Secondary Data Secondary data refers to data that has already been collected and reused for a different purpose. It is derived from two sources: - **Internal Sources:** These data are available within an organization, such as sales records, transaction data, market records, and customer feedback. - **External Sources:** These data are obtained from outside the organization and can be more expensive and time-consuming to gather. Examples include government publications, news reports, and data from international organizations. ### Other Sources of Data In addition to primary and secondary data, there are other sources of data that are widely used in data analysis: - **Sensor Data:** Data collected through IoT (Internet of Things) devices. Examples include data from smart home devices, wearable health trackers, and environmental sensors. - **Satellite Data:** Data collected from satellites, primarily in the form of images. This data is used in fields such as weather forecasting, remote sensing, and geospatial analysis. - **Web Traffic Data:** Data collected from user interactions on websites. This includes search queries, clicks, and browsing behavior. Examples include data from search engines, web analytics platforms, and social media platforms. | Type of Data | Sources | |---|---| | Primary Data | Interviews, Surveys, Observations, Experiments | | Secondary Data | Internal Sources (organization's records), External Sources (government publications, news, etc.) | | Other Data | Sensor Data (IoT devices), Satellite Data, Web Traffic Data | ## 3. List the different types of data with examples ### 1. Qualitative Data (Categorical Data) Qualitative data represents non-numerical information and is used to describe qualities or characteristics. #### a) Nominal Data Nominal data is used to categorize data into distinct groups or labels without any specific order or numerical significance. - Colors (red, blue, green, orange) - Fruits (apples, bananas, grapes) - Gender (male, female, other) - Marital Status (single, married, divorced, widowed) - Blood type (А, АВ, О, В) - Days of the week (Monday, Tuesday, Wednesday, etc.) #### b) Ordinal Data Ordinal data represents categories with a meaningful order, but the differences between categories are not measurable. - Reviews (excellent, good, fair, poor) - Educational Qualification (high school, undergraduate, postgraduate) - Grades in exams (A, B, C, D) - Economic background (below poverty, middle class, rich) ### 2. Quantitative Data (Numerical Data) Quantitative data represents numerical information that can be counted or measured. #### a) Discrete Data Discrete data refers to distinct, separate numerical values, often counted in whole numbers or integers. - Total number of students in a college - Number of cars in a parking area - Number of members in a family - Number of wheels in a car #### b) Continuous Data Continuous data refers to numerical values that can take any value within a range and can be expressed in fractions or decimals. - Height of a person - Temperature in Celsius or Fahrenheit - Weight in pounds or kilograms - Distance in meters or kilometers - Share price in the market ## 4. Discuss the various steps in data preprocessing Data preprocessing is an important step in the data mining process. It refers to the cleaning, transforming, and integrating of data in order to make it ready for analysis. The goal of data preprocessing is to improve the quality of the data and to make it more suitable for the specific data mining task. - **Data Cleaning** - Data cleaning routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. Data cleaning is usually performed as an iterative two-step process consisting of discrepancy detection and data transformation. - **Data Integration** - Data integration combines data from multiple sources, which may involve dealing with inconsistencies like different names for the same attribute (e.g., "customer id" vs. "cust id"). The resolution of semantic heterogeneity, metadata, correlation analysis, tuple duplication detection, and data conflict detection contribute to smooth data integration. - **Data Transformation** - Data transformation routines convert the data into appropriate forms for mining. For example, in normalization, attribute data are scaled so as to fall within a small range such as 0.0 to 1.0. Data discretization transforms numeric data by mapping values to interval or concept labels. Such methods can be used to automatically generate concept hierarchies for the data, which allows for mining at multiple levels of granularity. Discretization techniques include binning, histogram analysis, cluster analysis, decision tree analysis, and correlation analysis. For nominal data, concept hierarchies may be generated based on schema definitions as well as the number of distinct values per attribute. - **Data Reduction** - Data reduction techniques obtain a reduced representation of the data while minimizing the loss of information content. These include methods of dimensionality reduction, numerosity reduction, and data compression. ## 5. How is the quality of data assessed? Data quality is defined in terms of accuracy, completeness, consistency, timeliness, believability, and interpretability. These qualities are assessed based on the intended use of the data. 1. **Accuracy** - Accuracy means the data must be correct and free of errors. 2. **Completeness** - Completeness refers to having all necessary values recorded. 3. **Consistency** - Consistency ensures that data is uniform and free from discrepancies. Data often faces challenges such as inaccuracy, which can arise from faulty collection tools, human errors, or inconsistencies in naming conventions. For instance, if sales records have incorrect pricing information, decisions based on this data could be flawed. 4. **Timeliness** - Timeliness is also critical; outdated or incomplete data, if not updated promptly, can negatively impact decision-making. For example, calculating sales bonuses based on incomplete monthly data could result in incorrect bonus distributions. 5. **Believability** - Believability is about how much users trust the data. Even if data is accurate now, past errors might lead users to distrust it. For example, if a database had numerous errors previously, users might still doubt its reliability. 6. **Interpretability** - Interpretability refers to how easily data can be understood. If a database uses complex accounting codes that users don't understand, even accurate and complete data can be seen as low quality. ## 6. Explain the various methods for handling missing data The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It involves handling of missing data, noisy data, etc. ### Missing Data: This situation arises when some data is missing in the data. It can be handled in various ways: - **Ignore the Tuple:** This method involves discarding the entire record (tuple) if it contains any missing values. This approach is generally ineffective, particularly if the tuple has other valuable attributes or if the missing values are unevenly distributed across attributes. - **Fill in the Missing Value Manually:** This is a labor-intensive method where missing values are filled in by hand. It can be impractical for large datasets with numerous missing values. - **Use a Global Constant:** All missing values are replaced with a constant value, such as "Unknown" or a specific number (e.g., -∞). While simple, this approach can lead to misleading interpretations since the constant might be seen as a valid, meaningful value by the mining algorithm. - **Use a Measure of Central Tendency:** Missing values are replaced with the mean or median of the attribute. The mean is used for symmetric distributions, while the median is better for skewed distributions. - **Use the Mean or Median of the Same Class:** This approach is similar to the previous one but is done within the context of a specific class. For example, in a classification problem, missing values might be filled in with the average value for all samples in the same class. - **Use the Most Probable Value:** The missing value is estimated using more sophisticated methods like regression, Bayesian inference, or decision tree induction, which predict the value based on other attributes in the dataset. ## 7. Mention the techniques used to detect noisy data Noisy data is meaningless data that can't be interpreted by machines. It can be generated due to faulty data collection, data entry errors, etc. It can be handled in the following ways: - **Binning Method:** This method works on sorted data in order to smooth it. The whole data is divided into segments of equal size and then various methods are performed to complete the task. Each segment is handled separately. One can replace all data in a segment by its mean or boundary values can be used to complete the task. - **Regression:** Here, data can be made smooth by fitting it to a regression function. The regression used may be linear (having one independent variable) or multiple (having multiple independent variables). - **Clustering:** This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside the clusters. ## 8. Identify the strategies for data transformation. Explain why normalization of data is necessary. Apply the normalization techniques for the following data. X = {10, 23, -1, 14, -67, 89, 35, 78} Data transformation is the process where data are transformed and consolidated into forms appropriate for mining by performing summary or aggregation operations. The common techniques used for data transformation include: - **Smoothing:** This technique works to remove noise from the data. Methods such as binning, regression, and clustering are often used for smoothing. - **Attribute Construction (or Feature Construction):** In this technique, new attributes are constructed and added from the given set of attributes to help the mining process. This is often done to extract more relevant features for analysis. - **Aggregation:** This involves applying summary or aggregation operations to the data. For example, daily sales data may be aggregated to compute monthly or annual totals. This step is commonly used in constructing a data cube for analysis at multiple abstraction levels. - **Normalization:** In this technique, the attribute data are scaled to fall within a smaller range, such as [-1.0, 1.0] or [0.0, 1.0], to make the data more uniform and comparable. - **Discretization:** This process involves replacing the raw values of a numeric attribute (e.g., age) by interval labels (e.g., 0-10, 11-20, etc.) or conceptual labels (e.g., youth, adult, senior). The labels can then be recursively organized into higher-level concepts, resulting in a concept hierarchy for the numeric attribute. - **Concept Hierarchy Generation for Nominal Data:** For attributes such as street, they can be generalized to higher-level concepts, like city or country. Many hierarchies for nominal attributes are implicit within the database schema and can be automatically defined at the schema definition level. **Why Normalization is Necessary** - **Data Scaling:** It brings all data to a similar scale, making data easier to analyze and compare. - **Improved Performance:** When data is not normalized, some machine learning algorithms may be biased toward features that are on a larger scale. Normalization helps to improve algorithm performance. - **Standard Model Input:** Many models require data to be normalized to provide more reliable results and improve model performance. **Applying Min-Max Normalization** - Given data: X = {10, 23, -1, 14, -67, 89, 35, 78} - **Step 1:** Find the minimum and maximum values. - Xmin=-67, Xmax = 89 - **Step 2:** Apply the Min-Max normalization formula: - X' = (X - Xmin) / (Xmax - Xmin) * (New max - New min) + New min - Where the new min = 0 and new max = 1. - **Step 3:** Apply the formula to each data point. - For X = 10: X' = (10 - (-67)) / (89 - (-67)) ≈ 0.4936 - For X = 23: X' = (23 - (-67)) / (89 - (-67)) ≈ 0.5769 - For X = -1: X' = (-1 - (-67)) / (89 - (-67)) ≈ 0.4231 - For X = 14: X' = (14 - (-67)) / (89 - (-67)) ≈ 0.5192 - For X = -67: X' = (-67 - (-67)) / (89 - (-67)) = 0 - For X = 89: X' = (89 - (-67)) / (89 - (-67)) = 1 - For X = 35: X' = (35 - (-67)) / (89 - (-67)) ≈ 0.6538 - For X = 78: X' = (78 - (-67)) / (89 - (-67)) ≈ 0.9295 - **Step 4:** The normalized data is: - X' = {0.4936, 0.5769, 0.4231, 0.5192, 0, 1, 0.6538, 0.9295} ## 9. Explain the ways in which redundancy between the attributes can be derived. Compute correlation, Covariance between the given attributes and discuss the results Redundancy between attributes occurs when multiple attributes provide the same or similar information, which can lead to inefficiency in analysis. This can be identified through various methods. For example, correlation analysis checks if two attributes are highly related, indicating redundancy. Principal Component Analysis (PCA) reduces redundant attributes by transforming them into fewer, uncorrelated components. Feature selection techniques, like information gain or Chi-Square, help identify which attributes contribute the most to predictions, allowing redundant ones to be removed. ### Covariance and Correlation Analysis | X | Y | Z | |---|---|---| | 2 | 4 | 9 | | 7 | 8 | 2 | | 5 | 6 | 10 | | 3 | 3 | 15 | | 8 | 10 | 16 | | 1 | 2 | 3 | ### Step 1: Compute the Mean of Each Attribute - µx = (2 + 7 + 5 + 3 + 8 + 1) / 6 = 4.33 - µy = (4 + 8 + 6 + 3 + 10 + 2) / 6 = 5.5 - µz = (9 + 2 + 10 + 15 + 16 + 3) / 6 = 8.75 ### Step 2: Compute Covariance Covariance measures the extent to which two variables change together. The formula for covariance between two variables X and Y is: $Cov(X, Y) = \frac{1}{n-1} \sum_{i=1}^n (X_i - µ_x)(Y_i - µ_y)$ - **Covariance between x and y** - $Cov ( x, y ) = \frac{1}{6} [(2-4.33)(4-5.5) + (7-4.33)(8-5.5) + (5-4.33)(6-5.5) + (3-4.33)(3-5.5) + (8-4.33)(10-5.5) + (1 - 4.33)(2 - 5.5)] = 8.4$ - **Covariance between x and z** - $Cov ( x, z ) = \frac{1}{6} [(2-4.33)(9-8.75) + (7-4.33)(2-8.75) + (5-4.33)(10-8.75) + (3-4.33)(15-8.75) + (8-4.33)(16-8.75) + (1 - 4.33)(3-8.75)] = 3.93$ - **Covariance between y and z** - $Cov ( y, z ) = \frac{1}{6} [(4-5.5)(9-8.75) + (8-5.5)(2-8.75) + (6-5.5)(10-8.75) + (3-5.5)(15-8.75) + (10-5.5)(16-8.75)+(2-5.5)(3-8.75)] = 4.1$ ### Step 3: Compute Correlation Correlation normalizes covariance by the standard deviations of the variables, providing a standardized measure of the strength and direction of the linear relationship between two variables. The formula for correlation is: $ρ(Χ,Υ) = \frac{Cov(X, Y)}{σ_xσ_y}$ Where σx and σy are the standard deviations of X and Y, respectively. ### Compute Standard Deviations - The formula for the standard deviation σ is: - σ = √(1/(n-1) Σ(X_i - µ)^2) -**Standard Deviation of x** - σx = √(1/5) [(2-4.33)² + (7-4.33)² + (5-4.33)² + (3-4.33)² + (8-4.33)²+ (1-4.33)²] = 2.67 -**Standard Deviation of y** - σy = √(1/5) [(4-5.5)² + (8-5.5)² + (6-5.5)² + (3-5.5)² + (10-5.5)² + (2-5.5)²] = 2.77 -**Standard Deviation of z** - σz = √(1/5) [(9-8.75)² + (2-8.75)² + (10-8.75)²+(15, -8.75)² + (16-8.75)² (3-8.75)²] = 5.62 ### Step 4: Calculate Correlation - **Correlation between x and y** - ρ(x, y) = 8.4 / (2.67 * 2.77) ≈ 0.97 - **Correlation between x and z** - ρ(x, z) = 3.93 / (2.67 * 5.62) ≈ 0.24 - **Correlation between y and z** - ρ(y, z) = 4.1 / (2.77 * 5.62) ≈ 0.23 #### Interpretation of Covariance and Correlation - **Covariance**: - Cov(x, y) = 8.4: A positive covariance indicates that x and y tend to increase together. - Cov(x, z) = 3.93: A positive covariance, but smaller in magnitude. - Cov(y, z) = 4.1: A positive covariance, but again smaller in magnitude. - **Correlation**: - ρ(x, y) = 0.97: This is a very strong positive linear relationship between x and y. - ρ(x, z) = 0.24: A weak positive linear relationship between x and z. - ρ(y, z) = 0.23: A weak positive linear relationship between y and z. There is a strong correlation between x and y, suggesting a highly predictable relationship. The relationships between x and z, and y and z, are weak, indicating that these pairs do not show a strong linear trend. ## 10. Discuss the issues in data integration Data integration combines data from various sources into one unified dataset, reducing redundancies and inconsistencies to improve data mining accuracy and efficiency. For example, in merging customer data from different systems, "customer id" in one database may need to be matched with "cust number" in another. This ensures that the same customer is not misidentified or duplicated. - **Schema integration** can be challenging when attributes have different names or formats. For instance, one database might record discounts at the order level, while another records them at the item level, potentially leading to errors if not corrected before integration. - **Redundancies** can be detected using correlation analysis, which measures how one attribute relates to another. For example, if annual revenue can be derived from sales and discounts, redundancy can be identified. Tuple duplication also needs to be addressed, such as when a purchase order database contains multiple entries for the same customer with different addresses due to data entry errors. - **Data value conflicts** arise when the same attribute is represented differently across systems. For example, weight might be recorded in metric units in one system and in British imperial units in another, or room prices might differ due to varying currencies and included services. Resolving these discrepancies ensures consistent and accurate data integration. ## 11. Discuss the different types of data analytics Data analytics involves examining data to uncover useful trends and patterns that can guide decision-making. With the vast amount of data generated today and the powerful computing tools available, businesses can use this information to make informed decisions based on past successes. By analyzing data, companies can gain insights that help improve their operations and predict future outcomes. For example, in manufacturing, data on machine performance and work queues can help optimize production processes, ensuring machines run efficiently. Similarly, gaming companies use data to create engaging reward systems, and content providers analyze user interactions to improve content presentation and engagement. In essence, data analytics helps organizations in various sectors make better decisions and enhance their performance. ### Types of Data Analytics There are four major types of data analytics: - **Predictive (forecasting)** - **Descriptive (business intelligence and data mining)** - **Prescriptive (optimization and simulation)** - **Diagnostic analytics** #### Predictive Analytics Predictive analytics turns the data into valuable, actionable information. Predictive analytics uses data to determine the probable outcome of an event or a likelihood of a situation occurring. Predictive analytics holds a variety of statistical techniques from modeling, machine learning, data mining, and game theory that analyze current and historical facts to make predictions about a future event. Techniques that are used for predictive analytics are: - Linear Regression - Time Series Analysis and Forecasting - Data Mining #### Descriptive Analytics Descriptive analytics looks at data and analyses past events for insight as to how to approach future events. It looks at past performance and understands the performance by mining historical data to understand the cause of success or failure in the past. Almost all management reporting such as sales, marketing, operations, and finance uses this type of analysis. The descriptive model quantifies relationships in data in a way that is often used to classify customers or prospects into groups. Unlike a predictive model that focuses on predicting the behavior of a single customer, descriptive analytics identifies many different relationships between customer and product. Common examples of descriptive analytics are company reports that provide historic reviews like: - Data Queries - Reports - Descriptive Statistics - Data dashboard #### Prescriptive Analytics Prescriptive Analytics automatically synthesizes big data, mathematical science, business rules, and machine learning to make a prediction and then suggests a decision option to take advantage of the prediction. Prescriptive analytics goes beyond predicting future outcomes by also suggesting action benefits from the predictions and showing the decision-maker the implication of each decision option. Prescriptive Analytics not only anticipates what will happen and when but also why it will happen. Further, Prescriptive Analytics can suggest decision options on how to take advantage of a future opportunity or mitigate a future risk and illustrate the implication of each decision option. For example, Prescriptive Analytics can benefit healthcare strategic planning by using analytics to leverage operational and usage data combined with data of external factors such as economic data, population demography, etc. #### Diagnostic Analytics In this analysis, we generally use historical data over other data to answer any question or for the solution of any problem. We try to find any dependency and pattern in the historical data of the particular problem. For example, companies go for this analysis because it gives a great insight into a problem, and they also keep detailed information about their disposal otherwise data collection may turn out individual for every problem and it will be very time-consuming. Common techniques used for Diagnostic Analytics are: - Data discovery - Data mining - Correlations ## 12. Explain and differentiate between various tools used in data analytics. Data Analytics is an important aspect of many organizations nowadays. Real-time data analytics is essential for the success of a major organization and helps drive decision making. This article will help you gain knowledge about the various data analytic tools that are in use and how they differ. There are myriads of data analytic tools that help us get important information from the given data. We can use some of these free and open source tools even without any coding knowledge. These tools are used for deriving useful insights from the given data without sweating too much. For example, you could use them to determine the better among some cricket players based on various statistics and yardsticks. They have helped in strengthening the decision making process by providing useful information that can help reach better conclusions. There are many tools that are used for deriving useful insights from the given data. Some are programming-based and others are non-programming-based. Some of the most popular tools are: - SAS - Microsoft Excel - R - Python - Tableau - RapidMiner - KNIME ### **SAS** SAS was a programming language developed by the SAS Institute for performing advanced analytics, multivariate analyses, business intelligence, data management and predictive analytics. It is proprietary software written in C and its software suite contains more than 200 components. Its programming language is considered to be high level thus making it easier to learn. However, SAS was developed for very specific uses and powerful tools are not added every day to the extensive already existing collection thus making it less scalable for certain applications. It, however, boasts of the fact that it can analyze data from various sources and can also write the results directly into an Excel spreadsheet. It is used by many companies such as Google, Facebook, Twitter, Netflix and Accenture. SAS brought to the market a huge set of products in 2011 for customer intelligence and various SAS modules for web, social media and marketing analytics used largely for profiling customers and gaining insights about prospective customers. Even though it is under attack by upcoming languages such as R, Python, SAS still continues to develop in order to prove that it is still a major stakeholder in the data analytics market. ### **Microsoft Excel** It is an important spreadsheet application that can be useful for recording expenses, charting data and performing easy manipulation and lookup and or generating pivot tables to provide the desired summarized reports of large datasets that contain significant data findings. It is written in C, C++ and .NET Framework and its stable version was released in 2016. It involves the use of a macro programming language called Visual Basic for developing applications. It has various built-in functions to satisfy the various statistical, financial and engineering needs. It is the industry standard for spreadsheet applications. It is also used by companies to perform real-time manipulation of data collected from external sources such as stock market feeds and perform the updates in real-time to maintain a consistent view of data. It is relatively useful for performing somewhat complex analyses of data when compared to other tools such as R or Python. It is a common tool among financial analysts and sales managers to solve complex business problems. ### **R** It is one of the leading programming languages for performing complex statistical computations and graphics. It is a free and open-source language that can be run on various UNIX platforms, Windows and MacOS. It also has a command line interface which is easy to use. However, it is tough to learn especially for people who do not have prior knowledge about programming. However, it is very useful for building statistical software and is very useful for performing complex analyses. It has more than 11,000 packages and we can browse the packages category-wise. These packages can also be assembled with Big Data, the catalyst which has transformed various organizations' views on unstructured data. The tools required to install the packages as per user requirements are also provided by R which makes setting up convenient. ### **Python** It is a powerful high-level programming language that is used for general purpose programming. Python supports both structured and functional programming methods. Its extensive collection of libraries makes it very useful in data analysis. Knowledge of Tensorflow, Theano, Keras, Matplotlib, Scikit-learn and Keras can get you a lot closer towards your dream of becoming a machine learning engineer. Everything in Python is an object and this attribute makes it highly popular among developers. It is easy to learn compared to R and can be assembled onto any platform such as MongoDB or SQL server. It is very useful for big data analysis and can also be used to extract data from the web. It can also handle text data very well. Python can be assembled on various platforms such as SQL Server, MongoDB database or JSON (JavaScript Object Notation). Some of the companies that use Python for data analytics include Instagram, Facebook, Spotify and Amazon. ### **Tableau Public** Tableau Public is free software developed by the public company "Tableau Software" that allows users to connect to any spreadsheet or file and create interactive data visualizations. It can also be used to create maps, dashboards along with real-time updation for easy presentation on the web. The results can be shared through social media sites or directly with the client making it very convenient to use. The resultant files can also be downloaded in different formats. This software can connect to any type of data source, be it a data warehouse or an Excel application or some sort of web-based data. Approximately 446 companies use this software for operational purposes and some of the companies that are currently using this software include SoFi, The Sentinel and Visa. ### **RapidMiner** RapidMiner is an extremely versatile data science platform developed by "RapidMiner Inc". The software emphasizes lightning fast data science capabilities and provides an integrated environment for preparation of data and application of machine learning, deep learning, text mining and predictive analytical techniques. It can also work with many data source types including Access, SQL, Excel, Tera data, Sybase, Oracle, MySQL and Dbase. Here we can control the data sets and formats for predictive analysis. Approximately 774 companies use RapidMiner and most of these are US-based. Some of the esteemed companies on that list include the Boston Consulting Group and Domino's Pizza Inc. ### **KNIME** KNIME, the Konstanz Information Miner is a free and open-source data analytics software. It is also used as a reporting and integration platform. It involves the integration of various components for Machine Learning and data mining through the modular data-pipelining. It is written in Java and developed by KNIME.com AG. It can be operated in various operating systems such as Linux, OS X and Windows. More than 500 companies are currently using this software for operational purposes and some of them include Aptus Data Labs and Continental AG. ## 13. Discuss the different types of data. ### Relational Databases Relational databases store data in structured tables with rows and columns, where relationships between data are defined using foreign keys. They use SQL (Structured Query Language) for querying and managing data. 1. **MySQL** - Open-source relational database. - Widely used in web applications. - Supports ACID transactions and has a large community. 2. **PostgreSQL** - Advanced open-source relational database. - Known for extensibility and standards compliance. - Supports complex queries, foreign keys, triggers, and up to full transactional integrity. 3. **Oracle Database** - Enterprise-level relational database. - Known for high performance, scalability, and reliability. - Extensive features for transaction management, analytics, and data warehousing. 4. **SQL Server (Microsoft)** - Microsoft's relational database management system. - Integrates well with other Microsoft products. - Includes tools for data warehousing, business intelligence, and analytics. 5. **SQLite** - Lightweight, file-based relational database. - Often used in mobile apps and small applications. - Doesn't require a server to operate, making it easy to use and deploy. ### NoSQL Databases NoSQL databases are designed for unstructured, semi-structured, or structured data that doesn't fit well into a traditional relational database. They are often used in big data and real-time web applications. Unlike traditional relational databases, which store data in tables with predefined schemas, NoSQL databases use flexible data models that can easily adapt to changes. They are particularly well-suited for applications that require the ability to scale horizontally to manage growing amounts of data, such as in big data and real-time web applications. There are four main categories of NoSQL databases: document databases, key-value stores, column-family stores, and graph databases. Document databases store data in formats like JSON or XML, making them flexible for handling varying data structures. Key-value stores focus on simplicity and speed, storing data as key-value pairs. Column-family stores organize data into columns, allowing for efficient querying of large datasets. Graph databases excel at managing complex relationships between data by representing it as nodes and edges. NoSQL databases are favored for their scalability, flexibility, and high performance, particularly in scenarios where data is frequently changing and rapidly growing. However, they may not be the best choice for all applications. NoSQL databases generally lack full ACID compliance, which can lead to issues with data consistency. Additionally, they are more complex to manage than traditional relational databases and may not support complex queries as effectively. 1. **Cassandra** - Distributed NoSQL database. - Designed for high availability and scalability. - Uses a wide-column store model. 2. **MongoDB** - Document-oriented NoSQL database. - Stores data in flexible, JSON-like documents. - Great for applications needing fast, iterative development. 3

Data Analytics Assignments Solutions PDF

Document Details

Tags

Related

Summary

Full Transcript