Session 4: Big Data - Computer Science - 3rd Year - PDF
Document Details
Uploaded by Deleted User
Tags
Related
- Big Data Janitor Work: Key Hurdle to Insights (Aug 2014 PDF)
- Big Data and Business Analytics Ecosystems PDF
- Introduction to Big data analytics and data analytics life cycle.pdf
- Databricks Certified Data Analyst Associate Exam Preparation 2024 PDF
- Big Data Analytics PDF
- Introduction to Big Data Analytics PDF
Summary
This document contains questions on the Big Data topic for a computer science 3rd-year student. The questions cover various aspects of data analytics, including the data analytics lifecycle, data cleaning, and data planning.
Full Transcript
علوم الحاسب الفرقة الثالثة البيانات الضخمة Big Data Q1. What are the main stages of the Data Analytics Lifecycle? Q2. What is the main purpose of Phase 2: Data Preparation? Prepare Analytics Sandbox: Create a workspace for the analytics team. Ensure the...
علوم الحاسب الفرقة الثالثة البيانات الضخمة Big Data Q1. What are the main stages of the Data Analytics Lifecycle? Q2. What is the main purpose of Phase 2: Data Preparation? Prepare Analytics Sandbox: Create a workspace for the analytics team. Ensure the workspace has 10x+ capacity compared to the Enterprise Data Warehouse (EDW). Perform ELT (Extract-Load-Transform) Identify necessary data transformations. Execute Big ELT (Extract-Load-Transform) and Big ETL (Extract-Transform-Load). For Data Transformation & Cleansing: SQL, Hadoop, MapReduce, Alpine Miner. Familiarize yourself with the data thoroughly. Data Conditioning. Survey & visualize. Visualization: R (base package, ggplot and lattice), GnuPlot, Ggobi/Rggobi, Spotfire, Tableau. 1 Q3. What is Data cleaning? Data Cleaning, also known as Data Cleansing, is the process of preparing raw data for analysis by addressing errors, inconsistencies, and missing values. 1. Identifying and Handling incomplete Data. 2. Removing noise, duplicates, errors. 3. Validating Data Integrity, consistency. 4. Missed values: ▪ Default. ▪ Average/mood. ▪ Random. Q4. What is the main purpose of Phase 3: Data Planning? Determine Techniques & Workflow & Methods. Select methods based on hypotheses, data structure and volume. Useful Tools for this phase: R/Postgres SQL, SQL Analytics, Alpine Miner, SAS/ACCESS, SPSS/OBDC. Data Exploration. Variable Selection. Model Selection. Conversion to SQL or database language for best performance. Choose technique based on the end goal. 2 Q5. What is the main purpose of Phase 4: Model Building? Develop Data Sets Prepare datasets for testing, training, and production. Validate and Experiment Use smaller test sets to validate approaches. Optimize Environment Use fast hardware and parallel processing to streamline model building and workflows. Tools for Model Building R, PL/R, SQL, Alpine Miner, SAS Enterprise Miner. Q6. What is the main purpose of Phase 5: Communicate Results. Did we succeed? Did we fail? Interpret the results. Compare to IH’s (Initial Hypotheses) from phase 1. Identify key findings. Quantify business value. Summarizing findings, depending on the audience. Q7. What is the main purpose of Phase 6: Operationalize? Run a pilot. Assess the benefits. Provide final deliveries. Implement the model in the production environment. Define the process to update, retrain, and retire the model, as needed. 3