Podcast
Questions and Answers
Which statement correctly describes the role of big data analytics in business?
Which statement correctly describes the role of big data analytics in business?
- It solely automates data preprocessing tasks.
- It mainly concerns itself with data visualization techniques.
- It aids in making informed decisions to improve financial and operational outcomes. (correct)
- It primarily focuses on data storage solutions.
In the context of big data, what distinguishes semi-structured data from structured data?
In the context of big data, what distinguishes semi-structured data from structured data?
- Semi-structured data lacks any identifiable markers for semantic elements.
- Semi-structured data is exclusively composed of images, audio, and video files.
- Semi-structured data contains tags or markers to delineate semantic elements but doesn't conform to rigid database structures. (correct)
- Semi-structured data strictly adheres to predefined relational database schemas.
Which of the following is a characteristic of unstructured data?
Which of the following is a characteristic of unstructured data?
- It always maintains a standard hierarchy, facilitating easy data retrieval.
- It is characterized by a lack of a predefined format, varying in content and structure. (correct)
- It neatly fits into predefined rows and columns for easy analysis.
- It is easily managed and accessed by humans and computers.
Why is 'veracity' considered a crucial 'V' in the context of big data?
Why is 'veracity' considered a crucial 'V' in the context of big data?
What is the primary goal of the 'value' characteristic within the concept of the 5Vs of big data?
What is the primary goal of the 'value' characteristic within the concept of the 5Vs of big data?
How do transportation sectors utilize live tracking reports derived from big data analytics?
How do transportation sectors utilize live tracking reports derived from big data analytics?
In what way has big data analytics played a crucial role during crisis situations, such as the earthquake in Nepal in April 2015?
In what way has big data analytics played a crucial role during crisis situations, such as the earthquake in Nepal in April 2015?
In the context of data analytics, why is the life cycle often depicted as an iterative process?
In the context of data analytics, why is the life cycle often depicted as an iterative process?
Which of the following activities is part of the initial data discovery phase in the data analytics life cycle?
Which of the following activities is part of the initial data discovery phase in the data analytics life cycle?
Why is data preparation considered one of the most labor-intensive phases in the data analytics life cycle?
Why is data preparation considered one of the most labor-intensive phases in the data analytics life cycle?
What is the purpose of an analytical sandbox in the data preparation phase?
What is the purpose of an analytical sandbox in the data preparation phase?
The ETL process involves moving financial data from an organization's main database to a computational sandbox. What is one reason for doing this?
The ETL process involves moving financial data from an organization's main database to a computational sandbox. What is one reason for doing this?
What does the 'transformation' step in ETL primarily ensure regarding data?
What does the 'transformation' step in ETL primarily ensure regarding data?
Why might data scientists prefer to load raw data into a data warehouse rather than immediately transforming it?
Why might data scientists prefer to load raw data into a data warehouse rather than immediately transforming it?
What is the primary function of Application Programming Interfaces (APIs) in the context of data preparation?
What is the primary function of Application Programming Interfaces (APIs) in the context of data preparation?
How does Hadoop assist data scientists in data preparation?
How does Hadoop assist data scientists in data preparation?
During the model planning phase of data analytics, what should a team consider regarding analytical techniques?
During the model planning phase of data analytics, what should a team consider regarding analytical techniques?
What role do the types of input and output variables play in the model selection substep?
What role do the types of input and output variables play in the model selection substep?
What is one primary use of SAS modules in the context of big data analytics?
What is one primary use of SAS modules in the context of big data analytics?
In model building, what best describes the role of the testing data?
In model building, what best describes the role of the testing data?
Which of the following questions is most relevant to consider during the model building phase?
Which of the following questions is most relevant to consider during the model building phase?
What should the data analytics team do if the model fails to solve the research problem?
What should the data analytics team do if the model fails to solve the research problem?
Which of the following is an advantage of using Apache Spark?
Which of the following is an advantage of using Apache Spark?
In the context of the data analytics life cycle, what is addressed during the 'Communicate Results' phase?
In the context of the data analytics life cycle, what is addressed during the 'Communicate Results' phase?
Where is a great deal of time spent in the data analytics life cycle?
Where is a great deal of time spent in the data analytics life cycle?
What is the role of text transformation in text mining?
What is the role of text transformation in text mining?
Which phrase is another name for the Feature Selection step of text mining?
Which phrase is another name for the Feature Selection step of text mining?
What is one difference between Data Mining and Text Mining?
What is one difference between Data Mining and Text Mining?
What is the purpose of Information Retrieval?
What is the purpose of Information Retrieval?
What are the R packages for Framework Text Mining Applications and R Network Analysis Packages
What are the R packages for Framework Text Mining Applications and R Network Analysis Packages
What is are the components of Natural Language Processing (NLP)
What is are the components of Natural Language Processing (NLP)
What is the Lexicon in Lexical Analysis
What is the Lexicon in Lexical Analysis
In the context of data science, what is the primary benefit of using Python?
In the context of data science, what is the primary benefit of using Python?
How are the variable names improved?
How are the variable names improved?
What is meant by Lists being mutable and used in Python?
What is meant by Lists being mutable and used in Python?
How can Tuple be created?
How can Tuple be created?
What makes the list unique in Python as compared to a dictionary?
What makes the list unique in Python as compared to a dictionary?
What does the use of Numpy allow?
What does the use of Numpy allow?
What happens with the increase in the dimension?
What happens with the increase in the dimension?
In the text's dataset what did Supervised Learning further include?
In the text's dataset what did Supervised Learning further include?
Scikit-learn is used to implement many models. In its implementation what would this not include?
Scikit-learn is used to implement many models. In its implementation what would this not include?
Flashcards
What is Big Data?
What is Big Data?
An umbrella term for the techniques and technologies required to collect, aggregate, process, and gain insights from massive datasets.
What is Volume in Big Data?
What is Volume in Big Data?
Quantity of data
What is Velocity in Big Data?
What is Velocity in Big Data?
Speed the data is generated
What is Variety in Big Data?
What is Variety in Big Data?
Signup and view all the flashcards
What is Veracity in Big Data?
What is Veracity in Big Data?
Signup and view all the flashcards
What is Value in Big Data?
What is Value in Big Data?
Signup and view all the flashcards
What is Fraud Detection?
What is Fraud Detection?
Signup and view all the flashcards
What is Live Tracking?
What is Live Tracking?
Signup and view all the flashcards
what is Sales forecasting?
what is Sales forecasting?
Signup and view all the flashcards
What is Live Data Handling?
What is Live Data Handling?
Signup and view all the flashcards
What are Alert Generations?
What are Alert Generations?
Signup and view all the flashcards
What are Google Analytics Reports?
What are Google Analytics Reports?
Signup and view all the flashcards
What is Data Analytics Life-Cycle?
What is Data Analytics Life-Cycle?
Signup and view all the flashcards
What is Data Discovery?
What is Data Discovery?
Signup and view all the flashcards
What is Data Preparation?
What is Data Preparation?
Signup and view all the flashcards
What is Analytical Sandbox?
What is Analytical Sandbox?
Signup and view all the flashcards
What is Conservation?
What is Conservation?
Signup and view all the flashcards
What is ETL?
What is ETL?
Signup and view all the flashcards
What is ELT?
What is ELT?
Signup and view all the flashcards
What is Hadoop?
What is Hadoop?
Signup and view all the flashcards
What is Alpine Miner?
What is Alpine Miner?
Signup and view all the flashcards
What is OpenRefine?
What is OpenRefine?
Signup and view all the flashcards
What is Model Planning?
What is Model Planning?
Signup and view all the flashcards
What is Model Building?
What is Model Building?
Signup and view all the flashcards
What is SAS?
What is SAS?
Signup and view all the flashcards
What is Apache Spark?
What is Apache Spark?
Signup and view all the flashcards
What is BigML?
What is BigML?
Signup and view all the flashcards
What is MATLAB?
What is MATLAB?
Signup and view all the flashcards
What is Jupyter?
What is Jupyter?
Signup and view all the flashcards
What is Scikit?
What is Scikit?
Signup and view all the flashcards
What is TensorFlow?
What is TensorFlow?
Signup and view all the flashcards
What is Weka?
What is Weka?
Signup and view all the flashcards
What is Communicate Results?
What is Communicate Results?
Signup and view all the flashcards
What is Operationalization?
What is Operationalization?
Signup and view all the flashcards
What is Text mining?
What is Text mining?
Signup and view all the flashcards
What is Text Transformation?
What is Text Transformation?
Signup and view all the flashcards
What is Data Preprocessing?
What is Data Preprocessing?
Signup and view all the flashcards
What is Feature Selection?
What is Feature Selection?
Signup and view all the flashcards
What is Information Retrieval?
What is Information Retrieval?
Signup and view all the flashcards
What is Data Mining?
What is Data Mining?
Signup and view all the flashcards
What is Natural Language Processing (NLP)?
What is Natural Language Processing (NLP)?
Signup and view all the flashcards
Study Notes
- Big data refers to nontraditional techniques and tech for collecting, aggregating, processing, and gaining insights from massive datasets
- The big data process involves data acquisition, preprocessing, mining, prediction, and visualization
- Skilled analysts are needed to effectively leverage big data, basic big data analytics include: data preparation, model planning, and model building
Analytics for Data Science
- Big data's components consist of structured, semi-structured, and unstructured data
- Structured data has a well-defined structure in columns and rows for easy management and accessibility
- Semi-structured data does not adhere to formal structures but uses tags to distinguish semantic elements
- Unstructured data lacks a defined structure or standard hierarchy
- Big data has five characteristics known as the 5 V's: volume, velocity, variety, veracity, and value
The 5 V's of Big Data
- Volume is a huge amount of information and the size of data is and indicates whether data should be classified as "big"
- Internet traffic in 2016 measured 6.2 exabytes/month; expected to reach 40,000 exabytes by 2020
- Velocity refers to the speed of data generation, from machines, networks, social media, and mobile phones
- More than 3.5 billion searches occur on Google daily, and Facebook users are rising by about 22% yearly
- Variety talks about the nature of data that can be structured, semi-structured, and unstructured and data comes from inside and outside enterprises
- Veracity signifies data's vulnerability to inconsistencies and uncertainty because data are collected from various sources in huge amounts and monitoring data quality is challenging
- Value means data processed to extract value/knowledge, and insights derived are important
Examples of Data Analytics
- Fraud detection reports identify fraudulent transactions and unauthorized account access
- Live tracking reports are used track cars, customer requests, payment processing, and find needs and revenue
- Sales forecast and plan analysis are made to assess customers' sales, profits, needs, and evaluate future targets
- Live data are handled to provide stock market data and other real-time reports
- Alerts are generated based on events like data center alerts, using examples of big data analytics notifications
- Google Analytics reports get user visit counts, user location, and client computer specs
Data analytics life cycle
- The Data analytics life cycle is required for problems with big data and has data science applications
- Consists of six basic phases
- The process is iterative
- Data discovery involves analyzing the problem, finding data sources, and forming initial hypotheses
- Data preparation includes cleaning, sampling, combining, and aggregating data, often requiring extensive time and labor
- System preparing
- Plan construction
- Results communication
- Operationalization
- An analytical sandbox is created for data preparation, aggregating relevant data
- A copy of financial data from the computational sandbox is accessed over dealing with the production version of the organization’s main database, as this will be closely managed and required for financial reporting
- The ETL (Extract, Load, Transform) process is a type of data integration
- Analytical sandbox size should be at least 5–10 times the original datasets
- ETL involves extraction, transformation, and loading
ETL
- Extraction involves being aware of ODBC/JDBC drivers, understanding data structures, and handling resources, CDC (data capture modification) extracts data that has changed over time
- Transformation cleans data for further use, installation, and data integration
- Loading loads data into target structure for further processing
- Data scientists wanting raw data in warehouses lead to big ETL
- Application Programming Interfaces (APIs) are popular for accessing data
- Hadoop allows data scientists to explore data complexities and store data without needing to grasp all specifics
- Alpine Miner develops analytical workflows with a GUI for data manipulation and analytics
- OpenRefine is a standalone open-source tool for cleaning, transforming, and wrangling data
- Trifacta Wrangler enables analysts to get data from various sources to prepare for analytical or visualization tools
Model Planning
- The team selects an analytical model and appropriate variables
- Key consideration is assessing the structure of datasets and ensuring the analytical technique meets objectives
- Data exploration aims to know relationships among variables and variable reduction addresses significant correlations
- Model selection chooses an analytical technique based on project goals and revisits analytic challenges
- R manipulates data easily, runs on multiple platforms, has many packages
- Tableau Public links any data source, generates visualizations, allows file sharing
- SAS accesses/handles data, has customer intelligence products/marketing analytics, can predict, monitor, and refine behaviors
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.