Podcast
Questions and Answers
What is a new mantra in data collection?
What is a new mantra in data collection?
Which of the following is a reason for the growth in data collection?
Which of the following is a reason for the growth in data collection?
What type of data does Amazon handle millions of each day?
What type of data does Amazon handle millions of each day?
Why is competitive pressure significant in data science?
Why is competitive pressure significant in data science?
Signup and view all the answers
Which of the following statements is correct regarding large-scale data?
Which of the following statements is correct regarding large-scale data?
Signup and view all the answers
What is the primary goal of market segmentation in clustering?
What is the primary goal of market segmentation in clustering?
Signup and view all the answers
In document clustering, what is primarily used to measure the similarity between documents?
In document clustering, what is primarily used to measure the similarity between documents?
Signup and view all the answers
Which application is NOT related to deviation or anomaly detection?
Which application is NOT related to deviation or anomaly detection?
Signup and view all the answers
How can clustering quality be assessed in market segmentation?
How can clustering quality be assessed in market segmentation?
Signup and view all the answers
What is an example of a deviation detection application?
What is an example of a deviation detection application?
Signup and view all the answers
What is a primary responsibility of a data scientist working in healthcare?
What is a primary responsibility of a data scientist working in healthcare?
Signup and view all the answers
In which of the following industries can a data scientist work?
In which of the following industries can a data scientist work?
Signup and view all the answers
Which challenge is associated with data science that involves managing varying types of data?
Which challenge is associated with data science that involves managing varying types of data?
Signup and view all the answers
What does scalability refer to in the context of data science?
What does scalability refer to in the context of data science?
Signup and view all the answers
What type of analysis might non-traditional data scientists employ?
What type of analysis might non-traditional data scientists employ?
Signup and view all the answers
What is the primary goal of churn prediction for telephone customers?
What is the primary goal of churn prediction for telephone customers?
Signup and view all the answers
What attributes are analyzed to classify credit card transactions?
What attributes are analyzed to classify credit card transactions?
Signup and view all the answers
What is a significant success identified in the sky survey cataloging application?
What is a significant success identified in the sky survey cataloging application?
Signup and view all the answers
Which method is NOT typically used in classifying galaxies?
Which method is NOT typically used in classifying galaxies?
Signup and view all the answers
In regression analysis, what is the main purpose?
In regression analysis, what is the main purpose?
Signup and view all the answers
How many images are used in the sky survey cataloging?
How many images are used in the sky survey cataloging?
Signup and view all the answers
Which attribute is used in modeling customer loyalty?
Which attribute is used in modeling customer loyalty?
Signup and view all the answers
What is the data size of the object catalog in the galaxy classification?
What is the data size of the object catalog in the galaxy classification?
Signup and view all the answers
What is the primary goal of classification in machine learning?
What is the primary goal of classification in machine learning?
Signup and view all the answers
In the context of NYC Taxi Cab Data, which of the following tasks would most likely involve classification?
In the context of NYC Taxi Cab Data, which of the following tasks would most likely involve classification?
Signup and view all the answers
Which of the following is NOT a common classification task mentioned in the content?
Which of the following is NOT a common classification task mentioned in the content?
Signup and view all the answers
What type of data is typically used in fraud detection classification tasks?
What type of data is typically used in fraud detection classification tasks?
Signup and view all the answers
When using predictive modeling for classification, what is the term for the portion of data used to evaluate the model?
When using predictive modeling for classification, what is the term for the portion of data used to evaluate the model?
Signup and view all the answers
Which of the following best describes the training process in predictive modeling?
Which of the following best describes the training process in predictive modeling?
Signup and view all the answers
In classification tasks, what kind of model would be used to predict whether a taxi ride is a good or bad fare?
In classification tasks, what kind of model would be used to predict whether a taxi ride is a good or bad fare?
Signup and view all the answers
In the context of animal or environmental classification, which method is similar to that of detecting fraudulent credit card transactions?
In the context of animal or environmental classification, which method is similar to that of detecting fraudulent credit card transactions?
Signup and view all the answers
What is one of the responsibilities of a data scientist when predicting crime locations?
What is one of the responsibilities of a data scientist when predicting crime locations?
Signup and view all the answers
Which task do data scientists perform before constructing their models?
Which task do data scientists perform before constructing their models?
Signup and view all the answers
What kind of positions can graduates of a data science program pursue?
What kind of positions can graduates of a data science program pursue?
Signup and view all the answers
In the software development field, what issues can individuals face if they lack business knowledge?
In the software development field, what issues can individuals face if they lack business knowledge?
Signup and view all the answers
What advantage does a CIS graduate have when working in a software development team for healthcare?
What advantage does a CIS graduate have when working in a software development team for healthcare?
Signup and view all the answers
What is one of the roles of a business analyst?
What is one of the roles of a business analyst?
Signup and view all the answers
How do software developers benefit from understanding business operations?
How do software developers benefit from understanding business operations?
Signup and view all the answers
What does a system implementer do in their role?
What does a system implementer do in their role?
Signup and view all the answers
Which of the following best describes the CIS program?
Which of the following best describes the CIS program?
Signup and view all the answers
Why is data compliance necessary in data collection?
Why is data compliance necessary in data collection?
Signup and view all the answers
Study Notes
Data Science Overview
- Data science involves enormous growth in commercial and scientific databases due to advancements in data generation and collection technologies.
- A key mantra is gathering whatever data possible, anytime and anywhere.
- Gathered data will have value, either for the original purpose or for a purpose not anticipated beforehand.
Why Data Science? (Commercial Viewpoint)
- Large amounts of data are being collected and warehoused, including web data (e.g., Google).
- Social media platforms (e.g., Facebook, Amazon) have billions of active users.
- E-commerce involves millions of daily visits and transactions (e.g., Amazon).
- Computing powers have become more accessible and affordable.
- Competition requires companies to provide better, customized services.
Why Data Science? (Scientific Viewpoint)
- Data is collected and stored at enormous speeds.
- Remote sensors on satellites store petabytes of earth science data annually (e.g., NASA EOSDIS archives).
- Telescopes capture data, scanning the skies (e.g., Sky survey data).
- High-throughput studies involve biological data and scientific simulations (e.g., terabytes of data generated rapidly).
- Data science helps automate analysis of massive datasets and facilitates hypothesis formation.
Opportunities to Solve Society's Problems
- Data science can improve healthcare and reduce costs.
- Data science can predict the impact of climate change.
- Data science enables the discovery of alternative green energy sources.
- Data science can address hunger and poverty issues by increasing agricultural production.
What is Data Science?
- Data science is an emerging field, not yet fully defined.
- Key elements of data science include exploratory data analysis and visualization, machine learning, and high-performance computing techniques for dealing with large-scale data.
Skill Sets for Data Science
- Data science requires a combination of computer science, hacking skills, machine learning, math & statistics (traditional research, data science), and substantive expertise (domain science).
Appreciating Data
- Computer scientists may not naturally appreciate the significance of data.
- Data can be used to test and validate algorithms, but obtaining useful data sets requires effort and innovation
Computer Scientists vs. Real Scientists
- Scientists study the complexity of the natural world, whereas computer scientists create organized, clean virtual worlds.
- Scientific truths are multifaceted, whereas computer science deals in definite, "true" or "false" statements.
Computer Scientists vs. Real Scientists (continued)
- Scientists are data-driven, while computer scientists are algorithm-driven.
- Scientists focus on exploring and discovering things, whereas computer scientists create or invent.
- Scientists readily acknowledge the limitations and errors in data.
Genius vs. Wisdom
- Data science depends more on wisdom (knowing what to avoid) than on genius (knowing the right answer).
- Software developers focus on code production.
- Data scientists focus on creating insights.
Developing Wisdom
- Wisdom comes from experience, general knowledge, listening to others, and humility (acknowledging mistakes), recognizing errors and their causes.
- Data scientists often struggle to achieve accurate predictions, which makes experience crucial to their practice.
Developing Curiosity
- Good data scientists develop curiosity about their domain/application.
- Engage in discussions with those working with the data.
- Staying informed about the world through daily reading is beneficial.
Asking Good Questions
- Data scientists should ask questions to extract meaningful insights from data sets.
- Evaluate what questions the users and stakeholders need answered.
- Consider which datasets can provide answers to those questions.
Let's Practice Asking Questions!
- Questions relating to the three datasets include who, what, where, when, and why.
- The three datasets are Baseball-reference.com, Google Ngrams, and NYC taxi cab records.
Statistical Record of Play
- Baseball-reference.com provides detailed records of each year's batting, pitching, and fielding data for baseball players.
- Includes teams, awards, and other statistics.
Baseball Questions
- Focus on measuring player skill, evaluating trade fairness, analyzing career trajectories, and correlating batting performance with positions.
Demographic Questions
- Explore whether left-handed people have shorter lifespans than right-handers, frequency of returns to places of birth, the relationship between salaries and performance, and potential changes in human height and weight.
Google Ngrams
- Google Ngrams is a resource tracking word and phrase frequency over time.
- Includes 1 to 5 word phrases, providing an annual time series of their use.
Ngram Questions
- Questions relate to changes in cursing over time, lifespans of fame and technology trends, the emergence and persistence of new words, and association patterns in language.
NYC Taxi Cab Data
- Offers detailed data for every taxi trip, including driver/owner, pickup/dropoff locations, and fares from NYC, obtained through a Freedom of Information Act request.
Taxicab Questions
- Focus on drivers' earnings, travel distances, traffic patterns during rush hours, travel destinations at various times, drivers' tipping performance, and optimal pick-up strategies.
Machine Learning Tasks
- Tasks include clustering, predictive modeling, and anomaly detection.
Predictive Modeling: Classification
- Predictive modeling aims to use other attributes to determine the attribute specified.
- For example, modeling creditworthiness or predicting specific patient treatments.
- Classification techniques are crucial in many applications (e.g., fraud detection).
Applications of Classification Tasks
- Classifying credit card transactions as valid or fraudulent
- Identifying land cover types using satellite data
- Determining the category of news stories
- Identifying intruders within cyberspace and predicting outcomes
- Classifying protein secondary structures
Classification: Application 1 (Fraud Detection)
- Goal is to predict cases of fraud from credit card transactions.
- Credit card transactions and account details become important attributes.
- Transactions categorized as fraudulent or legitimate form a class variable for training a model.
- The model observes new transactions to detect fraud.
Classification: Application 2 (Churn Prediction)
- Goal is predicting whether a telephone customer will leave for a competitor.
- Customer behaviors, transaction data, financial profiles, and other factors are key attributes.
Classification: Application 3 (Sky Survey Cataloging)
- Goal is to classify stars and galaxies from survey images, specifically focusing on visually faint objects from the Palomar Observatory.
- Image segmentation, measuring attributes like light characteristics, and classification models are key components.
Classifying Galaxies
- Data contains a large amount of images regarding stars/galaxies, used for modeling/classification.
- Image data is characterized by attributes (e.g., image features, characteristics of received light).
Regression
- Regression models use continuous-valued attributes to predict the value of a continuous dependent variable, assuming a linear or nonlinear dependency.
- For example, new product sales projection, adjusting for advertising expenses, or predicting wind speed based on temperature (other environmental metrics).
Clustering
- Aim is grouping data points, clustering minimizes distances within clusters and maximizes between clusters.
Applications of Cluster Analysis
- Understanding and targeting customer demographics for improved marketing campaigns
- Clustering related documents in groups for user access
- Grouping genes/proteins based on performance, function, and similarities
- Categorizing price fluctuations for stocks
Clustering: Application 1 (Market Segmentation)
- Goal is dividing a market into customer segments with similar characteristics for improved/targeted marketing.
- Identifying customer attributes (e.g., demographics, purchasing behaviors) to segment them effectively.
- Measuring segment similarity by examining buying patterns within or across the segments.
Clustering: Application 2 (Document Clustering)
- Goal is classifying documents into groups with similar contents/themes.
- Identify frequent terms/topics within documents for creating similarity metrics.
- Similarity metrics and document clustering form the foundation for analysis.
Deviation/Anomaly/Change Detection
- Detecting significant deviations from normal patterns.
- Applications include credit card fraud detection, network intrusion detection, changes in sensor networks, and monitoring/tracking changes in global forest cover.
Motivating Challenges
- Data science faces challenges relating to scalability (handling large datasets), high dimensionality (extensive attributes), heterogeneity and complexity of data formats, ownership and distribution issues relating to various data sources, and non-traditional analysis methods.
DS Career Path
- Data Science (DS) graduates can find diverse career paths.
Introduction
- Data science programs produce graduates who usually choose data scientist positions in most cases.
- Data scientists can work in organizations like private companies, government agencies, and non-profit organizations.
Industries
- Data science is relevant to a wide range of industries (e.g., finance, government, healthcare, online platforms, large retailers, agriculture)
Data Scientist Responsibilities
- Data Scientists build and validate data models, used by their employers to predict, recommend, and evaluate future business decisions.
- Data Scientists are responsible for preparing/cleaning data for these models.
- Data management procedures are involved in data collection with considerations of the data's compliance with rules and legal standards.
More Opportunities
- Graduates may opt for software development roles as well as specialized roles creating business intelligence dashboards, presenting results through charts/reports to users.
CIS Career Path
- CIS graduates often pursue careers in software development, business analysis, and system implementation.
Introduction (CIS)
- CIS (Computer Information Systems) programs generally combine technology and business aspects, which equip graduates with a broad set of skills valuable in diverse areas.
- Graduates from these programs usually are adept in adapting technology (and knowledge of existing business systems) to achieve greater efficiency.
Introduction (continued)
- Technology knowledge alone may fail to address business standards, international business standards, or organizational constraints within software development.
Example
- Exposure to healthcare systems prepares graduates from CIS programs with understanding of EHR (electronic health record features/functionality).
You as a Business Analyst
- Business analysts usually define requirements for software systems that cater to customer needs.
- Familiarity with existing systems is often helpful in understanding customer needs.
You as a Software Developer
- Software developers create software systems based on functional requirements and business objectives.
- Business awareness allows developers to choose proper architectures that can support future business standards & requirements.
You as a System Implementer
- System implementers will guide users through how to utilize software systems properly.
- Experience and understanding of how businesses operate will lead to useful guidance for system usage.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
Test your knowledge on key concepts in data science, covering topics like data collection, clustering, market segmentation, and anomaly detection. This quiz explores various aspects of the data science field, including the responsibilities of data scientists and the challenges they face. Put your understanding to the test and see how well you know data science!