Introduction to Data Science - Part 1 PDF

Introduction to Data Science Part 1 What is Data? Data is the foundation of data science; it is the material on which all the analyses are based. In the context of data science, data refers to raw facts, figures, observations, or information collected, stored, and analyzed for the purpose of gaining insights, making decisions, and supporting various applications. Data can come from diverse sources such as sensors, surveys, social media, business transactions, or scientific experiments. What is Data? raw facts: facts are specific pieces of information derived from data analysis. Examples: 1. The average age of customers in a specific dataset is 35 years old. This fact is derived from calculating the mean of the ages of all customers in the dataset. 2. The sales of a product increased by 20% last quarter compared to the previous quarter. This fact is obtained by comparing the sales data from two different time periods and calculating the percentage increase 3. 80% of users prefer using a mobile app over a website for online shopping. This fact is determined by conducting a survey or analyzing user behavior data, and it represents the majority preference among users. What is Data? figures: a figure typically refers to a visual representation of data, often used to illustrate patterns, trends, or relationships within the data. Example: A Scatter Plot Showing the Relationship Between Study Hours and Exam Scores To understand the relationship between study hours and exam performance, the data scientist creates a scatter plot. This figure provides a clear visual representation of the relationship between study hours and exam scores, making it easier for stakeholders to understand the data and draw insights from it. What is Data? observations: observation refers to a specific instance or data point collected or recorded during an experiment, survey, study, or any other research activity. Observations are raw data and serve as the basis for analysis. Examples: Observations related to a survey on customer satisfaction. Observation 1: Observation 3: Customer ID: 12345 Customer ID: 23456 Date of Purchase: January 15, 2023 Date of Purchase: January 20, 2023 Satisfaction Rating (on a scale of 1 to 5): 4 Satisfaction Rating (on a scale of 1 to 5): 5 Comments: "The product arrived earlier than Comments: "Exceptional customer service. Product expected. Very satisfied with the service." exceeded my expectations. Highly satisfied." In this example, each observation represents feedback from a specific customer Observation 2: after their purchase. The observations include unique identifiers (such as Customer ID: 67890 Customer ID), the date of the interaction, a numerical satisfaction rating, and Date of Purchase: January 18, 2023 additional comments provided by the customers. These observations can be Satisfaction Rating (on a scale of 1 to 5): 2 analyzed collectively to identify patterns, trends, or areas for improvement in Comments: "The product quality was not as customer satisfaction, which can inform business decisions and strategies. Data expected. Dissatisfied with the purchase." scientists often work with large sets of observations to draw meaningful insights and conclusions. Data can come from diverse sources Computers Mobile devices Cameras Sensors Watches and other wearable technologies Generated in every social media interaction we make Every file we save Every picture we take Every query we submit it’s even generated when we do something as simple as get directions to the closest ice cream shop from Google. Types of Data? Data can come in many forms or formats, including numbers, text, images, audio, video, and more. In the context of data science, there are two types of data: Structured and Unstructured data. Structured Data Structured data refers to data that is organized in a specific format, making it easily searchable and analyzable by algorithms and understandable by both humans and machines. Computers can effectively process structured data for insights due to its quantitative nature. It is typically tabular with rows and columns that clearly define data attributes. Structured Data Examples of structured data: 1. Databases: Information stored in relational databases, where data is organized into tables with rows and columns. Each column represents a specific attribute (such as name, age, or address), and each row represents a record. SQL databases (e.g., MySQL, PostgreSQL) are common examples of structured data storage. Structured Data Examples of structured data: 2. Spreadsheets: Data organized in rows and columns within software like Microsoft Excel or Google Sheets. Each column represents a different data type, and each row represents a unique entry. Structured Data Examples of structured data: 3. CSV (Comma-Separated Values) Files: Plain text files where values are separated by commas. Each line represents a new record, and the commas indicate different fields or attributes Structured Data Examples of structured data: 4. JSON (JavaScript Object Notation) Data: Data represented as key- value pairs, which can be nested to create a hierarchical structure. Structured Data Examples of structured data: 5. XML (eXtensible Markup Language) Data: Data represented using tags in a hierarchical structure, allowing for complex and nested relationships. Unstructured Data Unstructured data refers to information that does not have a pre- defined data model or format. Unlike structured data, which fits neatly into databases and spreadsheets with rows and columns, unstructured data lacks a specific structure and is not easily organized into a tabular format. Unstructured data can come in various forms, including text documents, images, audio recordings, videos, social media posts, emails, and more. Unstructured Data It can be textual or non-textual. It can be human-generated or machine-generated. Unstructured data is usually stored in a non-relational database like Hadoop or NoSQL and processed by unstructured data analytics programs like OpenText IDOL. Unstructured Data Unstructured Data Examples of unstructured data in data science include: 1. Text Data: Textual information from sources such as books, articles, emails, social media posts, customer reviews, and surveys. Natural language processing (NLP) techniques are often used to extract insights from unstructured text data. 2. Images: Photographs, diagrams, satellite imagery, and other visual data. Image recognition and computer vision techniques are used to analyze and interpret unstructured image data. Unstructured Data Examples of unstructured data in data science include: 3. Audio Data: Recordings of voice conversations, podcasts, music, or other audio sources. Speech recognition and audio processing methods are employed to extract useful information from unstructured audio data. 4. Video Data: Moving images and scenes captured from cameras or videos. Video analysis techniques, including object detection and motion tracking, are used to process unstructured video data. Unstructured Data Examples of unstructured data in data science include: 5. Social Media Feeds: Unstructured data from social media platforms, including posts, comments, and multimedia content. Social media analytics tools are used to gain insights from unstructured social media data. 6. Web Pages: Content from websites, including articles, blogs, and forum posts. Web scraping and text mining techniques can be applied to extract valuable information from unstructured web data. Structured vs Unstructured Data Big Data? Big data, on the other hand, is bigger than traditional data, and not in the trivial sense. From variety (numbers, text, but also images, audio, mobile data, etc.), to velocity (retrieved and computed in real time), to volume (measured in tera-, peta-, exa-bytes), big data is usually distributed across a network of computers. Quantitative vs qualitative data Quantitative and qualitative data are two fundamental types of data used in data science and research. They differ in their nature, measurement, and the types of insights they can provide. Quantitative data Quantitative data is data that can be counted or measured. Usually, that means it is measured in numbers. 1. Nature Example Numerical: Quantitative data consists of Age: 25 years numerical values that can be measured Weight: 70 kg and counted. Revenue: $5000 Number of products sold: 100 units 2. Measurement Exact: Quantitative data is measured using precise and standardized methods. Units: It often includes units of measurement (e.g., kilograms, meters, dollars). 3. Analysis Quantitative data is used for making Statistical Methods: such as mean, median, standard deviation. numerical calculations and statistical Visualizations: Histograms, bar charts, line graphs, and scatter plots inferences. It's highly valuable for understanding patterns, trends, and relationships within datasets. Qualitative data Qualitative data is information that cannot be counted, measured or easily expressed using numbers. It is collected using questionnaires, interviews, or observation, and frequently appears in narrative form. 1. Nature Descriptive: Qualitative data describes qualities or characteristics. Examples: Non-Numerical: It does not consist of numerical values but rather Marital Status: Married categorical or textual information. Customer Feedback: "Satisfied" 2. Measurement Colors: Red, Blue, Green Subjective: Qualitative data is often based on opinions, attitudes, Interview Responses: "Agree," or perceptions and can be subjective. "Disagree" Categories: Data is usually categorized into groups or classes. Qualitative data is valuable for 3. Analysis exploring complex phenomena, Narrative Analysis: Qualitative data can involve analyzing narratives understanding social behaviors, and or textual information for insights. uncovering underlying motivations.

Introduction to Data Science - Part 1 PDF

Document Details

Tags

Related

Summary

Full Transcript