Introduction to Data Science.pptx
Document Details
Uploaded by HappierElder6060
Tags
Full Transcript
Introduction to Data Science Unit -1 Contents Introduction Need for Data Science Components of Data Science Data Acquisition and Data Science Life-Cycle Basic Tools of Data Science Difference between BI and Data Science...
Introduction to Data Science Unit -1 Contents Introduction Need for Data Science Components of Data Science Data Acquisition and Data Science Life-Cycle Basic Tools of Data Science Difference between BI and Data Science Applications of Data Science Role of Data Scientist What is Data Science? Data science is an interdisciplinary field that uses scientific techniques, procedures, algorithms, and structures to extract knowledge and insights from structured and unstructured data. It combines elements of statistics, mathematics, programming, and domain expertise to transform data into actionable insights. Need for Data Science 1. Informed Decision Making: – Empowers data-driven decisions – Enhances forecasting and planning 2. Competitive Advantage: – Optimizes operations – Improves customer experience 3. Efficiency and Automation: – Streamlines routine tasks – Boosts operational efficiency 4. Personalization: – Tailors products and services – Increases customer satisfaction 5. Risk Management: – Assesses and mitigates risks – Detects fraud and anomalies Need for Data Science 6.Healthcare Improvements: – Enables predictive diagnostics – Enhances patient care 7.Scientific Research: – Accelerates discoveries – Validates hypotheses 8.Social Good: – Accelerates discoveries – Validates hypotheses 9.Customer Insights: – Understands customer behavior – Enhances retention strategies 10.Innovation and Development: – Identifies market gaps – Drives product development Components of Data Science 1. Statistics: Statistics is one of the most important components of data science. Statistics is a way to collect and analyze the numerical data in a large amount and finding meaningful insights from it. 2. Domain Expertise: In data science, domain expertise binds data science together. Domain expertise means specialized knowledge or skills of a particular area. In data science, there are various areas for which we need domain experts. 3. Data engineering: Data engineering is a part of data science, which involves acquiring, storing, retrieving, and transforming the data. Data engineering also includes metadata (data about data) to the data. Components of Data Science 4. Visualization: Data visualization is meant by representing data in a visual context so that people can easily understand the significance of data. Data visualization makes it easy to access the huge amount of data in visuals. 5. Advanced computing: Heavy lifting of data science is advanced computing. Advanced computing involves designing, writing, debugging, and maintaining the source code of computer programs. 6. Mathematics: Mathematics is the critical part of data science. Mathematics involves the study of quantity, structure, space, and changes. For a data scientist, knowledge of good mathematics is essential. 7. Machine learning: Machine learning is backbone of data science. Machine learning is all about to provide training to a machine so that it can act as a human brain. In data science, we use various machine learning algorithms to solve the problems. Data Acquisition Data acquisition is the comprehensive process of systematically collecting, measuring, and recording data from various sources to facilitate analysis and decision-making. This process encompasses a wide range of techniques and tools designed to gather raw data from different origins, ensuring that the data is accurate, relevant, and suitable for further analysis. Data acquisition, also known as the process of collecting data, relies on specialized softwarethat quickly captures, processes, and stores information. It enables scientists and engineers to perform in-depth analysis for scientific or engineering purposes. Data acquisition systems are available in handheld and remote versions to cater to different measurement requirements. Handheld systems are suitable for direct interaction with subjects while remote systems excel at distant measurements, providing versatility in data collection. Components of Data Acquisition Components of Data Acquisition Sensors: Devices that gather information about physical or environmental conditions, such as te- mperature, pressure, or light intensity. Signal Conditioning: To ensureaccurate measurement, the raw sensor data undergoes preprocessing to filter out any noiseand scale it appropriately. Data Logger: Hardware or software that records and stores the conditioned data over time. Analog-to-Digital Converter (ADC): Converts analog sensor signals into digital data that computers can process. Interface: Connects the data acquisition system to a computer or controller for data transfer and control. Power Supply: Provides the necessary electrical power to operate the system and sensors. Control Unit: The management of the data acquisition system involves overseeing its overall operation, which includes tasks such as triggering, timing, and synchronization. Software: Allows users to configure, monitor, and analyze the data collected by the system. Components of Data Acquisition Communication Protocols: The transmission and reception of data between a system and external devices or networks is known as data communication. Storage: For storing recorded data, there are a rangeof options available, including memory cards, hard drives, or cloud storage. These provide both temporary and permanent storage solutions. User Interface: This system allows users to interact with and control the data acquisition system effectively. Calibration and Calibration Standards: To ensure accuracy the sensors and system are periodically calibrated against known standards. Real-time Clock (RTC): Accurate timing is maintained to ensure synchronized data acquisition and timestamping. Triggering Mechanism: Data capture is initiated based on predefined events or specific conditions. Data Compression: Efforts are made to reduce the size of collected data for storage and transmission in remote or resource limited applications. Key Elements of Data Acquisition Sources of Data Techniques Tools Sensors and IoT devices Manual data entry Data acquisition systems Databases and data warehouses Automated data collection ETL tools Web scraping and APIs Streaming data collection Data loggers and Web Surveys and forms scraping tools Batch processing Social media platforms Key Elements of Data Acquisition Importance Challenges Provides the raw data necessary for Data quality and integrity analysis Handling large volumes of data Ensures data is accurate and up-to- Ensuring data privacy and date security Facilitates real-time decision making Integrating data from diverse Supports predictive analytics and sources machine learning models Advantages of Data Acquisition Advancing Scientific Exploration: Researchers across fields such as physics, biology, and environmental science rely on data acquisition to collect information for experiments, simulations, and observations, facilitating breakthroughs and new insights. Enhancing Industrial Efficiency: Data acquisition systems play a pivotal role in industrial settings by overseeing manufacturing processes, guaranteeing quality assurance, and optimizing overall efficiency. Fostering Environmental Insights: Environmental monitoring benefits from data acquisition by tracking critical factors like air quality, water levels, and soil conditions, contributing to effective environmental management and timely disaster prediction. Revolutionizing Healthcare and Biomedical Studies: The realm of healthcare leverages data acquisition in medical devices to monitor vital signs and acquire physiological data, fueling diagnostic accuracy and propelling biomedical research forward. Elevating Automotive Evaluation: Within the automotive industry, data acquisition serves as an indispensable tool for testing vehicle performance, safety features, and Data Science Life Cycle Data Science Life Cycle 1. Identifying problems and understanding business: Identifying problems is one of the major steps necessary in the data science process to find a clear objective around which all the following steps will be formulated. In short, it is important to understand the business objective early since it will decide the final goal of your analysis. This phase should examine the trends of business, analyse case studies of similar analysis, and study the industry’s domain. The team will assess in-house resources, infrastructure, total time, and technology needs. Once these aspects are all identified and evaluated, they will prepare an initial hypothesis to resolve the business challenges following the current scenario. The phase should – ✔ Clearly state the problem that requires solutions and why it should be resolved at once ✔ Define the potential value of the business project ✔ Find risks, including ethical aspects involved in the project ✔ Build and communicate a highly integrated, flexible project plan 2. Data Collection/Data Gathering: Data collection is the next stage in the data science lifecycle to gather raw data from relevant sources. The data captured can be either in structured or unstructured form. The methods of collecting the data might come from – logs from websites, social media data, data from online repositories, and even data streamed from online sources via APIs, web scraping or data that could be present in excel or any other source. The person performing the task should know the difference between various data sets available and the data investment strategy of an organisation. A major challenge faced by professionals in this step is tracking where each data comes from and whether it is up-to-date. It is important to keep track of this information throughout the entire lifecycle of a data science project as it might help test Data Science Life Cycle 3.Data processing: In this phase, data scientists analyse the data collected for biases, patterns, ranges, and distribution of values. It is done to determine the sustainability of the databases and predicts their usage in regression, machine learning and deep learning algorithms. The phase also involves the introspection of different types of data, including nominal, numerical, and categorical data. Data visualisation is also done to highlight the critical trends and patterns of data, comprehended by simple bars and line charts. Simply put, data processing might be the most time-consuming but arguably the most critical phase in the entire life cycle of data analytics. The goodness of the model depends on this data processing stage. 4.Data analysis Data Analysis or Exploratory Data Analysis is another critical step in gaining some ideas about the solution and factors affecting the data science lifecycle. There are no set guidelines for this methodology, and it has no shortcuts. The key aspect to remember here is that your input determines your output. In this section, the data prepared from the previous stage will be explored further to examine the various features and their relationships, aiding in better feature selection required for applying it to the model. Experts use data statistics methods such as mean and median to better understand the data. In addition, they also plot data and assess its distribution patterns using histograms, spectrum analysis, and population distribution. Depending on the issues, the data will be analysed. Data Science Life Cycle 5.Data modelling: Modelling Data is one of the major phases of data processes and is often mentioned as the heart of data analysis. A model should use prepared and analysed data to provide the desired output. The environment needed for executing the data model will be decided and created before meeting the specific requirements. In this phase, the team works together to develop datasets for training and testing the model for production purposes. It also involves various tasks such as choosing the appropriate mode type and learning whether the problem is a classification, regression, or clustering problem. After analysing the model family, you must choose the algorithms to implement them. It has to be done carefully since extracting necessary insights from the prepared data is extremely important. 6.Model deployment: Now, we are at the final stage of the lifecycle of data science. After a rigorous evaluation process, the model is finally prepared to be deployed in the desired format and preferred channel. Remember, there is no value for the machine learning model until it’s deployed to production. Hence machine learning models have to be recorded before the deployment process. In general, these models are integrated and coupled with products and applications. The stage of Model deployment involves the creation of a delivery mechanism required to get the mode out in the market among the users or to another system. Machine learning models are also deployed on devices and gaining adoption and popularity in the field of computing. From simple model output in a Tableau Dashboard to a complex as scaling it to cloud in front of millions of users, this step is distinct for different projects. Basic Tools of Data Science 1. Programming Languages Python: Widely used for its simplicity and rich ecosystem of libraries for data analysis, visualization, and machine learning. R: Popular in the statistics and data analysis community, with strong visualization capabilities. 2. Libraries and Frameworks Pandas: A Python library for data manipulation and analysis, providing data structures like DataFrames. NumPy: A Python library for numerical computing, essential for handling arrays and performing mathematical operations. Matplotlib: A plotting library for Python, useful for creating static, interactive, and animated visualizations. Seaborn: Built on top of Matplotlib, it provides a high-level interface for drawing attractive statistical graphics. SciPy: A Python library used for scientific and technical computing. Scikit-learn: A machine learning library for Python, offering simple and efficient tools for data mining and data analysis. TensorFlow and PyTorch: Popular frameworks for deep learning. Basic Tools of Data Science 3. Data Management Tools SQL: A language for managing and querying relational databases. MySQL, PostgreSQL: Commonly used relational database management systems. MongoDB: A NoSQL database for handling unstructured data. Hadoop: A framework for distributed storage and processing of large datasets using the MapReduce programming model. 4. Data Visualization Tools Tableau: A powerful tool for creating interactive and shareable dashboards. Power BI: A business analytics tool by Microsoft for visualizing data and sharing insights. D3.js: A JavaScript library for producing dynamic, interactive data visualizations in web browsers. Basic Tools of Data Science 5. Integrated Development Environments (IDEs) and Notebooks Jupyter Notebook: An open-source web application for creating and sharing documents that contain live code, equations, visualizations, and narrative text. Spyder: An open-source IDE for scientific programming in Python. RStudio: An IDE for R that provides a user-friendly interface for data analysis. 6. Data Cleaning and Preprocessing Tools OpenRefine: A powerful tool for working with messy data, cleaning it, and transforming it from one format into another. Trifacta: A data wrangling tool for exploring and preparing diverse data for analysis. 7. Big Data Tools Apache Spark: An open-source unified analytics engine for large-scale data processing. Apache Hive: A data warehouse software that facilitates reading, writing, and managing large datasets residing in distributed storage. Basic Tools of Data Science 8. Version Control Systems Git: A distributed version control system for tracking changes in source code during software development. GitHub, GitLab, Bitbucket: Platforms for hosting and collaborating on Git repositories. 9. Data Acquisition Tools Beautiful Soup: A Python library for pulling data out of HTML and XML files. Scrapy: An open-source and collaborative web crawling framework for Python. APIs: Application Programming Interfaces used for retrieving data from online sources. Difference between BI and Data Science Sr. Factor Data Science Business Intelligence No. 1 Concept It is a field that uses mathematics, It is basically a set of technologies, statistics and various other tools to applications and processes that are used discover the hidden patterns in the data. by the enterprises for business data analysis. 2 Focus It focuses on the future. It focuses on the past and present. 3 Data It deals with both structured as well as It mainly deals only with structured data. unstructured data. 4 Flexibility Data science is much more flexible as It is less flexible as in case of business data sources can be added as per intelligence data sources need to be pre- requirement. planned. 5 Method It makes use of the scientific method. It makes use of the analytic method. 6 Complexity It has a higher complexity in comparison It is much simpler when compared to data to business intelligence. science. Difference between BI and Data Science Sr No Factor Data Science Business Intelligence 7 Expertise It’s expertise is data scientist. It’s expertise is the business user. 8 Questions It deals with the questions of what will happen and what if. It deals with the question of what happened. 9 Storage The data to be used is disseminated in real-time clusters. Data warehouse is utilized to hold data. 10 Integration of The ELT (Extract-Load-Transform) process is generally The ETL (Extract-Transform-Load) process is generally data used for the integration of data for data science used for the integration of data for business intelligence applications. application 11 Tools It’s tools are SAS, BigML, MATLAB, Excel, etc. It’s tools are InsightSquared Sales Analytics, Klipfolio, ThoughtSpot, Cyfe, TIBCO Spotfire, etc. 12 Usage Companies can harness their potential by anticipating the Business Intelligence helps in performing root cause future scenario using data science in order to reduce risk analysis on a failure or to understand the current status. and increase income. 13 Greater business value is achieved with data science in Business Intelligence has lesser business value as the Business comparison to business intelligence as it anticipates future extraction process of business value carries out Value events. statically by plotting charts and KPIs (Key Performance Indicator). 14 Handling data The technologies such as Hadoop are available and others The sufficient tools and technologies are not available sets are evolving for handling understandingItsItsarge data sets. for handling large data sets. Applications of Data Science 1.Image recognition and speech recognition: Data science is currently using for Image and speech recognition. When you upload an image on Facebook and start getting the suggestion to tag to your friends. This automatic tagging suggestion uses image recognition algorithm, which is part of data science. When you say something using, "Ok Google, Siri, Cortana", etc., and these devices respond as per voice control, so this is possible with speech recognition algorithm. 2.Gaming world: In the gaming world, the use of Machine learning algorithms is increasing day by day. EA Sports, Sony, Nintendo, are widely using data science for enhancing user experience. 3.Internet search: When we want to search for something on the internet, then we use different types of search engines such as Google, Yahoo, Bing, Ask, etc. All these search engines use the data science technology to make the search experience better, and you can get a search result with a fraction of seconds. 4.Transport: Transport industries also using data science technology to create self-driving cars. With self-driving cars, it will be Applications of Data Science 5.Healthcare: In the healthcare sector, data science is providing lots of benefits. Data science is being used for tumor detection, drug discovery, medical image analysis, virtual medical bots, etc. 6.Recommendation systems: Most of the companies, such as Amazon, Netflix, Google Play, etc., are using data science technology for making a better user experience with personalized recommendations. Such as, when you search for something on Amazon, and you started getting suggestions for similar products, so this is because of data science technology. 7.Risk detection: Finance industries always had an issue of fraud and risk of losses, but with the help of data science, this can be rescued. Most of the finance companies are looking for the data scientist to avoid risk and any type of losses with an increase in customer satisfaction. Role of Data Scientist. Data scientist roles and responsibilities include: Data mining or extracting usable data from valuable data sources Using machine learning tools to select features, create and optimize classifiers Carrying out preprocessing of structured and unstructured data Enhancing data collection procedures to include all relevant information for developing analytic systems Processing, cleansing, and validating the integrity of data to be used for analysis Analyzing large amounts of information to find patterns and solutions Developing prediction systems and machine learning algorithms Presenting results in a clear manner Propose solutions and strategies to tackle business challenges Collaborate with Business and IT teams