Data Science Lecture Notes PDF

Summary

This document is a lecture on data science, with a focus on data wrangling and preprocessing techniques, and applications in finance. The lecture touches on important concepts like using Python for financial data analysis, and handling missing values. The document is likely part of an introductory to intermediate course in data science.

Full Transcript

Data Science Agenda Introduction Data Science in Finance Python for Financial Data Analysis Data Wrangling and Preprocessing Financial Data Exploratory Data Analysis (EDA) in Finance What is Data? Data Science Data science is the study of data to extract meaningful insights for business. What is...

Data Science Agenda Introduction Data Science in Finance Python for Financial Data Analysis Data Wrangling and Preprocessing Financial Data Exploratory Data Analysis (EDA) in Finance What is Data? Data Science Data science is the study of data to extract meaningful insights for business. What is Data Science? Data Science is about data gathering, analysis and decision-making. Data Science is about finding patterns in data, through analysis, and make future predictions. By using Data Science, companies are able to make: Better decisions (should we choose A or B) Predictive analysis (what will happen next?) Pattern discoveries (find pattern, or maybe hidden information in the data) Where is Data Science Needed? Data Science is used in many industries in the world today, e.g. banking, consultancy, healthcare, and manufacturing. Examples  For route planning: To discover the best routes to ship  To foresee delays for flight/ship/train etc. (through predictive analysis)  To create promotional offers  To find the best suited time to deliver goods  To forecast the next years revenue for a company  To analyze health benefit of training  To predict who will win elections What is data science used for in finance? Data science is widely used in the finance industry to improve decision-making, reduce risk, and increase efficiency. Benefits of Data Science in the Finance Industry 1. Fraud detection and prevention (creating a fake bank account) 2. Credit allocation 3. Risk management and analysis (borrow loan) 4. Customer analytics and segmentation (needs, desires, and expectations) 5. Pricing optimization Python for Financial Data Analysis Python is the most popular programming language in finance. Because it is an object-oriented and open-source language, it is used by many large corporations, including Google, for a variety of projects. Python can be used to import financial data such as stock quotes using the Pandas framework. What is Data Wrangling? Data wrangling is the process of cleaning, structuring, and transforming raw data into a usable format for analysis. Also known as data munging, it involves tasks such as handling missing or inconsistent data, formatting data types, and merging different datasets to prepare the data for further exploration and modeling in data analysis or machine learning projects. 1. Discover This involves identifying data sources, assessing data quality, and gaining insights into the structure and format of the data. 2. Structure Structuring typically involves reshaping data, handling missing values, and converting data types. 3. Clean This involves removing or correcting inaccurate data, handling duplicates, and addressing any anomalies that could impact the reliability of analyses. By cleaning the data, your focus is on enhancing data accuracy 4. Enrich This can include merging datasets, extracting relevant features, or incorporating external data sources. 5. Validate Validation ensures the quality and reliability of your processed data. 6. Publish This involves documenting data lineage and the steps taken during the entire wrangling process, sharing metadata, and preparing the data for storage or integration into data science and analytics tools. Data Wrangling and Preprocessing Financial Data Roadmap Data Preprocessing Data preprocessing is a part of data wrangling that focuses specifically on preparing data for analysis or modeling. It's often the first step in machine learning projects, and it can include: Scaling/Normalization: Adjusting the values of numerical data so they are within a certain range (e.g., between 0 and 1). Encoding: Converting categorical data (e.g., "male" or "female") into a numerical format that algorithms can work with (e.g., using 1 and 0). Handling Missing Values: Deciding how to deal with data that’s missing (e.g., removing rows, filling in missing values with averages, or predicting missing data). Splitting Data: Dividing data into training and testing sets for machine learning models. In short, data preprocessing is about getting the data ready for analysis, and data wrangling is a broader process that involves cleaning and structuring it. Both are important steps to ensure high-quality, usable data for any project. Thank you

Use Quizgecko on...
Browser
Browser