Summary

This document covers introductory concepts of data science, explaining its purpose and applications in various fields. It details types of data commonly used in data science, such as CSV and spreadsheets, and describes data access methods using Python libraries like NumPy and Pandas. The document concludes with a brief overview of data visualization techniques.

Full Transcript

Introduction What is data science? As the name indicates, it is extracting meaningful insights from data. let us understand data science using a simple example. Consider a small retail shop near your house. How does its owner know whether the shop is profitable,...

Introduction What is data science? As the name indicates, it is extracting meaningful insights from data. let us understand data science using a simple example. Consider a small retail shop near your house. How does its owner know whether the shop is profitable, which items he needs to stock more, quantity of the stock required for smooth operation of the shop, etc? He would note down all the transactions happening in the shop. As it is a small shop, he may also memorize some of the things needed for the operation of the shop. He would be able to answer the questions discussed above using his experience and basic mathematical skills. This is basically data science. Data Science Artificial Intelligence is a technology which completely depends on data. It is the data which is fed into the machine which makes it intelligent. And depending upon the type of data we have; AI can be classified into three broad domains: Applications of data science Fraud and Risk detection: Financial institutions use data science to predict chances of bad debts while processing loan documents. It is very easy to analyze customer behaviour with AI power. Genetics and Genomics: Data science can be applied to advanced treatment like gene therapy. As you already know, the structure of human DNA is very complex. Data science techniques allow integration of different kinds of data with genomic data in disease research, which provides a deeper understanding of genetic issues in reactions to particular drugs and diseases. Internet search: The Internet has become a part of our lives. We use search engines every now and then for different needs like clearing doubts, shopping, entertainment, etc. Search engines provide us with relevant information in every lurk and corner of the world in a fraction of seconds. This is an example of the use of data science in our daily lives. Targeted advertising and product recommendations: Data science helps companies decide on the mode, type, location and time of advertising to maximize results. You can see that advertisements displayed on your browser depends on your buying and search pattern. TYPES OF DATA For Data Science, usually the data is collected in the form of tables. These tabular datasets can be stored in different formats. Some of the commonly used formats are: 1. CSV: CSV stands for comma separated values. It is a simple file format used to store tabular data. Each line of this file is a data record and reach record consists of one or more fields which are separated by commas. Since the values of records are separated by a comma, hence they are known as CSV files. 2. Spreadsheet: A Spreadsheet is a piece of paper or a computer program which is used for accounting and recording data using rows and columns into which information can be entered. Microsoft excel is a program which helps in creating spreadsheets. 3. SQL: SQL is a programming language also known as Structured Query Language. It is a domain-specific language used in programming and is designed for managing data held in different kinds of DBMS (Database Management System) It is particularly useful in handling structured data. A lot of other formats of databases also exist, you can explore them online! DATA ACCESS After collecting the data, to be able to use it for programming purposes, we should know how to access the same in a Python code. To make our lives easier, there exist various Python packages which help us in accessing structured data (in tabular form) inside the code. Let us take a look at some of these packages: 1. Numpy 2. Pandas 3. Matplotlib NumPy NumPy, which stands for Numerical Python, is the fundamental package for Mathematical and logical operations on arrays in Python. It is a commonly used package when it comes to working around numbers. NumPy gives a wide range of arithmetic operations around numbers giving us an easier approach in working with them. NumPy also works with arrays, which is nothing but a homogenous collection of Data. An array is nothing but a set of multiple values which are of same datatype. They can be numbers, characters, booleans, etc. but only one datatype can be accessed through an array. In NumPy, the arrays used are known as ND-arrays (N-Dimensional Arrays) as NumPy comes with a feature of creating n-dimensional arrays in Python. Pandas Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. The name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals. Pandas is well suited for many different kinds of data: Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet Ordered and unordered (not necessarily fixed-frequency) time series data. Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels Any other form of observational / statistical data sets. The data actually need not be labelled at all to be placed into a Pandas data structure The two primary data structures of Pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. Pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries. Here are just a few of the things that pandas does well: Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations Intelligent label-based slicing, fancy indexing, and subsetting of large data sets Intuitive merging and joining data sets Flexible reshaping and pivoting of data sets Matplotlib Matplotlib is an amazing visualization library in Python for 2D plots of arrays. Matplotlib is a multi-platform data visualization library built on NumPy arrays. One of the greatest benefits of visualization is that it allows us visual access to huge amounts of data in easily digestible visuals. Matplotlib comes with a wide variety of plots. Plots helps to understand trends, patterns, and to make correlations. They’re typically instruments for reasoning about quantitative information. Some types of graphs that we can make with this package are listed below: Assessment time

Use Quizgecko on...
Browser
Browser