Introduction to Data Science PDF

IT2302 INTRODUCTION TO DATA SCIENCE Overview (Campbell, 2021) Let us first understand what data science is before we delve into various aspects of data science. In simple terms, data science is a branch of mathematics and statistics to obtain useful and meaningful insights about the data set and trends from the raw data or information. Using programming, business, and analytical skills, you can process and manage the data set. This sounds tough. Most people do not know how to work with data science or understand how to develop skills effectively. The field of data science goes back to its roots in statistics. Having said that, this field is a combination of programming, business acumen, and statistics. It is essential to learn more about each topic to have an idea of how you approach the learning process. The art of finding any hidden insights and trends from the data set goes way back. Ancient Egyptians analyzed census data to help them collect taxes efficiently. They also used data analysis to forecast when there could be floods in the Nile. It is important to learn from past data to identify trends or insights in the data set. This helps the business make informed decisions. Regardless of the industry, every company is looking for ways to manage and store large volumes of data. This was a challenge for most companies until 2010. The introduction of Hadoop (a software framework for distributed storage and processing of big data) and other platforms has given organizations an easier way to store large volumes of data. Now, companies can focus on methods and solutions to process information. This can only be done using data science. It is important to note that data science is the future of technology. Data science is a mix of numerous algorithms, tools, principles, and languages to identify the hidden patterns within the variables in the data set. This may lead you to wonder how this is different from what has been done on data for years. The answer is that earlier, we could only use tools and algorithms to explain the variables in the data set, but using data science, it becomes easier to predict the outcomes. A data analyst uses the data only to explain what is happening in the present using a historical data set. On the other hand, a data scientist only looks at the data to obtain insights from the data set. He also uses some advanced algorithms to identify the probability of the occurrence of an event. He looks at the data from various angles and aspects. Data science is used to make informed decisions based on predictions made using the existing data set. It is centered on building, cleaning, and organizing datasets. On the other hand, data analytics pertains to analyzing data to answer questions, extract insights, and identify trends. This is accomplished using different tools, techniques, and frameworks that vary depending on the type of analysis being conducted (Stobierski, 2021). Thus, you can apply numerous analytics to a data set to obtain information. We will discuss these in brief in the subsequent sections. Analytics It is the systematic investigation of data or statistics. It is used to discover, interpret, and communicate meaningful patterns in data. Predictive casual. If you want to develop a model to predict the possibilities or outcomes of a futuristic event, you need to use predictive causal analytics. Let us assume you work for a credit company, and you loan people money based on their credit. You will be concerned with your customer's ability to repay the amount you have lent to them. You can develop models to perform predictive analysis using the payment history. This can help you determine if the customer will pay you on time or not. 09 Handout 1 *Property of STI  [email protected] Page 1 of 4 IT2302 Prescriptive. You may need to use a model to make the required decisions and modify the parameters based on the data set or question. To do this, you need to use prescriptive analytics. This form of analytics is more about providing the right information to make an informed decision. You can also use this type of analytics to predict a range of associated outcomes and prescribed actions. An example of this type of analytics is a self-driving car. You can run numerous algorithms (a procedure or formula for solving a problem, based on conducting a sequence of specified actions) on the data collected from the cars and use the results to make the car more intelligent. This makes it easier for the car to make the right decisions to turn, slow down, speed up, or identify the direction to take. Data science and machine learning are both popular buzzwords today. These two (2) terms are often thrown together but should not be used interchangeably. Although data science includes machine learning, it is a vast field with many different tools. Machine Learning It is a group of computational algorithms that performs pattern recognition, classification, and prediction by learning from existing data. Make predictions. Numerous machine-learning algorithms allow you to make predictions using unstructured, semi-structured, and structured data sets. Let us assume you work for a finance company and you have the transactional data available. You need to develop a model to determine the trend of future transactions. To perform this analysis, you need to use a supervised machine-learning algorithm. Such algorithms are used to train the machine with an existing data set. You can also use supervised machine learning algorithms to develop and train a model to detect future frauds based on historical information. Pattern discovery. Not every data set has variables you can use to make the necessary predictions. This is not true. There is a hidden pattern in every data set, and you need to find those patterns to make the required predictions. To do this, you need to use an unsupervised model since you do not have any pre-defined labels in the data set (using which you can group the variables). One (1) of the most common algorithms used to identify patterns is clustering. Let us assume you work for a phone company, and you are tasked with identifying where to set up cell towers in an area to establish a network. You can then use the clustering algorithm to identify where you can set up towers to ensure every user in the area receives the optimum signal strength. Why Use Data Science? (Campbell, 2021) In the past, organizations manage small volumes of data. It was easy to analyze and understand the data and relationships within the data set using some business intelligence tools. Most traditional business intelligence tools only worked on structured data sets, but most of the data collected today are semi- structured or unstructured. It is important to understand that most data collected now are semi- structured or unstructured. Simple business intelligence tools cannot process this type of data, especially since large volumes of data are collected from different instruments. For this reason, we need to develop advanced and complex analytical algorithms and tools to process, analyze, and draw some insights from the data. It is not only for this reason why data science has gained popularity. Let us look at how data science is used in different domains: Customer Service. How great would it be if you could know exactly what your customers want? Do you think you can use existing data to learn more about your customers, such as purchase history, browsing history, income, and age? You may have had this data with you in the past, as well. Since you use 09 Handout 1 *Property of STI  [email protected] Page 2 of 4 IT2302 different mathematical and statistical models, you can effectively work with large volumes of data and identify the right products to recommend to your customers. This is a great way to bring more business to your firm. Self-Driven Cars. How would you feel if your car could drive you home? Numerous companies are trying to develop and improve the workings of a self-driven car. The cars collect live information from various sensors, such as lasers, radars, and cameras, to create a map of the surrounding environment. The algorithm in the car uses this data to decide to speed up, slow down, park, stop, overtake, etc. These algorithms are often machine learning algorithms. Predictions. Let us now consider how you can use data science in predictive analytics. Consider weather forecasting. The algorithms used take data from aircraft, satellites, radars, ships, and other parts to collect and analyze data. This helps you build the required models. You can use these models to predict the occurrence of any natural calamities. Using this information, you can take the necessary measures to save lives. Who is a Data Scientist? (Campbell, 2021) If you look for data scientist on the Internet, you may come across numerous definitions. A data scientist uses data science to answer some business questions and concerns. The term data scientist was coined when people learned that a data scientist uses data, various mathematical or statistical functions, operations, and other scientific fields and applications to make sense of the data in the database. Functions Performed by Data Scientists Data scientists crack various data problems using their expertise in specific scientific disciplines. He works with different mathematical, statistical, and computer science elements. He does not necessarily have to be an expert in these fields. He would use some technologies and solutions to develop the right solutions and reach conclusions crucial for the organization's development and growth. A data scientist finds a way to present the data in a useful form compared to the data available in the data set. They work with both structured and unstructured data. Differences between Data Science and Business Intelligence (Campbell, 2021) Before we look at the differences between data science and business intelligence, let us understand these terms better. Using business intelligence (BI), an organization can find insight and hindsight in the existing data set to describe various trends in the data set. Through BI, businesses can take data from internal and external sources, prepare that data, and run queries on the data set to obtain the required information. They can then create the required dashboards to answer different questions or identify solutions to various business problems. BI can also help businesses evaluate certain futuristic events. On the other hand, data science is a different approach to looking at data. You can take a forward-looking approach and explain any information or insight in the data set. Using data science, you can analyze the current or past data that helps you predict the outcomes. This is one (1) way most organizations do their best to make informed decisions. Now you have an idea of what data science is, let us look at the lifecycle of data science. Most people rush into using the models they develop on the data sets without understanding the basics of data science. You need to understand these basics and assess the business requirements before you rush into using the model. Make sure to follow the data science life cycle phases to ensure your results are accurate. Lifecycle This section gives you a brief overview of the phases in the data science lifecycle. 09 Handout 1 *Property of STI  [email protected] Page 3 of 4 IT2302 Phase One: Discovery. Before you work on the project, you need to understand the following: business requirements, specifications, required or approved budget, and priorities. If you want to pursue a career in data science, you need to possess the ability to ask important questions. You need to assess if you have the right resources, people, technology, data, and time to support the work done on the project. This phase involves framing the problem and identifying the initial hypothesis you want to test. Phase Two: Data Preparation. When you identify the required resources needed to work on the analysis, you need to develop or identify an analytical sandbox where you can perform the testing and analysis of the data. Before modeling it, you need to process, explore, and condition the data. You also need to perform the following operations to move the data into the sandbox environment: extract- transform-load-transform. Programming languages can be used to clean, transform, and visualize the data used in the analysis. These programming languages help you identify the outliers in the data. You can also use the information to develop or identify a relationship between variables. Once the data is cleaned and prepared, you can perform different types of analysis on the data. Phase Three: Plan the Model. During this phase, you need to identify the techniques and methods to help you draw the relationship between the different variables in the data set. These relationships will help you determine the algorithms you can use in the next phase of the lifecycle. To do this, you need to apply exploratory data analytics methods and tools using various formulas and visualization methods. Let us look at some tools used for this below: o R: This programming language has various modeling capabilities. It is also a good platform to use and develop the right models if you are a beginner. o SQL: This provides a set of methods to perform analysis within the database using different predictive models and mining functions. o ACCESS or SAS: These tools can be used to access data from various storage platforms, like Hadoop, and use that data to create a reusable and repeatable model. The market has numerous tools to develop modeling techniques, but R is commonly used. At the end of this phase, you will have the required insights in your data that will help you determine the algorithm to use. The next phase is where you apply this algorithm and develops the model. Phase Four: Build the Model. Now that you have decided which algorithm to use, you must split the data set into training and testing data sets. In this phase, you need to consider the existing tools and determine if they are sufficient for building a model. Make sure you identify a robust environment to run the models. To develop the model, you need to analyze different techniques, such as clustering, classification, and association. Phase Five: Operate the Model. In this phase, you run the data through the model and deliver the reports and necessary technical documents. Additionally, you may also need to run the model in the production environment to test if it works the way it needs. This gives you an idea of how the model performs on real-time data. You can also determine any constraints in the model. Phase Six: Communicate the Results. It is important to evaluate if the model has given you the needed results. You can do this by analyzing your hypotheses. This is the last phase of the data science lifecycle and is where you identify the key findings and communicate the same to the organization. You can determine the results of the model based on the criteria you identified in the first phase. Reference: Campbell, A. (2021). Data science for beginners: Comprehensive guide to most important basics in data science. Alex Campbell. Stobierski, T. (2021). What's the difference between data analytics & data science? https://online.hbs.edu/blog/post/data-analytics-vs-data- science 09 Handout 1 *Property of STI  [email protected] Page 4 of 4

Introduction to Data Science PDF

Document Details

Tags

Related

Summary

Full Transcript