introduction-to-data-science.pdf
Document Details
Uploaded by ProfuseNirvana
Università Cattolica del Sacro Cuore
Tags
Related
- ICT922 Digital Transformation and Cloud Computing Lecture 1 PDF
- Learning And Information Technology - Computing Essentials (PDF)
- Lesson 1 Overview of Information and Communications Technology PDF
- Chapter 1: Information Technology, the Internet, and You PDF
- Information Technology, the Internet, and You (Chapter 1) PDF
- Information Technology PDF
Full Transcript
lOMoARcPSD|11350337 Introduction TO DATA Science introduction to data science (Università Cattolica del Sacro Cuore) Studocu is not sponsored or endorsed by any college or university Downloaded by Chiara Davol...
lOMoARcPSD|11350337 Introduction TO DATA Science introduction to data science (Università Cattolica del Sacro Cuore) Studocu is not sponsored or endorsed by any college or university Downloaded by Chiara Davoli ([email protected]) lOMoARcPSD|11350337 INTRODUCTION TO DATA SCIENCE CHAPTER I INTRODUCTION Lesson 1 14/02/2023 What is data science and why we study it? To understand what data science is, we have at first to understand what data is. Data items refers to an elementary description of thing events, activities and transactions that are recorded, classified and stored but are not organized to convey any specific meaning. For example: These are data, pure data doesn’t provide any knowledge. Information instead is data after having organized and process the data itself. From raw data… to information… We are in the era of ‘’Big Data’’, Big Data is a huge amount of data, typically this term is use also to explain what we can do with this amount of data: we can predict things and understand our reality. Why in the last few years the amount of data produce has increase so much? Where does this data come from? For example: if you buy something in a physical shop, basically no one keep track of that, and physical shop was the standard until few years ago, if you buy something in an online shop (like amazon) data is produced: the fact that you bought something is track even your age and gender are tracked. Not only we produce data, but we also consume a lot of data, for example liking a video on Instagram both consume data and produce data. Having this amount of data allows us to do many things, but this scenario also have some risks: The main risk is about Privacy, when we talk about privacy is not just the personal information, the one that are shared directly (ex. When you register in a web site, you provide first name, last name, email, phone number... if there is a privacy violation someone can know your phone number), also with the information provided someone can Infor other information that were not directly provided. The five steps of data science: 1. Asking an interesting question 2. Obtaining the data 3. Exploring the data 4. Modeling the data 5. Communicating and visualizing the result Explain this five-steps using Netflix example: with Netflix there’s a trial period (30 days) after that you have to pay, can I predict after a few days of trial if this user will become a register user? one thing you can do is analyzing the data about people having in the Downloaded by Chiara Davoli ([email protected]) lOMoARcPSD|11350337 database and compare the data related to non-subscribers users, our systems need to track all the data we need, we have to explore the data, so maybe creating classes of users, like people that watch TV series typically are more tempt to subscribe, respect people that watch more movies. So, you have to extract some knowledge from the data, then you model the data, After have look to our data you will realized that there are three features: - Age - Country - Number of hours those users watch Netflix during the trial You can predict if this user will subscribe or not, then you create a model that having in input this data can predict if the users will subscribe or not. Where the information is? We can say that the information is inside the information system, we can define an organization’s information system as the network of people, hardware, software and procedures needed by the organization. CHAPTER II RELATIONAL DATABASES Lesson 2 17/02/2023 A database can be defined as an organized collection of data, used to represent the set of information useful for the information system. Difference between database and DBMS: A DBMS (Data Base Management systems) is a software which provides user and other applications the access to a database, giving the tools for (mainly) viewing, adding, deleting and modifying the collection of data. Between the users and the DBMS, you can have also applications, which is any piece of software in terns use the DBMS, for example: e-commerce website, the site you use to search for any product is the application, the website in terms uses the DBMS, which in terms look at the data in the database, the same source of information, can be used by many different applications. Each database has several tables, so the whole idea about relational databases is that you store information into tables, a table in the context of relational databases is called a relation (table relation) For example, ID_customer can be Primary key that it’s a field in a table that allows you to uniquely identify each row that are called records or tuple (row tuple or record) and columns are called fields or attributes (columns fields or attribute) But for example, firstname_customer it can’t be a primary key because there can be people with the same name. Downloaded by Chiara Davoli ([email protected]) lOMoARcPSD|11350337 The two foundations of the relational model are first in primary key and the second one in foreign keys, is the constraint I put on another field saying this fields must contain only values coming from the primary keys of another table, because, for example ID_customer and ID_brand in this case can be considered as foreign keys. In database design we try to represent information only once because if you need to update an information, you have to update only in one piece of database. DBMS have some properties: 1. Data integrity, we want to preserve the integrity, we want to avoid as much as possible mistakes of our data. 2. Access authorization, the DBMS in order to access a database asks you to provide credentials, typically username and password, otherwise you cannot access, based on these credentials, the data you can see, but also the operation you can do on the data itself might change. 3. Concurrent access control, the idea here is that especially when you have a lot of users and you have an application using a database that is very intense from the point of view of the user access, you need concurrent access control, basically you have to handle situation where 2 or more users are trying to access the same piece of information and maybe they trying to modify that piece of information, this can cause a situation of conflict (for example two users that are trying to buy concert ticket in the main time). For example, there are the so-called locking mechanism, if I am modifying a record of my database the DBMS lock the record so the other users cannot modifying while I’m doing it, but in general we are talking about any mechanism that allows to handle this concurrent access. Lesson 3 21/02/2023 So, a relational model is a data model provides structures to represent data, a very popular data model is the relational model, based on the concept of relation, a relation can be represented as a 2-dimesional table, typically a relational database is composed by several table linked to each other through the values of some fields. Example: This Is a simplifying database used by a university to register an exam. 1. Students table 2. Courses table 3. Exam table Primary keys of these three tables are: For students table ID For courses table Code For exam tables student and course Student and course in exam table are foreign keys. Downloaded by Chiara Davoli ([email protected]) lOMoARcPSD|11350337 The primary key it can also be compose about 2 or multiple fields, so the primary key here would be a combination of student and course (student + course) allows me to identify each single records in the exams table we won’t have duplication, is it true that we won’t have duplication only if the student accept the vote, I the student don’t pass the exam or don’t accept the vote a good primary key become (student + course + date). Is not the name of the fields that makes the foreign keys, like student and ID, they don’t necessarily to be the same name, you have a so-called referential integrity constraint between the two fields. Every time you have an information that is unknown or unavailable you have a special value that is called the NULL value, we need it because sometimes you don’t have all the information about something. Database integrity constraints we have 3 categories: 1. Entity integrity constraints Is the primary keys constraints, this constrains requires there is no duplicate record within a relation (you cannot have 2 records that are completely equal in the table), and the existence of a field or a set of fields which uniquely identify each records, that fields must therefore be unique but also not NULL. How we represent relations and fields on a piece of paper? Example: - Courses (code, name, credits) - Invoices (number, year, date, customer) Between parenthesis are listed the fields of the table separated by comma, the primary keys must be underlined, in case you have to underline two primary keys you underline both fields (number + year). IN EVERY SITUATION WE DON’T HAVE MORE PRIMARY KEYS BUT ONE PRIMARY KEYS COMPOSED BY 2 OR MORE FIELDS. 2. Domain integrity constraints It’s a set of rules we might apply to our database to preserve integrity, second class of rules, here we are setting rules for its specific fields. For example here I could set a rule saying the grade field must contains values in the range 18-30. If by mistakes someone try to register a clearly wrong grade like 33 the system will throw an error. We are going to limits the possibility to have error, the one that we mention in the example is a custom domain rule. But we have other rules for example, the not NULL constraint, in which for the field we are considered, the null value is not allowed, which by the way this is by default with primary keys, but also other fields can have the not NULL constraints. Downloaded by Chiara Davoli ([email protected]) lOMoARcPSD|11350337 The length, for example, you can say that this fields must contains 50 characters, and finally we have the data type, is something you set for each fields that in some way allows you to define the big domain the values are coming from. 3. Tuple integrity constraints This are constraints that instead of being verify on the single fields, they are verify on the tuple level. 4. Referential integrity constraints this is the most important, it requires that the value of an attribute of a relation exist as a value of the primary key of another relation. For example we have a referential integrity constrains between students in exam table and ID in students’ table. Any values in the students fields must be available must be available in the ID fields. Referential integrity constraints are like the Drop-down menu when you are in a form select the value from a menu, you cannot write for some fields any values, but just the one choosing from a list. In our example, basically there is 1 field, student fields in the exams table and for that field we are constraint, we cannot set here any free value, we can only set value from the ID fields in the student table. Everything that I have in the student field must be available in the ID field, because otherwise I will refer to a student that does not exist. ID primary key Students foreign key But is not correct the opposite: everything that I have in the ID field must be available in the student field, why? There is no constraint about the fact that a student must take an exam. Referential integrity cnstraints example: - Courses (code_course, name, credits) - Students (code_students, first_name, last_name, email) - Exams (code_exam, course, student, date, grade) Primary key code exam which is an additional field We set the following referential integrity constraints: 1. Between exams.course and courses.code_course 2. Between exams.student and students.code_students These dot (.) means the name of the table and field name, so exams. course is like saying the field course in the exams table. Downloaded by Chiara Davoli ([email protected]) lOMoARcPSD|11350337 RECOMMENDER SYSTEMS Recommender system are one of the applications of data science, the concept of Information, is important because nowadays we have a lot of information, and what we can do with information? One of the things that we can do is to implemented Recommender systems. Recommender systems are used every day, is what today is called ‘’algorithm’’ but is not properly correct called it algorithm because this one is a much more generic term. Amazon was one of the very first company implementing recommender systems, because even many years ago on amazon you got in the home page suggestion about things to buy. We can have for example: music recommendation systems, shopping recommendation systems. Recommendation system are always work? No, but the more information they have the better they work, still there is rules for improvement. There is a big trade off here: because on one side the more information you give about yourself to the platforms the better will be the recommendation, on the other side you are sharing your personal information, so your privacy. there’s an ethical open question that is about what we called the ‘’filter bubble’’ so the more the content you get is filtered by a very struct algorithm the more you will see thing that are in your comfort zone, and you don’t try something different that you may be like or not. The amount of content we have available is so huge that without a proper filter is nowadays impossible to use this platform, for example the very first version of Facebook, the one of 2007/2008, where’s there was no algorithm, but was not so hard to use without algorithm because people had 30 max 40 friends on Facebook, it was not so popular, nowadays maybe you follows thousands of people on Facebook and Instagram, so there’s the need of an algorithm that filter in some way thing for you. DATA AND INFORMATION Data items refer to an elementary description of things, events, activities and transactions that are recorded, classified and stored but are not organized to convey any specific meaning. Information refers to data that have been organized so that they have meaning and value recipient. So, information is data after having been processed (or organized, structured...) while data alone is useless information produces knowledge. We are in the era of ‘Big data’, which means that we have a lot of data right now, it is also used as referring to the activity you can do on the data to recognize patterns and produce knowledge, an example of big data in this second assertions is sentiment analysis, for example apple can do sentiment analysis related to a specific model of iPhone that is launched to see the general sentiment people have about that product positive or negative sentiment. The whole concept of data science is extracting knowledge for data, we care about extracting knowledge to predict the future, I can see trends, I can see what is happening in a few hours in a few days or in a few months or at least I can try, this was very huge during the pandemic when people try to predict trends to understand what to do. Do these things always work? No, we cannot predict future with an accuracy of 100%. Data science involves skills in three areas: - Computer programming - Statistics and math - Domain knowledge Downloaded by Chiara Davoli ([email protected]) lOMoARcPSD|11350337 THE FIVE STEP OF DATA SCIENCE 1. Asking an interesting question 2. Obtain the data 3. Exploring the data 4. Modeling the data 5. Communicating and visualizing the results For example, how Netflix will be able to know if a user will renew the subscription every month? Taking information as: the age, the number of hours spent on the platform during the free trial, the gender, the country. We could analyze all the data and see if there is a correlation between these data and the subscription every month. We want to build a machine learning classifier, for example let’s say that the two feature we decide to use are: - Age of the subscriber 34 - Number of hours watching Netflix 12 According to these features the black box give the answer 0 or 1, 0 is no, she doesn’t become a subscriber, 1 is yes, she will. In data science another important thing is data visualization, once you have the data and you have analyze the data the way you show the data thought charts and graphs can have a huge impact especially if you want to report data to people, like managers, who have limited amount of time available if you are able to summarize concepts visually in an effective way, that obviously is a plus. We can define an organization’s information system as the network of people, hardware, software and procedures which produces and manages the information needed by the organization. DATABASES A database can be defined as an organized collection of data used to represent the set of information useful for the information system. A database management system (DBMS) is a software which provides users and other applications the access to a database, giving the tools for (mainly) viewing, adding, deleting and modifying the collection of data. We have to note that database and DBMS are often confused, so we can hear for example that MySQL is a database, while is actually a DBMS. DBMS, access to databases is usually provided through query languages (for example SQL), there are some properties provided by data base management system: - Data integrity, our data must be correct, this can be misleading, because this constraint, that we are going to study, cannot prevent 100% the mistakes that people can do with the data. - Access authorization, when you access a DBMS typically provide you the so called ‘’credential’’ (username and password) this credential allows you to access the database. Downloaded by Chiara Davoli ([email protected]) lOMoARcPSD|11350337 - Current access control, many users are trying to access the same piece of information in the data base at the same time. The RELATIONAL MODEL is the model we use to represent information as data, inside a database is based on the concept of relation (a relation is a table, and a database is composed by several tables). Tables has a primary keys and foreign keys, primary keys which is the field that uniquely identify each record. Foreign keys are for example in the exams table; course and student, because we have same data in the first table under ‘’code’’ (course code) and in the third table under ‘’ID’’ (student ID). Sometimes our tables are always full of data, you might have a value that is unknown or unavailable value so every time you ended in this situation, the relational model has a very specific concept to denote that and is the ‘’NULL value’’. The DATABASE INTEGRITY a DBMS provides some properties including data integrity (physical integrity and logical integrity), to preserve logical integrity, we have integrity constraints: 1. Entity integrity constraints This is about the primary keys, we don’t want to duplicate records within a relationship, and therefore, we need a field or a set of fields that uniquely identify each single records. That field (primary key) must be unique and not null, so the two properties of a primary key are: 1. no duplication, 2. Not null, it means that a primary key cannot get the null value. Each relation can have just one primary key and in the schema is represented underlined. A primary key can also be composed by set of fields that together are the primary keys, but we cannot say that we have two primary keys, we can have only one primary key, is only compose by more fields. 2. Domain integrity constraints 3. Tuple integrity constraints 4. Referential integrity constraints Downloaded by Chiara Davoli ([email protected]) lOMoARcPSD|11350337 With a referential integrity constraint, I’m constraining the value of a column saying the value I have in this column requires to be available as a value in another column. Referential integrity constraints are like drop down menu. It requires that the value of an attribute (or set of attributes) of a relation exists as a value of the (primary or candidate) key of another relation. We have set a referential integrity constraint between student (in exams) and ID (in students). Mistakes that we can do when we have referential integrity constraints: 1. We think that there is also an additional referential integrity constraint the other way around instead the referential integrity constraints is always one way. 2. We define the referential integrity constraints the other way around, so we actually mean that the value in the student field must be available in the ID field but instead we wrote the other way around. Referential integrity constraints example, we have the following databases: Course (code_course, name, credits) Students (code_students, first_name, last_name, email) Exams (code_exam, course student date, grade) We decide to have as primary key code_exams an additional field that identify each single exam. We set referential integrity constraints between: Exams.course and courses.code_course Exams.students and students.code_student ‘’.’’ It means name of the table, field name, so exams.course is exactly like saying the field course in the exam table. So here I can say that there is a referential integrity constraint between the field course in the table exams and the field code_course in the table courses. The second one means that there is a referential integrity constraint between the field student in the table exams and the field code_students in the table students. In practice this mean that the field ‘’course’’ in the table ‘’exams’’ must be equal to the field ‘’code_course’’ in the table ‘’courses’’. Let’s formalize a bit: the terms ‘’being equal’’ doesn’t mean so much because the field is not equal to another field because we have several values/records, so is more correct to say that: the value we have in the course field must be available in the primary key code course of the courses table. Is like saying in the table ‘’exams’’ for the field ‘’course’’ we don’t have a normal text box but we have a drop down menu we can only choose courses available in the courses table any other courses we try to insert are wrong and we are not allowed to do it. Why doesn’t the opposite referencing integrity constraint make sense here? Since we might have courses for which no students taken exam yet then the constraint wouldn’t be correct, because the constraint between courses.code_course and exam.course means any value I have in code course here must be available in the exam table (but this is not true). Downloaded by Chiara Davoli ([email protected]) lOMoARcPSD|11350337 Schema of database is the name of the tables, the name of the fields or our constraints. Instance is the actual data we have. Operations that can violate the referential integrity constraints at first we have to focus on one of the two integrity constraints in the example, let’s focus on the first one between ‘’exam.course’’ and ‘’courses.code_course’’, how can I violate this constraints? We have 4 main class of operations on a database and typically they are summarized with the acronym CRUD, create (I adding a record), Read (I read records), Update (I modified records) and Delete (I delete records) we can see for each of these operations how they can violate our constraints. except for reading which is the only class that cannot produce a violation the other can instead produce a violation: First big class of operations create: I violate the constraint if I insert a new tuple/record in exams with a value for course that does not exist in code_course. This is our initial instance; I’ll try to insert a new tuple having course 500, grade 22, date 22/02/2021 and student 100342, this is a violation because 500 does not exist in courses. The result of this operation is that the DBMS doesn’t allow it. Second class of operations update: if the update is done in the exam table is very similar, if I try on the exams table to modify the record with code_exam 3 changing the course from 335 to 336, the result is that it cannot work and we get an error and nothing will change. Third class of operation is delete: what happen if I delete the tuple/record having code_course 200, so mathematics? we can have three different result depending on how I set my referential integrity constraints, the three option are: Downloaded by Chiara Davoli ([email protected]) lOMoARcPSD|11350337 1. Restrict 2. Cascade 3. Set NULL Restrict if you try to delete mathematics this will not happened, and the database remains exactly how was before, no changes. Cascade it means that on courses the system lets you delete the course however automatically the system deletes the exams related to that course. A we can see the mathematics course missing but the related number in exams miss. In this way the referencing integrity constraints is preserved. Set NULL the course Is deleted automatically the system instead of deleting the exams it keeps those exams but for the course field it set the values as NULL. Is like saying we don’t know more the course that these exams are related to. Going back to our ‘’update’’ class of operations, we can add an additional situation that could arise: We have the modification of the tuple having code_course = 200 (course table) and we have setting code_course = 205, and similarly on wat we said before we can have 3 options: - Restrict, no modification is allowed. - Cascade, the system allows me to modify. - Set NULL CHAPTER III DATABASE DESIGN How a database is design? How can we design a database? INDEXES Downloaded by Chiara Davoli ([email protected]) lOMoARcPSD|11350337 An index is a data structure we use to make some operations on our database faster, in particular with indexing we increase the speed of the select commands, because we have this additional data structure order by a particular field where we have for each record, we have a link that tells us where exactly the record is in the databases storage. Indexing in databases works. So SELECT OPERATIONS are faster, but INSERT, UPDATE and DELETE operations are slower if you add indexes, because every time we add modified delete records we have to rebuilt the indexes, so it takes time, so the more indexes we add the slower will be those operations. On which field we should define an index? If we have fields that appear very often ‘’where clauses’’ (SELECT statements) it means that very often we do search on those fields so it makes sense to create an index on that field. So if we have fields in where clauses and those queries are executed often, then we ant to add that field as an index, for eg. We have the table customer with several fields including age and for some reason we never search customer by age, so it doesn’t makes to add an indexes on age if we never search by age, we had only disadvantages because anyway insert, update and delete operations are slower. Also fields that appear in ‘’order by clauses’’ (SELECT statements) that’s another candidate to because an index. And finally, fields which appear in ‘’on clauses’’ (SELECT statements with join). However, for primary key formally we always have an indexed already set, so the DBMS when you declare a field as primary key that’s an index. We can have Standard index and Unique index it imposes that we don’t have duplication for that field We should go through THREE STEPS METHODS starting from our information need going through the so-called CONCEPTUAL DESIGN , LOGICAL DESIGN and PHYSICAL DESIGN. EXAMPLE 1: The very first step is the CONCEPTUAL DESIGN, using this model called the ER model (the one in the figure) the Entity and Relationship model we describe in a very abstract high-level case, our use case. Let’s says that in our example we want to represent the information related to a Championships so to a group of teams (football teams or any other sports) competing for a price for example the Serie A in Italy. We have the entities, in the example PLAYER and TEAM, each entity represents a class of objects having similar characteristics, we detect which are the entity in our use case (player and team) we describe each entity using some attributes, for example: - Code, first name and last name PLAYER - Code and name TEAM The two attributes wrote in black (code in player and code in team) are very similar to the concept of primary keys we see in the case of recommender system, in the ER model in Downloaded by Chiara Davoli ([email protected]) lOMoARcPSD|11350337 fact they are called keys which is an attribute that can identify uniquely each instance of the entity. The word PLAY, which is between the two entities, is called a relationship that is something that links tougher the entities, in our example player and team, we have to think about relationship as something that take one object from the entity A and one object from the entity B and create a payer. Think about Player and Team as a set we take two objects: one from player and one from team and we create a payer through the play relationship. (1, 1) and (1, N) are the so-called cardinality (minimum cardinality, maximum cardinality) and it means: each player in how many teams can play minimum 1 and maximum 1. The only two values available for the minimum cardinality are: - 0 a player minimum play in zero team, which means hat is not compulsory that a player plays in a team. - 1 which means that is compulsory that a player plays in a team. For the maximum cardinality the two option we have are: - 1 - N it means many. PLAYER SIDE: the question is: ‘’the player can play in just one team or in many teams?’’ ‘’just one’’ at least if we are considering only the current situation. TEAM SIDE: the question is ‘’for each team how many players we have minimum and maximum?’’ ‘’minimum we have 1 player maximum N each team can have many players’’. EXAMPLE 2: Suppose we have many courses, and we can enroll in this course, respect to the previous example, here we have an attribute also on the relationships (DATE): which is the date, a particular students enroll a particular course. Schema for ER MODEL: ENTITIES RELATIONSHIP, it creates a PAYER. KEYS ATTRIBUTES CARDINALITY (MINIMUM: 1 or 0, MAXIMUM: 1 or N) I cannot put the attributes ‘’date’’ in the entity ‘’students’’ or in the entity ‘’course’’. About the cardinalities, STUDENT SIDE (0, N): - 0 is the minimum number of courses a student can enroll in, in this case the student is allowed to enroll minimum 0 courses. - N is the maximum number of courses a student can enroll in, the student can enroll many courses, as far is more than 1, it doesn’t if it’s 5, 10, 100 we put N. COURSE SIDE (0, N): - 0 is the minimum number of students which can enroll a course, in our database we accept the fact that the courses is available but not students have enroll that. - N is the maximum number of students which can enroll a course, in this case many students can enroll in each course. Downloaded by Chiara Davoli ([email protected]) lOMoARcPSD|11350337 Once we set minimum and maximum cardinality then we focus just on the maximum and we can have 3 possible cases: 1. ( /, N) ( /, N) is called a MANY TO MANY RELATIONSHIPS. 2. ( /, 1) ( /, N) is called a ONE-TO-MANY RELATIONSHIP, it doesn’t matter if the one is on the right or left side. 3. ( /, 1) ( /, 1) is called a ONE-TO-ONE RALATIONSHIP. EXAMPLE 3 We represent our department in our company with our employees, we want to represent the fact that some employees are directors of a department, for example we have the marketing department, the research and development department and so on, and we have many employees, some of them are director of a department. EMPLOYEE SIDE (0, 1) 0 means we don’t have any constraint about the fact that an employee must direct a department, we have some employees not directing a department. In other words, we can say that: ‘’the participation of the employee to the relationship heading is optional’’. 1 means: ‘’for each employee how many departments the employee can direct maximum?’’ ‘’maximum 1, a person cannot be the directors of 2 departments’’. DEPARTMENT SIDE (1, 1) ‘’how many employees are directing each department?’’ or ‘’for which department how many employees we have?’’ ‘’minimum 1 we cannot have department without director and maximum 1.’’ So, the CONCEPTUAL DESIGN is a formal representation of the real-world problem, DBMS-independent (this means that it doesn’t matter the DBMS we will use MySQL, PostgreSQL and so on, the ER model is so abstract that can work with any DBMS, is very high logical representation) we produce a conceptual schema using the ER model. After the conceptual design we can go through the LOGICAL DESIGN once we create conceptual schema then applying some rules, we can translate that schema into a logical schema, so in a set of tables (the ones that we have seen in the previous slides). The representation is still independent from physical details, but the model used is available in DBMS. Finally, we create the PHYSICAL DESIGN once we have our logical schema, we can take the tables and implement them into a real database with a software that allows us to create the table and set some parameter. ENTITY-RELATIONSHIP MODEL (ER MODEL) We have many constructs that compose our model: Entity, Relationship, Attributes, Cardinality, Key and Generalization. ENTITY can be divided in : - An entity type is a class of things having common properties and independent existence, for example student and course. Downloaded by Chiara Davoli ([email protected]) lOMoARcPSD|11350337 - An entity occurrence is a uniquely identified object of an entity type, for example the specific course ‘’information systems’’ occurrence of the course entity. RELATIONSHIP is the link among two or more entity types for example enrollment (among students and courses). A relationship occurrence is a uniquely identified association among entity occurrence, for example student john with ID 35876, enrolls in the ‘’information systems’’ course, this specific enrollment is a relationship occurrence. ATTRIBUTES are elementary properties which describe entities and relationships, for each entity or relationship occurrence, a value belonging to a domain is associated to each attribute, for example ‘’ID number’’ or ‘’last name’’ can both be student attributes. CARDINALITY describes minimum and maximum number of relationship occurrences to which an entity occurrence can participate, for example student can enroll in 0 or N courses, where N is the maximum number of courses a student is allowed to take. What we often need to know is: - Minimum cardinality: 0 (is optional for the entity to participate to the relationship) or 1 (is compulsory)? - Maximum cardinality: 1 (we have maximum one) or N (no limit) ? Looking at the relationship at the maximum cardinality we can derive three relationship types: One to one, One to many and Many to many. FEOM CONCEPTUAL SCHEMA TO LOGICAL SCHEMA: PHYSICAL DESIGN Downloaded by Chiara Davoli ([email protected]) lOMoARcPSD|11350337 We take the logical schema and implement in real database, in particular for each field we have to choose a data type, for example this are data type we have in MySQL: Here we have 3 big classes: numeric (number) , data & time and string (means that you can store everything in there: letter, numbers …). How choosing one data type or the other might impact on our database and the applications the database we will use? Why should I choose numeric or string? Choosing the right format allows me to execute some automatic validation on the values, so if I expect numbers on a field, it’s correct to choose a numeric field otherwise the DBMS would take even letters. The second thing we have to decide when we do the physical step, is defining on a field if we have the indexes, an index is an additional structure the DBMS creates in addition to the table where the records are sorted by that particular field, so it’s easier to reach the record by approximation, basically Indexes allows us to select information faster when we search for something. Since when we search for something having an index can increase the performance, so, why we don’t put an index in every field? For two reasons: 1. For each index we need a storage, so we need more storage space. 2. Every time we create, modify or delete data in table the index needs to reflect the new situation, and so need to be reorganize. That’s why if one side select operations are faster, insert update and delete are slower. Note that indexes are defined on one or more attributes: if we define an index on ‘’first_name’’ just a search on ‘’first_name’’ can benefit from that index. NETFLIX EXAMPLE CONCEPTUAL SCHEMA First things we have to do when we start a new database is check which are the ENTITIES (class of objects having common properties), so for example Netflix entities are: - Content: title, director, genre, category, languages, ID - Actors: country, ID, name, genre, birthday Downloaded by Chiara Davoli ([email protected]) lOMoARcPSD|11350337 This is a many to many relationships (N, N). Now, if we might want to represent the fact that: a particular actor plays in particular movie a particular role, we have to represent this additional point in ‘’PLAY.’’ Now we add a new entity: - Users: ID, first name, last name, credit card, subscription, email, password, country But the schema depends on what we want to represent: If it’s enough for us to represent the fact that users, for example, Anna played a specific content, this representation works, however, if we want to represent the fact that Anna play that content for example 1 month ago and then again yesterday, this representation is not enough, because in the entity relationship model a relationship is not allowed to represent the fact that the same payer of object is duplicated. In other words: - If we want to represent in our schema if a user at some point in the history watched that content, this representation is fine. - If we want to represent the possibly multiple times a user plays that content, a relationship doesn’t work. A relationship technically is a payer of object, 1 content 1 user, we cannot have another payer with the same object. So now, we create a new entity: - Stream: date time start, date time end, ID Stream basically represents each act of streaming, for example, it represents the fact that at some point, someone click on the play button on a content, and we will link that with both content and users. So, when we have this kind of situation, we create a new entity, we link that with the other two with two relationship that for sure there will be One-to-Many (1,N). Downloaded by Chiara Davoli ([email protected]) lOMoARcPSD|11350337 There’s something we are missing, which is TV series, because we have episodes, we might need to create a new entity: - Episode: number of episodes, id, title, season, duration which is link to content, you need to know which episode belong with each content. Now, we are tracking another piece of information, typically when we watch or listening something, we can give an implicit or explicit feedback to the system, so we have to track also this kind of explicit feedback, so the user can like or dislike a movie. We add a relationship: - Feedback: type (which contains both like and dislike), date time. Why in this case adding a relationship is good, and we don’t add an entity, like stream? Because, with a relationship we cannot track the same pear of objects twice. Downloaded by Chiara Davoli ([email protected]) lOMoARcPSD|11350337 Now, the last thing we are going to represent is the billing, so the invoices we produce every month. We add another entity: - Invoice: date, number, amount The primary key here is composed by data + number, in this case the representation is different. LOGICAL SCHEMA So, we have to translate the schema, how we can translate entity? An entity becomes a table or a relation, first translation of our schema: - Contents (ID_content, director_content, genre_content, category_content, lenguages_content) - Episodes (ID_episode, title_episode, season_episode, duration_episode, number_episode, ID_content(FK)) How can we translate a relationship? The way we translate a relationship depends on the kind of relationship we have one-to-one, one-to-many and many-to-many. relationship ‘’belong’’, one-to-many (1,N). RULE ONE-TO-MANY RELATIONSHIP: I represent a one-to many relationships in the table deriving from the entity having as maximum cardinality 1, in our case is ‘’episode’’. 1° thing is taking a primary key from one table, putting in the other table. 2° thing is to set the referential integrity constraint. Referential integrity constraint between episode.ID_content and content. ID_content. The meaning of this referential integrity constraint is that everything I have in ID_content tables episode must be available in ID_content tables contents. Second translation: - Actors (country_actor, ID_actor, birthday_actor, name_actor, gender_actor) - Contents (ID_content, director_content, genre_content, category_content, lenguages_content) Relationship ‘’play’’, many-to-many (N,N) RULE MANY-TO-MANY RELATIONSHIP: Every time we have a many-to-many relationship, we have a new table representing that relationship having as primary keys the entity having maximum cardinality N is always one primary key, but it contains two fields, so in this case together ID_actors and ID_content. Downloaded by Chiara Davoli ([email protected]) lOMoARcPSD|11350337 - Play (ID_actor (FK), ID_content (FK), role_play) Referential integrity constraints between play.ID_actor and actors.ID_actors Between play.ID_content and contents.ID_content. Third translation: - Contents (ID_content, director_content, genre_content, category_content, lenguages_content) - Users (ID_user, email_user, password_user) Relationship ‘’feedback’’, many-to-many (N, N) - Feedback (ID_users (FK), ID_content (FK), type_feedback, datatime_feedback) Referential integrity constraints between feedback.ID_user and users.ID_user Between feedback.ID_content and content.ID_content Fourth translation: - Invoices (amount_invoice, data_invoice, number_invoice, ID_user(FK)) - Users (ID_user, email_user, password_user) Relationship ‘’issue’’, one-to-many (1, N) Referential integrity constraints between invoices.ID_user and Users.ID_user Fifth translation: - Streams (DTstart_stream, DTend_stream, ID_stream, ID_user(FK)) - Users (ID_user, email_user, password_user) Relationship ‘’play’’, one-to-many (1,N) Referential integrity constraints between streams.ID_user and users.ID_user Sixth translation: - Streams (DTstart_stream, DTend_stream, ID_stream) - Contents (ID_content, director_content, genre_content, category_content, lenguages_content, ID_stream (FK)) Relationship ‘’include’’, one-to-many (1,N) Referential integrity constraints between contents.ID_stream and streams.ID_stream CHAPTER IV RECOMMENDER SYSTEMS What is a recommender system? Are personalized information agents that provide recommendations, so every time we see a system, a piece of software, an algorithm, automatically suggest us something based on some criteria, that’s a recommender system, for eg. Amazon suggest something to buy, Netflix suggest something to watch, Instagram suggest user to follow, these are all recommender system, everything a recommender is recommending is technically called Item. The very first company using recommender system were: amazon, Netflix, last.fm and pandora, nowadays almost all the company are using recommender system. How does recommender system work? Let’s take as example Netflix or Spotify, we have two big categories: Downloaded by Chiara Davoli ([email protected]) lOMoARcPSD|11350337 1. Content based filtering; recommendations are based on characteristics of the items to recommend. In practice this means: if I listen only techno music then the system will suggest me other music belonging to the same genre, so the system analyses or know the kind of content I’m listening to and according to that it suggests other items having the same characteristics, it could be the genre, but it could be even more specific. 2. Collaborative filtering it uses a completely different approach, they don’t know anything about the content, the characteristics, they just give recommendation based on the preferences of other users, there’s no need for content analysis. User based, it means that the system analyzes my taste, analyses the taste of all the other user in the system it checks, who are the people having a taste similar to mine and uses their opinion to suggest me something I haven’t seen yet. It looks for similarities among users. Item based, it looks at the similarities among items. We actually don’t know if it’s content base or collaborative filtering unless Netflix expose their algorithm, and this is something they won’t do. This are two techniques we can use to do the same thing, so to try to suggest to a user something we think the user will like. Content based requires that we know the characteristics of the item we are recommending, collaborative filtering uses instead a sort of collective brain. For collaborative filtering, the first thing we need is references, because how we know the taste of our users, we need some explicit or implicit rating: - Explicit means that I ask to a user to rate a content, for eg. Like/dislike or rate a content from 1 star to 5 star. Some popular methods for explicit rating are the rating with the stars from 1-5 with or without semantic explanation, the binary ratings (with like or dislike) and unary ratings (like). The more precise is our scale, probably there is a higher cost of rating. - Implicit means infer preferences from behavior analysis, for eg. If I listen the same song every day 20 times, probably I like that song, even if I don’t click like. Implicit ratings instead we have purchase data, consumption data (so songs listened, articles read, movie watched) according to the behavior we can know if the user for example like pop music or rock music. When we talk about implicit rating, we have to consider a physiological aspect, which is that sometime people are not completely honest when they rate content, so we can have some false negative and false positive, ratings. The time factor is another factor that we have to consider because our taste can change over the years. EXAMPLE, COLLABORATIVE FILTERING Our input data can be summarized in a User-Items-Matrix, basically we have each user of our system in a row, each Items of my system in a column and each cell represents in this case the vote a user gave to an item. Downloaded by Chiara Davoli ([email protected]) lOMoARcPSD|11350337 Step 1 production of the users, construction of the matrix. In our example we have only 5 users and 4 TV shows, for example if we assume that there is a 1 to 5 range, where 5 means ‘’I like that TV shows so much’’ than we can say that user ‘’Bob’’ likes so much the ‘’Big Bang Theory’’, while instead user ‘’John’’ doesn’t like ‘’Friends’’. How we know these data? Either explicitly (the users rate the TV series) and implicitly (we infer that according to the habits of the user). For example, the cell that is empty, we don’t know if the user Tom likes The Big Bang Theory, we want to know that because in order to give good suggestion, we want to guess this vote, how we can do it? Basically, the idea is use people having similar taste and compute this prediction based on the vote that those people gave, for example if Bob is very similar to Tom in terms of taste, than we can assume that, if Bob likes The Big Bang Theory also tom will like it. Step 2 production of the user distances matrix Mathematically, how can we solve this problem? We produce another matrix containing, on the rows and on the columns the users and we put in the cells the distance between each payer of users. We consider users as vectors, and mathematically we have several methods to do the distances between vectors, in our context ‘’distance’’ means that they don’t have the same taste, so the lower is the distance the more similar is the taste. So, one typical method is computing the Euclidean distance: Basically, this Euclidean distance takes in input 2 rows and produces in our input a value that measures how distance are those two vectors. From this matrix we can notice: - The diagonal which is compose by 0. - The matrix is symmetric. Step 3 normalize the distances dividing each distance by the higher distance available in the matrix. For example, the distance between John and Bob becomes: , we want to transform the values in the cells into something that is easier to read, we have to reach a value which is between 0 and 1. Downloaded by Chiara Davoli ([email protected]) lOMoARcPSD|11350337 - 0 is minimum similarity. - 1 maximum similarity. Step 4 production of the user similarities matrix. Since we are more interested in the similarity respect to the distance, we have to compute the similarity computing 1 – normalized distance. We have computed the similarity between john and Bob. Step 5 take the neighbors of our target user (in our case is ‘’Tom’’) and use the neighbors to compute the guess. The neighbors are user having similar taste with tom. Who are the two-user having highest similarity? Bob and Anna because we have to check the highest value in the column and row. Step 6 compute the (guessed) vote and commend the item if it is high enough. now we can use Bob’s and Anna’s rates to predict Tom’s rate: Our prediction is that ‘’Tom’’ would give 4 as a rate for the ‘’Big Bang Theory.’’ ‘’How many neighbors should we select?’’ Because in this example we have only 5 users but in real world we can have millions of users, so we should have many neighbors, the more neighbors we have the more robust the prediction will be. CONTENT BASED VS COLLABORATIVE FILTERING The advantages of collaborative filtering respect to the content based is that collaborative filtering is easier to implement, because we don’t need that content analysis step, because for content-based algorithms, we need to analyze the content of our items and based on that we will do our suggesting, with collaborative filtering we didn’t do any content analysis, we just use the ratings available in our matrix. Downloaded by Chiara Davoli ([email protected]) lOMoARcPSD|11350337 Collaborative Filtering has the disadvantage of the data sparsity which in turns has some related consequences: 1. Cold-start problem: it means that the system has some difficulties when starting we can see two sides of the problems: Difficult to serve a new user, for the system is difficult to suggest me something because we can’t compute the neighbors, since you cannot produce the neighbors, you cannot produce the suggestion. The system starts to produce suggestion when it has some data, so when the user stars to watching things and giving rates to the items. Difficult to suggest a new item, since we don’t know the vote the neighbors gave to that new Item, then we cannot do the operation to predict rates, so no votes received by this new item so difficult to suggest a new item. This produces the so-called ‘’rich-get-richer effect’’ for mainstream items, basically with this system items that are mainstream tent to become even more popular, while items that are more ‘’Indi’’ so not very mainstream is difficult they get recommended. The difficult to suggest a new item is not a problem for content-based filtering even if the new items haven’t received any votes yet but still, since we did our content analysis, we know their characteristics, so we can suggest that items based on its characteristics for user who like those characteristics. But in content-based filtering we can have problems related to the difficult to serve a new user, because if we have a new user if he/she hasn’t expressed any preferences yet the system doesn’t know him/her, so is not possible the system suggest to this new user the items. another approach which is more related to the collaborative filtering is trying to grab our preferences from social media, if we log-in without using email and password, but using the social media account, some social platforms allow other systems to grab our preferences, so for example the pages we like, the user is informed about what is happening, because when we use this kind of system a pop up window appears. PRIVACY AND TRUST Tradeoff between personal information and recommendation effectiveness (how good are the recommendation) the more information we have about the user, the better will be our suggestion, but on the other side the more information we have, the worst can be the problem privacy wise. Both user and companies using this services should take into consideration, on the user side we can say that we should be aware of what is going on, so we should be aware of the fact that we are giving away some personal information and that maybe according to this personal information, the system can infer other information related to our taste, opinion and so on… and this should be on one side, pretty bad, on the other side this allows to propose us something that we might like, so we need to find a balance between this two elements. We see that there are business rules that needs to be taken into consideration In the whole process, for eg let’s says that amazon predicts that I like so much a product but they know the product is sold out and that product won’t be available for the next 3 week, ‘’does it makes sense to suggest this product?’’ no because the systems give a suggesting for a product that the costumer cannot buy, so even if according to the recommendation algorithm still for business related decision the system don’t recommend it. Between business rules and trust there’s a kind of trade off, if we realize that the business rules have a huge impact on the suggestion I received, maybe we don’t trust the platform anymore. Downloaded by Chiara Davoli ([email protected]) lOMoARcPSD|11350337 And finally, we should always consider that ratings and reviews, we cannot assume that we can have fake reviews, both positive and negative, positive fake reviews, so reviews made by the people who created a particular content or selling a product in order to sell more, or negative fake reviews, maybe done by some competitors who want to decrease the perception, potential costumer, can have about that product. SQL SQL structured query language is a language we use to communicate with a DBMS, in order to execute an SQL query on our database we have to click ‘’MYSQL’’ in the upper line. SELECT … name of the columns … FROM … name of the table … WHERE … where filter, a filter that we can apply to grab just some of the rows … These are 3 main parts of a select statement, which is used to retrieve/read rows from 1 or more table. For example, let’s say that we want to see first and last name of all the costumer from Germany, according to this we should specify: SELECT … firstname_customer, lastname_customer. FROM … customers WHERE … contry_customer = ‘Germany’ SELECT * means all the columns. Let’s make things a bit more complicated, let’s says that we want the German customers over 30, we can do that: SELECT * FROM customers WHERE contry_customer = ‘Germany’, age_customer > 30 When you have numerical fields, we don’t need to use quotes instead when we have an alphanumeric string and dates, we need quotes. So how can we combine this with components of our ‘’where’’ quotes? Using a Boolean operator in particular in our case, an AND Boolean operator. SELECT * FROM customers WHERE country_customer = ‘Germany’ AND age_customer > 30 Downloaded by Chiara Davoli ([email protected]) lOMoARcPSD|11350337 ‘’AND’’ means that both the conditions must be true. Another comparison operator (beyond AND) is ‘’differ from’’ represent by this symbol. SELECT * FROM customers WHERE contry_customers ‘Germany’ AND age_customer > 30 This mean ‘’give me all the non-German customers over 30. Then we can have ‘’greater or equal to’’ represented by this symbol >= SELECT * FROM customers WHERE contry_customers ‘Germany’ AND age_customer >= 30 This means ‘’give me all the non-German customers which are over 30 or 30’’. The other Boolean operator we will see is ‘’OR’’, with ‘’AND’’ both the condition must be true, with ‘’OR’’ is it enough even if only one condition is it true. So, now suppose that we want to reach all the German and Italian customers, we have to do this: SELECT * FROM customers WHERE country_customer = ‘Germany’ OR country_customer = ‘Italy’ This time we want German and Italian customers over 30, what we do is this: SELECT * FROM customers WHERE country_customer = ‘Germany’ OR country_customer = ‘Italy’ AND age_customer >30 To change priorities, we have to put parenthesis: SELECT * FROM customers WHERE (country_customer = ‘Germany’ OR country_customer = ‘Italy’) AND age_customer >30 When we want to search for some words inside a field, we use a different operators called ‘’LIKE’’ for examples in sales table we want to see all the sales containing the word iPhone in the product fields. SELECT * FROM sales WHERE product sale LIKE ‘%iPhone%’ Here we are saying ‘’take all the rows having in the product sale’’ this kind of patterns so % means any characters, anything is fine. Downloaded by Chiara Davoli ([email protected]) lOMoARcPSD|11350337 If we use percentage % at the beginning and at the end it means ‘’contains’’ SELECT * FROM sales WHERE product sale LIKE ‘iPhone%’ When we select record from a table, we don’t have to assume that we will get any specific order/sorting because a table is just a set of elements, so there’s no implicit order in this element, if we want a specific order we have to specify, with ‘’ORDER BY’’. SELECT * FROM sales WHERE product sale LIKE ‘iPhone%’ ORDER BY price_sale If we want to do the order from the maximum to the minimum, we have to insert the word ‘’DESC’’. SELECT * FROM sales WHERE product sale LIKE ‘iPhone%’ ORDER BY price_sale DESC In SQL we can also compute mathematical operations: SELECT product sale, price_sale, price_sale*1.22 FROM WHERE ORDER BY This would add 22% VAT to our price, the four operations are: *(multiply), +, -, / (divide), we should even specify an alias ‘’AS’’ a name we want to give to this column, like : SELECT product sale, price_sale, price_sale*1.22 AS price_sale_vat FROM sales WHERE ORDER BY We don’t have at the beginning the price sale vat column, but it goes to be created. How to work with NULL values, as we can see, we have some NULL values. Downloaded by Chiara Davoli ([email protected]) lOMoARcPSD|11350337 For some of the product we have the id brand for some other we don’t, the brand is specify inside the name of the product. Every time we need to compare a field with the NULL values, we have to write ‘’IS NULL’’ like this: SELECT product_sale, price_sale, price_sale*1.22 AS price_sale_vat FROM sales WHERE id_brand IS NULL ORDER BY Or if we want to compare the fields ‘’id_brand’’because we want to know which are the rows where this fields ‘’IS NOT NULL’’ we have to wrote: SELECT product_sale, price_sale, price_sale*1.22 AS price_sale_vat FROM sales WHERE id_brand IS NOT NULL ORDER BY Sometimes we need data coming from different table at the same time for eg. We want for all the sales after March 1° 2022 we want to see, the data of the sales, the product sold and the name of the customer. SELECT date_sale, product_sale, firstname_customer, lastname_customer FROM sales INNER JOIN customers WHERE date_sale > ‘2022-03-01’ ORDER BY date_sale When we have to join tow tables we have to wrote ‘’INNER JOIN’’ but then we have to specify how we have to join them, and we do that with ‘’ON’’. SELECT date_sale, product_sale, firstname_customer, lastname_customer FROM sales INNER JOIN customers ON sales.id_customer = customers.id_customer WHERE date_sale > ‘2022-03-01’ ORDER BY date_sale Let’s say that we want sales starting from Jan 1°, 2022, but only by customer which name is Anna: SELECT product_sale, date_sale FROM sales INNER JOIN customers ON sales.id_customer = customers.id_customer WHERE date_sale > ‘2022-01-01’ AND firstname_customer = ‘Anna’ ORDERED BY date_sale Select Date of the sales and the product sold. Date product for all the sale having a price between 300 and 1000 by Italian and German customer. SELECT date_sale, product_sale FROM sales INNER JOIN customers ON sales.id_customer = customers.id_customer Downloaded by Chiara Davoli ([email protected]) lOMoARcPSD|11350337 WHERE price_sale >= 400 AND price_sale = 18: Print (‘ you have passed the exam! ’ ) Else: Print (‘you did not pass the exam : ((( ‘ ) The concept of elif is an optional, we have to rephrase the entire schema: Grade_exam = 18 If grade_exam > 18 Print (‘you passed the exam! ’) Elif grade_ exam == 18 Print (‘you barely passed the exam …’) Else: Print (you did not pass the exam : ( ( ( ‘) In programming languages = is used to assign a value, == we use this if we want to compare to something else. FUNCTION, LOCAL VARIABLES very often when we have a complex program we want to takes some of our code and put that code in a function, because we want to reuse that code, so instead of re-writing that code, again and again, we write the code in a function and every time we need it, we call the function instead of re-writing the code itself. For example: Def compute_price_vat (price): Downloaded by Chiara Davoli ([email protected]) lOMoARcPSD|11350337 Price_vat = price * 1.22 Print (price_vat) Compute_price_vat (100) I can give to the function different values: Compute_price_vat (50) Compute_price_vat (52) Consider that we can have more than 1 variable so: Def sum_elements (a, b) c=a+b print (c) sum_elements (10, 5) variables inside a function are local, this means it value is there only inside the function, the variable for eg c is available only in the function sum_elements, if I wrote the variable outside the function, python doesn’t this function. First thing first we have to import the CSV file into python: EDA, it stands for exploratory data analysis. IMPORT PANDAS AND MATPLOTLIB.PYPLOT Import pandas as pd df = pd.read_csv (‘formatted_flights.csv’) df pandas provide some data structure called, pandas data frame that are very useful in data science, so there we are saying ‘’import pandas and call it as pd’’, df stands for data frame. What we have finally is a table having 3 columns (the original CSV from the file has only 2) because the first column is automatically added by pandas to indicates the indexes, in pandas indexes started form 0. Downloaded by Chiara Davoli ([email protected]) lOMoARcPSD|11350337 How to set a specific column of the data frame? Import pandas as pd df = pd.read_csv (‘formatted_flights.csv’) df [‘arr_delay’] if we execute the code we should get only the column arrival delay. Now we want to represent graphically this data set, in particular this column using an histogram: Import pandas as pd df = pd.read_csv (‘formatted_flights.csv’) df [‘arr_delay’].hist () if we apply dot and the name of a methods it means we apply the method what we have on the left as we can see, by default 10 bins are created (the columns), the highest bins means that a lot of flights has arrived earlier between -25 and -5. Downloaded by Chiara Davoli ([email protected]) lOMoARcPSD|11350337 but we can force the number of bins because only 10 are created by default doing this: import pandas as pd df = pd.read_csv (‘formatted_flights.csv’) df [‘arr_delay’].hist (bins=50) We can also specify the bins that we want for example those between -200, 50, doing: import pandas as pd df = pd.read_csv (‘formatted_flights.csv’) df [‘arr_delay’].hist (bins = [-200, -50, 25, 70, 100, 150]) Downloaded by Chiara Davoli ([email protected]) lOMoARcPSD|11350337 Let’s says that we want to show the graph but only for ‘’JetBlue Airways’’, when we want to select rows by tables, we have to do: df.loc [ df [‘name’] == ‘JetBlue Airways’] Downloaded by Chiara Davoli ([email protected])