w02-01-FrameworksAndAlgorithms 2024 PDF

Summary

This document discusses frameworks and algorithms for data analysis, focusing on challenges in handling data from diverse sources, such as log files and sensor data. Real-time data streams, feature extraction, and predictive maintenance are also briefly discussed.

Full Transcript

**w02-01-FrameworksAndAlgorithms-2024** 0:00\ So this lecture today is include well is discussing a range of topics. 0:05\ So we\'re going to start about talking about different frameworks that we might use. 0:10\ We\'ll mention a few algorithm choices that we could make or tools that we could us...

**w02-01-FrameworksAndAlgorithms-2024** 0:00\ So this lecture today is include well is discussing a range of topics. 0:05\ So we\'re going to start about talking about different frameworks that we might use. 0:10\ We\'ll mention a few algorithm choices that we could make or tools that we could use and hopefully that will illustrate some of the ways that you could solve some of the problems that we would discussing last time. 0:23\ So we\'re not really at this point thinking about databases. 0:26\ We\'re more thinking about files that are distributed across a series of computers. 0:34\ OK, so frameworks, hopefully we can click the next button. 0:39\ Great. 0:40\ So first of all, what are the underlying challenges that we\'re dealing with? 0:46\ Well, the big challenges that I mentioned before are that we may have data from lots of different sources. 0:52\ So that means you might have log files from computers that are telling you how they are functioning, how they are responding to queries. 1:01\ There may be data streams coming from sensors, real time data streams that is, that are being recorded somewhere into a disc and you need to analyse these and then from that analysis publish some kind of dashboard, show people the data. 1:20\ So we need to be able to extract features from the data. 1:23\ By a feature, what I mean is some sort of characteristic. 1:27\ You may say that this machine, for example, if it\'s a wind turbine is about to fail, maybe the gearbox temperature is a little bit higher than it should be. 1:36\ And having built up and understanding that maybe using neural network model or something else, you can then judge you need to take action. 1:44\ So you\'re extracting features often you\'re using some kind of statistical basis to do this, meaning is this repeatable? 1:51\ Is this outside the specification? 1:53\ So you can\'t necessarily tell straight away if you have something that\'s of significance. 1:59\ And yes, so as I\'ve already mentioned, you may want to use some kind of so-called artificial intelligence approach where what we really mean is it could be a decision tree, a neural network or some other computer aided approach where you\'re making decision as to whether you have a problem or not. 2:17\ And then finally, once you\'ve done all that analysis, typically speaking, the data are presented on a dashboard. 2:23\ You often see this could be done with Tableau or Power BI or something else that is used by the top level management to understand the health of the company or the product or whatever it is you\'re monitoring. 2:37\ OK, so the input that you have may have problems in it. 2:42\ And normally we refer to this whole subject area as data preparation or data quality cheques. 2:49\ So you need to verify do the data have the right types in them. 2:52\ Now you remember from last week we were talking about types of fields in tables or relational tables. 3:00\ And sometimes you may have a bug somewhere and you end up with the wrong data that somebody\'s trying to insert into a table, and you don\'t want that to happen. 3:08\ So when we\'re not dealing with the relational database, we still need to do these cheques. 3:12\ Now you might have some sort of API where you are allowing people to upload data, or perhaps they are giving you a spreadsheet or a CSV file or something like that, and you have to ingest that. 3:27\ It could be a log output from a sensor. 3:31\ Now you need to adjust that and you need to be tolerant of potential errors, mistakes. 3:35\ You need to perhaps unify the schema so that it is actually more useful for a dashboard. 3:41\ Now I say this with experience that commercial providers of data are sometimes more difficult to deal with than public sector ones, because what happens here is a company contracts and the contractor produces the data centre to them. 3:57\ They\'re contracted to do that in that particular way. 3:59\ They don\'t want to do any additional work. 4:01\ Maybe the customer forgot to tell them the specific specification and so somebody upstream has to manhandle these data into some kind of unified format. 4:12\ You can have several different consultancies providing different data. 4:14\ For example, if you think about the rail tracks around the UK or any other rail network in the UK in particular, you have different consultancies measuring the data that come from the rail surface measurements and those data then might be in different formats and scored somewhere and you have to process them anyway. 4:33\ It\'s a big challenge. 4:36\ So you need to potentially perform an analysis of a system behaviour or some kind of user behaviour of a system. 4:45\ And we need to make sure the data are OK. 4:49\ And a lot of the work is, as I\'ve already said, the data preparation. 4:53\ We might need some kind of log analysis. 4:56\ For example, if we have logs from the computer, there might be rarefied events like if your server is running, it may only allow a remote login every once in a while. 5:06\ Mostly it\'s serving out a web server and and, but maybe it has a particularly odd log where a batch of IPS from somewhere in the world that\'s not normally asking for data stops asking for data. 5:18\ Something like that. 5:19\ We may have the problem that we need to resolve where we have a device that\'s at distance. 5:25\ So I already mentioned the wind farm. 5:28\ You could have boats, marine boats, sea, which are going across to the next port. 5:33\ They need to know if there\'s a problem so the park can be ordered and then the park can be there for when they arrive. 5:40\ Complicated machinery, for example, big pumps are sometimes gas turbine engines. 5:48\ It\'s the same kind of engine you have inside an aircraft and although it\'s not an aircraft, those engines need to be serviced and you need predictive maintenance really because you want to reduce costs. 6:02\ You\'re not flying in an airline. 6:04\ Patients, This is an interesting one. 6:07\ What we have in the UK and definitely in Scotland is the idea that we would like to bring patients to the hospital when they need to be at the hospital. 6:17\ It\'s pre emptive investigation. 6:20\ So this is particularly useful when somebody\'s got a chronic health concern, maybe they have Crohn\'s or inflammatory bowel disease or something that\'s chronic or heart problems. 6:31\ You can detect some of these things by taking measurements of the person. 6:36\ And if you detect there to be a flare up, you can pull them into hospital, do some further cheques and maybe change the medication and prevent a relapse or some kind of autoimmune behaviour, which is what you have with Crohn\'s. 6:50\ So anyway, so you can instrument people and take their data and process them. 6:56\ And what\'s all the purpose for this is you want to essentially reduce your costs and there potentially may well be some benefit. 7:05\ OK, so that\'s all the background information. 7:08\ So now I\'m going to go on to data management systems. 7:11\ Now bear in mind we talked about databases last week, this week. 7:15\ I\'m not talking about databases, I\'m more talking about file systems and ways of processing the data on them. 7:23\ So generically there are three types of cloud based file system. 7:29\ All right, so these are the generic categories and then we\'ll discuss the particular instances. 7:34\ The data lake is a generic storage that you find on the Internet and it\'s basically for unstructured data. 7:43\ So it\'s a large collection of files, as we say here. 7:47\ They they could be coming from an Internet of Things device. 7:50\ Maybe it\'s your fridge logging data back, or your light bulb or whatever it is. 7:55\ Social media logging of computers, It\'s all shoved in a regular fashion into a data lake. 8:02\ Now, a data lake has no schema on, right? 8:06\ That means that it just allows you to push data into it. 8:10\ Now, of course, if you push data in there and they\'re all in a bit of a mess when you come to read them later, you\'ll then have to decide how you\'re going to read them in order to form some kind of relations so that you can then display it in a dashboard. 8:24\ So they normally offer you something called schema on read, and we\'ll come back to that later in the lecture. 8:30\ I\'ve got an example of schema on read. 8:32\ So what we do here is we\'re essentially providing a recipe for how do we relate those data together when we read them back from the data lake. 8:43\ So there are bits of software that help you manipulate data in a data lake. 8:51\ So Metacat will allow you to explore as I\'ve got here, Hive, RDS, Teradata, Redshift, S3, that\'s the Amazon storage or Cassandra, which I mentioned last week. 9:01\ So that\'s the generic catalogue or Atlas. 9:05\ So the so-called Atlas like this is specific to Mongo DB, which you would have seen in the lab last week. 9:12\ Atlas is the cloud offering of Mongo DB. 9:15\ OK. 9:16\ And you can follow the link at the bottom here to learn about data lakes. 9:20\ This is a page put together by AWS Amazon. 9:24\ So that\'s one type of generic storage. 9:27\ Another type is a data warehouse. 9:31\ This is where the data are structured. 9:33\ You can think of the data warehouse as being analogous to a relational database. 9:38\ You have to have a schema defined in the data warehouse and you write data into there. 9:44\ You may or may not know, but copies of the data that are in the back end of my place, the back end of my place is Oracle, by the way. 9:51\ Copies of that are pushed every once in a while into a data warehouse and then those data are processed in the data warehouse to provide tutors with a dashboard that suggests to them as students are interfacing with my place, that\'s a real piece of software and it uses a real data warehouse. 10:09\ The benefit of this is that often you don\'t want to perform lots of analysis on your in house relational database because that\'s your production system. 10:19\ For example, if I ran all my analysis on the Oracle back end of Pegasus, Pegasus would slow down, my place would slow down. 10:28\ And you all be saying, oh, why is it going so slow? 10:30\ It\'s will, he\'s running all these queries, he\'s finding out the server. 10:34\ So instead of that, they replicate the data off to a data warehouse and that\'s where the analysis takes place. 10:40\ And from there the data are exposed typically to other people using dashboards where Power BI is the Microsoft offering. 10:47\ And yes, we do use Power BI with the data warehouse with the back end data from my place to understand how students are behaving. 10:57\ So with this, you have to define a schema and then you obey that schema when you write into the environment. 11:03\ Now, it\'s clearly not a problem if you\'re replicating from a relational database because that thing already has a schema. 11:10\ But if you were going to try to import data, however you import it, it has to obey the schema. 11:16\ Now the issues here are that rather like your relational database, you again are going to have to tend to buy expensive hardware. 11:24\ It\'s going to be memory intensive, high specs CPUs. 11:28\ You may end up with some very fancy IBM mainframe in there or something like that. 11:33\ To to offer a resilient and effective data warehouse, you are going to have to have special hardware. 11:40\ So the two cloud offerings I\'m mentioning here is Amazon Redshift and Azure sign ups. 11:46\ They\'re examples if you want to go and have a look at them. 11:48\ As far as I\'m aware, we probably are using Azure, but I can\'t remember. 11:54\ OK, so that\'s data warehouses and data lakes. 11:58\ And now I\'m going to go on to particular software solutions. 12:02\ Does anybody want to raise a comment or question or something about warehouses or data links before I continue? 12:10\ Yeah, OK, I\'ll carry on. 12:14\ I\'m going to reference back to them because the sort of software I\'m talking about is, well, most of them are in fact data lakes, but they offer schema on read. 12:25\ OK, so first of all, a lot of the software I\'m discussing in this talk comes from the Apache Software Foundation, which has been around for a long time. 12:35\ I started using software from Apache more than 26 years ago, at least quite a long time ago I started. 12:43\ So this foundation has been around for a long time. 12:46\ It offers all sorts of pieces of software. 12:49\ Some of them, if you look into them carefully, are still there, but nobody\'s really using them. 12:53\ Others are very, very important. 12:55\ So inside their offerings there are some things that are more important than others. 13:02\ So I\'m going to mention Hadoop briefly because Hadoop is still very important for large data sets. 13:07\ So how do you, as a as a file system anyways, designed to store up to petabytes of data? 13:15\ It won\'t store a vast amount of data for reasons I\'ll come to in a minute, but I\'ll start store quite a large amount. 13:22\ So to give you a scale, the hard disc in my laptop is a TB and 1024 terabytes is a petabyte if you\'re being strict and based to usage. 13:34\ So it\'s like 1024 hard discs or computers all stuck together. 13:39\ OK. 13:39\ The idea with Hadoop is it should be able to work on low cost hardware. 13:44\ You don\'t want to be having a super fancy computer with super fancy RAM and everything else needs to work on low cost hardware. 13:53\ And we need to allow hardware to be removed or added to the system. 13:58\ So that means we need some kind of replication of data, because if somebody\'s going to pull the hardware out, potentially we\'ve lost data unless we\'ve replicated it. 14:07\ Hadoop was written in Java. 14:09\ It\'s cross-platform, so you can run it on Windows or Linux or whatever because it\'s a Java application. 14:16\ In principle, most people run it on Linux because that uses less resources than anything else. 14:24\ OK, so what\'s in Hadoop? 14:25\ Well, Hadoop is a group of tools. 14:28\ It\'s not just one thing. 14:30\ We\'ve got some common module. 14:32\ We have the so-called HDFS or Hadoop distributed file system. 14:36\ This file system follows the way of working which is common to a lot of different distributed file system of which I\'ve used several. 14:47\ So we can comment on distributed file systems and issues with them. 14:50\ YARN is a job scheduler. 14:53\ Now why a job scheduler? 14:55\ Well, if you\'ve got a distributed system, you then needs to hand out the task of processing pieces of it to a distributed compute basically. 15:08\ So you\'ve got lots of different computer processors doing your work. 15:12\ And then the problem is, well, somebody else is going to want to do that too. 15:15\ So you need a scheduler so that you push out the processes and the queue is, you know, being formed and emptied all the time. 15:25\ You don\'t want computers being, should we say, monopolised by 1 user or being underused. 15:31\ So you end up with a schedule which also manages the cluster. 15:35\ So for example, if a compute node goes offline, it can see it\'s offline. 15:39\ It doesn\'t send us send any more processes. 15:43\ It also provides MapReduce or at least supports the MapReduce algorithm which I\'ll comment on later. 15:50\ MapReduce is an algorithm which allows parallel processing. 15:54\ It is one of many ways of parallel processing. 15:59\ It\'s mentioned in this lecture as an example. 16:01\ That\'s relatively simple. 16:03\ There is a second example which is also simple, which we\'ll cover near the end of the lecture. 16:08\ OK, so what\'s inside the box with HDFS? 16:11\ Well, HDFS comprises a name node and data some data nodes. 16:18\ The idea here is that the name node is normally one computer. 16:22\ You might have a backup name node or something like that, which is operating as a slave. 16:26\ It\'s receiving a copy, but it\'s basically one computer and you\'ve got to ask Namenode where the data are. 16:33\ So the client, the client being a computer that\'s outside of HDFS, asks for the data. 16:39\ So it sends a request to the Namenodes as well, where are the data. 16:43\ And then the the Namenode works with what we\'d call a logical file name. 16:49\ So logical file name is a file name that we humans can understand, right? 16:54\ It\'s some sort of, in this case, home food data. 16:57\ Well, it could be some kind of path, yeah, file path. 17:01\ And then the Namenode knows where the blocks are. 17:04\ So it sends off the request to the data nodes and then the the client can get the data back. 17:10\ Now the nice thing with this setup is that the data are actually replicated in blocks across these data nodes. 17:18\ So typically you have three block replicas. 17:21\ So each block is replicated 3 times, and when you are reading, what happens is that you\'re reading from the blocks according to where the Namenode says they are. 17:31\ When you\'re writing, you write once it\'s cached, and then after it\'s filled up the block, you\'ll replicate the block and then replicate it once more. 17:39\ So that redundancy means that then you can lose some hardware. 17:43\ It\'s a bit like RAID. 17:44\ I mentioned RAID last time. 17:45\ RAID is where we combine several hard discs into one. 17:50\ They look like they\'re one disc and we normally have so-called parity information across the discs where we have redundancy depending on the RAID level of course. 17:59\ OK, So what\'s the benefit of HDFS? 18:02\ Well, fault detection and recovery. 18:06\ If you lose some data and it\'s set up to have three replicas, we will create another replica. 18:12\ With commodity hardware. 18:14\ We expect failure, right? 18:16\ Very good hardware will still fail. 18:18\ The commodity hardware failure is more expected. 18:22\ It provides streaming access. 18:24\ What do we mean by streams? 18:26\ I don\'t mean a nice trickle of water. 18:29\ What we\'re talking about here is in computer speak is where we have an open connection to a file and the file data continue to stream into our environment, into our processing programme, whatever it is. 18:42\ So here we\'re talking about, we can open a stream to a data set and read it. 18:50\ We don\'t need to have to open all these blocks one after another. 18:54\ In terms of our own code, it can store large data sets. 18:59\ Excuse me. 19:00\ So it only really works with large files though. 19:05\ So if you\'ve got a GB file or, or up to terabytes, it\'s fine. 19:10\ If you go very small, like OK, I\'ve got a little log file that\'s 10 kilobytes, when you store loads of those on HDFS, it will not work very nicely because what will happen is that the Namenode will be overwhelmed with requests for where the data files are. 19:27\ So you\'ll be completely dominated by this Namenode rather than the data nodes. 19:33\ And this is common to a lot of relational, sorry, common to a lot of distributed file systems where they have a Namenode, which there are others, OK. 19:44\ The idea also is you write once into this and then you read many times and it\'s functioning like a data lake. 19:51\ All right, So you\'re writing into it with no schema, but then you\'re going to read back and you\'re going to have to enforce the schema when you read. 19:58\ The nice thing with HDFS is that it will actually move the computation, so using that scheduler to where the data are. 20:07\ So another problem with the distributed system is you don\'t want to have to move the data around. 20:12\ Because then you\'re going to saturate the nettle, sorry the the network bottlenecks. 20:17\ So imagine each computer has got a 10 gig link between it or whatever it is. 20:22\ If you\'ve got a lot of data being copied around, it doesn\'t matter how big that bandwidth is, eventually you\'re saturated. 20:28\ So if you put the computation next to the data, that\'s potentially giving you a faster solution. 20:34\ Anyway, there are some assumptions and goals given to you at the bottom in that link. 20:40\ Now, as I\'ve already said, the software is written in Java, underlying that it\'s normally run on Linux. 20:47\ Each node is normally set up on dedicated hardware, meaning it runs on one computer. 20:53\ So you\'ve got a computer that runs as a name node, you\'ve got other computers that run as data nodes, and the data nodes hold the data files, and they also do the block block creation or deletion where a block is part of a file. 21:08\ So what happens when you write into this is that it will replicate once the cache is filled, it operates what\'s called a staging. 21:16\ So it breaks the file up into 64 megabyte chunks normally, and then those are replicated out normally three times, similar, as I\'ve already said, as RAID. 21:30\ So when you\'re writing, once the 64 megabyte block is filled, it\'ll immediately replicate that, and then once that replica has been completed, replicated again and so you can carry on writing. 21:42\ So you\'re writing the 1st 64 megabit block, you go on, then creates another one, you start filling that one up, and then when you finally close the last block, it\'s not filled. 21:51\ It will then replicate that and then replicate it again. 21:53\ This is what I mean by a pipeline of replicas. 21:59\ It\'s copying one and then copying it again. 22:01\ Otherwise, if you didn\'t have that check, you might have like 1/2 copy and then you\'re not sure if it\'s actually all the data. 22:10\ OK, So yeah, pipeline replication with right access first data node then replicates the second one and off to the third. 22:19\ Great. 22:19\ So that is Hadoop, which is a particular file system for a particular reason. 22:24\ It\'s for very high speed data access with large files, so GB files. 22:31\ Now, if you know anything about computer networking, you\'ll know if you have a big file, what happens is that you can actually ramp up to the maximum throughput of the network interface over some period of time. 22:44\ You won\'t get there initially. 22:47\ So that working with big files is a particular problem, and it\'s a particular problem that we don\'t sometimes deal with because we\'ve got little log files or whatever else. 22:58\ So I\'m mentioning Seth here because a lot of people don\'t use HDFS because as I say, it\'s good for one thing, not good for all things. 23:07\ Seth is a general purpose file system, distributed file system. 23:11\ It\'s come from people who are working on another file system, which is called Lustre, and it\'s basically been a gradual progression. 23:21\ Now the nice key with Seth is that we don\'t have a head node anymore. 23:26\ The way Seth works is there\'s an algorithm, there\'s a a look up table, they call it the crush algorithm and there\'s a paper about it at the bottom of the slide. 23:35\ But essentially what happens is that you are are told where the next block is by the algorithm. 23:43\ So as long as you have the map to start off, you can then jump to the next block. 23:47\ So you don\'t need to go back to the head node in the same way you do with HDFS. 23:52\ Now the benefit of this is then you can potentially have access to smaller files and you\'re not overloaded by logical name look UPS. 24:04\ Rather like HDFS. 24:06\ CEF is self healing, meaning if you\'ve got a few servers in the CEF cluster, you pull one out, it will again do what HDFS does. 24:15\ It will replicate again across and you\'ll carry on. 24:19\ You can put in a new server and it will just replicate as that will spread over the servers, which is really quite nice. 24:25\ Unlike HDFS, it offers you three things. 24:30\ It offers you buckets. 24:31\ Now a bucket here is if you know anything about databases, there\'s a type called a binary BLOB, and that\'s where you just throw some data in. 24:40\ They\'re stored in a bucket. 24:42\ You can address them. 24:43\ Now you can use Mongo DB to do something a bit like that for you. 24:47\ If you go on to the cloud you\'ll see you\'ve got an H sorry an S3 bucket. 24:53\ That would be AWS storage where you can Chuck in there big files. 24:58\ Now a bucket is not actually a file system. 25:02\ You can get your data out of there, but you use some kind of URL like a web address to get data. 25:09\ But you can\'t, you know, go in there and move around and pushing files around in the same way you do with a normal POSIX file system. 25:17\ Now the other benefit of Seth having this crush algorithm and no head no is it can support so-called block storage. 25:24\ Now block storage is what\'s needed for virtual machines. 25:29\ So a virtual machine, when you install it, it is installed onto a normal computer, but it\'s essentially separated from the computer. 25:38\ But it needs to store it\'s data in a file system that operates just like your computer\'s file system, which is a block file system, meaning the blocks are quite small. 25:47\ So Seth can do this. 25:49\ It\'s actually good enough to be the backing storage for virtual machines, which HDFS definitely cannot. 25:55\ And then lastly, it does allow you to support a true POSIX file system where you can mount it in the same way you mount another shared space. 26:05\ So sometimes and more often than not, private cloud providers will use step instead of HDFS because although for extremely large files it might not be as fast as HDFS, in some cases it is better in that it provides you a generic solution so you can use your storage for all of these different things. 26:27\ OK, it\'s internals are a little bit more complicated than HDFS and I\'m just going to skip through them quickly. 26:35\ It has a so-called storage demon where normally there are three or more of these per cluster. 26:42\ And yeah, so it\'s three plus you have to have three or more, but you can have, you know, 1000. 26:48\ So it\'s massively parallel. 26:50\ So what what this thing does, this storage demon does is it provides information about where the data are. 26:57\ So they\'re sitting there reporting where the data are to the monitor. 27:02\ The monitor, you have three to seven of these. 27:05\ These are a little bit like the head node, and they contain a map that\'s then used by this crush algorithm to find the data blocks. 27:16\ This map is updated every once in a while if if some of the hardware is failing or whatever. 27:22\ But given the map and the algorithm, an incoming query can then find the data. 27:28\ And then lastly, we\'ve got a couple of other things. 27:30\ There\'s a metadata server that\'s needed for POSIX file, POSIX file system. 27:35\ Why for POSIX file system? 27:36\ Well, in a POSIX file system, you need a concept of an owner. 27:40\ You need a concept of whether you\'ve got read or write permission, whether it\'s executable. 27:44\ So these are metadata to do with the file. 27:48\ They\'re not actually the file data. 27:50\ And then lastly, there\'s normally one manager that just keeps track of everything that provides some sort of dashboard. 27:57\ So the system of Seth implies that you can have, you know, 3 clusters and an enormous amount of data because it allows more parallelism. 28:09\ OK, so that\'s Seth. 28:11\ I was going to go on to Spark. 28:12\ Does anybody have any questions about HDFS or Seth before I go into Spark? 28:20\ No. 28:21\ OK, Now Spark is used for data processing and it is used more than anything else in the commercial world at the moment for data processing. 28:36\ It underpins most of the analysis you\'ll see. 28:40\ It\'s the market leader. 28:42\ It\'s sort of taken over from Hadoop. 28:44\ So Hadoop as a platform for data analysis was very important, but it\'s still important for some things, but it\'s more so that sparks now more important. 28:53\ The good thing about Spark, what have you got right? 28:56\ You have more ways to interact with it than you did with Hadoop. 29:00\ Hadoop requires you to write in Java to talk to it, whatever else, and whereas Spark actually supports several different languages. 29:10\ So you can talk to Spark using Python, SQL, Scala, which is compiled and run on the Java virtual machine, obviously Java and R So R is a language used by statisticians through perform analysis. 29:24\ Now, the one big thing about Spark is it caches the data in memory as much as it can, and this caching speeds up the data processing that that\'s crucial to how it works. 29:37\ It\'s used, as I\'ve said here, for iterative and Interactive Data mining. 29:41\ So it\'s optimised for speed and it\'s much faster than Hadoop is for small bits of data. 29:49\ Where it came from, Well, people were at the time using Hadoop with MapReduce and they were finding that this was slow, they had bottleneck problems. 29:59\ And so that\'s where Spot came from. 30:01\ There\'s actually a presentation at the bottom, some of the ideas, some of the faults where Spot came from initially. 30:09\ All right, So what can Spark do? 30:12\ Spark is not a cluster manager and it is not a storage system. 30:17\ It\'s a layer on top that allows you to process and access data in a sensible manner, in a way that you would, if you know how to use Python data analysis. 30:28\ OK, So what can it do? 30:30\ Well, it can access data that are stored in your local file system. 30:34\ Now, you\'d occasionally do that, but that\'s more for testing Spark. 30:38\ You wouldn\'t normally do that if you\'re really using Spark. 30:40\ Spark is designed for analysis of data that are across a distributed file system. 30:46\ If your data are all on your local machine, you\'re better off not using Spark because there\'s an overhead to starting the Spark Manager process, which you don\'t have if you just run it locally. 30:58\ But if your data are spread across lots of computers, then Spark will help you because it will cache the data, it will bring them back together, and we\'ll do all the processing for you. 31:08\ So it can talk to HDFS, which is the Hadoop file system. 31:11\ You can also talk to Cassandra, Hbase, which is another, it\'s another Apache software project, the storage of data S3, which you probably talk about already, BLOB storage, you can talk to that. 31:24\ You can talk to the Azure version as well, the Azure BLOB storage. 31:27\ So basically giving you a common interface to perform a data analysis and make your life easier. 31:37\ It allows you to, as I\'ve said already, perform distributed data processing. 31:42\ And how does it do this? 31:44\ For testing? 31:45\ It has its own cluster manager, meaning you tell it to do something and then that request goes off to its internal one. 31:52\ Again, that\'s for testing. 31:54\ Don\'t do this. 31:55\ Normally in the lab, we\'re going to use the local 1 because we can explore the concepts of ideas, but normally you\'d be connecting you to something else. 32:03\ It can talk to Apache Mesos, which is a cluster manager, which is actually become deprecated now. 32:10\ By deprecated, what I mean is it still works today, but soon in the future it will stop working. 32:16\ So I\'m not mentioning that when I come to the lab examples that you\'ll see, he will talk to YARN, which is the scheduler that comes as part of the Hadoop tools. 32:26\ And lastly and most excitingly, it will work with Kubernetes. 32:30\ So now this is the most exciting one because Kubernetes is a container orchestrator, but what on earth is YARN band? 32:42\ So containers are essentially a a computer operating system, thin down to exactly what you need to run an application. 32:51\ So it\'s an operating system and then layers of code on top and you can then run this container on another computer. 32:59\ Now Kubernetes is an orchestrator, meaning it starts and stops and controls the process of multiple containers on a cloud environment. 33:09\ So you have a cloud environment and Kubernetes will stop stop them. 33:14\ So if you have Spark and talking to Kubernetes, then effectively you can have an elastic compute, meaning your compute can just grow and shrink. 33:24\ And that compute could be shared with other things that are running in containers on that platform, which then means you could leverage a common platform for many different things, which is why it\'s the most exciting. 33:36\ The others in this list, like for example Hadid, Yarn, that if you ran your on a cluster which was also trying to do something with containers, you don\'t have the resource conflict. 33:47\ Whereas Kubernetes managed this whole thing which is more exciting anyway. 33:52\ Apache Spark has gone through series of development steps over the years. 33:58\ It\'s first data format and it\'s still underpinning Spark is what\'s called resilient distributed data sets. 34:05\ So this is a definition of a data set and it doesn\'t require schema. 34:10\ Remember data lake? 34:13\ No schema. 34:14\ You write into it and then you can read it back. 34:16\ Now when you read it back, you can assume a schema or you can either assume that schema in your code or you can assume the schema by pulling the RDD into the data frame. 34:30\ So the data frame interface was written in 2013. 34:35\ This says on the slide. 34:36\ The idea here is to give you the same kind of interface that you have with a pandas data frame. 34:42\ Hopefully everybody\'s used Pandas, right? 34:44\ You load ACSV or whatever it is into pandas and then you can perform the analysis. 34:49\ The beauty with Spark is you\'re performing your analysis on a data frame, but that can be from a distributed system and you don\'t care. 34:58\ Well, maybe you care if you know, you throw out loads of processes and somebody\'s paying the AWS bill because then it could potentially suck up a lot of processing. 35:07\ But in terms of you writing code, you\'re abstracted away from the process of carrying out the analysis across all these different computers. 35:16\ Now the last one that was developed in 2015 is a is the data set API. 35:21\ And this essentially gives you a typeless or typed in way of interacting with the data. 35:28\ That one isn\'t available for the Python interface at the moment because Python itself allows you to have duct typing where you can decide what the type is at runtime. 35:38\ So Python doesn\'t need this, whereas things like Java do need it anyway. 35:43\ So we will see in the lab how to use the top two RDDS and data frames, right? 35:50\ So I\'m going to Park Spot for a minute and I\'m going to come back to some practical examples of Spark later on in the lecture. 35:57\ What\'s happened with Spark is that Spark has been developed and it\'s been improved by a a bunch of developers who actually produced a commercial bit of software which sits on top of Spark. 36:09\ And that commercial bit of software is called data bricks. 36:12\ So if you go to some of your friendly public cloud providers such as Azure, they will offer you data bricks. 36:20\ And what they are doing is they\'re offering you a service that, yeah, underneath this Spark. 36:26\ And it\'s essentially giving you an interface which looks like a Jupiter notebook. 36:30\ So it\'s a super easy way of manipulating lots and lots of data. 36:36\ Now the other thing that\'s inside data bricks is they have created what they like to call a lake house where you\'ve got a mixture of schema and no schema. 36:47\ So they\'re using a bit of software called Delta Lake. 36:50\ This again is open source. 36:52\ It\'s not, it\'s not commercial. 36:55\ So you could download it, install it and play with it. 36:57\ But it\'s part of the data bricks installation and this Delta Lake guarantees some kind of acid transaction. 37:05\ So this, remember this is the to do with the ability to safeguard your data. 37:12\ Essentially, are they going to be lost the integrity and redundancy? 37:18\ OK, you can see that we can read from AWS, Azure, Google Objects store. 37:22\ So it\'s doing the same thing that Spark can do. 37:25\ It\'s being able to read from lots of distributed stores. 37:29\ If you interact with this thing, it\'s got an online interface. 37:32\ It looks like Jupiter notebook. 37:34\ It allows you to use regular tools like ML Flow, psychic, low intensive flow by Torch. 37:40\ Having said that, ML Flow can be used with Spark. 37:45\ It actually is integrated with Spark and Spark offers you that. 37:48\ So what they\'ve essentially done here is they\'ve improved on Spark and they\'re offering it as a commercial solution. 37:57\ All right. 37:58\ So that\'s all I was going to say about Apache Spark, and I\'m going to come back to it. 38:03\ And I was going to then go on to map produce briefly as an example and then move on. 38:08\ Does anybody have any questions or comments so far about anything? 38:15\ No OK need another cup of coffee or I lost you all about 10 minutes ago, I don\'t know what. 38:23\ OK so MapReduce, this is 1 algorithm you can use for parallel processing. 38:32\ Now it\'s just an algorithm, meaning I can implement it myself. 38:36\ You can implement it yourself. 38:38\ It\'s a set of rules and a way of working. 38:41\ It\'s supported by different software frameworks. 38:43\ We will use one in the lab to see it working. 38:46\ There are many ways of doing these kind of distributed processing things, but this is just being used for illustration. 38:51\ OK, where did this originally come from? 38:53\ It was a Google who originally came up with it. 38:56\ They don\'t use it anymore. 38:57\ They\'ve gone to something better for their particular application. 39:01\ You can read about it at the bottom, The Google publication, they came up with it because they were thinking about text processing. 39:08\ So you\'ve got loads of different web pages on the Internet and you want to extract from them how often words appear. 39:16\ Term frequency, it\'s called. 39:18\ We\'ll cover that in a future week. 39:19\ And then you want to basically cobble together an index, and that\'s a distributive processing job. 39:26\ So hence map reduce. 39:28\ So the idea here is that we want to generate key and value pairs from running across the data, and we do that in a distributed way. 39:37\ So from one data set we generate key value pairs from another key value pairs, and then we want to bring them together, we reduce them, and then we output something at the end. 39:48\ So we want to be able to partition the data and schedule the processing across different computers. 39:54\ So you\'ve got some processing that\'s happening close to the data and some that\'s happening at the aggregation point. 40:03\ All right, so this is it graphically what happens? 40:05\ And it it\'s shown here in graphical form with a very simple example to illustrate the point. 40:11\ So here are the input data. 40:12\ We have some text and so we\'ve got some text characters. 40:16\ You can see this ABCDDCADB. 40:21\ Now it will split the data if you use it, say for example, Hadoop will pick up a different mapper per input. 40:32\ If you have a very big input file, it should split the input file into files. 40:37\ So anyway, you\'ve got the split of the input file, which I\'ve illustrated here into chunks and then you have the mapping. 40:45\ So the mapping is where you have some input and you decide you want a key and value pair. 40:51\ Now in this trivial case, I\'ve just said the key is the actual value that we had here. 40:56\ So you can see the key is A and the value is 1. 41:01\ So I\'m just saying OK, I\'ve found the character A once or here I\'ve found the character D once. 41:08\ Now I\'m generating here a series of key and value pairs. 41:11\ So you can see I\'ve got D1D1C1 and so on and so forth. 41:16\ You could be a bit more clever at the mapping stage. 41:18\ You can basically produce any key and value pair you like at this stage. 41:23\ So if you want to do lots of processing, output the key and value pair, that\'s up to you. 41:27\ You then have the so-called stuff shuffle and stalk phase where it sorts things according to the key. 41:34\ So you end up with all the keys together in one chunk or another and then you end up lastly with the reduce where you are given the data. 41:44\ Now, in terms of the, the way this works is the thing that supports the MapReduce algorithm does the input phase for you, splits the data, gives you the data, you then receive the data. 41:57\ You have to write the map phase. 42:00\ So you have to come up with what do I want to do to generate a key value pair? 42:03\ And you can do whatever you like, the shuffle and saw that\'s handled by the the framework supporting their algorithm. 42:10\ And then you can do whatever you like for the reduced phase. 42:14\ So you could output a plot, which you\'ll see in the lab or or something else. 42:19\ All right, So quickly, we\'ve got these different phases. 42:23\ We\'ve got an input phase map combiner, a combiner I haven\'t included here. 42:29\ What happens is that if you have lots and lots of nodes, imagine you\'ve got vast numbers of nodes, then if we just go straight from the output that\'s come from a map straight to the reducer, the reducer is going to be overwhelmed by the input data. 42:44\ So the combiner is an optional phase where we have lots of nodes in the system. 42:49\ Shuffle sort reduces in the output phase. 42:51\ OK, so how does this thing work? 42:54\ It\'s typically reading data from a distributed file system. 42:57\ So hence why it was used with HDFS in the 1st place. 43:02\ It splits the data up into the 64 megabit or 128 megabit sections, which is your rambler you\'re thinking about. 43:09\ HDFS is the size of a normal HDFS block. 43:13\ Then this is split into a job. 43:17\ So you\'ve got 64 megabit considered by one job. 43:21\ It generates one key and value pair output, then the map. 43:26\ What it does is it receives those key and value pairs. 43:29\ Depending on the framework, you may be allowed to omit the key or the value. 43:34\ You can with Hadoop. 43:35\ You\'ll see it in the lab. 43:36\ You\'re actually allowed to omit the key and value if you think it\'s useful. 43:40\ The map process is run once per input data block and then produces some output and it\'s written by the programmer. 43:50\ The combine, as I already said is an optional phase and it\'s needed where you\'ve got lots of nodes and you\'re having to combine them all together and you can\'t go straight to the reducer. 44:00\ So this is running again user defined code to aggregate the values. 44:04\ We won\'t see this in the lab, it\'s just it is possible. 44:08\ I\'m, I\'m not going any further other than that shuffle and salt that\'s held or implemented by the framework and it shuffles and sorts depending on the keys. 44:20\ And then lastly the reducer, This is written by the developer, it receives the grouped key and value pairs and then you can do all sorts of things, whatever you want in the reducer. 44:31\ You can perform some kind of further aggregation or filtering or combining the values together. 44:36\ And then at the end you output zero more key value pairs. 44:40\ So you shrunk the data down. 44:41\ So you can see why this was used to do a parallel processing algorithm, because in the algorithm itself, it has the idea that it\'s going to have to be executed close to the data, combined and then combined an output instead of pulling all the data and saturating all the network links, which we don\'t want to do. 45:00\ And then finally, yeah, the output. 45:01\ So whatever it is you\'re using normally picks up the standard out. 45:05\ That\'s the text that goes to the screen and saves it somewhere for you, which you\'ll see in the lab. 45:11\ OK, so that\'s MapReduce. 45:12\ Does anybody have any quick comments or questions about it before I go on to how you can implement it? 45:21\ No, not very talkative today. 45:26\ So implementing MapReduce, you can implement this thing from scratch. 45:31\ As I already said, you can sit at home, look at the rules. 45:34\ I\'m going to write this myself in whatever language I want and just obey the rules. 45:39\ It is supported by Mongo DB, just it\'s basically being phased out. 45:46\ So I\'m no longer talking about it in the lab. 45:48\ It\'s become no longer that important for Mongo DB because there\'s something else which I\'ll talk about at the end of the lecture. 45:54\ It\'s supported still by Hadoop. 45:57\ And yeah, you\'re going to have to write the map and the reduce steps, but whatever it is, in our case, it\'d be Hadoop performs the other steps for you. 46:08\ So MapReduce is assigned text to read normally. 46:12\ So by text, I mean it\'s got to be a text file with it could be numbers, it could be letters, but it\'s a human readable text file. 46:23\ It\'s not a binary file. 46:24\ Normally you can use binary input, but you have to do a bit of extra work to accept binary input and some of these frameworks don\'t allow you to do that very easily. 46:35\ OK, so here\'s an example of using MapReduce with Hadoop. 46:40\ So Hadoop supports the so-called streaming library. 46:44\ So what we can do is we can run a map, produce, sorry, a map function or a map method programme even on a file depending on how big it is. 46:57\ So what happens here is that we write our math and reduce algorithms in any programming language you like. 47:03\ OK, so you could write in Fortran if you\'re so inclined, or C\# or whatever. 47:08\ You receive the data input from Hadoop through the standard in as if somebody typed it. 47:14\ And then what happens inside your programme is up to you as long as you output something. 47:19\ So you have to output a key in Valley. 47:22\ You can allow the. 47:23\ Sorry, you can\'t put a null key or a null value if you want to. 47:28\ And yes, you can do numeric processing where the data that you\'re reading are actually a bunch of numbers. 47:34\ They don\'t have to be text. 47:37\ So here you can. 47:38\ Here\'s an example in Python. 47:40\ So what am I doing here at the top? 47:43\ I\'m reading from the standard in. 47:44\ Now the standard in is essentially what you have when you type in the keyboard. 47:49\ But what\'s happening here is Hadoop is opening the file and sending the data to one instance of the mapper. 47:57\ This is the map programme. 47:59\ And here the map programme is very simple. 48:02\ All I\'m doing is I\'m reading all the lines in the input data and then I\'m splitting them based on a white space. 48:10\ That\'s what this function does. 48:12\ And then I end up forcing these to lowercase and I\'ve just got the word tab 1. 48:19\ Now the first tab character with this Hadoop streaming library is used to separate the key and value pair. 48:27\ So if I have a tab with nothing in front, then the key is null. 48:31\ If I have a tab with nothing after, then the value is null. 48:35\ All right, you can have more tabs after the first one, but they are soon to be part of the value. 48:41\ So here what I\'m doing is exactly what I was saying earlier. 48:44\ I\'ve got the the word and just one. 48:49\ So I\'m going to end up with a list of words and the number one is my key and value pairs. 48:57\ Now the reducer again reads from the standard in and I need to get back my key value pairs. 49:04\ Now I\'m using split here. 49:06\ Yes, I could be more sensible and yes, I could only split using the first tab. 49:12\ In this case, I only have one tab, so just simplify the code, all right, to the bright people. 49:18\ So what we\'re getting back here is the key in value again, and I can then count up the number of times a word exists in a text file. 49:28\ Now, using this, I could download lots of Internet pages and execute this. 49:34\ I\'d have all the keys and then I reduce them and I\'d end up with a term, frequency, a number of words in each document. 49:43\ OK, so I\'ve already said, yeah, you can do numeric processing. 49:47\ You can export some value from this if you so wish. 49:51\ So here\'s another example. 49:54\ This time the value is not just a word. 49:58\ In fact, this time the value is a histogram. 50:02\ And I\'ve taken a histogram and I have turned it into Jason. 50:06\ So Jason, as you know is a text format. 50:10\ So I\'m just saying, OK, there\'s the histogram with the number of bins with the entries in the bins turning to Jason. 50:16\ I\'m my mapper is now histogramming with some input all the data that\'s given to it. 50:23\ And then when it finishes, it\'s sending the Jason to the reducer. 50:28\ And then the reducer receives all of the Jason files from all of the files that have been analysed, that got histograms or whatever. 50:36\ And then you can go ahead and append all the histograms together. 50:39\ So you can have processing happening at the map stage where you are performing some kind of analysis. 50:46\ You then aggregate them back at the reduced stage and then finally output histogram. 50:51\ You\'ll see that working in the lab when you get to it. 50:55\ Again, this this is makes much more sense if you\'re processing large amounts of data over a parallel system. 51:02\ If you can do this on one computer only, don\'t bother with Hadoop and the streaming library because there\'s an overhead to managing the processes that will slow you down. 51:13\ It\'s not significant when you\'ve got large files across lots of computers. 51:19\ All right, so that is the implementation of MapReduce with Hadoop and streaming library. 51:26\ And now it\'s going to go on to MongoDB aggregation, which is a kind of should we say in evolution of MapReduce. 51:33\ It\'s a generalisation of MapReduce. 51:36\ Does anybody have any comments or questions about last few slides and the Python before I go on? 51:44\ No. 51:45\ OK, So this is a generalisation and it\'s essentially taken over from MapReduce and Mongo DB. 51:51\ So that\'s why I\'m not talking about it. 51:52\ MapReduce that is in the lab with Mongo DB. 51:56\ Now the idea here is we\'ve got a collection. 51:59\ You remember this Mongo DB old database or several databases inside. 52:05\ There can be several collections and you want to go through all those documents in the collection. 52:10\ And the documents in the collection could have a lot of data in them. 52:14\ They could be quite complicated. 52:16\ They\'re not just a table. 52:17\ You have quite a complicated structure. 52:19\ So they come up with this idea of pipelining the analysis of these input data that are in the collections, and then the output from one stage in this pipeline becomes input to the next. 52:33\ So you start off by reading the data from the collection, you do something, you output a new collection, and then you\'re essentially inputting that into the next step of the pipeline and so on and so forth. 52:44\ Now there are a whole bunch of operators, and I\'m not expecting you to remember them. 52:48\ It\'s more just the principle I\'m hoping you to capture. 52:52\ If you follow the link at the bottom, you\'ll find out about the operators. 52:56\ We will use some of them in the lab. 52:58\ Essentially you can unwind, you can basically take a list that\'s embedded inside a docking record and consider each one as a document, or you can aggregate where you consider several documents as one document. 53:11\ Here\'s an example. 53:14\ So at the top what I\'ve done is I\'ve got\... 53:17\ where you could insert some more data and you\'ll probably recognise this from Mongo DB. 53:21\ I\'m using the PY Mongo interface, so it\'s Python code. 53:26\ I\'m saying I\'m inserting many documents. 53:28\ So in this simple example I\'ve got customer ID and amount 300 into this collection here and then afterwards I\'ve defined a pipeline. 53:38\ Now this pipeline only has one step. 53:41\ The syntax is that you have a list and then inside you have each step in the pipeline. 53:46\ So each pipeline step has a an operator here where you are doing something. 53:52\ In this case, I\'m saying I want to group together data I\'m aggregating and my new collection is going to have the ID of customer ID. 54:00\ So I\'ve decided that\'s what\'s my ID is going to be for the internal collection. 54:04\ And then I\'m saying I want to sum across the amount for all of the unique IDs. 54:11\ So what\'s going to come out of this is all of the the amount, the total of the amount for customer ID what or two or three. 54:19\ So it\'s just one step in the pipeline. 54:22\ And then I can run this using collection dot aggregate. 54:27\ So this this collection is just the variable. 54:29\ I call it collection because in my case is a collection. 54:32\ In your case, it could be something like, I don\'t know, boats or whatever, whatever variable you think is useful. 54:39\ So that\'s the actual Mongo DB collection and we just say aggregate and we give it the pipeline definition and this will return a series of documents which you can then iterate over as a normal document collection. 54:52\ You can serialise them, that means store them as well if you so choose. 54:58\ OK, so that is MongoDB aggregation, which you\'ll see in the lab. 55:04\ Spark using Spark. 55:07\ Now I\'ve borrowed this diagram from the URL at the bottom here. 55:13\ Spark, as I said earlier, allows you to process over a parallel set of computer nodes. 55:22\ Now how does it work? 55:23\ You have at the very bottom of the Spark processing stack a so-called Spark context. 55:28\ And what happens when you start this is it starts a series of other processes, one of which is actually monitoring what\'s happening to the jobs. 55:38\ So it\'s not just OK running your programme, it it actually runs a little web server as well, which you can query while the job\'s running and find out what\'s happening to the parallel processes. 55:47\ Which is one of the reasons why you shouldn\'t really use this if you\'re running locally because there\'s an O 8 to it. 55:52\ Anyway, the Spark context is responsible for sending job requests off to these worker nodes. 55:59\ And in between, you have this cluster manager, which is depending on the kind of cluster manager you\'re using, whether it\'s curb rows, sorry, curb rows, whether it\'s Kubernetes rather, whether it\'s Kubernetes or something else, it\'ll look slightly different. 56:12\ You they\'ve got different diagrams for the different cluster manager setups if you want to have a look. 56:18\ Now, as I\'ve already said, you have some kind of caching as part of this Apache, sorry, Apache Spark setup, which speeds up your data analysis. 56:29\ Now, while you\'re running that Spark context, what you have is a little web server. 56:34\ So you can try this in the lab, you can pause or use the debugger to pause the programme and while the Spark context is still in scope, you can go to the local host and check out the jobs that are running. 56:48\ So this was me. 56:49\ I was running one of the test programmes in the lab and I paused it and had a look and I could see what was going on with the programmes. 56:57\ Now obviously that\'s far more useful if your analysis is running across lots of different nodes. 57:03\ While it\'s running, you can just go ahead and look what\'s going on inside. 57:09\ Here is an example of RDDS. 57:11\ This is a very very simple example to hopefully give you a flavour. 57:16\ What we have in Python is we start with the library PY Spark. 57:21\ So the Python support for Spark is through. 57:25\ Actually it\'s a Java wrapper library. 57:28\ But it appears in Python as if it is a, as if it is a Python library. 57:35\ Unless of course we have a weird error and then we see lots of Java errors on the screen because underneath it\'s still job. 57:40\ So in this case, I create a Spark context. 57:43\ Remember that\'s the key process that has to be in scope and manages the job. 57:48\ Now when I create it, I have to tell it where it\'s going to run. 57:52\ Now, if I don\'t tell it where it\'s going to run, I can actually direct it where it\'s going to run using a Spark submit command, which you\'ll see in the lab. 58:02\ But I\'m not going to confuse you with that at the moment. 58:04\ So here I give it a name as well. 58:06\ Why do I give it a name? 58:07\ Well, I give it a name because it\'s running in a batch, batch system. 58:11\ So there may be many other applications running in that system. 58:15\ So if your applications called no name and you know, got hundreds of different people running an application called no name, it doesn\'t tell you anything. 58:22\ So normally you give your app a name so you know what it is. 58:26\ It\'s actually running here. 58:28\ I\'ve said local. 58:29\ Now if I say local like this without any square brackets, it\'s going to use one thread, one processing thread on the local scheduler. 58:37\ And this is only for testing. 58:39\ So here instead of local I could give the address of the Kubernetes master node or the the yarns scheduler or something else. 58:48\ So this is where you change which things Spark is talking to. 58:54\ Now you can then load in data. 58:56\ Here I\'m loading in the data from a local file system. 58:58\ Again, for testing you could use a a logical file name which is to a distributed file system, which wouldn\'t be for testing. 59:07\ So here I\'ve got just a bunch of numbers and then having read the numbers. 59:10\ So this is now outputting an RDD. 59:14\ This is the redundant data set. 59:16\ I can then operate on the RDD using a map. 59:21\ We\'ve seen this concept before, haven\'t we? 59:23\ So the map, basically it will take a function and it will give each value from the input data here this RDD. 59:32\ So in this case, each line into this map. 59:36\ Now you can implement the function as a real function or you can use Lambda syntax. 59:43\ Now I\'m using Lambda syntax here in the lab. 59:45\ I\'ve got a mixture. 59:47\ There\'s a function and Lambda syntax. 59:49\ So you can see them through the two things. 59:51\ The Lambda syntax here, what does it mean? 59:54\ It means that for every line in the input file here numbers. 59:59\ It\'s going to split them according to white space and then it\'s going to cast that line. 1:00:04\ So it\'s split them. 1:00:06\ So each word we\'re assuming here, this thing numbers does really contain numbers because otherwise that\'s going to crash, right? 1:00:13\ And then once it\'s done all of that, it\'s converting it into an array. 1:00:16\ So the output of this, the output of this map step is an RDD which comprises an an array of ints. 1:00:26\ All right, So we\'ve taken in some data and you can quite clearly see this is unstructured. 1:00:32\ And then we\'ve put in our code some assumptions about its schema and then we\'ve produced something that\'s more structured. 1:00:40\ OK, so that\'s one way of working here is the data frame way. 1:00:46\ Now a data frame sits on top of a Spark context. 1:00:51\ So what we do here is we create a so-called spark session. 1:00:55\ Now underneath the still a spark context that\'s running and we set it up with and again a name. 1:01:03\ Because the names are said already is associated with the jobs that are going to run on the parallel system. 1:01:09\ We then can have other settings in here which I haven\'t put or we can provide them using the spark submit command. 1:01:17\ Now with a Spark session, we can then read the data and we can start to assume some schema. 1:01:25\ We can either infer the schema from the data or when we read them, we can require the schema. 1:01:31\ Now if you require it and it doesn\'t match, it\'s going to fail or it\'s going to force it. 1:01:35\ Here I\'m inferring it, meaning that it will read my CSV file and it will decide whether or not that column is an integer column or whatever else. 1:01:46\ So it\'s, it\'s performing something a little bit like a Pandas CSV read, only it\'s a little bit cleverer and it\'s able to do it in a parallel process manner. 1:01:57\ So here I\'m again loading some local file just for testing. 1:02:01\ Now once we\'ve got a data frame, we can use it just like we would a pandas data frame, meaning you can perform all the normal stuff you would on a pandas data frame. 1:02:11\ I\'m not commenting on that. 1:02:12\ Any beyond, beyond this in the lab, you\'ll see me use a Spark data frame in the way that you\'d use pandas. 1:02:20\ Here I\'m going one step further and I\'m using SQL. 1:02:24\ So as soon as you\'ve got a Pandas data frame, you can tell Spark, sorry. 1:02:29\ As soon as you\'ve got a Spark data frame, well, you can tell Spark that you want to consider it as a table. 1:02:35\ So you want to decide this is a relational table. 1:02:38\ So to do this, we essentially say we can have a general view or we can have a temporary view. 1:02:43\ But you essentially say this data frame is this table. 1:02:47\ Now once you\'ve done that, you can then go ahead and use regular SQL to access it. 1:02:52\ And why is this useful? 1:02:54\ You say why? 1:02:55\ Well, because once you have a relational database interface, you can then put on top a dashboard such as our friend Power VI or something else. 1:03:04\ Tableau. 1:03:05\ It\'s going to read the relational data directly from Spot. 1:03:10\ So you can have all of that logic in the background and then provide an interface which can be used for dashboarding. 1:03:18\ That\'s it.

Use Quizgecko on...
Browser
Browser