FSS01_ExerciseSolutions.pdf
Document Details
Uploaded by StatelyAgate7771
Tags
Full Transcript
Fundamentals of Software Systems Exercise 1 Christoph Lofi 1 Data for AI Systems • 1.1)Briefly describe why data-related tasks take so much time and effort in typical machine learning projects. 2 Data for AI Systems • 1.1)Briefly describe why data-related tasks take so much time and effort in...
Fundamentals of Software Systems Exercise 1 Christoph Lofi 1 Data for AI Systems • 1.1)Briefly describe why data-related tasks take so much time and effort in typical machine learning projects. 2 Data for AI Systems • 1.1)Briefly describe why data-related tasks take so much time and effort in typical machine learning projects – There are many data related tasks like data identification, data aggregation, data cleaning, data labelling, and data augmentation – Most of these tasks are specific to a given problem domain and cannot be solved in a general fashion • “custom made solutions” needed in many cases – Most of these tasks (still) rely on manual labor 3 Data for AI Systems • 1.2) Briefly summarize the 4V’s of Big Data (Volume, Velocity, Variety, Veracity). 4 Data for AI Systems • 1.2) Briefly summarize the 4V’s of Big Data (Volume, Velocity, Variety, Veracity). – Volume: There is a lot of data to store (size) • A lot of storage is needed – Velocity: Data comes in a high rates / speeds • The ability to ingest data at high rates is needed – Variety: Data comes ion many formats / forms / data models • Systems needs to be flexible (and thus cannot easily optimize) – Veracity: Data is of uncertain or even dubious quality • Systems need to measure and cope with data uncertainty / quality 5 Data for AI Systems • 1.3)Briefly assess in how far the 4V’s can be addressed by current technology. 6 Data for AI Systems • 1.3)Briefly assess in how far the 4V’s can be addressed by current technology. – Volume/Velocity are mostly fine • we have systems which scale well (in a mostly inefficient fashion) and process data streams in real-time (with some caveats) – Variety is a bit more challenging • There are specialized systems for each type of data, but it’s hard to use them together flexibly – Veracity • …this is active research. No real default solutions yet. 7 Data for AI Systems • 1.4)Briefly describe what a data lake is. 8 Data for AI Systems • 1.4)Briefly describe what a data lake is. – A single store of a wide variety of data • Typically big • Typically varied with respect to data types, data formats, schemas, and domains • Typically without solid central understanding of data semantics 9 Data for AI Systems • 1.5) What does ETL stand for? 10 Data for AI Systems • 1.5) What does ETL stand for? – Extract, Transform, Load • Typically pipeline workflow in data engineering for transferring data from one system to another (different) system 11 Data for AI Systems • 1.6)Briefly discuss the advantages and limitations of synthetic training data for AI systems. (forgot this one…) 12 Data for AI Systems • 1.6)Briefly discuss the advantages and limitation of synthetic training data for AI systems. – Advantages • Can generate large amounts of training data quickly • Can be used for data augmentation (e.g., balancing underrepresented classes) – Disadvantages • Unclear semantic reliability – Worse: reliability hard to measure and assess • Does not work in all domains / scenarios – Relies on a well-trained powerful generator model 13 TOWARDS SCALABLE NOSQL DATA MANAGEMENT • 2.1)Briefly describe the main advantages of NoSQL systems over relational databases. 14 TOWARDS SCALABLE NOSQL DATA MANAGEMENT • 2.1)Briefly describe the main advantages of NoSQL systems over relational databases. – Potentially much better scalability, thus can have global distribution, higher performance and throughput – Potentially focus on availability and replication – Typically, simpler (but limiting) data model – Typically, no need for predefined schemas – Etc. 15 TOWARDS SCALABLE NOSQL DATA MANAGEMENT • 2.2)Briefly describe the main advantages of relational databases over NoSQL systems. 16 TOWARDS SCALABLE NOSQL DATA MANAGEMENT • 2.2)Briefly describe the main advantages of relational databases over NoSQL systems. – ACID transactions – Powerful SQL query language – Flexible and more powerful data model • A lot of data is naturally tabular / relational – Query optimization – Etc. 17 TOWARDS SCALABLE NOSQL DATA MANAGEMENT • 2.3)Briefly discuss in how far data management for a large enterprise application using SOA would differ from a traditional enterprise database setup. 18 TOWARDS SCALABLE NOSQL DATA MANAGEMENT • 2.3)Briefly discuss in how far data management for a large enterprise application using SOA would differ from a traditional enterprise database setup. – This would move from having one big central database to having several smaller databases which are directly co-located with their respective application – Communication / synchronization via interfaces (and not a common schema) 19 TOWARDS SCALABLE NOSQL DATA MANAGEMENT • 2.4) Imagine that you are hired by the university library to design the data storage solution for a new project which scans and digitizes its large collection of historic maps in high resolution. These maps are typically not available to the public due to their fragile nature. After scanning, the maps are to be annotated with additional meta-data like for example place shown on the map, type of map, time period of the map, authors, or other meta-data like descriptions or even map-specific meta-data fields as deemed relevant by the annotator. Later, this will be used for a freely accessible web applications which allows users to search for a map, and then view it online. What kind of data storage solution would you suggest for this scenario? What type of data model would it use? Why do you think that your suggestion is a good one? 20 TOWARDS SCALABLE NOSQL DATA MANAGEMENT • many answers possible here – I personally would go with a document store as this is matches the data model and queries quite well (no complex analytics or joins; all metadata refers to a single item/document) – There can also be some arguments being mage for not storing the images in the database but putting them into the filesystem; and then using the database only for metadata. 21 TOWARDS SCALABLE NOSQL DATA MANAGEMENT • 2.5)Which data model would have the smallest impendance mismatch in the scenario described in 2.4, assuming that the majority of your application is written in a mix of Java Script and Python? 22 TOWARDS SCALABLE NOSQL DATA MANAGEMENT • 2.5)Which data model would have the smallest impendance mismatch in the scenario described in 2.4, assuming that the majority of your application is written in a mix of Java Script and Python? – Document store, as Java Script / Python are both heavily based on JSON, and that also matches the scenario well. 23 TOWARDS SCALABLE NOSQL DATA MANAGEMENT • 2.6)Briefly describe what the (potential) disadvantages of a document store over using a key-value store. 24 TOWARDS SCALABLE NOSQL DATA MANAGEMENT • 2.6)Briefly describe what the (potential) disadvantages of a document store over using a key-value store. – Key-value stores can be much simpler and lighter – Key-Value stores could potentially be faster / scale better (depending on use case scenario) 25 P2P • 3.1)Briefly describe what the purpose of finger tables is in a DHT implementation like Chord. 26 P2P • 3.1)Briefly describe what the purpose of finger tables is in a DHT implementation like Chord. – To quickly locate the physical machine holding the requested data – Ensuring that each node only needs to have a node state of O(log n) while maintaining a routing complexity of also O(log n) • (node state== knowledge about other nodes) 27 P2P • 3.2) Why does Amazon DynamoDB not use finger tables? – Because Dynamo assumes a more reliable and stable cluster of nodes • Chord was for P2P systems – Thus, it doesn’t care about minimizing the node state, and opts for have a node state of O(n) in favor of having routing complexity of O(1) 28 P2P • 3.3) Imagine a DHT using finger tables with a (unrealistically tiny) hash range from 0 to 255 which is organized as a hash ring (like in Chord). There are already 5 nodes in the network. Node 1 covers the hash range 3-55 Node 2 covers the hash range 56-70 Node 3 covers the hash range 71-200 Node 4 covers the hash range 201-240 Node 5 covers the hash range 241-2 Construct a Chord-style finger table for Node 1. 29 P2P i log distance 2i distance Target hash Node id 0 1 56 2 1 2 57 2 2 4 59 2 3 8 63 2 4 16 71 3 5 32 87 3 6 64 119 3 7 128 183 3 30 CONSISTENCY • 4.1) Briefly outline the difference between consistency in the CAP theorem and consistency in ACID. 31 CONSISTENCY • 4.1) Briefly outline the difference between consistency in the CAP theorem and consistency in ACID. – CAP consistency only refers to consistency of replicas; ACID consistency refers to overall data consistency when faced with potentially complex transactions. 32 CONSISTENCY • 4.2)A new database system claims that it can recover from a fatal hard drive failure by rebuilding all lost data from a log file stored on a different drive, and thus would be available (in the sense of the CAP theorem). Briefly discuss this claim. 33 CONSISTENCY • 4.2)A new database system claims that it can recover from a fatal hard drive failure by rebuilding all lost data from a log file stored on a different drive, and thus would be available (in the sense of the CAP theorem). Briefly discuss this claim. – “Rebuilding” data (as used in database recovery) is a potentially lengthy process • While it does protect from data loss (to some extend), the system is not fully available during the rebuild 34 CONSISTENCY • 4.3)Can the Two-Phase Commit Protocol recover from a worker who sent a “ready” answer to a “prepare”, but never answered with an “acknowledge” after a “commit”? Why (not)? 35 CONSISTENCY • 4.3) Can the Two-Phase Commit Protocol recover from a worker who sent a “ready” answer to a “prepare”, but never answered with an “acknowledge” after a “commit”? Why (not)? – Somewhat. We can treat that in different ways, but the safest would be to wait for a timeout. If there is still acknowledgement, treat that silence as a failure, and send a rollback to all other works and consider the transaction aborted. • Of course, the worker might have committed and then failed before sending the acknowledgment, we need to check for that when it becomes available again • Still, this is a messy situation 36 CONSISTENCY • 4.4)Why does a system like Amazon Dynamo use a Vector Clock instead of regular time stamps? 37 CONSISTENCY • 4.4)Why does a system like Amazon Dynamo use a Vector Clock instead of regular time stamps? – Because vectors clocks can differentiate between conflicting versions (in case of a partitioning event or concurrent modify) and one version simply being outdated compared to another. 38 CONSISTENCY • 4.5)Briefly discuss: In a replicated data storage scenario, a master-slave setup for each replica using locks for write operations would ensure that “read-your-own-write” conflicts will not happen. 39 CONSISTENCY • 4.5) Briefly discuss: In a replicated data storage scenario, a master-slave setup for each replica using locks for write operations would ensure that “readyour-own-write” conflicts will not happen. – No. Write locks would only fix write-write conflicts. The client can in theory still read from a slave which did not yet receive the write it already issued before to the master. • Of course, we can also have read locks or force to read from the master to fix this, but that will cost performance / limits scalability 40 41