Full Transcript

Edit this cell to enter your name and PI. Then run the JavaScript cell below to identify answer and feedback cells. Name:\ PI: In : %%javascript var head=document.getElementsByTagName('head'),style = document.createElement('style' TMA 03 M269 requires all assignments to be submitted electronically,...

Edit this cell to enter your name and PI. Then run the JavaScript cell below to identify answer and feedback cells. Name:\ PI: In : %%javascript var head=document.getElementsByTagName('head'),style = document.createElement('style' TMA 03 M269 requires all assignments to be submitted electronically, by following the link(s) from your StudentHome page to the online TMA/EMA service. If you foresee any difficulty with submitting your assignment on time then you should contact your tutor well in advance of the cut-off date. For further information about policy, procedure and general submission of assignments please refer to the Assessment Handbook, which can also be accessed via your StudentHome page. The learning outcomes assessed by each questions are outlined at the head of the question. In answering this TMA, remember that you can only use the Python data types, methods and functions listed in the chapter summaries, unless specified otherwise within this TMA. If you use an algorithm, data structure or operation that is not allowed in a part of a question, your answer to that part will be awarded zero marks. You may wish to make use of the allowed tool to help you check for any violations. Your TMA should be opened and edited in Jupyter notebooks. Ensure you put your name and PI at the top of this document. You should run the second cell to highlight the answer cells in yellow. All of TMA 03 is in this document. There are seven questions worth 100 marks. Part 1 covers a range of topics from Chapters 1 to 25. Part 2 focuses on complexity classes and computability from Chapters 26 and 27. PART 1 (59 marks) Introduction DNA strands contain sequences of chemical units known as bases, often referred to by the initial letters of their chemical names, one of A, C, G, or T. This allows a DNA strand to be conveniently represented by a string such as ‘CAGTATGGCCA’, which we can process using Python code the same way as any string. In practice, DNA strands are usually very long and may have millions of bases. The DNA strand examples we use for this question will be much shorter than this. You do not need to understand any more than this about DNA or biology in general. From now on we will usually be discussing string processing, though it is very much applicable to the field of bioinformatics. We are interested in finding the longest common substring of two given character sequences (not the longest common subsequence as in the Chapter 23 examples). For the longest common substring the common characters must be contiguous in the original strings. For example, ‘ACG’ is a subsequence of ‘TATCCG’ but it is not a substring of this. ‘ATC’ is a substring of ‘TATCCG’ because the letters ‘A’, ‘T’ and ‘C’ occur together in the longer string. Make sure you understand the difference! This could be relevant in searching for similar structures within two DNA sequences, such as genes or other structures within chromosomes. We will ask you to approach this problem in two different ways and to compare the expected and actual performance of each approach. Question 1 (20 marks) You should be able to answer this question after you have studied up to Chapter 13 where Big-Oh notation is introduced, though probably you will be able to attempt most of it after studying Chapter 4 on Sequences and iteration. This question assesses the learning outcomes: Develop and apply algorithms and data structures to solve computational problems. Analyse the complexity of algorithms to support software design choices. Explain how an algorithm or data structure works in order to communicate with relevant stakeholders. Write readable, tested, documented and efficient Python code. We wish to develop a function to return the longest common substring of two character strings, which we will call left and right. We will use examples representing DNA sequences, so the strings we use will only contain the characters 'A' , 'C', 'G', 'T' , but the function should work for any valid character string. In this question, we want you to take a brute force (exhaustive search) approach but to build up the function in stages as outlined below. Q1(a) (3 marks) Write and test a Python function common_pair , that checks whether any two consecutive characters of left match any two consecutive characters in right. The function should return the first two matching characters it finds or the empty string if there is no match. Complete the function in the following cell. We have provided some tests; your tutor may use additional tests. In [ ]: %run -i m269_util def common_pair(left: str, right: str) -> str: pass # Delete this and replace with your answer here common_pair_tests = [ # case, ('one is empty', left, 'ACTG', right, '', common pair ''), ('same string', ('nothing common', ('subsequence', 'ACTG', 'ACCA', 'AAAA', 'ACTG', 'GTGT', 'GATTA', 'AC'), ''), '') ] test(common_pair, common_pair_tests) Q1(b) (4 marks) Write and test a new function common_substring , with arguments left and right as in part (a) but also accepting an additional integer argument length. The function should check whether any substring of length characters in left matches a substring of length characters in right. The function should return the first matching substring of length characters or the empty string if there is no match. Complete the function in the following cell. We have provided some tests - your tutor may use additional tests. In [ ]: %run -i m269_util def common_substring(left: str, right: str, length: int) -> str: pass # Delete this and replace with your answer here common_substring_tests = [ # case, left, ('one is empty', 'ACTG', ('same string', 'ACTG', ('nothing common', 'ACCA', ('subsequence', 'AAAA', ] right, '', 'ACTG', 'GTGT', 'GATTA', csl, 2, 3, 3, 1, common substring ''), 'ACT'), ''), 'A') test(common_substring, common_substring_tests) Q1(c) (3 marks) Write and test a new function longest_common_substring , with arguments left and right as in part (a). This function should check for the longest common substring of left and right. It should return the longest common substring or the empty string if there is no common substring. Complete the function in the following cell. We have provided some tests - your tutor may use additional tests. In [ ]: %run -i m269_util def longest_common_substring(left: str, right: str) -> str: pass # Delete this and replace with your answer here longest_common_substring_tests = [ # case, left, right, ('one is empty', 'ACTG', '', ('same string', 'ACTG', 'ACTG', ('nothing common', 'ACCA', 'GTGT', ('subsequence', 'AAAA', 'GATTA', ] longest common substring ''), 'ACTG'), ''), 'A') test(longest_common_substring, longest_common_substring_tests) Q1(d) (5 marks) Explain whether or not you would expect any difference in the best-case and worst-case complexities of your function longest_common_substring. Analyse the worst-case complexity of your function. You may find it convenient to define L = |left| and R = |right| and to express the complexity in terms of those values. We have used uppercase letters to avoid confusing lowercase 'l' with other characters. You should use Big-Theta notation (Section 2.7) or Big-Oh notation (Section 13.3.3) as appropriate. Complexity of sequence operations (which applies to strings) is given in Section 4.9.1. Write your answer here Q1(e)(i) (3 marks) Measure the actual worst-case performance of your function longest_common_substring for a range of suitable inputs. You will need to make a number of measurements to see how the timing varies as L or R increases. You may find it useful to recall that arithmetic operators such as + and * can be applied to strings to conveniently define very long strings. For example: left_1 = 'C' * 100 + 'A' * 200 left_2 = 'GAT' * 100 left = left_1 + left_2 left = left * 2 In [ ]: # Write your timing code here Q1(e)(ii) (2 marks) Discuss whether the performance was in line with your expectations from complexity analysis. There are many reasons why you may or may not see performance data exactly in line with expectations. You can still get full marks on this question as long as you provide appropriate timing results and a clear discussion of what you expected and what you found, but you do not need to provide an explanation of any differences. Write your answer here Question 2 (29 marks) This question can be answered after studying Chapter 23 (Dynamic Programming). It assesses the learning outcomes: Develop and apply algorithms and data structures to solve computational problems. Analyse the complexity of algorithms to support software design choices. Explain how an algorithm or data structure works, in order to communicate with relevant stakeholders. Write readable, tested, documented and efficient Python code. In this question we aim to solve the same problem as in Question 1 – given any two DNA sequences (strings containing only C, G, A or T) find their longest common substring – but using a different approach. You should provide a dynamic programming solution, following the examples in Chapter 23, adapting code as appropriate. You will lose marks if you do not follow an approach similar to Chapter 23. Q2(a) (4 marks) Write and test a recursive Python function lc_leading_substring , that returns the longest common leading substring of two strings, left and right. This means comparing the two strings starting from their first character. The function should return the empty string if there is no match. This function should be useful when you move on to coding part (b). For example, the longest common leading substring of 'ACAGTC' and 'ACATGT' is 'ACA'. The longest common leading substring of 'ACAG' and 'CACAG' is the empty string ''. Complete the function in the following cell. We have provided some tests - your tutor may use additional tests. In [ ]: %run -i m269_util def lc_leading_substring(left: str, right: str) -> str: pass # Delete this and replace with your answer here lc_leading_substring_tests = [ # case, left, right, longest common leading substring ('one is empty', 'ACTG', '', ''), ('same string', 'ACTG', 'ACTG', 'ACTG'), ('nothing common', 'ACCA', 'GTGT', ''), ('subsequence', 'AAAA', 'GATTA', '') ] test(lc_leading_substring, lc_leading_substring_tests) Q2(b) (14 marks) Write and test a function longest_common_substring_DP with string arguments left and right , using a top-down dynamic programming algorithm, to find the longest common substring of the two strings. It should return the empty string if there is no common substring. Chapter 23 shows a variety of dynamic programming approaches to various problems, including the similar (but different!) problem of the longest common subsequence of two strings. We would like you to write your function so as to use top-down dynamic programming using either a dictionary or a matrix approach for the cache, closely following one of the techniques illustrated in Chapter 23 for the longest common subsequence problem. You will lose marks if you do not follow an approach similar to Chapter 23. Complete the function in the editable code cell below. You can use the same tests that were provided for Question 1(c) - your tutor may use additional tests. Here is an outline of a possible recursive solution to the problem, similar to the recursive solution to the longest common subsequence problem in Chapter 23. This is not exactly the algorithm you need as it does not use cacheing, but it may be a useful start. In [ ]: #def lc_substring(left: str, right: str) -> str: #"""Return the longest common substring of left and right.""" # if one or both strings are empty: # return the empty string # else # find the LC leading substring of left and right # compute the LC substring when skipping the right head # compute the LC substring when skipping the left head # return the longest of these three substrings In [ ]: %run -i m269_util def longest_common_substring_DP(left: str, right: str) -> str: pass # Delete this and replace with your answer here # Use the same test cases as in Q1(c) test(longest_common_substring_DP, longest_common_substring_tests) Q2(c) (5 marks) Explain whether or not you would expect any difference in the best-case and worst-case complexities of your dynamic programming function for finding the longest common substring. Analyse the worst-case complexity of your function. Again, you may find it convenient to define L = |left| and R = |right| and to express the complexity in terms of those values. We have used uppercase letters to avoid confusing lowercase 'l' with other characters. You should use Big-Theta notation (Section 2.7) or Big-Oh notation (Section 13.3.3) as appropriate. Complexity of sequence operations (which applies to strings) is given in Section 4.9.1. Write your answer here Q2(d)(i) (4 marks) Measure the actual performance of your dynamic programming function for finding the longest common substring, for a range of suitable inputs. If you find that you get errors because of too much depth of recursion for large values of L and R, you can check and amend the current recursion limit on your system using the following Python code: import sys print('Current recursion limit is', sys.getrecursionlimit()) sys.setrecursionlimit(3000) # Change this '3000' to any sufficiently large integer value print('Check: new recursion limit is', sys.getrecursionlimit()) In [ ]: # Write your timing code here Q2(d)(ii) (2 marks) Discuss whether the performance was in line with your expectations from complexity analysis and how it compares with your 'brute force' function longest_common_substring from Question 1. There are many reasons why you may or may not see performance data exactly in line with expectations. You can still get full marks on this question as long as you provide appropriate timing results and a clear discussion of what you expected and what you found. You do not need to provide an explanation of any differences. Write your discussion answer here Question 3 (10 marks) You should be able to answer most of this question after you have attempted Question 1 (which requires study up to Chapter 4 on Sequences and iteration) and all of it after you have attempted Question 2 (which requires study up to Chapter 23 on Dynamic programming). This question is quite open-ended but assesses some or all of the learning outcomes: Develop and apply algorithms and data structures to solve computational problems. Analyse the complexity of algorithms to support software design choices. Explain how an algorithm or data structure works, in order to communicate with relevant stakeholders. In comparing DNA sequences and sometimes other strings, it can be useful to allow slightly inexact matches. For example, DNA sequences that are essentially the same may differ in a small number of characters due to processes such as mutation, deletion or insertion. In their simplest form these involve a change of one character – a mutation involves a base ('A', 'C', 'G' or 'T') that changes to a different base (e.g. 'ACTG' becomes 'ATTG'), a deletion means one base is removed from the DNA sequence (e.g. 'ACTG' becomes 'ATG') and an insertion means one additional base is added to the sequence at a particular point (e.g. 'ACTG' becomes 'ACTCG'). We do not expect that you will need to use any sources other than the M269 text in answering this question, but if you do use other sources, then make sure that you include appropriate references to these at the end of your answer, to avoid any concerns about plagiarism. Q3(a) (7 marks) Discuss in general terms whether and how it might be possible to adapt your 'brute-force' algorithm from Question 1, to allow for some limited effects of mutation, deletion or insertion in identifying longest common substrings, and whether this could be expected to have an impact on the performance. We do not expect you to write or modify any code for this question. Write your answer here Q3(b) (3 marks) Give one idea for how your dynamic programming solution to Question 2, could be adapted for the same sort of inexact matching to allow for mutation (only). This could be a similar adaptation to what you proposed in Q3(a) for the Question 1 solution, or it might be different. In either case, explain why this is appropriate. Very briefly explain any effect you would expect on the complexity or performance of the dynamic programming solution from your proposed adaptation Write your answer here PART 2 (41 marks) Question 4 (9 marks) This question can be answered after studying Section 26.2. It assesses the learning outcome: Understand the common complexity classes. A workshop organiser is trying to distribute people into groups so that no group has two people who have previously worked together. There's a limit on the number of groups that can be created because each group will be in a separate room. The organiser knows who has worked with whom. Can she distribute the participants as required? The general problem is as follows. Function: workshop groups\ Input: people, an undirected graph; number of groups, an integer\ Preconditions: people has n ≥ 2 nodes; 1 ≤ number of groups ≤ n\ Output: ok, a Boolean\ Postconditions: ok is true if and only if the set of nodes of people can be divided into number of groups subsets so that neighbouring nodes are in different subsets. Here is an example graph, with nodes representing people and edges representing work relations. This figure shows a graph with five nodes, labelled A to E. There are six undirected and unweighted edges: A–B, B–C, C–E, E–D, D–A and D–B. If number of groups is 1, then ok = false, because the 5 people can’t be put in the same group. However, for number of groups = 3, ok is true: A and C are one group; E and B are another; and D is the third group. Another possible grouping would be A and E are the first group, C and D are the second, and B is the third. Prove that this decision problem is in class NP by answering the following parts. Q4(a) (3 marks) Describe a certificate for a problem instance (a graph people and an integer number of groups) that has a true output. Give as example a certificate for the graph above and number of groups = 3. Write your answer here Q4(b) (6 marks) Explain why certificates for this problem can be verified in polynomial time. In other words, briefly explain what the verifier does, given people, number of groups and the corresponding certificate, and why it takes polynomial time. Write your answer here Question 5 (29 marks) This question can be answered after studying Section 27.2. It assesses the learning outcomes: Understand the Turing Machine model of computation. Explain how an algorithm or data structure works, in order to communicate with relevant stakeholders. We want a Turing machine that checks if a string is a palindrome, i.e. is the same as when read backwards. The input tape consists of zero or more zeros and ones, followed by blanks. The output is 1 if the input is a palindrome, otherwise 0. You may modify the input symbols. For example, if the input is [1, 0, 1] then the output is , whereas input [1, 0] leads to output. Q5(a) (17 marks) Write the transition table for the Turing machine. Organise the transitions by state or by the order they're executed (Section 27.1.3). Use descriptive state names. Suggestions: Write and test a machine that handles even-length inputs and then handles odd-length inputs. 'Remove' (i.e. blank out) input symbols as you process them. You should add tests to check your Turing machine, but you won't be awarded any marks for your tests. You don't have to remove your tests before submitting this TMA. In palindrome_tests , set the debug parameter to True if you want to see the configurations your Turing machine goes through. Make sure you set it back to False and run the cell again before submitting your TMA. In [ ]: %run -i m269_util %run -i m269_tm # both files are in the downloaded TMA zip file palindrome = { # write the transitions here in the form # (state, symbol): (new_symbol, LEFT or RIGHT or STAY, new_state), } palindrome_tests = [ # case, ('palindrome', ('not palindrome', ] TM, input tape, debug, palindrome, [1,0,1], False, palindrome, [1,0], False, output tape ), ), test(run_TM, palindrome_tests) Q5(b) (12 marks) Explain how your Turing machine works, describing the main transitions and the purpose of each state. Your answer shouldn't be a direct translation of the transition table to English. You may include diagrams, if you wish, drawn by hand or with a computer. Write your answer here Question 6 (3 marks) This question can be answered after studying Section 27.4. It assesses the learning outcome: Know about the limits of computation and their practical implications. Consider the problem of deciding whether a given function (in Python or some other programming language) is recursive. Is this a computable problem, i.e. a decidable problem? In other words, is there an algorithm that can decide for every function whether that function calls itself? Explain why or why not. Write your answer here Submit this TMA by zipping together the notebook along with all of the helper and data files you have used, and submitting your.zip file using the online TMA/EMA service.

Use Quizgecko on...
Browser
Browser