SOLVED Assignment 3 ECSE 420: Parallel Computing

Starting from:

~~$28~~

$14

Questions 1. (18 marks) Chapter 7, Memory access, Anderson Lock The following benchmark is designed to measure the average time for reading array A containing L elements from memory, when the array elements are s words apart: for stride s from 4 Bytes (1 word) to L/2 by 2x (s = 1, 2, 4, …, L/2) time the following loop (repeat many times and average) for i from 0 to L-1 load A[i*(s+1)] from memory (4 Bytes) So, the benchmark reads L words from memory into cache, where L is a constant; however, these L words are not consecutive memory locations (they are "s" words apart). The results obtained from the benchmark are shown in Fig.1: Figure1. Average time for reading each array element Assuming only one level of cache exists, the cache line size is 4 words, and a single processor is running the benchmark, answer the following questions: 1.1 In Fig.1, when the total number of elements of the array - L - is smaller than L’, the average time per access remains constant. What does the value of L’ represent? What does time t0 show? 1.2 When L is larger than L’, what does time t1 indicate in Fig.1? 1.3 For each part of the graph – parts 1, 2 and 3 -, justify its behaviour. We know a phenomenon called false sharing could cause unnecessary invalidations in ALock (Anderson Lock), and one way to avoid false sharing is to pad array elements so that distinct elements are mapped to distinct cache lines. © 2022 Instructor-generated course materials (e.g., handouts, notes, summaries, exam questions, etc.) are protected by law and may not be copied or distributed in any form or in any medium without explicit permission of the instructor. Note that infringements of copyright can be subject to follow up by the University under the Code of Student Conduct and Disciplinary Procedures.” 2/2 1.4 Considering the results from Fig.1, how could the padding technique used in ALock degrade the overall performance of the lock? 2. (18 marks) Chapter 9, examining the fine-grained algorithm 2.1. Provide the code for the contains() method missing from the fine-grained algorithm described in chapter 9. 2.2. Write a test to verify the contains() method and explain why your implementation is correct. 3. (12 marks) Chapter 10, designing a bounded lock-based queue 3.1. Design a bounded lock-based queue implementation using an array instead of a linked list. Allow parallelism by using two separate locks for head and tail. 3.2. Try to transform your algorithm to be lock-free. Where do you run into difficulty? 4. (32 marks) Chapter 16, matrix vector multiplication 4.1. Implement a sequential matrix vector multiplication. 4.2. Give an efficient and highly parallel multithreaded algorithm for multiplying an n × n matrix A by a length-n vector x that achieves work Θ(n 2 ) and critical path Θ(log n). [Hint: If you use a cached thread pool with the Future interface, to achieve the Θ(log n) critical path you can follow the algorithm for matrix multiplication in Chapter 16. Otherwise, depending on how you divide the problem, you could obtain a critical path different than Θ(log n). This is acceptable, just be sure to show your work and justify in your report the critical path of your implementation.] 4.3. Write a test program that measures the execution time for multiplying a 2,000 by 2,000 matrix with a corresponding 2000 wide vector using the parallel method and sequential method, respectively. Discuss the execution time of the two methods and compute the speedup. Be sure to indicate how many threads are being used. 4.4. Analyze and discuss the work and critical-path length of your implementation, and give the parallelism. Total: 80 marks