$18.20
Q1 (25 points): Write a script word analogy.sh that finds D given A, B, and D. • The command line is: word analogy.sh vector file input dir output dir flag1 flag2 • vector file is an input file with the format “w v1 v2 ... vn” (e.g., vectors.txt), where < v1, v2, ..., vn > is word embedding of the word w. • input dir (e.g., question-data) is a directory that contains a list of test files. The lines in the test file have the format “A B C D”, the four words as in the word analogy task. • output dir is a directory to store the output: – For each file under input dir, your script should create a file with the same name under output dir. – The two files should have exactly the same number of lines and the same content, except that the word D in the files under output dir is the word selected by the algorithm; that is, you will go over all the words in vector file and find one what is most similar to y = xB − xA + xC. • flag1 is an interger indicating whether the word embeddings should be normalized first. – If flag1 is non-zero, you need to normalize the word embedding vectors first. That is, if the vector is < v1, v2, ..., vn >, you normalize that to < v1/Z, v2/Z, ..., vn/Z >, where Z = q v 2 1 + v 2 2 + ... + v 2 n . – If flag1 is 0, just use the original vectors. • flag2 is an integer indicating which similarity function to use for calculating sim(x,y): – If flag2 is non-zero, use cosine similarity (https://en.wikipedia.org/wiki/Cosine similarity). 1 – If flag2 is 0, use Euclidean distance (https://en.wikipedia.org/wiki/Euclidean distance). – Note that when Euclidean distance is used, the smaller the distance is, the more similar the two words are. In addition to output dir, your script should print out to stdout (1) accuracy for each file under input dir and (2) total accuracy. The stdout can then be redirected to a file (see eval result). • You should print out the following to stdout: fileName1 ACCURACY TOP1: acc% (cor/num) fileName2 ACCURACY TOP1: acc% (cor/num) ... Total accuracy: accTotal% (corSum/numSum) • f ileN amei is ith file in the input dir. • num is the number of examples in the file. • cor is the number of examples in the file that your system output is correct (i.e., the D in output dir/f ilename is the same as the D in input dir/f ilename) • acc% = cor num . • For total accuracy line, corSum is the sum of the cor, and numSum is the sum of num in the previous lines. • accTotal% = corSum numSum. Q2 (15 points): Run the following commands and submit output dirs: • mkdir exp00 exp01 exp10 exp11 • word analogy.sh vectors.txt question-data exp00 0 0 > exp00/eval res • word analogy.sh vectors.txt question-data exp00 0 1 > exp01/eval res • word analogy.sh vectors.txt question-data exp00 1 0 > exp10/eval res • word analogy.sh vectors.txt question-data exp00 1 1 > exp11/eval res Here, vectors.txt and question-data are the ones under /dropbox/18-19/570/hw11/examples/. Q3 (35 points): Answer the following questions for the Skip-Gram model. Most of the questions were covered in class. For Q3, let’s assume that the vocabulary has 100K words, and the word embeddings have 50 dimensions. 2 3a (4 points): What is the “fake task” in order to learn word embeddings? That is, for this fake task, what are the input and the output at the test time? 3b (4 points): How many layers are there in the neural network for solving the fake task? How many neurons are there in each layer? 3c (4 points): Not counting the vector for the input word and the output vector for the output layer, how many matrices are there in the network? What are the dimensions of the matrices? How many model parameters are there? That is, how many weights need to be estimated during the training? 3d (4 points): Why do we need to create the fake task? 3e (10 points): For any supervised learning algorithm, the training data is a set of (x, y) pairs: x is the input, y is the output. For the Skip-Gram model discussed in class, what is x? What is y? Given a set of sentences, how to generate (x, y) pairs? Notice that my lecture and the blogs give slightly different answers to what y is. You can use either answer. Just specify whether the answer is from my lecture or from the blogs. 3 3f (4 points): What is one-hot representation? Which layer is that used? Why is it called one-hot? 3g (5 points): Softmax is used in the output layer. Why do we need to use softmax? Submission: Your submission should include the following: 1. readme.[txt|pdf] with answers to Q3 and any note that you want the grader to read. 2. hw.tar.gz that includes word analogy.sh and the output directories created in Q2 (see the complete file list in submit-file-list). 4