$18.20
Q1 (55 points): Create a MaxEnt POS tagger, maxent tagger.sh. • The command line is: maxent tagger.sh train file test file rare thres feat thres output dir • The train file and test file have the format (e.g., test.word pos): w1/t1 w2/t2 ... wn/tn • rare thres is an integer: any words (in the train file and test file) that appear LESS THAN raw thres times in the train file are treated as rare words, and features such as pref=xx and suf=xx should be used for rare words (see Table 1 in (Ratnaparkhi, 1996)). • feat thres is an integer: All the wi features (i.e., CurrentWord=xx features), regardless of their frequency, should be kept. For all OTHER types of features, if a feature appears LESS THAN feat thres in the train file, that feature should be removed from the feature vectors. • output dir is a directory that stores the output files from the tagger. Your script should create the following files and store them under output dir: – train voc (e.g., ex train voc): the vocabulary that includes all the words appearing in train file. The file has the format “word freq” where freq is the frequency of the word in the training data. The lines should be sorted by freq in descending order. For words with the same frequency, sort the lines alphabetically. – init feats (e.g., ex init feats): features that occur in the train file. It has the format “featName freq” and the lines are sorted by the frequency of the feature in the train file in descending order. For features with the same frequency, sort the lines alphabetically. – kept feats (e.g., ex kept feats): This is a subset of init feats, and it includes the features that are kept after applying feat thres. – final train.vectors.txt (e.g., ex final train.vectors.txt): the feat vectors for the train file in the Mallet text format. Only features in kept feats should be kept in this file. – final test.vectors.txt: the feat vectors for the test file in the Mallet text format. The format is the same as final train.vectors.txt. – final train.vectors: the binary format of the vectors in final train.vectors.txt. – me model: the MaxEnt model (in binary format) which is produced by the MaxEnt trainer. – me model.stdout and me model.stderr: the stdout (standard out) and stderr (standard error) produced by the MaxEnt trainer are redirected and saved to those files by running command such as “mallet train-classifier --trainer MaxEnt --input final train.vectors --output-classifier me model > me model.stdout 2 > me model.stderr”. The training accuracy is displayed at the end of me model.stdout. – sys out: the system output file when running the MaxEnt classifier with command such as “mallet classify-file --input final test.vectors.txt --classifier me model --output sys out”. 1 Your script maxent tagger.sh should do the following: 1. Create feature vectors for the training data and the test data. The vector files should be called final train.vectors.txt and final test.vectors.txt. 2. Run mallet import-file to convert the training vectors into binary format, and the binary file is called final train.vectors. 3. Run mallet train-classifier to create a MaxEnt model me model using final train.vectors 4. Run mallet classify-file to get the result on the test data final test.vectors.txt. 5. Calculate the test accuracy For step 2-4, you should use Mallet commands. For Step 5, if you don’t want to write code for it, you can use the vectors2classify command, which covers step 3-5. In that case, you need to convert final test.vectors.txt to the binary format first. For the first step, you need to write some code. Features are defined in Table 1 in (Ratnaparkhi, 1996). The following is one way for implementing this step: 1. create train voc from the train file, and use the word frequency in train voc and rare thres to determine whether a word should be treated as a rare word. The feature vectors for rare words and non-rare words are different. 2. Form feature vectors for the words in train file, and store the features and frequencies in the training data in init feats. 3. Create kept feats by using feat thres to filter out low frequency features in init feats. Note that wi features are NOT subject to filtering with feat thres and every wi feature in init feats should be kept in kept feats. 4. Go through the feature vector file for train file and remove all the features that are not in kept feats. 5. Create feature vectors for test file, and use only the features in kept feats. If a word in the test file appears LESS THAN rare thres times (or does not appear at all) in the training file, the word should be treated as a rare word even if it appears many times in the test file. 6. For the feature vector files, replace all the occurrences of “,” with “comma” as Mallet treats “,” as a separator. Q2 (20 points): Run maxent tagger.sh with wsj sec0.word pos as train file, test.word pos as test file, and the thresholds as specified in Table 1: • training accuracy is the accuracy of the tagger on the train file • test accuracy is the accuracy of the tagger on the test file • # of feats is the number of features in the train file before applying feat thres • # of kept feats is the number of features in the train file after applying feat thres • running time is the CPU time (in minutes) of running maxent tagger.sh. Please do the following: 2 Table 1: Tagging accuracy with different thresholds Expt rare feat training test # of # of running id thres thres accuracy accuracy feats kept feats time 1 1 1 1 1 3 1 3 2 3 2 3 3 5 3 5 5 10 5 10 • Fill out Table 1. • What conclusion can you draw from Table 1? • Save the output files of maxent tagger.sh to res id/, where id is the experiment id in the first column (e.g., the files for the first experiment will be stored under res 1 1). Submit only the subdirs for the first row and the last row (i.e., res 1 1 and res 5 10). Submission: Your submission should include the following: 1. readme.[txt|pdf] includes Table 1 and your answer to Q2. 2. hw.tar.gz that includes maxent tagger.sh and res 1 1/ and res 5 10/ created in Q2 (see the complete file list in submit-file-list). 3