$28
Overview and Objectives. In this homework, we are going to do some exercises about alternative losses for linear regression, practice recall and precision calculations, and implement a logistic regression model to predict whether a tumor is malignant or benign. There is substantial skeleton code provided with this assignment to take care of some of the details you already learned in the previous assignment such as cross-validation, data loading, and computing accuracies. How to Do This Assignment. • Each question that you need to respond to is in a blue "Task Box" with its corresponding point-value listed. • We prefer typeset solutions (LATEX / Word) but will accept scanned written work if it is legible. If a TA can’t read your work, they can’t give you credit. • Programming should be done in Python and numpy. If you don’t have Python installed, you can install it from here. This is also the link showing how to install numpy. You can also search through the internet for numpy tutorials if you haven’t used it before. Google and APIs are your friends! You are NOT allowed to... • Use machine learning package such as sklearn. • Use data analysis package such as panda or seaborn. • Discuss low-level details or share code / solutions with other students. Advice. Start early. There are two sections to this assignment – one involving working with math (20% of grade) and another focused more on programming (80% of the grade). Read the whole document before deciding where to start. How to submit. Submit a zip file to Canvas. Inside, you will need to have all your working code and hw1-report.pdf. You will also submit test set predictions to a class Kaggle. This is required to receive credit for Q8. 1 Written Exercises: Linear Regression and Precision/Recall [5pts] I’ll take any opportunity to sneak in another probability question. It’s a small one. 1.1 Least Absolute Error Regression In lecture, we showed that the solution for least squares regression was equivalent to the maximum likelihood estimate of the weight vector of a linear model with Gaussian noise. That is to say, our probabilistic model was yi ∼ N (µ = wT xi , σ) −→ P(yi |xi , w) = 1 σ √ 2π e − (yi−wT xi ) 2 σ2 (1) and we showed that the MLE estimate under this model also minimized the sum-of-squared-errors (SSE): argmaxY N i=1 P(yi |xi , w) | {z } Likelihood = argmin X N i=1 (yi − wT xi) 2 | {z } Sum of Squared Errors (2) However, we also demonstrated that least squares regression is very sensitive to outliers – large errors squared can dominate the loss. One suggestion was to instead minimize the sum of absolute errors. In this first question, you’ll show that changing the probabilistic model to assume Laplace error yields a least absolute error regression objective. To be more precise, we will assume the following probabilistic model for how yi is produced given xi : yi ∼ Laplace(µ = wT xi , b) −→ P(yi |xi , w) = 1 2b e − |yi−wT xi | b (3) 1 I Q1 Linear Model with Laplace Error [2pts]. Assuming the model described in Eq.3, show that the MLE for this model also minimizes the sum of absolute errors (SAE): SAE(w) = X N i=1 |yi − wT xi | (4) Note that you do not need to solve for an expression for the actual MLE expression for w to do this problem. Simply showing that the likelihood is proportional to SAE is sufficient. 1.2 Recall and Precision y P(y|x) 0 0.1 0 0.1 0 0.25 1 0.25 0 0.3 0 0.33 1 0.4 0 0.52 y P(y|x) 0 0.55 1 0.7 1 0.8 0 0.85 1 0.9 1 0.9 1 0.95 1 1.0 Beyond just calculating accuracy, we discussed recall and precision as two other measure of a classifier’s abilities. Remember that we defined recall and precision as in terms of true positives, false positives, true negatives, and false negatives: Recall = #TruePositives #TruePositives + #FalseNegatives (5) and Precision = #TruePositives #TruePositives + #FalsePositives (6) I Q2 Computing Recall and Precision [3pts]. To get a feeling for recall and precision, consider the set of true labels (y) and model predictions P(y|x) shown in the tables above. We compute Recall and Precision as a specific threshold t – considering any point with P(y|x) > t as being predicted to be the positive class (1) and ≤ t to be the negative class (0). Compute and report the recall and precision for thresholds t = 0, 0.2, 0.4, 0.6, 0.8, and 1. 2 Implementing Logistic Regression for Tumor Diagnosis [20pts] In this section, we will implement a logistic regression model for predicting whether a tumor is malignant (cancerous) or benign (non-cancerous). The dataset has eight attributes – clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bland chromatin, nomral nucleoli, and mitoses – all rated between 1 and 10. You will again be submitting your predictions on the test set via the class Kaggle. You’ll need to download the train_cancer.csv and test_cancer_pub.csv files from the Kaggle’s data page to run the code. 2.1 Implementing Logistic Regression Logistic Regression. Recall from lecture that the logistic regression algorithm is a binary classifier that learns a linear decision boundary. Specifically, it predicts the probability of an example x ∈ R d to be class 1 as P(yi = 1 | xi) = σ(wT xi) = 1 1 + e−wT xi , (7) where w ∈ R d is a weight vector that we want to learn from data. To estimate these parameters from a dataset of n input-output pairs D = {xi , yi}, we assumed yi ∼ Bernoulli(θ = σ(wT xi)) and wrote the negative log-likelihood: −logP(D|w) = − Xn i=1 logP(yi |xi , w) = − Xn i=1 yi logσ(wT xi) + (1 − yi)log(1 − σ(wT xi)) (8) I Q3 Negative Log Likelihood [2pt]. Implement the calculateNegativeLogLikelihood function in logreg.py. The function takes in a n×d matrix X of example features (each row is an example) and a n×1 vector of labels y. It should return the result of computing Eq.8 (which results in a scalar value). Note that np.log and np.exp will apply the log or exponential function to each element of an input matrix. Gradient Descent. We want t find optimal weights w∗ = argminw − logP(D|w). However, taking the gradient of the negative log-likelihood yields the expression below which does not offer a closed-form solution. ∇w(−logP(D|w)) = Xn i=1 (σ(wT xi) − yi)xi (9) Instead, we opted to minimize −logP(D|w) by gradient descent. We’ve provided pseudocode in the lecture but to review the basic procedure is written below (α is the stepsize). 1. Initialize w to some initial vector (all zeros,random, etc) 2. Repeat until max iterations: (a) w = w − α ∗ ∇w(−logP(D|w)) For convex functions (and sufficiently small values of the stepsize α), this will converge to the minima. I Q4 Gradient Descent for Logistic Regression [5pt]. Finish implementing the trainLogistic function in logreg.py. The function takes in a n×d matrix X of example features (each row is an example) and a n×1 vector of labels y. It returns the learned weight vector and a list containing the observed negative log-likelihood after each epoch (uses calculateNegativeLogLikleihood). The skeleton code is shown below. 1 def trainLogistic (X ,y , max_iters =2000 , step_size =0.0001) : 2 # Initialize our weights with zeros 3 w = np . zeros ( (X. shape [1] ,1) ) 4 # Keep track of losses for plotting 5 losses = [] 6 7 # Take up to max_iters steps of gradient descent 8 for i in range ( max_iters ): 9 10 # Make a variable to store our gradient 11 w_grad = np . zeros ( (X . shape [1] ,1) ) 12 13 # Compute the gradient over the dataset and store in w_grad 14 15 # #### TODO : Implement equation 9. 16 17 # This is here to make sure your gradient is the right shape 18 assert ( w_grad . shape == (X . shape [1] ,1) ) 19 20 # Take the update step in gradient descent 21 w = w - step_size * w_grad 22 23 # Calculate the negative log - likelihood with w 24 losses . append ( calculateNegativeLogLikelihood (X ,y ,w )) 25 26 return w , losses To complete this code, you’ll need to implement Eq.9 to compute the gradient of the negative log-likelihood of the dataset with respect to the weights w. Note that an approach that loops over the dataset to compute Eq.9 takes about 15x slow than a fully matrix-form version. The max_iters variable can be set lower (200ish) during development to avoid this slowness being too annoying – but when optimizing performance later you may want to raise it back up. Either solution is fine for this assignment if you’re patient. If you’ve implemented this question correctly, running logreg.py should print out the learned weight vector and training accuracy. You can expect something around 86% for the train accuracy. Provide your weight vector and accuracy in your report. 3 2.2 Playing with Logistic Regression on This Dataset Adding a Bias. The model we trained in the previous section did not have a constant offset (called a bias) in the model – computing wT x rather than wT x + b. A simple way to include this in our model is to add an new column to X that has all ones in it. This way, the first weight in our weight vector will always be multiplied by 1 and added. I Q5 Adding A Dummy Variable [1pt]. Implement the dummyAugment function in logreg.py to add a column of 1’s to the left side of an input matrix and return the new matrix. Once you’ve done this, running the code should produce the training accuracy for both the no-bias and this updated model. Report the new weight vector and accuracy. Did it make a meaningful difference? Observing Training Curves. After finishing the previous question, the code now also produces a plot showing the negative log-likelihood for the bias and no-bias models over the course of training. If we change the learning rate (also called the step size), we could see significant differences in how this plot behaves – and in our accuracies. I Q6 Learning Rates / Step Sizes. [2pt] Gradient descent is sensitive to the learning rate (or step size) hyperparameter and the number of iterations. Does it look like the gradient descent algorithm has converged or does it look like the negative log-likelihood could continue to drop if max_iters was set higher? Different values of the step size will change the nature of the curves in the training curve plot. In the skeleton code, this is originally set to 0.0001. Change the step size to 1, 0.1, 0.01, and 0.00001. Provide the resulting training curve plots and training accuracy. Discuss any trends you observe. Cross Validation. The code will also now print out K-fold cross validation results (mean and standard deviation of accuracy) for K = 2, 3, 4, 5, 10, 20, and 50. This part may be a bit slow, but you’ll see how the mean and standard deviation change with larger K. I Q7 Evaluating Cross Validation [2pt] Come back to this after making your Kaggle submission. The point of cross-validation is to help us make good choices for model hyperparameters. For different values of K in K-fold cross validation, we got different estimates of the mean and standard deviation of our accuracy. How well did these means and standard deviations capture your actual performance on the leaderboard? Discuss any trends you observe. 2.3 Make Your Kaggle Submission Great work getting here. In this section, you’ll submit the predictions of your best model to the class-wide Kaggle competition. You are free to make any modification to your logistic regression algorithm to improve performance; however, it must remain logistic regression! For example, you can change feature representation, adjust the learning rate, and max_steps parameters. 4 I Q8 Kaggle Submission [8pt]. Submit a set of predictions to Kaggle that outperforms the baseline on the public leaderboard. To make a valid submission, use the train set to build your logistic regression classifier and then apply it to the test instances in test_cancer_pub.csv available from Kaggle’s Data tab. Format your output as a two-column CSV as below: id,type 0,0 1,1 2,1 3,0 . . . where the id is just the row index in test_cancer_pub.csv. You may submit up to 10 times a day. In your report, tell us what modifications you made for your final submission. Extra Credit and Bragging Rights [1.25pt Extra Credit]. The TA has made a submission to the leaderboard. Any submission outperforming the TA on the private leaderboard at the end of the homework period will receive 1.25 extra credit points on this assignment. Further, the top 5 ranked submissions will “win HW2” and receive bragging rights. 3 Debriefing (required in your report) 1. Approximately how many hours did you spend on this assignment? 2. Would you rate it as easy, moderate, or difficult? 3. Did you work on it mostly alone or did you discuss the problems with others? 4. How deeply do you feel you understand the material it covers (0%–100%)? 5. Any other comments? 5