SOLVED CS434 Homework 3 Naïve Bayes and Neural Networks

Starting from:

$28

Overview and Objectives. In this homework, we are going to do some exercises to understand Naïve Bayes a bit better and then get some hand-on experience with simple Neural Networks in a multi-class classification setting. How to Do This Assignment. • Each question that you need to respond to is in a blue "Task Box" with its corresponding point-value listed. • We prefer typeset solutions (LATEX / Word) but will accept scanned written work if it is legible. If a TA can’t read your work, they can’t give you credit. • Programming should be done in Python and numpy. If you don’t have Python installed, you can install it from here. This is also the link showing how to install numpy. You can also search through the internet for numpy tutorials if you haven’t used it before. Google and APIs are your friends! You are NOT allowed to... • Use machine learning package such as sklearn. • Use data analysis package such as panda or seaborn. • Discuss low-level details or share code / solutions with other students. Advice. Start early. There are two sections to this assignment – one involving working with math (20% of grade) and another focused more on programming (80% of the grade). Read the whole document before deciding where to start. How to submit. Submit a zip file to Canvas. Inside, you will need to have all your working code and hw3-report.pdf. You will also submit test set predictions to a class-wide Kaggle competition. 1 Written Exercises: Analyzing Naïve Bayes [5pts] 1.1 Bernoulli Naïve Bayes As A Linear Classifier Consider a Naïve Bayes model for binary classification where the features X1, ..., Xd are also binary variables. As part of training this Naïve Bayes model, we would estimate the conditional distributions P(Xi |y=c) for c = {0, 1} and i = 1, ..., d as well as the prior distribution P(y=c) for c = {0, 1}. For notational simplicity, let’s denote P(Xi=1|y=c) as θic and P(Xi=0|y=c) as 1 − θic. Likewise, let θ1 be P(y=1) and θ0 = 1 − θ1 be P(y=0). If we write out the posterior of this classifier (leaving out the normalizing constant), we would have: P(y = 1|x1, ..., xd) = P(y = 1) Qd i=1 P(xi |y = 1) P(x1, ..., xd) ∝ θ1 Y d i=1 θ xi i1 (1 − θi1) 1−xi (1) P(y = 0|x1, ..., xd) = P(y = 0) Qd i=1 P(xi |y = 0) P(x1, ..., xd) ∝ θ0 Y d i=1 θ xi i0 (1 − θi0) 1−xi (2) The classification rule in Naïve Bayes is to choose the class with the highest posterior probability, that is to say by predicting y = argmaxc P(y = c|x1, ..., xd) for a new example x = [x1, ..., xd]. In order for the prediction to be class 1, P(y = 1|x1, ..., xd) must be greater than P(y = 0|x1, ..., xd), or equivalently we could check if: P(y = 1|x1, ..., xd) P(y = 0|x1, ..., xd) > 1 (3) In this setting with binary inputs, we will show that this is a linear decision boundary. This also true for continuous inputs if they are modelled with a member of the exponential family (including Gaussian) – see proof here in §3.1. 1 I Q1 Prove Bernoulli Naïve Bayes has a linear decision boundary [4pts]. Prove the Bernoulli Naïve Bayes model described above has a linear decision boundary. Specifically, you’ll need to show that class 1 will be predicted only if b + wT x > 1 for some parameters b and w. To do so, show that: P(y = 1|x1, ..., xd) P(y = 0|x1, ..., xd) > 1 =⇒ b + X d i=1 wixi > 0 (4) As part of your report, explicitly write expressions for the bias b and each weight wi . Hints: Expand Eq. (3) by substituting the posterior expressions from Eq. (1) & (2) into Eq. (3). Take the log of that expression and combine like-terms. 1.2 Duplicated Features in Naïve Bayes Naïve Bayes classifiers assume features are conditionally independent given the class label. When this assumption is violated, Naïve Bayes classifiers may be over-confident or under-confident in the outputs. To examine this more closely, we’ll consider the situation where an input feature is duplicated. These duplicated features are maximally dependent – making this is a strong violation of conditional independence. Here, we’ll examine a case where the confidence increases when the duplicated features are included despite no new information actually being added as input. I Q2 Duplicate Features in Naïve Bayes [1pts]. Consider a Naïve Bayes model with a single feature X1 for a binary classification problem. Assume a uniform prior such that P(y = 0) = P(y = 1). Suppose the model predicts class 1 for an example x then we know: P(y = 1|X1 = x1) > P(y = 0|X1 = x1) (5) Now suppose we make a mistake and duplicate the X1 feature in our data – now we have two identical inputs X1 and X2. Show that the predicted probability for class 1 is higher than it was before; that is, prove: P(y = 1|X1 = x1, X2 = x2) > P(y = 1|X1 = x1) (6) Hints: Use the assumption that P(y = 0) = P(y = 1) to simplify the posterior expressions in Eq. (6) and Eq. (5). As X2 is an exact copy of X1, P(X2 = x2|y) is the same as P(X1 = x1|y) for any example. 2 Implementing a Neural Network For Digit Identification [20pts] Small MNIST. In this section, we will implement a feed-forward neural network model for predicting the value of a drawn digit. We are using a subset of the MNIST dataset commonly used in machine learning research papers. A few example of these handwritten-then-digitized digits from the dataset are shown below: Each digit is a 28 × 28 greyscale image with values ranging from 0 to 256. We represent an image as a row vector x ∈ R 1×784 where the image has been serialized into one long vector. Each digit has an associated class label from 0,1,2,...,9 corresponding to its value. We provide three dataset splits for this homework – a training set containing 5000 examples, a validation set containing 1000, and our test set containing 4220 (no labels). These datasets can be downloaded from the class Kaggle competition for this homework. 2 2.1 Cross-Entropy Loss for Multiclass Classification Unlike the previous classification tasks we’ve examined, we have 10 different possible class labels here. How do we measure error of our model? Let’s formalize this a little and say we have a dataset D = {xi , yi} N i=1 with yi ∈ {0, 1, 2, ..., 9}. Assume we have a model f(x; θ) parameterized by a set of parameters θ that predicts P(Y |X = x) (a distribution over our labels given an input). Let’s refer to P(Y = c|X = x) predicted from this model as pc|x for compactness. We can write this output as a categorical distribution: P(Y = y|X = x) =    p0|x ify = 0, p1|x ify = 1 . . . p9|x ify = 9 = Y 9 c=0 p I[y==c] c (7) where I[condition] is the indicator function that is 1 if the condition is true and 0 otherwise. Using this, we can write our negative log-likelihood of a single example as as: −logP(D|θ) = − X 9 c=0 I[yi == c] log pc|xi = − log pyi|xi (8) This loss function is also often referred to as a Cross-Entropy loss. In this homework, we will minimize this negative log-likelihood by stochastic gradient descent. In the following, we will refer to this negative log-likelihood as L(θ) = − log pyi|xi (9) Note that we write L as a function of θ because each pyi|xi is produced by our model f(xi ; θ). 2.2 Implementing Backpropagation for Feed-forward Neural Network In this homework, we’ll consider feed-forward neural networks composed of a sequence of linear layers xW1 + b1 and non-linear activation functions g1(·). As such, a network with 3 of these layers stacked together can be written b3 + g2(b2 + g1(b1 + x ∗ W1) ∗ W2) ∗ W3 (10) Note how that this is a series of nested functions, reflecting the sequential feed-forward nature of the computation. To make our notation easier in the future, I want to give a name to the intermediate outputs at each stage so will expand this to write: z1 = x ∗ W1 + b1 (11) a1 = g1(z1) (12) z2 = a1 ∗ W2 + b2 (13) a2 = g2(z2) (14) z3 = a2 ∗ W3 + b3 (15) (16) where z’s are intermediate outputs from the linear layers and a’s are post-activation function outputs. In the case of our MNIST experiments, z3 will have 10 dimensions – one for each of the possible labels. Finally, the output vector z3 is not yet a probability distribution so we apply the softmax function: pj|x = e z3j P9 c=0 e z3c (17) and let p·|x be the vector of these predicted probability values. Gradient Descent for Neural Networks. Considering this simple 3-layer neural network, there are quite a few parameters spread out through the function – weight matrices W3, W2, W1 and biases vectors b3, b2, b1. Suppose we would like to find parameters that minimize our loss L that measures our error in the network’s prediction. 3 How can we update the weights to reduce this error? Let’s use gradient descent and start by writing out the chain rule for the gradient of each of these. I’ll work backwards from W3 to W1 to expose some structure here. δL δW3 = δL δp·|x δp·|x δz3 δz3 δW3 (18) δL δW2 = δL δp·|x δp·|x δz3 δz3 δa2 δa2 δz2 δz2 δW2 (19) δL δW1 = δL δp·|x δp·|x δz3 δz3 δa2 δa2 δz2 δz2 δa1 δa1 δz1 δz1 δW1 (20) As I’ve highlighted in color above, we end up reusing the same intermediate terms over and over as we compute derivatives for weights further and further from the output in our network.1 As discussed in class, this suggests the straight-forward backpropagation algorithm for computing these efficiently. Specifically, we will compute these intermediate colored terms starting from the output and working backwards. Forward-Backward Pass in Backpropagation. One convenient way to implement backpropagation is to consider each layer (or operation) f as having a forward pass that computes the function output normally as output = fforward(input) (21) and a backward pass that takes in the gradient up to this point in our backward pass and then outputs the gradient of the loss with respect to its input: δL δinput = fbackward δL δoutput = δL δoutput δoutput δinput (22) The backward operator will also compute any gradients with respect to parameters of f and store them to be used in a gradient descent update step after the backwards pass. The starter code implements this sort of framework. See the snippet on the following page that defines a neural network like the one we’ve described here, except it allows for a configurable number of linear layers. Please read the comments and code below before continuing reading this document. To give concrete examples of the forward-backward steps for an operator, consider the Sigmoid (aka the logistic) activation fucntion below: Sigmoid(x) = 1 1 + e−x (23) The implementation for forward and backward for the Sigmoid is below – in forward it computes Eq.(23), in backward it computes and returns δL δinput = δL δoutput δoutput δinput = δL δoutputSigmoid(input)(1 − Sigmoid(input)) (24) It has no parameters so does nothing during the "step" function. 1 class Sigmoid : 2 3 # Given the input , apply the sigmoid function 4 # store the output value for use in the backwards pass 5 def forward ( self , input ) : 6 self . act = 1/(1+ np . exp (- input )) 7 return self . act 8 9 # Compute the gradient of the output with respect to the input 10 # self .act *(1 - self .act ) and then multiply by the loss gradient with 11 # respect to the output to produce the loss gradient with respect to the input 12 def backward ( self , grad ) : 13 return grad * self . act * (1 - self . act ) 14 15 # The Sigmoid has no parameters so nothing to do during a gradient descent step 16 def step ( self , step_size ): 17 return 1 I don’t repeat this for the bias vectors, as it would simply involve changing the final terms from δzi δWi to δzi δbi . 4 1 class FeedForwardNeuralNetwork : 2 3 # Builds a network of linear layers separated by non - linear activations 4 # Either ReLU or Sigmoid . Each internal layer has hidden_dim dimensions . 5 def __init__ ( self , input_dim , output_dim , hidden_dim , num_layers , activation =" ReLU ") : 6 7 if num_layers == 1: # Just a linear mapping from input to output 8 9 self . layers = [ LinearLayer ( input_dim , output_dim ) ] 10 11 else : # At least two layers 12 13 # Layer to map input to hidden dimension size 14 self . layers = [ LinearLayer ( input_dim , hidden_dim ) ] 15 self . layers . append ( Sigmoid () if activation ==" Sigmoid " else ReLU () ) 16 17 # Hidden layers 18 for i in range ( num_layers -2) : 19 self . layers . append ( LinearLayer ( hidden_dim , hidden_dim ) ) 20 self . layers . append ( Sigmoid () if activation ==" Sigmoid " else ReLU () ) 21 22 # Layer to map hidden dimension to output size 23 self . layers . append ( LinearLayer ( hidden_dim , output_dim ) ) 24 25 # Given an input , call the forward function of each of our layers 26 # Pass the output of each layer to the next one 27 def forward ( self , X) : 28 for layer in self . layers : 29 X = layer . forward ( X) 30 return X 31 32 # Given an gradient with respect to the network output , call 33 # the backward function of each of our layers . Pass the output of each layer to the one before it 34 def backward ( self , grad ) : 35 for layer in reversed ( self . layers ): 36 grad = layer . backward ( grad ) 37 38 # Tell each layer to update its weights based on the gradient computed in the backward pass 39 def step ( self , step_size =0.001) : 40 for layer in self . layers : 41 layer . step ( step_size ) Operating on Batches. The network described in the equations earlier in this section is operating on a single input at a time. In practice, we will want to operate on sets of n examples at once such that the layer actually computes Z = XW + b for X ∈ R n×input_dim and Z ∈ R n×output_dim – call this a batched operation. It is straightforward to change the forward pass to operate on these all at once. For example, a linear layer can be rewritten as Z = XW + b where the +b is a broadcasted addition – this is already done in the code above. On the backward pass, we simply need to aggregate the gradient of the loss of each data point with respect to our parameters. For example, δL δW1 = Xn i=1 δLi δW1 (25) where Li is the loss of the i’th datapoint and L is the overall loss. Deriving the Backward Pass for a Linear Layer. In this homework, we’ll implement the backward pass of a linear layer. To do so, we’ll need to be able to compute dZ/db, dZ/dW, and dZ/dX. For each, we’ll start by considering the problem for a single training example x (i.e. a single row of X) and then generalize to the batch setting. In this single-example setting, z = xW + b such that z, b ∈ R 1×c , x ∈ R 1×d , and W ∈ R d×c . Once we solve this case, extending to the batch setting just requires summing over the gradient terms for each example. dZ/db. Considering just the i’th element of z, we can write zi = xw·,i +bi where w·,i is the i’th column of W. From this equation, it is straightforward to observe that element bi only effects the corresponding output zi such that dzi dbj = ( 1 if i = j 0 otherwise (26) This suggests that the Jacobian dz/db is an identity matrix I of dimension c × c. Applying chain rule and summing 5 over all our datapoints, we see dL/db can be computed as a sum of the rows of dL/dZ: dL db = Xn k=1 dLk dZk dZk db = Xn k=1 dLk dZk I = Xn k=1 dLk dZk (27) dZ/dW. Following the same process of reasoning from the single-example case, we can again write the i’th element of z as zi = xw·,i + bi where w·,i is the i’th column of W. When considering the derivative of zi with respect to the columns of W, we see that it is just x for w·,i and 0 for other columns as they don’t contribute to zi – that is to say: δzi δw·,j = ( x if i = j 0 otherwise (28) Considering the loss gradient δL/δw·,i for a single example, we can write: δL δw·,i = δL δzi δzi δw·,i = δL δzi x (29) That is to say, each column i of δL δW is the input x scaled by the loss gradient of zi . As such, we can compute the gradient for the entire W as the product: δL δW = x T δL δz (30) Notice that x T is d × 1 and δL δz is 1 × c – resulting in a d × c gradient that matches the dimension of W. Now let’s consider if we have multiple datapoints x1, ...xn as the matrix X and likewise multiple activation vectors z1, ..., zn as the matrix Z. As our loss simply sums each datapoint’s loss, the gradient also decomposes into a sum of δL δzi,· terms. δL δW = Xn k=1 δL δZk δZk δW = Xn k=1 XT k δLk δZk (31) We can write this even more compactly as: δL δW = XT δL δZ (32) dZ/dX. This follows a very similar path as dZ/dW. We again consider the i’th element of z as zi = xw·,i + bi where w·,i is the i’th column of W. Taking the derivative with respect to x it is clear that for zi the result will be w·,i. δzi δx = w·,i (33) This suggests that the rows of dZ/dx are simply the columns of W such that dZ/dx = WT and we can write δL δx = δL δz δz δx = δL δz WT (34) Moving to the multiple example setting, the above expression gives each row of dL/dX and the entire matrix can be computed efficiently as dL dX = dL dZ WT (35) 6 I Q3 Implementing the Backward Pass for a Linear Layer [6pt]. Implement the backward pass function of the linear layer in the skeleton code. The function takes in the matrix dL/dZ as the variable grad and you must compute dL/dW, dL/db, and dL/dX. The first two are stored as self.grad_weights and self.grad_bias and the third is returned. The expressions for these can be found above in Eq.27 (dL/db), Eq.32 (dL/dW), and Eq.35 (dL/dX). 1 class LinearLayer : 2 3 # Initialize our layer with ( input_dim , output_dim ) weight matrix and a (1 , output_dim ) bias vector 4 def __init__ ( self , input_dim , output_dim ): 5 self . weights = np . random . randn ( input_dim , output_dim ) . astype ( np . float64 )* np . sqrt (2. / input_dim ) 6 self . bias = np . ones ( (1 , output_dim ) ) . astype ( np . float64 ) *0.5 7 8 # During the forward pass , we simply compute Xw+b 9 def forward ( self , input ) : 10 self . input = input 11 return self . input@self . weights + self . bias 12 13 # ################################################ 14 # Q3 Implementing Backward Pass for Linear 15 # ################################################ 16 # Inputs : 17 # 18 # grad dL/dZ -- For a batch size of n, grad is a (n x output_dim ) matrix where 19 # the i’th row is the gradient of the loss of example i with respect 20 # to z_i ( the output of this layer for example i) 21 # 22 # Computes and stores : 23 # 24 # self . grad_weights dL/dW -- A ( input_dim x output_dim ) matrix storing the 25 # gradient of the loss with respect to the weights of 26 # this layer . 27 # 28 # self . grad_bias dL/db -- A (1 x output_dim ) matrix storing the gradient 29 # of the loss with respect to the bias of this layer . 30 # 31 # Return Value : 32 # 33 # grad_input dL/dX -- For a batch size of n, grad_input is a (n x input_dim ) 34 # matrix where the i’th row is the gradient of the loss of 35 # example i with respect to x_i (the input of this 36 # layer for example i) 37 # ################################################ 38 39 def backward ( self , grad ) : # grad is dL/dZ 40 self . grad_weights = ? # Compute dL/dW as in Eq. 32 41 self . grad_bias = ? # Compute dL/db as in Eq. 27 42 return ? # Compute dL/dX as in Eq. 35 43 44 # During the gradient descent step , update the weights and biases based on the stored gradients from the backward pass 45 def step ( self , step_size ): 46 self . weights -= step_size * self . grad_weights 47 self . bias -= step_size * self . grad_bias Once you’ve completed the above task, running the skeleton code should load the digit data and train a 2-layer neural network with hidden dimension of 16 and Sigmoid activations. This model is trained on the training set and evaluated once per epoch on the validation data. After training, it will produce a plot of your results that should look like the one below. This curve plots training and validation loss (cross-entropy in this case) over training iterations (in red and measured on the left vertical axis). It also plots training and validation accuracy (in blue and measures on the right vertical axis). As you can see, this model achieves between 80% and 90% accuracy on the validation set. 7 2.3 Analyzing Hyperparmeter Choices Neural networks have many hyperparameters. These range from architectural choices (How many layers? How wide should each layer be? What activation function should be used? ) to optimization parameters (What batch size for stochastic gradient descent? What step size (aka learning rate)? How many epochs should I train? ). This section has you modify many of these to examine their effect. The default parameters are below for easy reference. 1 # GLOBAL PARAMETERS FOR STOCHASTIC GRADIENT DESCENT 2 np . random . seed (102) 3 step_size = 0.01 4 batch_size = 200 5 max_epochs = 200 6 7 # GLOBAL PARAMETERS FOR NETWORK ARCHITECTURE 8 number_of_layers = 2 9 width_of_layers = 16 # only matters if number of layers > 1 10 activation = " ReLU " if False else " Sigmoid " Optimization Parameters. Optimization parameters in Stochastic Gradient Descent are very inter-related. Large batch sizes mean less noisy estimates of the gradient, so larger step sizes could be used. But larger batch sizes also mean fewer gradient updates per epoch, so we might need to increase the max epochs. Getting a good set of parameters that work well can be tricky and requires checking the validation set performance. Further, these “good parameters” will vary model-to-model. I Q4 Learning Rate [2pts]. The learning rate (or step size) in stochastic gradient descent controls how large of a step in the direction of the loss gradient we take our parameters at each iteration. The batch size determines how many data points we use to estimate the gradient. Modify the hyperparameters to run the following experiments: 1. Step size of 0.0001 (leave default values for other hyperparameters) 2. Step size of 5 (leave default values for other hyperparameters) 3. Step size of 10 (leave default values for other hyperparameters) Include these plot in your report and answer the following questions: a) Compare and contrast the learning curves with your curve using the default parameters. What do you observe in terms of smoothness, shape, and what performance they reach? b) For (a), what would you expect to happen if the max epochs were increased? 8 Activation Function and Depth. As networks get deeper (or have more layers) they tend to become able to fit more complex functions (though this also may lead to overfitting). However, this also means the backpropagated gradient has many product terms before reaching lower levels – resulting in the magnitude of the gradients being relatively small. This has the effect of making learning slower. Certain activation functions make this better or worse depending on the shape of their derivative. One popular choice is to use a Rectified Linear Unit or ReLU activation that computes: ReLU(x) = max(0, x) (36) This is especially common in very deep networks. In the next question, we’ll see why experimentally. I Q5 ReLU’s and Vanishing Gradients [3pts]. Modify the hyperparameters to run the following experiments: 1. 5-layer with Sigmoid Activation (leave default values for other hyperparameters) 2. 5-layer with Sigmoid Activation with 0.1 step size (leave default values for other hyperparameters) 3. 5-layer with ReLU Activation (leave default values for other hyperparameters) Include these plot in your report and answer the following questions: a) Compare and contrast the learning curves you observe and the curve for the default parameters in terms of smoothness, shape, and what performance they reach. Do you notice any differences in the relationship between the train and validation curves in each plot? b) If you observed increasing the learning rate in (2) improves over (1), why might that be? c) If (3) outperformed (1), why might that be? Consider the derivative of the sigmoid and ReLU functions. 2.4 Randomness in Training There is also a good deal of randomness in training a neural network with stochastic gradient descent – network weights are randomly initialized and the batches are randomly ordered. This can make a non-trivial difference to outcomes. I Q6 Measuring Randomness [1pt]. Using the default hyperparameters, set the random seed to 5 different values and report the validation accuracies you observe after training. What impact does this randomness have on the certainty of your conclusions in the previous questions? 2.5 Make Your Kaggle Submission Great work getting here. In this section, you’ll submit the predictions of your best model to the class-wide Kaggle competition. You are free to make any modification to your neural network or the optimization procedure to improve performance; however, it must remain a feed-forward neural network! For example, you can change any of the optimization hyperparameters or add momentum / weight decay, vary the number of layers or width, add dropout or residual connections, etc. 9 I Q7 Kaggle Submission [8pt]. Submit a set of predictions to Kaggle that outperforms the baseline on the public leaderboard. To make a valid submission, use the train set to train your neural network classifier and then apply it to the test instances in mnist_small_test.csv available from Kaggle’s Data tab. Format your output as a two-column CSV as below: id,digit 0,3 1,9 2,4 3,1 . . where the id is just the row index in mnist_small_test.csv. You may submit up to 10 times a day. In your report, tell us what modifications you made for your final submission. Extra Credit and Bragging Rights [1.25pt Extra Credit]. The TA has made a submission to the leaderboard. Any submission outperforming the TA on the private leaderboard at the end of the homework period will receive 1.25 extra credit points on this assignment. Further, the top 5 ranked submissions will “win HW3” and receive bragging rights. 3 Debriefing (required in your report) 1. Approximately how many hours did you spend on this assignment? 2. Would you rate it as easy, moderate, or difficult? 3. Did you work on it mostly alone or did you discuss the problems with others? 4. How deeply do you feel you understand the material it covers (0%–100%)? 5. Any other comments? 10