SOLVED CS434 Homework 1 k-Nearest Neighbor Classification and Statistical Estimation

Starting from:

$28

Overview and Objectives. In this homework, we are going to implement a kNN classifier and get some practice with the concepts in statistical estimation we went over in lecture. The assignment will help you understand how to apply the concepts from lecture to applications. How to Do This Assignment. • Each question that you need to respond to is clearly marked in a blue "Task Box" with its corresponding point-value listed. • We prefer typeset solutions (LATEX / Word) but will accept scanned written work if it is legible. If a TA can’t read your work, they can’t give you credit. Submit your solutions to Canvas as a zip including your report and any code the assignment asks you to develop. • Programming should be done in Python and numpy. If you don’t have Python installed in your machine, you can install it from https://www.python.org/downloads/. This is also the link showing how to install numpy: https://numpy.org/install/. You can also search through the internet for numpy tutorials if you haven’t used it before. Google is your friend! You are NOT allowed to... • Use machine learning package such as sklearn. • Use data analysis package such as panda or seaborn. • Discuss low-level details or share code / solutions with other students. Advice. Start early. Start early. Start early. There are two sections to this assignment – one involving working with math (20% of grade) and another focused more on programming (80% of the grade). Read the whole document before deciding where to start. How to submit. Submit a zip file to Canvas. Inside, you will need to have all your working code and hw1-report.pdf. You will also submit test set predictions to a class Kaggle. This is a required to receive credit for Q10. 1 Statistical Estimation [10pts] The Poisson distribution is a discrete probability distribution over the natural numbers starting from zero (i.e. 0,1,2,3,...). Specifically, it models how many occurrences of an event will happen within a fixed interval. For example – how many people will call into a call center within a given hour. Importantly, it assumes that an individual event (a person calling the center) is independent and occurs with a fixed probability. Say there is a 2% chance of someone calling every minute, then on average we would expect 1.2 people per hour – but sometimes there would be more and sometimes fewer. The Poisson distribution models that uncertainty over the total number of events in an interval. The probability mass function for the Poisson with parameter λ is given as: Pois(X = x; λ) = λ x e −λ x! ∀x ∈ {0, 1, 2, ...} λ ≥ 0 (1) where the rate λ can be interpreted as the average number of occurrences in an interval. In this section, we’ll derive maximum likelihood estimates for λ and show that the Gamma distribution is a conjugate prior to the Poisson. 1 I Q1 Maximum Likelihood Estimation of λ [4pts]. Assume we observe a dataset of occurrence counts D = {x1, x2, ..., xN } coming from N i.i.d random variables distributed according to Pois(X = x; λ). Derive the maximum likelihood estimate of the rate parameter λ. To help guide you, consider the following steps: 1. Write out the log-likelihood function logP(D|λ). 2. Take the derivative of the log-likelihood with respect to the parameter λ. 3. Set the derivative equal to zero and solve for λ – call this maximizing value λˆMLE. Your answer should make intuitive sense given our interpretation of λ as the average number of occurrences. In lecture, we discussed how to use priors to encode beliefs about parameters before any data is seen. We showed the Beta distribution was a conjugate prior to the Bernoulli – that is to say that the posterior of a Bernoulli likelihood and Beta prior is itself a Beta distribution (i.e. Beta ∝ Bernoulli * Beta) . The Poisson distribution also has a conjugate – the Gamma distribution. Gamma is a continuous distribution over the range [0, ∞) with probability density function: Gamma(Λ = λ; α, β) = β αλ α−1 e −βλ Γ(α) ∀λ ≥ 0 α, β > 0 (2) I Q2 Maximum A Posteriori Estimate of λ with a Gamma Prior [4pts]. As before, assume we observe a dataset of occurrence counts D = {x1, x2, ..., xN } coming from N i.i.d random variables distributed according to Pois(X = x; λ). Further, assume that λ is distributed according to a Gamma(Λ = λ; α, β). Derive the MAP estimate of λ. To help guide you, consider the following steps: 1. Write out the log-posterior logP(λ|D) ∝ logP(D|λ) + logP(λ). 2. Take the derivative of logP(D|λ) + logP(λ) with respect to the parameter λ. 3. Set the derivative equal to zero and solve for λ – call this maximizing value λˆMAP . In the previous question we found the MAP estimate for λ under a Poisson-Gamma model; however, we didn’t demonstrate that Gamma was a conjugate prior to Poisson – i.e. we did not show that the product of a Poisson likelihood and Gamma prior results in another Gamma distribution (or at least is proportional to one). I Q3 Deriving the Posterior of A Poisson-Gamma Model [2pt]. Show that the Gamma distribution is a conjugate prior to the Poisson by deriving expressions for parameters αP , βP of a Gamma distribution such that P(λ|D) ∝ Gamma(λ; αP , βP ). [Hint: Consider P(D|λ)P(λ) and group like-terms/exponents. Try to massage the equation to looking like the numerator of a Gamma distribution. The denominator can be mostly ignored if it is constant with respect to λ as we are only trying to show a proportionality (∝).] 2 k-Nearest Neighbor (kNN) [50pts] In this section, we’ll implement our first machine learning algorithm of the course – k Nearest Neighbors. We are considering a binary classification problem where the goal is to classify whether a person has an annual income more or less than $50,000 given census information. Test results will be submitted to a class-wide Kaggle competition. As no validation split is provided, you’ll need to perform cross-validation to fit good hyperparameters. Data Analysis. We’ve done the data processing for you for this assignment; however, getting familiar with the data before applying any algorithm is a very important part of applying machine learning in practice. Quirks of the dataset could significantly impact your model’s performance or how you interpret the outcome of your experiments. For instance, if I have an algorithm that achieved 70% accuracy on this task – how good or bad is that? We’ll see! We’ve split the data into two subsets – a training set with 8,000 labelled rows, and a test set with 2,000 unlabelled rows. These can be downloaded from the data tab of the HW1 Kaggle as “train.csv” and “test_pub.csv” respectively. Both files will come with a header row, so you can see which column belongs to which feature. Below you will find a table listing the attributes available in the dataset. We note that categorizing some of these 2 attributes into two or a few categories is reductive (e.g. only 14 occupations) or might reinforce a particular set of social norms (e.g. categorizing sex or race in particular ways). For this homework, we reproduced this dataset from its source without modifying these attributes; however, it is useful to consider these issues as machine learning practitioners. attribute name: type. list of values id: numerical. Unique for each point. Don’t use this as a feature (it will hurt, badly). age: numerical. workclass: categorical. Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay education-num: ordinal. 1:Preschool, 2:1st-4th, 3:5th-6th, 4:7th-8th, 5:9th, 6:10th, 7:11th, 8:12th, 9:HS-grad, 10:Some-college, 11:Assoc-voc, 12:Assoc-acdm, 13:Bachelors, 14:Masters, 15:Prof-school, 16:Doctorate marital-status: categorical. Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse occupation: categorical. Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlerscleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, ArmedForces relationship: categorical. Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried race: categorical. White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black sex: categorical. 0:Male, 1:Female capital-gain: numerical. capital-loss: numerical. hours-per-week: numerical. native-country: categorical. United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(GuamUSVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinada&Tobago, Peru, Hong, Holand-Netherlands income: ordinal. 0: <=50K, 1: >50K This is the class label. Don’t use this as a feature. Our dataset has three types of attributes – numerical, ordinal, and nominal. Numerical attributes represent continuous numbers (e.g. hours-per-week worked). Ordinal attributes are a discrete set with a natural ordering, for instance different levels of education. Nominal attributes are also discrete sets of possible values; however, there is no clear ordering between them (e.g. native-country). These different attribute types require different preprocessing. As discussed in class, numerical fields have been normalized. For nominal variables like workclass, marital-status, occupation, relationship, race, and native-country, we’ve transformed these into one column for each possible value with either a 0 or a 1. For example, the first instance in the training set reads: [0, 0.136, 0.533, 0.0, 0.659, 0.397, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0] where all the zeros and ones correspond to these binarized variables. The following questions guide you through exploring the dataset and help you understand some of the steps we took when preprocessing the data. I Q4 Encodings and Distance [2pt] To represent nominal attributes, we apply a one-hot encoding technique – transforming each possible value into its own binary attribute. For example, if we have an attribute workclass with three possible values Private, State-gov, Never-worked – we would binarize the workclass attribute as shown below (each row is a single example data point): workclass Private State-gov Never-worked State-gov The original field workclass= Private workclass= State-gov workclass= Never-worked 1 0 0 0 1 0 0 0 1 0 1 0 Fields after binarization A common naive preprocessing is to treat all categoric variables as ordinal – assigning increasing integers to each possible value. For example, such an encoding would say 1=Private, 2=State-gov, and 3=Neverworked. Contrast these two encodings. Focus on how each choice affects Euclidean distance in kNN. 3 I Q5 Looking at Data [3pt] What percent of the training data has an income >50k? Explain how this might affect your model and how you interpret the results. For instance, would you say a model that achieved 70% accuracy is a good or poor model? How many dimensions does each data point have (ignoring the id attribute and class label)? [Hint: check the data, one-hot encodings increased dimensionality] 2.1 k-Nearest Neighbors In this part, you will need to implement the k-NN classifier algorithm with the Euclidean distance. (If you are not familiar with Euclidean distance, you can go back to the L1.2 - kNN slide 18) The kNN algorithm is simple as you already have the preprocessed data. To make a prediction for a new data point, there are three main steps: 1. Calculate the distance of that data point to every training example. 2. Get the k nearest points (i.e. the k with the lowest distance). 3. Return the majority class of these neighbors as the prediction Implementing this using native Python operations will be quite slow, but lucky for us there is the numpy package. numpy makes matrix operations quite efficient and easy to code. How will matrix operations help us? The next question walk you through figuring out the answer. Some useful things to check out in numpy: broadcasting, np.linalg.norm np.argsort() or np.argpartition(), and array slicing. I Q6 Norms and Distances [2pt] Distances and vector norms are closely related concepts. For instance, an L2 norm of a vector x (defined below) can be intepretted as the Euclidean distance between x and the zero vector: ||x||2 = vuutX d i=1 x 2 i (3) Given a new vector z, show that the Euclidean distance between x and z can be written as an L2 norm. [kNN implementation note for later, you can compute norms efficiently with numpy using np.linalg.norm] In kNN, we need to compute distance between every training example xi and a new point z in order to make a prediction. As you showed in the previous question, computing this for one xi can be done by applying an arithmetic operation between xi and z, then taking a norm. In numpy, arithmetic operations between matrices and vectors are sometimes defined by “broadcasting”, even if standard linear algebra doesn’t allow for them. For example, given a matrix X of size n × d and a vector z of size d, numpy will happily compute Y = X − z such that Y is the result of the vector z being subtracted from each row of X. Combining this with your answer from the previous question can make computing distances to every training point quite efficient and easy to code. I Q7 Implement kNN [10pts] Okay, enough math and talk about numpy. It is time to get our hands dirty. Let’s write some code! Implement k-Nearest Neighbors using Euclidean distance for this dataset in Python in a file named knn.py. Conceptually, you will need to load the training and test data and have some function to find the nearest neighbors for each test point to make a prediction. 2.2 K-Fold Cross Validation We do not provide class labels for the test set or a validation set, so you’ll need to implement K-fold Cross Validation1 to check if your implementation is correct and to select hyperpameters. As discussed in class, K-fold Cross Validation divides the training set into K segments and then trains K models – leaving out one of the segments each time and training on the others. Then each trained model is evaluated on the left-out fold to estimate performance. Overall performance is estimated as the mean of these K fold performances. 1Note that K here has nothing to do with k in kNN – just overloaded notation. 4 I Q8 Implement 4-fold Cross Validation [8pts] Inside of your knn.py file, implement 4-fold cross validation for your kNN algorithm. Your implementation of cross validation should return the observed accuracies for each fold. We’ll use this in the next part. I Q9 Hyperparameter Search [15pt] To search for the best hyperpameters, we will use cross-validation to estimate our accuracy on new data. For each k in 1,3,5,7,9,99,999,and 8000, report: • accuracy on the training set when using the entire training set for kNN (call this training accuracy), • the mean and variance of the 4-fold cross validation accuracies (call this validation accuracy). What is the best number of neighbors (k) you observe? When k = 1, is training error 0%? Why or why not? What trends (train and cross-valdiation accuracy rate) do you observe with increasing k? Do they relate to underfitting and overfitting? 2.3 Kaggle Submission Great work getting here. In this section, you’ll submit the predictions of your best model to the class-wide Kaggle competition. You are free to make any modification to your kNN algorithm to improve performance; however, it must remain a kNN algorithm! You can change distances, weightings, select subsets of features, etc. I Q10 Kaggle Submission [10pt]. Submit a set of predictions to Kaggle that outperforms the baseline on the public leaderboard. To make a valid submission, use the train set to build your kNN classifier and then apply it to the test instances in test_pub.csv available from Kaggle’s Data tab. Format your output as a two-column CSV as below: id,income 0,0 1,0 2,0 3,0 4,0 5,1 6,0 . . . You may submit up to 5 times a day. In your report, tell us what modifications you made to kNN for your final submission. Extra Credit and Bragging Rights [2.5pt Extra Credit]. The TA has made a submission to the leaderboard. Any submission outperforming the TA on the private leaderboard at the end of the homework period will receive 2.5 extra credit points on this assignment. Further, the top 5 ranked submissions will “win HW1” and receive bragging rights. 3 Debriefing (required in your report) 1. Approximately how many hours did you spend on this assignment? 2. Would you rate it as easy, moderate, or difficult? 3. Did you work on it mostly alone or did you discuss the problems with others? 4. How deeply do you feel you understand the material it covers (0%–100%)? 5. Any other comments? 5