$24
P1. Use the MNIST dataset and build a binary classifier to detect 3 versus 5. You can assume 3 is the positive class and 5 is the negatively class. Note that, for training, you only need a subset of the original training dataset, which are those corresponding to 3s and 5s. Similarly for testing, you will use the subset of the original test dataset that corresponds to 3s and 5s. For this part, use the SGDClassifier with the default hyperparameters unless mentioned otherwise. (a) Use cross_val_score() to show the accuracy of prediction under cross validation. (b) Use cross_val_predict() to generate predictions on the training data. Then, generate the following: • The confusion matrix • The precision score • The recall score • The F1 score (c) Use cross_val_predict() to generate the prediction scores on the training set. Then, plot the precision and recall curves as functions of the threshold value. (d) Based on the curves, what will be a sensible threshold value to choose? Generate predictions under the chosen threshold value. Evaluate the precision and recall scores using the predictions. (e) Plot the ROC curve and evaluate the ROC AUC score. (f) Try the RandomForestClassifier. Plot the ROC curve and evaluate the ROC AUC score. (g) Repeat part (f) with feature scaling using StandardScaler(). P2. Build a multiclass classifier that distinguishes three classes: 3, 5, and others (i.e., neither 3 nor 5). Do this by train three binary classifiers: one that distinguishes between 3 and 5, one that distinguishes 3 and others, and one that distinguishes between 5 and others. Use the SGDClassifier for each the binary classifiers. For prediction, given the image of a digit, count the number of duels won as follows: • Assume the digit is 3. Pass it to the 3-vs-5 classifier and 3-vs-others classifier, and count the number of wins by 3. • Assume the digit is 5. Pass it to the 3-vs-5 classifier and 5-vs-others classifier, and count the number of wins by 5 • Assume the digit is ‘others’. Pass it to the 3-vs-others classifier and 5-vs-others classifier, and count the number of wins by ‘others’. Whichever gives the most wins, the assumed digit will be the predicted class for the input. If there is tie, break the tie randomly. P3. Use the KNeighborsClassifier, which has built-in support for multiclass classification, to classify all the 10 digits. Try to build a classifier that achieves over 97% accuracy on the test set. The KNeighborsClassifier works quite well for this task if you find the right hyperparameters. Use a grid search on the weights and n_neighbors hyperparameters. Once you find a good set of hyperparameters, please conduct error analysis on the training dataset. In particular, find the confusion matrix, display it as an image using matshow(), and discuss the kinds of errors that your model makes. Please submit a pdf file that contains your code and results, and the Jupyter Notebook if you use it. Hint: You can build on the code that we discussed during the lectures, which can be downloaded from the github page: https://github.com/ageron/handson-ml2. Most of the code for P3 is related to Exercise 1. P4. See the separate the PDF file ‘A4-P4.pdf’.