$28
1. Visualizing SGD Trajectories of Fully-Connected Neural Networks (FCNNs) [25 pts]: First, read all the “Introduction to PyTorch” tutorials (see https://pytorch.org/tutorials/beginner/ basics/intro.html), including “Training a classifier”, “Learn the Basics”, “Quickstart”, “Tensors”, “Datasets & Dataloaders”, “Transforms”, “Build Model”, “Autograd”, “Optimization”, and “Save & Load Model”. Then complete the following tasks: (a) [10 pts]: Use PyTorch to build and train a simple FCNN with at least 2 hidden layers to classify the Fashion MNIST dataset (similar to Homework 3). You can use label-preserving transformations such as rotation (make sure not to rotate the images by more than, say, ±10◦ ) to improve generalization. Your network must be fully-connected – it cannot use convolution. Report the test accuracy you get in the PDF. (b) [10 pts]: For any fixed FCNN architecture with at least 2 hidden layers, visualize the gradient descent trajectory in 3-D from two different random parameter initializations. In your plot, you should use the first two axes to represent different values for the NN’s parameters (which we will denote here collectively simply as p) and the third (vertical) axis to represent the crossentropy fCE(p). Of course, there are far (!) more than just 2 parameters in the NN, and thus it will be necessary to perform dimensionality reduction. You should use principal component analysis (PCA) to reduce the parameter space down to just 2 dimensions. The two PCs will represent different “directions” along which the NN parameters can vary, where these directions are chosen to minimize the reconstruction error of the data. In general, it will not be the case that a PC corresponds to just a single weight/bias; rather, moving along each PC axis will correspond to changing all of the NN’s parameters at once. Concrete steps: i. Run SGD at least two times to collect multiple trajectories of p. (Ask ChatGPT for help on how to extract all the parameters from the entire NN as a single vector.) For each value, save the corresponding training cost fCE(p) – you will need these later. To keep things tractable, train the network on just 1000 examples of Fashion MNIST. ii. Use the collected p vectors to estimate the first 2 principal components that map from the full parameter space down to just 2 dimensions (see sklearn.decomposition.PCA). iii. For each p that was encountered during training, project it into the 2-d space, and then plot it as part of a 3-d scatter plot with its associated fCE(p) value that was computed during SGD. iv. To give a sense of the “landscape” through which SGD is “hiking”, compute a dense grid of at least 25x25 points in the 2-d PC space. For each point p˜ in this 2-d space, project it back into the full parameter space (see PCA.inverse transform) to obtain a value for p. Load these parameters p into the NN (ask ChatGPT for help on how to do this). Finally, make a surface plot (use the plot surface function in matplotlib) showing the corresponding cost values over all points in this grid. Render this surface plot in the same figure as the 3-d scatter plot. Make sure to compute the fCE values (for both the 3-d scatter plot and the surface plot) on the set of 1000 training data, since that is what SGD directly optimizes. Include your figure in the PDF you submit. An example figure is shown below: 1 As you can see, the two SGD trajectories started on different sides of a ridge and ended up descending into different valleys. (c) [3 pts]: In the figure you created in part (b), the fCE values of the points in the 3-d scatter plot (computed during SGD) do not always exactly equal the corresponding values from the surface plot generated on the dense grid of points. Why is that, and why would it be impractical to create a surface plot over the grid that exactly matches the “real” fCE values obtained using SGD? (Think about how PCA works and how it is used here.) Answer in a few sentences in the PDF. (d) [2 pts]: Assume that the set of all the p vectors you collected has zero mean (i.e., the sum of all the p vectors equals the zero vector). Let p, p ′ represent two different configurations of the NNs parameters, and let p˜, p˜ ′ ∈ R 2 represent their respective projections in the 2-d PC space. Furthermore, let pˆ, pˆ ′ represent the reconstructions (using PCA.inverse transform) of the NN’s parameters from p˜, p˜ ′ . Which of the following statements are always true? Report the correct statements in your PDF. i. If p = 2p ′ (i.e., the NN parameters in the first configuration are all twice the magnitude of the corresponding parameters in the second configuration), then p˜ = 2p˜ ′ . ii. If p˜ = 2p˜ ′ , then p = 2p ′ . iii. If p˜ = 2p˜ ′ , then pˆ = 2pˆ ′ . iv. fCE(pˆ) ≤ fCE(p). 2. Simple CNN for Fashion MNIST [15 pts]: Read the following PyTorch tutorial: https:// pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html. Then, apply the same methodology to the Fashion MNIST dataset (see the torchvision.datasets.FashionMNIST class). Note that, since Fashion MNIST images are grayscale, they have just a single color channel rather than 3. Hence, you will need to adapt the CNN slightly so that the number of input channels in the first convolutional layer is 1 instead of 3. Also, the normalization step transforms.Normalize will use just 1 channel instead of 3. Finally, as the image size is different compared to CIFAR10, the size of the feature maps will also be different. You will thus need to update the number of incoming neurons to the first fully-connected (nn.Linear) layer accordingly. After making these modifications, train and evaluate a classifier on this dataset. Report the test accuracy in the PDF you submit. 3. Supervised Pre-Training and Fine-Tuning of CNNs [15 pts]: Read the following PyTorch tutorial: https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html. 2 Then, apply the same methodology to the Fashion MNIST dataset. Report the test accuracies in the PDF you submit, using either (a) fine-tuning of the whole model or (b) training just the final (classification) layer. 4. CNNs for Behavioral Cloning in Pong [20 pts]: Train an AI agent to play Pong (see https: //www.gymlibrary.dev/environments/atari/pong/). In this game, each player can execute one of 6 possible actions at each timestep (NOOP, FIRE, RIGHT, LEFT, RIGHTFIRE, and LEFTFIRE). The goal is to execute the best action at each timestep based on the current state of the game. To get started, first download the following files: • https://s3.amazonaws.com/jrwprojects/pong_actions.pt • https://s3.amazonaws.com/jrwprojects/pong_observations.pt Together, these files define (image, action) pairs generated by an expert player from the Atari Pong video game. Using PyTorch, implement and train any NN architecture you choose (I recommend a simple CNN) to map the images to their corresponding expert actions. Your NN will implement the control policy that dictates how the agent behaves in differnet situations, and the approach of training this NN in a supervised manner from expert trajectories is called behavioral cloning. After training your NN, save it to a file (use torch.save). Then, load the model in play atari.py and see how well your AI “player” does against the computer. To receive full credit, your trained agent should be able to beat the computer (i.e., reach 21 points first). (Note: for this part of the homework, you need to use a standard Python interpreter – Google Colab cannot render the frames of the video game.) In addition to your Python code (homework4 WPIUSERNAME1.py or homework4 WPIUSERNAME1 WPIUSERNAME2.py for teams), create a PDF file (homework4 WPIUSERNAME1.pdf or homework4 WPIUSERNAME1 WPIUSERNAME2.pdf for teams) containing the screenshots described above. 3