Homework #3 CSE 446 solved

Starting from:

$30

Classification
1. [5 points] Suppose we run the Perceptron algorithm with w initialized to be an arbitrary unit vector. Suppose that the algorithm is then given the same vector x (whose label is 1) over and over again. How many
mistakes can it make in the worst case? Express your answer in terms of ||x||2. (Hint: derive the result from
first principles, do not attempt to use the general result we proved in class).
2. Consider using a linear decision boundary for classification (labels in {−1, 1}) of the form w · x = 0 (i.e., no
offset). Now consider the following loss function evaluated at a data point (x, y) which is a variant on the hinge
loss.
`((x, y), w) = max(0, −y(w · x)).
a. [2 points] Given a dataset of (xi
, yi) pairs, write down a single step of subgradient descent with a step
size of η if we are trying to minimize
1
n
Xn
i=1
`((xi
, yi), w)
for `(·) defined as above. That is, given a current iterate we what is an expression for the next iterate?
b. [2 points] Use what you derived to argue that the Perceptron (as formulated in class) can be viewed as
implementing SGD applied to the loss function just described (for what value of η)?
c. [1 points] Suppose your data was drawn iid and that there exists a w∗
that separates the two classes
perfectly. Provide an explanation for why hinge loss is generally preferred over the loss given above.
3. We’ve talked a lot about binary classification, but what if we have k > 2 classes, like the 10 digits of MNIST?
Concretely, suppose you have a dataset {(xi
, yi)}
n
i=1 where xi ∈ R
d and yi ∈ {1, . . . , k}. Like in our least squares
classifier of homework 1 for MNIST, we will assign a separate weight vector w(`)
for each class ` = 1, . . . , k; let
W = [w(1)
, . . . , w(k)
] ∈ R
d×k
. We can generalize the binary classification probabilistic model to multiple classes
as follows: let
PW (yi = `|W, xi) = exp(w(`)
· xi)
Pk
j=1 exp(w(j)
· xi)
The negative log-likelihood function is equal to
L(W) = −
Xn
i=1
X
k
`=1
1{yi = `} log
exp(w(`)
· xi)
Pk
j=1 exp(w(j)
· xi)
!
1
Define the softmax(·) operator to be the function that takes in a vector θ ∈ R
d and outputs a vector in R
d
whose ith component is equal to P
exp(θi)
d
j=1 exp(θj )
. Clearly, this vector is nonnegative and sums to one. If for any i
we have θi maxj6=i θj then softmax(θ) approximates ei
, a vector of all zeros with a one in the ith component.
For each yi
let yi be the one-hot encoding of yi (i.e., yi ∈ {0, 1}
k
is a vector of all zeros aside from a 1 in the
yith index).
a. [5 points] If yb
(W)
i = softmax(W>xi), show that ∇W L(W) = −
Pn
i=1 xi(yi − yb
(W)
i
)
>.
b. [5 points] Recall problem 6 of Homework 1 (Ridge Regression on MNIST) and define J(W) = 1
2
Pn
i=1 kyi−
W>xik
2
2
. If ye
(W)
i = W>xi show that ∇W J(W) = −
Pn
i=1 xi(yi − ye
(W)
i
)
>. Comparing the least squares
linear regression gradient step of this part to the gradient step of minimizing the negative log likelihood
of the logistic model of part a may shed light on why we call this classification problem logistic regression.
c. [15 points] Using the original representations of the MNIST flattened images xi ∈ R
d
(d = 28 × 28 = 764)
and all k = 10 classes, implement gradient descent (or stochastic gradient descent) for both J(W) and
L(W) and run until convergence on the training set of MNIST. For each of the two solutions, report
the classification accuracy of each on the training and test sets using the most natural arg maxj ejW>xi
classification rule.
Kernels
4. [5 points] Suppose that our inputs are one dimensional and that our feature map is infinite dimensional:
φ(x) is a vector whose ith component is
1
√
i!
e
− x
2
2 x
i
.
for all nonnegative integers i. (Thus, φ is an infinite dimensional vector.) Show that K(x, x0
) = e
−
(x−x
0)
2
2 is a
kernel function for this feature map, i.e.,
φ(x) · φ(x
0
) = e
−
(x−x
0)
2
2 .
Hint: Use the Taylor expansion of e
z
. (This is the one dimensional version of the Gaussian (RBF) kernel).
5. This problem will get you familiar with kernel ridge regression using the polynomial and RBF kernel. First
let’s generate some data. Let n = 30 and f∗(x) = 4 sin(πx) cos(6πx2
). For i = 1, . . . , n let each xi be drawn
uniformly at random on [0, 1] and yi = f∗(xi) + i where i ∼ N (0, 1). For any function f, the true error is
defined as
Etrue(f) = EXY [(f(X) − Y )
2
]
whereas the training error is defined as
Ebtrain(f) = 1
n
Xn
i=1
(f(xi) − yi)
2
.
Using kernel ridge regression, construct a predictor
αb = arg min
α
||Kα − y||2 + λαT Kα , fb(x) = Xn
i=1
αbik(xi
, x)
where Ki,j = k(xi
, xj ) is a kernel evaluation and λ is the regularization constant.
a. [10 points] Using leave-one-out cross validation, find a good λ and hyperparameter settings for the following
kernels:
• kpoly(x, z) = (1 + x
T
z)
d where d ∈ N is a hyperparameter,
2
• krbf (x, z) = exp(−γkx − zk
2
) where γ > 0 is a hyperparameter1
.
Report the values of d, γ, and the λ values for both kernels.
b. [10 points] Let fbpoly(x) and fbrbf (x) be the functions learned using the hyperparameters you found in part
a. For a single plot per function fb∈ {fbpoly(x), fbrbf (x)}, plot the original data {(xi
, yi)}
n
i=1, the true f(x),
and fb(x) (i.e., define a fine grid on [0, 1] to plot the functions).
c. [5 points] We wish to build Bootstrap percentile confidence intervals for fbpoly(x) and fbrbf (x) for all
x ∈ [0, 1] from part b. Use the non-parametric bootstrap with B = 300 datasets to find 5% and 95%
percentiles at each point x on a fine grid over [0, 1] (see Hastie, Tibshirani, Friedman Ch. 8.2 for a review).
Specifically, for each dataset b ∈ {1, . . . , B}, draw uniformly at randomly with replacement n samples from
{(xi
, yi)}
n
i=1, train an fbb using the bth resampled dataset, compute fbb(x) for each x in your fine grid; let
the 5th percentile at point x be the largest value ν such that 1
B
PB
b=1 1{fbb(x) ≤ ν} ≤ .05, define the 95%
analogously. Plot the 5 and 95 percentile curves on the plots from part b.
d. [5 points] Repeat all parts of this problem with n = 300 (you may just use 10-fold CV instead of leaveone-out)
e. [5 points] For this problem, use the fbpoly(x) and fbrbf (x) learned in part d. Suppose m = 1000 additional
samples (x
0
1
, y0
1
), . . . ,(x
0
m, y0
m) are drawn i.i.d. the same way the first n samples were drawn. Use the nonparametric bootstrap with B = 300 to construct a confidence interval on E[(Y −fbpoly(X))2−(Y −fbrbf (X))2
]
(i.e. randomly draw with replacement m samples denoted as {(xe
0
i
, ye
0
i
)}
m
i=1 from {(x
0
i
, y0
i
)}
m
i=1 and compute
1
m
Pm
i=1
(ye
0
i − fbpoly(xe
0
i
))2 − (ye
0
i − fbrbf (xe
0
i
))2

, repeat this B times) and find 5% and 95% percentiles.
Using this confidence interval, is there statistically signicant evidence to suggest that one of fbrbf and fbpoly
is better than the other at predicting Y from X? (Hint: does the confidence interval contain 0?)
Context: With binary labels like in the extra credit below, we can easily compute means and variances and
appeal to the CLT to compute confidence intervals. However, in almost all other settings computing nearly-exact
confidence intervals is impossible because there is information we simply do not have. The Bootstrap gives us
an ability to construct confidence intervals on nearly any quantity of interest with minimal side-knowledge.
k-means clustering
6. Given a dataset x1, ..., xn ∈ R
d and an integer 1 ≤ k ≤ n, recall the following k-means objective function
min
π1,...,πk
X
k
i=1
X
j∈πi
kxj − µik
2
2
, µi =
1
|πi
|
X
j∈πi
xj . (1)
Above, {πi}
k
i=1 is a partition of {1, 2, ..., n}. The objective (1) is NP-hard2
to find a global minimizer of.
Nevertheless the commonly used heuristic which we discussed in lecture, known as Lloyd’s algorithm, typically
works well in practice. Implement Lloyd’s algorithm for solving the k-means objective (1). Do not use any off
the shelf implementations, such as those found in scikit-learn.
a. [5 points] Run the algorithm on the training dataset of MNIST with k = 10, plotting the objective function
(1) as a function of iteration. Visualize (and include in your report) the cluster centers as a 28 ×28 image.
b. [5 points] For k = {2, 5, 10, 20, 40, 80, 160, 320, 640, 1280} run the algorithm on the training dataset to
obtain centers {µi}
k
i=1. If {(xi
, yi)}
n
i=1 and {(x
0
i
, y0
i
)}
m
i=1 denote the training and test sets, respectively,
plot the training error 1
n
Pn
i=1 minj=1,...,k kµj − xik
2
2 and test error 1
m
Pm
i=1 minj=1,...,k kµj − x
0
i
k
2
2 as a
function of k on the same plot. (If the large values of k are taking unresonably long to compute, just go
up to the largest value of k you can.)
1Given a dataset x1, . . . , xn ∈ Rd, a heuristic for choosing a range of γ in the right ballpark is the inverse of the median of all