SOLVED CS 6601 Assignment 5: Gaussian mixture models

Starting from:

~~$28~~

$14

The Challenge

Automatic image processing is a key component to many AI systems, including facial recognition and video compression. One basic method for processing is segmentation, by which we divide an image into a fixed number of components in order to simplify its representation. For example, we can train a mixture of Gaussians to represent an image, and segment it according to the simplified representation as shown in the images below.

In this assignment, you will learn to perform image segmentation. To this end, you will implement Gaussian mixture models and iteratively improve their performance. You will perform this segmentation on the "Bird" and "Party Spock" images included with the assignment.

About the Assignment

The tests for the assignment are provided in the notebook, so Bonnie is only for submission purposes. The tests on Bonnie will be similar to the ones provided here, but the images being tested against, and the values for calculations will be different.

Thus, you will be allowed only 5 submissions on Bonnie. Make sure you test everything before submitting. Score for the last submission counts. The code will be allowed to run for not more than 2 hours per submission. In order for the code to run quickly, make sure to vectorize the code (more on this below).

Your Mission (should you choose to accept it)

Your assignment is to implement several methods of image segmentation, with increasing complexity.

Implement k-means clustering to segment a color image.
Build a Gaussian mixture model to be trained with expectation-maximization.
Experiment with varying the details of the Gaussian mixture model’s implementation.
Implement and test a new metric called the Bayesian information criterion, which guarantees a more robust image segmentation.

Grading

The grade you receive for the assignment will be distributed as follows:

k-Means Clustering (20 points)
Gaussian Mixture Model (40 points)
Model Performance Improvements (20 points)
Bayesian Information Criterion (20 points)
Bonus

Due Date

The assignment is due April 2nd, 2017 at 11:59PM UTC-12 (Anywhere on Earth time). The deliverable for this assignment is a completed mixture_models.py file.

Resources

The em.pdf chapter in the assignment folder gives a good explanation of implementing and training mixture models, particularly 424-427 (k-means) and 435-439 (mixture models and EM). The book Elements of Statistical Learning, pages 291-295.

Background

A Gaussian mixture model is a generative model for representing the underlying probability distribution of a complex collection of data, such as the collection of pixels in a grayscale photograph.

In the context of this problem, a Gaussian mixture model defines the joint probability $f (x)$ as

$f (x) = \sum_{i = 1}^{k} m_{i} N_{i} (x | μ_{i}, σ_{i}^{2})$

where $x$ is a grayscale value [0,1], $f (x)$ is the joint probability of that gray scale value, $m_{i}$ is the mixing coefficient on component $i$ , $N_{i}$ is the $i^{t h}$ Gaussian distribution underlying the value $x$ with mean $μ_{i}$ and variance $σ_{i}^{2}$ .

We will be using this model to segment photographs into different grayscale regions. The idea of segmentation is to assign a component $i$ to each pixel $x$ using the maximum posterior probability

$c o m p o n e n t_{x} = a r g m a x_{i} (m_{i} N_{i} (x | μ_{i}, σ_{i}^{2})$

Then we will replace each pixel in the image with its corresponding $μ_{i}$ to produce a result as below (original above, segmented with three components below).

Original image. Post-segmentation.

from __future__ import division
import warnings
warnings.simplefilter(action = "ignore", category = FutureWarning)
import numpy as np
import scipy as sp
from matplotlib import image
import matplotlib.pyplot as plt
import matplotlib.cm as cm

"""Helper image-processing code. 
These have been added in a separate python file and added in to the repo.
The functions below have all been imported in to your submission file"""
def image_to_matrix(image_file, grays=False):
    """
    Convert .png image to matrix
    of values.
    
    params:
    image_file = str
    grays = Boolean
    
    returns:
    img = (color) np.ndarray[np.ndarray[np.ndarray[float]]] 
    or (grayscale) np.ndarray[np.ndarray[float]]
    """
    img = image.imread(image_file)
    # in case of transparency values
    if(len(img.shape) == 3 and img.shape[2] > 3):
        height, width, depth = img.shape
        new_img = np.zeros([height, width, 3])
        for r in range(height):
            for c in range(width):
                new_img[r,c,:] = img[r,c,0:3]
        img = np.copy(new_img)
    if(grays and len(img.shape) == 3):
        height, width = img.shape[0:2]
        new_img = np.zeros([height, width])
        for r in range(height):
            for c in range(width):
                new_img[r,c] = img[r,c,0]
        img = new_img
    return img

def matrix_to_image(image_matrix, image_file):
    """
    Convert matrix of color/grayscale 
    values  to .png image
    and save to file.
    
    params:
    image_matrix = (color) numpy.ndarray[numpy.ndarray[numpy.ndarray[float]]] or (grayscale) numpy.ndarray[numpy.ndarray[float]]
    image_file = str
    """
    # provide cmap to grayscale images
    cMap = None
    if(len(image_matrix.shape) < 3):
        cMap = cm.Greys_r
    image.imsave(image_file, image_matrix, cmap=cMap)
    
def flatten_image_matrix(image_matrix):
    """
    Flatten image matrix from 
    Height by Width by Depth
    to (Height*Width) by Depth
    matrix.
    
    params:
    image_matrix = (color) numpy.ndarray[numpy.ndarray[numpy.ndarray[float]]] or (grayscale) numpy.ndarray[numpy.ndarray[float]]
    
    returns:
    flattened_values = (color) numpy.ndarray[numpy.ndarray[float]] or (grayscale) numpy.ndarray[float]    
    """
    if(len(image_matrix.shape) == 3):
        height, width, depth = image_matrix.shape
    else:
        height, width = image_matrix.shape
        depth = 1
    return image_matrix.reshape(height*width, depth)

def unflatten_image_matrix(image_matrix, width):
    """
    Unflatten image matrix from
    (Height*Width) by Depth to
    Height by Width by Depth matrix.
    
    params:
    image_matrix = (color) numpy.ndarray[numpy.ndarray[float]] or (grayscale) numpy.ndarray[float]
    width = int
    
    returns:
    unflattened_values = (color) numpy.ndarray[numpy.ndarray[numpy.ndarray[float]]] or (grayscale) numpy.ndarray[numpy.ndarray[float]]
    """
    heightWidth = image_matrix.shape[0]
    height = int(heightWidth / width)
    if(len(image_matrix.shape) > 1 and image_matrix.shape[-1] != 1):
        depth = image_matrix.shape[-1]
        return image_matrix.reshape(height, width, depth)
    else:
        return image_matrix.reshape(height, width)

def image_difference(image_values_1, image_values_2):
    """
    Calculate the total difference 
    in values between two images.
    Assumes that both images have same
    shape.
    
    params:
    image_values_1 = (color) numpy.ndarray[numpy.ndarray[numpy.ndarray[float]]] or (grayscale) numpy.ndarray[numpy.ndarray[float]]
    image_values_2 = (color) numpy.ndarray[numpy.ndarray[numpy.ndarray[float]]] or (grayscale) numpy.ndarray[numpy.ndarray[float]]
    
    returns:
    dist = int
    """
    flat_vals_1 = flatten_image_matrix(image_values_1)
    flat_vals_2 = flatten_image_matrix(image_values_2)
    N, depth = flat_vals_1.shape
    dist = 0.
    point_thresh = 0.005
    for i in range(N):
        if(depth > 1):
            new_dist = sum(abs(flat_vals_1[i] - flat_vals_2[i]))
            if(new_dist > depth * point_thresh):
                dist += new_dist
        else:
            new_dist = abs(flat_vals_1[i] - flat_vals_2[i])
            if(new_dist > point_thresh):
                dist += new_dist
    return dist

image_dir = 'images/'
image_file = 'party_spock.png'
values = image_to_matrix(image_dir + image_file)
print(values)

Part 0: Note on Vectorization

The concept of Vectorization was introduced in the last section of Assignment 4. For this assignment, please vectorize your code wherever possible using numpy arrays, instead of running for-loops over the images being processed.

For an example of how this might be useful, consider the following array:

A = [12 34 1234 764 ...(has a million values)... 91, 78]

Now you need to calculate another array B, which has the same dimensions as A above. Say each value in B is calculated as follows:

(each value in B) = square_root_of(some constants pi log(k) * (each value in A))/7

You might wish to use a for-loop to compute this. However, it will take really long to run on an array of this magnitude.

Alternatively, you may choose to use numpy and perform this calculation in a single line. You can pass A as a numpy array and the entire calculation will be done in a line, resulting in B being populated with the corresponding values that come out of this formula.

Part 1: K-means clustering

20 pts

One easy method for image segmentation is to simply cluster all similar data points together and then replace their values with the mean value. Thus, we'll warm up using k-means clustering. This will also provide a baseline to compare with your segmentation. Please note that clustering will come in handy later.

Fill out k_means_cluster() to convert the original image values matrix to its clustered counterpart. Your convergence test should be whether the assigned clusters stop changing. Note that this convergence test is rather slow. When no initial cluster means are provided, k_means_cluster() should choose $k$ random points from the data (without replacement) to use as initial cluster means.

For this part of the assignment, since clustering is best used on multidimensional data, we will be using the color image bird_color_24.png.

You can test your implementation of k-means using our reference images in k_means_test().

Try to vectorize the code for it to run faster. Without vectorization it takes 25-30 minutes for the code to run.

from random import randint
from functools import reduce
def k_means_cluster(image_values, k=3, initial_means=None):
    """
    Separate the provided RGB values into
    k separate clusters using the k-means algorithm,
    then return an updated version of the image
    with the original values replaced with
    the corresponding cluster values.
    
    params:
    image_values = numpy.ndarray[numpy.ndarray[numpy.ndarray[float]]]
    k = int
    initial_means = numpy.ndarray[numpy.ndarray[float]] or None
    
    returns:
    updated_image_values = numpy.ndarray[numpy.ndarray[numpy.ndarray[float]]]
    """
    # TODO: finish this function
    raise NotImplementedError()
    return updated_image_values

def k_means_test():
    """
    Testing your implementation
    of k-means on the segmented
    bird_color_24 reference images.
    """
    k_min = 2
    k_max = 6
    image_dir = 'images/'
    image_name = 'bird_color_24.png'
    image_values = image_to_matrix(image_dir + image_name)
    # initial mean for each k value
    initial_means = [
        np.array([[0.90980393,0.8392157,0.65098041],[0.83137256,0.80784315,0.69411767]]),
        np.array([[0.90980393,0.8392157,0.65098041],[0.83137256,0.80784315,0.69411767],[0.67450982,0.52941179,0.25490198]]),
        np.array([[0.90980393,0.8392157,0.65098041],[0.83137256,0.80784315,0.69411767],[0.67450982,0.52941179,0.25490198],[0.86666667,0.8392157,0.70588237]]),
        np.array([[0.90980393,0.8392157,0.65098041],[0.83137256,0.80784315,0.69411767],[0.67450982,0.52941179,0.25490198],[0.86666667,0.8392157,0.70588237],[0,0,0]]),
        np.array([[0.90980393,0.8392157,0.65098041],[0.83137256,0.80784315,0.69411767],[0.67450982,0.52941179,0.25490198],[0.86666667,0.8392157,0.70588237],[0,0,0],[0.8392157,0.80392158,0.63921571]]),
    ]
    # test different k values to find best
    for k in range(k_min, k_max+1):
        updated_values = k_means_cluster(image_values, k, initial_means[k-k_min])
        ref_image = image_dir + 'k%d_%s'%(k, image_name)
        ref_values = image_to_matrix(ref_image)
        dist = image_difference(updated_values, ref_values)
        print('Image distance = %.2f'%(dist))
        if(int(dist) == 0):
            print('Clustering for %d clusters produced a realistic image segmentation.'%(k))
        else:
            print('Clustering for %d clusters didn\'t produce a realistic image segmentation.'%(k))

k_means_test()

Part 2: Implementing a Gaussian mixture model

40 pts

Next, we will step beyond clustering and implement a complete Gaussian mixture model.

Complete the below implementation of GaussianMixtureModel so that it can perform the following:

Calculate the joint log probability of a given greyscale value. (5 points)
Use expectation-maximization (EM) to train the model to represent the image as a mixture of Gaussians. (20 points) To initialize EM, set each component's mean to the grayscale value of randomly chosen pixel and variance to 1, and the mixing coefficients to a uniform distribution. Note: there are packages that can run EM automagically, but please implement your own version of EM without using these extra packages. We've set the convergence condition for you in GaussianMixtureModel.default_convergence(): if the new likelihood is within 10% of the previous likelihood for 10 consecutive iterations, the model has converged.
Calculate the log likelihood of the trained model. (5 points)
Segment the image according to the trained model. (5 points)
Determine the best segmentation by iterating over model training and scoring, since EM isn't guaranteed to converge to the global maximum. (5 points)

We have provided the necessary tests for this part.

When multiplying lots of probabilities in sequence, you can end up with a probability of zero due to underflow. To avoid this, you should calculate the log probabilities for the entire assignment.

The log form of the Gaussian probability of scalar value $x$ is:

$l n (N (x | μ, σ)) = - 0.5 l n (2 π σ^{2}) - \frac{(x - μ)^{2}}{2 σ^{2}}$

where $μ$ is the mean and $σ$ is standard deviation.

You can calculate the sum of log probabilities by using scipy.misc.logsumexp(). For example, logsumexp([-2,-3]) will return the same result as numpy.log(numpy.exp(-2)+numpy.exp(-3)).

In other words, logsumexp(a, b) = log(e^a + e^b).

Rather than using lists of lists, you will find it much easier to store your data in `numpy.array` arrays.

You can instantiate them using the command

matrix = numpy.zeros([rows, columns])

where rows is the number of rows and columns is the number of columns in your matrix. numpy.zeros() generates a matrix of the specified size containing 0s at each row/column cell. You can access cells with the syntax matrix[2,3] which will return the value in row 2 and column 3.

Warning: You may lose all marks for this part if your code runs for too long.

You will need to vectorize your code in this part. Specifically, the method train_model() needs to perform operations using numpy arrays, as does likelihood(), which calculates the log likelihood. These are time-sensitive operations and will be called over and over as you proceed with this assignment.

For the assignment, focus on vectorizing the following:

The calculations for the Expectation step, where you calculate joint probabilities
The calculations where you update your means, variances and mixing coefficients in the Maximization step.

Remember, these are fundamental operations and will be called a lot in the remainder of the assignment. So it is crucial you optimize these.

For the synthetic data test which we provide to check if your training is working, the set is too small and it won't make a difference. But with the actual image that we use ahead, for-loops won't do good. Vectorized code would take under 30 seconds to converge which would typically involve about 15-20 iterations with the convergence function we have here. Inefficient code that uses loops or iterates over each pixel value sequentially, will take hours to run. You don't want to do that because:

You won't have that much time to test to your code.
You won't be getting marks. We will be capping the run time and kill anything that takes over 2 minutes for each iteration.

You will want to have your image pixel values as a one-dimensional array to perform these operations. We have provided a method to flatten the image for this purpose.

def default_convergence(prev_likelihood, new_likelihood, conv_ctr, conv_ctr_cap=10):
    """
    Default condition for increasing
    convergence counter: 
    new likelihood deviates less than 10%
    from previous likelihood.

    params:
    prev_likelihood = float
    new_likelihood = float
    conv_ctr = int
    conv_ctr_cap = int

    returns:
    conv_ctr = int
    converged = boolean
    """
    increase_convergence_ctr = (abs(prev_likelihood) * 0.9 < 
                                abs(new_likelihood) < 
                                abs(prev_likelihood) * 1.1)
    
    if increase_convergence_ctr:
        conv_ctr+=1
    else:
        conv_ctr =0
        
    return conv_ctr, conv_ctr > conv_ctr_cap

from random import randint
import math
from scipy.misc import logsumexp
class GaussianMixtureModel:
    """
    A Gaussian mixture model
    to represent a provided 
    grayscale image.
    """
    
    def __init__(self, image_matrix, num_components, means=None):
        """
        Initialize a Gaussian mixture model.
        
        params:
        image_matrix = (grayscale) numpy.nparray[numpy.nparray[float]]
        num_components = int
        """
        self.image_matrix = image_matrix
        self.num_components = num_components
        if(means is None):
            self.means = [0]*num_components
        else:
            self.means = means
        self.variances = [0]*num_components
        self.mixing_coefficients = [0]*num_components
    
    def joint_prob(self, val):
        """Calculate the joint 
        log probability of a greyscale
        value within the image.
        
        params:
        val = float
        
        returns:
        joint_prob = float
        """
        # TODO: finish this
        raise NotImplementedError()
        return joint_prob
    
    def initialize_training(self):
        """
        Initialize the training
        process by setting each
        component mean to a random
        pixel's value (without replacement),
        each component variance to 1, and
        each component mixing coefficient
        to a uniform value 
        (e.g. 4 components -> [0.25,0.25,0.25,0.25]).
        
        NOTE: this should be called before 
        train_model() in order for tests
        to execute correctly.
        """
        # TODO: finish this
        raise NotImplementedError()
    
    def train_model(self, convergence_function=default_convergence):
        """
        Train the mixture model 
        using the expectation-maximization
        algorithm. Since each Gaussian is
        a combination of mean and variance,
        this will fill self.means and 
        self.variances, plus 
        self.mixing_coefficients, with
        the values that maximize
        the overall model likelihood.
        
        params:
        convergence_function = function that returns True if convergence is reached
        """
        # TODO: finish this
        raise NotImplementedError()
    
    def segment(self):
        """
        Using the trained model, 
        segment the image matrix into
        the pre-specified number of 
        components. Returns the original 
        image matrix with the each 
        pixel's intensity replaced 
        with its max-likelihood 
        component mean.
        
        returns:
        segment = numpy.ndarray[numpy.ndarray[float]]
        """
        # TODO: finish this
        raise NotImplementedError()
        return segment
    
    def likelihood(self):
        """Assign a log 
        likelihood to the trained
        model based on the following 
        formula for posterior probability:
        ln(Pr(X | mixing, mean, stdev)) = sum((n=1 to N),ln(sum((k=1 to K), mixing_k * N(x_n | mean_k, stdev_k) )))
        
        returns:
        log_likelihood = float 
        """
        # TODO: finish this
        raise NotImplementedError()
        return log_likelihood
        
    def best_segment(self, iters):
        """Determine the best segmentation
        of the image by repeatedly 
        training the model and 
        calculating its likelihood. 
        Return the segment with the
        highest likelihood.
        
        params:
        iters = int
        
        returns:
        segment = numpy.ndarray[numpy.ndarray[float]]
        """
        # finish this
        raise NotImplementedError()
        return segment

def gmm_likelihood_test():
    """Testing the GMM method
    for calculating the overall
    model probability.
    Should return -364370.
    
    returns:
    likelihood = float
    """
    image_file = 'images/party_spock.png'
    image_matrix = image_to_matrix(image_file)
    num_components = 5
    gmm = GaussianMixtureModel(image_matrix, num_components)
    gmm.initialize_training()
    gmm.means = [0.4627451, 0.10196079, 0.027450981, 0.011764706, 0.1254902]
    likelihood = gmm.likelihood()
    return likelihood

gmm_likelihood_test()

def gmm_joint_prob_test():
    """Testing the GMM method
    for calculating the joint 
    log probability of a given point.
    Should return -0.98196.
    
    returns:
    joint_prob = float
    """
    image_file = 'images/party_spock.png'
    image_matrix = image_to_matrix(image_file)
    num_components = 5
    gmm = GaussianMixtureModel(image_matrix, num_components)
    gmm.initialize_training()
    gmm.means = [0.4627451, 0.10196079, 0.027450981, 0.011764706, 0.1254902]
    test_val = 0.4627451
    joint_prob = gmm.joint_prob(0.4627451)
    return joint_prob

gmm_joint_prob_test()

def generate_test_mixture(data_size, means, variances, mixing_coefficients):
    """
    Generate synthetic test
    data for a GMM based on
    fixed means, variances and
    mixing coefficients.
    
    params:
    data_size = (int)
    means = [float]
    variances = [float]
    mixing_coefficients = [float]
    
    returns:
    data = np.array[float]
    """

    data = np.zeros(data_size).flatten()

    indices = np.random.choice( len(means), len(data), p=mixing_coefficients)

    for i in range(len(indices)):
        data[i] = np.random.normal(means[indices[i]], variances[indices[i]])

    return np.array([data])

def gmm_train_test():
    """Test the training 
    procedure for GMM using
    synthetic data.
    
    returns:
    gmm = GaussianMixtureModel
    """

    print( 'Synthetic example with 2 means')

    num_components = 2
    data_range = (1,1000)
    actual_means = [2, 4]
    actual_variances = [1]*num_components
    actual_mixing = [.5]*num_components
    dataset_1 = generate_test_mixture(data_range, actual_means, actual_variances, actual_mixing)
    gmm = GaussianMixtureModel(dataset_1, num_components)
    gmm.initialize_training()
    # start off with faulty means
    gmm.means = [1,3]
    initial_likelihood = gmm.likelihood()

    gmm.train_model()
    final_likelihood = gmm.likelihood()
    likelihood_difference = final_likelihood - initial_likelihood
    likelihood_thresh = 250
    if(likelihood_difference >= likelihood_thresh):
        print('Congrats! Your model\'s log likelihood improved by at least %d.'%(likelihood_thresh))

    print( 'Synthetic example with 4 means:')

    num_components = 4
    actual_means = [2,4,6,8]
    actual_variances = [1]*num_components
    actual_mixing = [.25]*num_components
    dataset_1 = generate_test_mixture(data_range, 
                actual_means, actual_variances, actual_mixing)
    gmm = GaussianMixtureModel(dataset_1, num_components)
    gmm.initialize_training()
    # start off with faulty means
    gmm.means = [1,3,5,9]
    initial_likelihood = gmm.likelihood()
    gmm.train_model()
    final_likelihood = gmm.likelihood()
    
    # compare likelihoods
    likelihood_difference = final_likelihood - initial_likelihood
    likelihood_thresh = 200
    if(likelihood_difference >= likelihood_thresh):
        print('Congrats! Your model\'s log likelihood improved by at least %d.'%(likelihood_thresh))
    return gmm

gmm_train_test()

def gmm_segment_test():
    """
    Apply the trained GMM 
    to unsegmented image and
    generate a segmented image.
    
    returns:
    segmented_matrix = numpy.ndarray[numpy.ndarray[float]]
    """
    image_file = 'images/party_spock.png'
    image_matrix = image_to_matrix(image_file)
    num_components = 3
    gmm = GaussianMixtureModel(image_matrix, num_components)
    gmm.initialize_training()
    gmm.train_model()
    segment = gmm.segment()
    segment_num_components = len(np.unique(segment))
    if(segment_num_components == num_components):
        print('Congrats! Your segmentation produced an image '+
              'with the correct number of components.')
    return segment

def gmm_best_segment_test():
    """
    Calculate the best segment
    generated by the GMM and
    compare the subsequent likelihood
    of a reference segmentation. 
    Note: this test will take a while 
    to run.
    
    returns:
    best_seg = np.ndarray[np.ndarray[float]]
    """
    image_file = 'images/party_spock.png'
    image_matrix = image_to_matrix(image_file)
    image_matrix_flat = flatten_image_matrix(image_matrix)
    num_components = 3
    gmm = GaussianMixtureModel(image_matrix, num_components)
    gmm.initialize_training()
    iters = 10
    # generate best segment from 10 iterations
    # and extract its likelihood
    best_seg = gmm.best_segment(iters)
    matrix_to_image(best_seg, 'images/best_segment_spock.png')
    best_likelihood = gmm.likelihood()
    
    # extract likelihood from reference image
    ref_image_file = 'images/party_spock%d_baseline.png'%(num_components)
    ref_image = image_to_matrix(ref_image_file, grays=True)
    gmm_ref = GaussianMixtureModel(ref_image, num_components)
    ref_vals = ref_image.flatten()
    ref_means = list(set(ref_vals))
    ref_variances = [0]*num_components
    ref_mixing = [0]*num_components
    for i in range(num_components):
        relevant_vals = ref_vals[ref_vals==ref_means[i]]
        ref_mixing[i] = float(len(relevant_vals)) / float(len(ref_vals))
        ref_variances[i] = np.mean((image_matrix_flat[ref_vals==ref_means[i]] - ref_means[i])**2)
    gmm_ref.means = ref_means
    gmm_ref.variances = ref_variances
    gmm_ref.mixing_coefficients = ref_mixing
    ref_likelihood = gmm_ref.likelihood()
    
    # compare best likelihood and reference likelihood
    likelihood_diff = best_likelihood - ref_likelihood
    likelihood_thresh = 1e4
    if(likelihood_diff >= likelihood_thresh):
        print('Congrats! Your image segmentation is an improvement over ' +
              'the baseline by at least %.2f.'%(likelihood_thresh))
    return best_seg

best_segment = gmm_best_segment_test()
matrix_to_image(best_segment, 'best_segment.png')

Part 3: Model experimentation

20 points

We'll now experiment with a few methods for improving GMM performance.

3a: Improved initialization

12.5 points

To run EM in our baseline Gaussian mixture model, we use random initialization to determine the initial values for our component means. We can do better than this!

Fill in the below GaussianMixtureModelImproved.initialize_training() with an improvement in component initialization. Please don't use any external packages for anything other than basic calculations (e.g. scipy.misc.logsumexp). Note that your improvement might significantly slow down runtime, although we don't expect you to spend more than 10 minutes on initialization.

Hint: you'll probably want an unsupervised learning method to initialize your component means. Clustering is one useful example of unsupervised learning, and you may want to look at 1-dimensional methods such as Jenks natural breaks optimization.

class GaussianMixtureModelImproved(GaussianMixtureModel):
    """A Gaussian mixture model
    for a provided grayscale image, 
    with improved training 
    performance."""
    
    def initialize_training(self):
        """
        Initialize the training
        process by setting each
        component mean using some algorithm that you think might give better means to start with,
        each component variance to 1, and
        each component mixing coefficient
        to a uniform value 
        (e.g. 4 components -> [0.25,0.25,0.25,0.25]).
        [You can feel free to modify the variance and mixing coefficient initializations too if that works well.]
        """
        # TODO: finish this
        raise NotImplementedError()

def gmm_improvement_test():
    """
    Tests whether the new mixture 
    model is actually an improvement
    over the previous one: if the
    new model has a higher likelihood
    than the previous model for the
    provided initial means.
    
    returns:
    original_segment = numpy.ndarray[numpy.ndarray[float]]
    improved_segment = numpy.ndarray[numpy.ndarray[float]]
    """
    image_file = 'images/party_spock.png'
    image_matrix = image_to_matrix(image_file)
    num_components = 3
    initial_means = [0.4627451, 0.20392157, 0.36078432]
    # first train original model with fixed means
    gmm = GaussianMixtureModel(image_matrix, num_components)
    gmm.initialize_training()
    gmm.means = np.copy(initial_means)
    gmm.train_model()
    original_segment = gmm.segment()
    original_likelihood = gmm.likelihood()
    # then train improved model
    gmm_improved = GaussianMixtureModelImproved(image_matrix, num_components)
    gmm_improved.initialize_training()
    gmm_improved.train_model()
    improved_segment = gmm_improved.segment()
    improved_likelihood = gmm_improved.likelihood()
    # then calculate likelihood difference
    diff_thresh = 1e3
    likelihood_diff = improved_likelihood - original_likelihood
    if(likelihood_diff >= diff_thresh):
        print('Congrats! Improved model scores a likelihood that was at ' +
              'least %d higher than the original model.'%(diff_thresh))
    return original_segment, improved_segment

best_segment, best_segment_improved = gmm_improvement_test()
matrix_to_image(best_segment, 'best_segment_original.png')
matrix_to_image(best_segment_improved, 'best_segment_improved.png')

3b: Convergence condition

7.5 points

You might be skeptical of the convergence criterion we've provided in default_convergence(). To test out another convergence condition, implement new_convergence_condition() to return true if all the new model parameters (means, variances, and mixing coefficients) are within 10% of the previous variables for 10 consecutive iterations. This will mean re-implementing train_model(), which you will also do below in GaussianMixtureModelConvergence.

You can compare the two convergence functions in convergence_condition_test().

def new_convergence_function(previous_variables, new_variables, conv_ctr, conv_ctr_cap=10):
    """
    Convergence function
    based on parameters:
    when all variables vary by
    less than 10% from the previous
    iteration's variables, increase
    the convergence counter.
    
    params:
    
    previous_variables = [numpy.ndarray[float]] containing [means, variances, mixing_coefficients]
    new_variables = [numpy.ndarray[float]] containing [means, variances, mixing_coefficients]
    conv_ctr = int
    conv_ctr_cap = int
    
    return:
    conv_ctr = int
    converged = boolean
    """
    # TODO: finish this function
    raise NotImplementedError()
    return conv_ctr, converged

class GaussianMixtureModelConvergence(GaussianMixtureModel):
    """
    Class to test the 
    new convergence function
    in the same GMM model as
    before.
    """

    def train_model(self, convergence_function=new_convergence_function):
        # TODO: finish this function
        raise NotImplementedError()

def convergence_condition_test():
    """
    Compare the performance of 
    the default convergence function
    with the new convergence function.
    
    return:
    default_convergence_likelihood = float
    new_convergence_likelihood = float
    """
    image_file = 'images/party_spock.png'
    image_matrix = image_to_matrix(image_file)
    num_components = 3
    initial_means = [0.4627451, 0.10196079, 0.027450981]
    
    # first test original model
    gmm = GaussianMixtureModel(image_matrix, num_components)
    gmm.initialize_training()
    gmm.means = np.copy(initial_means)
    gmm.train_model()
    default_convergence_likelihood = gmm.likelihood()
    
    # now test new convergence model
    gmm_new = GaussianMixtureModelConvergence(image_matrix, num_components)
    gmm_new.initialize_training()
    gmm_new.means = np.copy(initial_means)
    gmm_new.train_model()
    new_convergence_likelihood = gmm_new.likelihood()
    
    # test convergence difference
    convergence_diff = new_convergence_likelihood - default_convergence_likelihood
    convergence_thresh = 200
    if(convergence_diff >= convergence_thresh):
        print('Congrats! The likelihood difference between the original '
              + 'and the new convergence models should be at least %.2f'%(convergence_thresh))
    return default_convergence_likelihood, new_convergence_likelihood

Part 4: Bayesian information criterion

20 points

In our previous solutions, our only criterion for choosing a model was whether it maximizes the posterior likelihood regardless of how many parameters this requires. As a result, the "best" model may simply be the model with the most parameters, which would be overfit to the training data.

To avoid overfitting, we can use the Bayesian information criterion (a.k.a. BIC) which penalizes models based on the number of parameters they use. In the case of the Gaussian mixture model, this is equal to the number of components times the number of variables per component (mean, variance and mixing coefficient) = 3*components.

4a: Implement BIC

5 points

Implement bayes_info_criterion() to calculate the BIC of a trained GaussianMixtureModel.

def bayes_info_criterion(gmm):
    # TODO: finish this function
    raise NotImplementedError()
    return BIC

def bayes_info_test():
    """
    Test for your
    implementation of
    BIC on fixed GMM values.
    Should be about 727045.
    
    returns:
    BIC = float
    """
    
    image_file = 'images/party_spock.png'
    image_matrix = image_to_matrix(image_file)
    num_components = 3
    initial_means = [0.4627451, 0.10196079, 0.027450981]
    gmm = GaussianMixtureModel(image_matrix, num_components)
    gmm.initialize_training()
    gmm.means = np.copy(initial_means)
    BIC = bayes_info_criterion(gmm)
    return BIC

4b: Test BIC

15 points

Now implement BIC_model_test(), in which you will use the BIC and likelihood to determine the optimal number of components in the Party Spock image. Use the original GaussianMixtureModel for your models. Iterate from k=2 to k=7 and use the provided means to train a model that minimizes its BIC and a model that maximizes its likelihood.

Then, fill out BIC_likelihood_question() to return the number of components in both the min-BIC and the max-likelihood model.

def BIC_likelihood_model_test():
    """Test to compare the 
    models with the lowest BIC
    and the highest likelihood.
    
    returns:
    min_BIC_model = GaussianMixtureModel
    max_likelihood_model = GaussianMixtureModel
    """
    # TODO: finish this method
    raise NotImplementedError()
    comp_means = [
        [0.023529412, 0.1254902],
        [0.023529412, 0.1254902, 0.20392157],
        [0.023529412, 0.1254902, 0.20392157, 0.36078432],
        [0.023529412, 0.1254902, 0.20392157, 0.36078432, 0.59215689],
        [0.023529412, 0.1254902, 0.20392157, 0.36078432, 0.59215689, 0.71372563],
        [0.023529412, 0.1254902, 0.20392157, 0.36078432, 0.59215689, 0.71372563, 0.964706]
    ]
    return min_BIC_model, max_likelihood_model

def BIC_likelihood_question():
    """
    Choose the best number of
    components for each metric
    (min BIC and maximum likelihood).
    
    returns:
    pairs = dict
    """
    # TODO: fill in bic and likelihood
    raise NotImplementedError()
    bic = 0
    likelihood = 0
    pairs = {
        'BIC' : bic,
        'likelihood' : likelihood 
    }
    return pairs

Bonus

2 points

A crucial part of machine learning is working with very large datasets. As stated before, using for loops over these datasets will result in the code taking many hours, or even several days, to run. Even vectorization can take time if not done properly, and as such there are certain tricks you can perform to get your code to run as fast as physically possible.

For this part of the assignment, you will need to implement part of a k-Means algorithm. You are given two arrays - points_array with X n-dimensional points, and means_array with Y n-dimensional points. You will need to return an X x Y array containing the distances from each point in points_array to each point in means_array.

Your code will be tested using two very large arrays, against our reference implementation, which was designed by Murtaza Dhuliawala. Thus, you'll be competing against the Head TA!

If your implementation returns the correct answer in time comparable to Murtaza's implementation, you will receive 2 bonus points.

For reference, the data used is in the order of thousands of points and hundreds of means, and Bonnie automatically kills a grading script that takes more than 500MB. So please test accordingly locally before submitting, as you may lose a submission for an inefficient solution. It is very likely that you could run out of memory if your implementation is inefficient.

def bonus(points_array, means_array):
    """
    Return the distance from every point in points_array
    to every point in means_array.
    
    returns:
    dists = numpy array of float
    """
    # TODO: fill in the bonus function
    # REMOVE THE LINE BELOW IF ATTEMPTING BONUS
    raise NotImplementedError()
    return dists

def bonus_test():
    points = np.array([[ 0.9059608,0.67550357,0.13525533],[ 0.23656114,0.63624466,0.3606615 ],[ 0.91163215,0.24431103,0.33318504],[ 0.25209736,0.24600123,0.42392935],[ 0.62799146,0.04520208,0.55232494],[ 0.5588561, 0.06397713,0.53465371],[ 0.82530045,0.62811624,0.79672349],[ 0.50048147,0.13215356,0.54517893],[ 0.84725662,0.71085917,0.61111105],[ 0.25236734,0.25951904,0.70239158]])
    means = np.array([[ 0.39874413,0.47440682,0.86140829],[ 0.05671347,0.26599323,0.33577454],[ 0.7969679, 0.44920099,0.37978416],[ 0.45428452,0.51414022,0.21209852],[ 0.7112214, 0.94906158,0.25496493]])
    expected_answer = np.array([[ 0.90829883,0.9639127, 0.35055193,0.48575144,0.35649377],[ 0.55067427,0.41237201,0.59110637,0.29048911,0.57821151],[ 0.77137409,0.8551975, 0.23937264,0.54464354,0.73685561],[ 0.51484192,0.21528078,0.58320052,0.39705222,0.85652654],[ 0.57645778,0.64961631,0.47067874,0.60483973,0.95515036],[ 0.54850426,0.57663736,0.47862222,0.56358129,0.94064631],[ 0.45799673,0.966609,0.45458971,0.70173336,0.63993928],[ 0.47695785,0.50861901,0.46451987,0.50891112,0.89217387],[ 0.56543953,0.94798437,0.35285421,0.59357932,0.4495398 ],[ 0.30477736,0.41560848,0.66079087,0.58820896,0.94138546]])
    if np.allclose(expected_answer,bonus(points,means),1e-7):
        print 'You returned the correct distances.'
    else:
        print 'Your distance calculation is incorrect.'

bonus_test()

You're done with the requirements! Hope you have completed the functions in the mixture_models.py file and tested everything!