Download the data file, hard2d.txt. This file contains one matrix, D=[ZETA',X'], ZETA=[zeta_1,...,zeta_n], X=[x_1,...,x_n], where zeta_i in (-1,1) is a binary class label, and x_i in [[0,1],[-1,1]] is a 2D real vector drawn randomly from the right half of the signum square.

Load the dataset into matlab, for example, use
`D=load('hard2d.txt'); X=D(:,2:3)'; ZETA=D(:,1)';`.

The first thing you should do, with any dataset, is seek to
gain a little intuitive understanding of the data. With a
low-dimensional dataset like this one, you can get excellent
intuition by drawing a scatter plot. Do that. In matlab, this
is `plot(X(1,ZETA==1),X(2,ZETA==1),'r.'); hold on;
plot(X(1,ZETA==-1),X(2,ZETA==-1),'b.'); title('Scatter plot of the
hard2d dataset');` Alternatively you can get a plot by downloading this function: nnplot.m. Download it to your working directory, and type `help nnplot` to learn how it is used, or just open it up by typing `edit nnplot.m`.

Print your scatter plot as an image (`print -dpng
fig1.png`).

In this part of the problem you will create a two-layer neural net with 24 hidden nodes, with tanh activation functions on the outputs of the hidden nodes, and a tanh activation on the output node. You will try several different methods for training the network, in order to see how initial conditions can dramatically change the quality of the resulting classifier.

Write a function `[Z,Y]=nnclassify(X,U,V)` that takes as
input: (1) a matrix of observations, `X`, whose dimension
is p by n, (2) the first-level matrix of weights and
biases, `U`, whose dimension is q by (p+1) (where `q` is the number of hidden nodes, I recommend `q=24` but Amit used `q=12`, and `p=2` is the number of input dimensions), and (2) the
second level vector of weights and biases, `V`, whose
dimension is 1 by (q+1). It should compute, as output, the
axon values of the hidden and output layers (`Y`
and `Z`). For example, your function might include lines
of code that look
like `A=U*[ones(1,n);X];`, `Y=tanh(A)`, `B=V*[ones(1,n);Y];`,
and `Z=tanh(B);`

Look at the scatter plot (print it out, if you can).
Use `ginput` to find some points on the boundary between
the two classes---the slides have a few
pages of sample code that should help you do this. As shown in
the slides, you can check your lines by using code something
like `nnplot(X,ZETA,ZETA,'',1);for m=1:24,plot([0
1],-U(m,1)/U(m,3)+[0,-U(m,2)/U(m,3)]);end` --- this will
create the scatterplot, then overlay the line corresponding to
each row of the `U` matrix.

Set `V=[THRESHOLD,ones(1,q)]`, where `THRESHOLD`
is a number you choose so that the classifier gives the right
answer. If you have 6 lines bounding each region, then probably
you'll have `THRESHOLD` equal to either 5.5 or -5.5,
depending on which direction your `U` matrix came
out.

Run `nnclassify` to generate `Z`, and plot a
scatter plot showing the `Z` labeling of each token. Make
sure it looks the way you expected; if it doesn't, find out why.
Compute the error rate of your classifier
(`ER=sum(ZETA.*Z<0)/n;`), and list the error rate in
the title of your figure. Print a copy of the figure as an image.

In case you have trouble finishing this part of the lab, you can copy the following `U` and `V` matrices.
`U=[-20.0000, 1.8872, -20.0000; -20.0000, 6.2628,-17.2728; -20.0000, 9.3492, -14.2008; 10.0000, 1.8872, 20.0000; 5.9092, 6.2628,17.2728; 1.3012,9.3492, 14.2008; -10.0000, 1.8872,-20.0000; -11.3636, 6.2628, -17.2728; -12.8996, 9.3492, -14.2008; 0, 1.8872, 20.0000; -2.7272, 6.2628, 17.2728; -5.7992, 9.3492, 14.2008; 0, 1.8872, -20.0000; -2.7272, 6.2628, -17.2728; -5.7992, 9.3492,-14.2008; -10.0000, 1.8872, 20.0000; -11.3636, 6.2628, 17.2728; -12.8996, 9.3492, 14.2008; 10.0000, 1.8872,-20.0000; 5.9092, 6.2628, -17.2728; 1.3012, 9.3492, -14.2008; -20.0000, 1.8872, 20.0000; -20.0000, 6.2628, 17.2728; -20.0000, 9.3492, 14.2008];`, and `V=[-5.5,-ones(1,24)];`

Create a function `[EPSILON,DELTA]=nnbackprop(X,Y,Z,ZETA,V)`
calculating the labeling mean-squared error (MSE) back-propagated
to the second and first layers of the network, respectively; thus
EPSILON is a 1xn vector, and DELTA is a qxn matrix. The
function `backprop` should probably contain lines
of code that look something
like `EPSILON=2*(1-Z.^2).*(Z-ZETA);`
and `DELTA=(1-Y.^2).*(V(:,2:(q+1))'*EPSILON);`

Create a
function `[dU,dV]=nngradient(X,Y,DELTA,EPSILON);` that
finds the gradient of MSE with respect to U and V. Your function
will probably include lines like `dU=(1/n)*DELTA*[ones(1,n);X]';`
and `dV=(1/n)*EPSILON*[ones(1,n);Y]`.

Create a function `[U,V]=nndescent(X,ZETA,U0,V0,ETA,T)`
that performs `T` iterations of gradient descent, with a
learning rate of `ETA`, from a starting point
of `U0,V0`.

Start with the knowledge-based neural net of the previous subsection. Perform 1000 iterations of gradient descent, with a learning rate of 0.1. You should wind up with an error rate similar to last time, but slightly better; what is it? Create a scatter plot of the new labeling, and put the error rate in the title.

Discard your previous weight matrices. Instead, randomly initialize the weight matrices, then perform 1000 iterations of gradient descent with a learning rate of 0.002. What is the error rate now? What is the scatter plot? Try running gradient descent for another 1000 iterations, and another. Do the results get better?

Use simulated annealing to find the global optimum.

I used Gaussianized annealing with 5000 iterations. Gaussian simulated annealing works like this: compute a step in the direction of the gradient descent, using your back-propagation code from II.C. Compute the training error rate at this initial location. This is your initial error. The initial location is the beginning state of a Markov chain. Now start SA with some initial "hot" temperature. This notion of temperature could simply be the iteration number. Then add a Gaussian random component to the weight vector keeping all the other components of the weight vector fixed. Compute the training error rate at the new location. If the random location is better than the initial location, count this is an acceptance of a Markov chain state transition and go to that state (location). If it's worse, then you can still count this as an acceptance with probability P=exp(-(error_difference)*log(t+1)/ridge), where ridge is the difference between the maximum possible versus the minimum possible error rate that could be achieved on your training dataset, and t is the iteration number. Repeat this process of adding a Gaussian random component to a different component each time until you have swept through all the components of the weight vector (i.e (P+1)*Q + (Q+1)). Sweeping through all the components once constitutes one trial. To establish stationarity, you will need to perform multiple trials at the same temperature. The number of trials required to establish stationarity is heuristic in nature and could be based on some pre-determined values for maximum number of acceptances or rejections or simply the maximum number of trials. Once you have established stationarity, you can decrease the temperature by a small amount. Now, start all over again as you did for the previous temperature - i.e., adding Gaussian random components and then establishing stationarity. It is expected that if you continue doing this for each decreasing temperature, then with high probability you can find the global minimum after establishing stationarity at some minimum temperature. You can set the minimum temperature heuristically. For example, it could be the last iteration number. The final weight vector is the optimal weight vector determined by simulated annealing.

Every time you encounter a weight matrix that is better than its predecessor and successor, you should append that weight matrix to the back of auxiliary rank-3 tensor, cell array, or file, and store its associated cost in an auxiliary cost vector; you should also (automatically) output the very last weight matrix as an option.

Discretized simulated annealing is random in a discrete state
space, rather than a continuous state space. The basic idea is to
discretize the space of all possible weight vectors (e.g., by
quantizing every weight into bins of width 0.1). Compute a
gradient update using back-propagation. Then move to a different
grid in the weight space, uniformly distributed over all possible
grid locations. Compute the training error in the old location
and the new location. If the new location has lower error, go
there; if not, go there anyway, with probability
`P=exp(-(error_difference)*log(t+1)/ridge)`. Every time
you find a local optimum (better than the iteration that preceded
it, and better than the iteration that follows it), save the
weight vector.

Create a scatter plot of the labeling function achieved in this way, and compute its error rate. If you run simulated annealing long enough, it should actually beat the knowledge-based initialization; but that might take a very, very long time.

In this part of the problem you will create a Gaussian mixture model with 4-5 Gaussians/class. You will try several different methods for training the GMM, in order to see how initial conditions can dramatically change the quality of the resulting classifier.

Use `edit gmmclassify.m` to open a new matlab function
file, and define its content with a function declaration on the
first line of the form `function
[Z,G]=gmmclassify(X,C,MU,SIG2)`. Add some comments (lines
starting with `%`) specifying that this file computes the
label corresponding to each observation. The observation
matrix, `X`, is NxP, where N is the number of observations,
and P is the dimension of each observation. The matrix of mixture
weights, `C`, is Kx2, where 2 is the number of classes
(zeta=+1 vs. zeta=-1), and K is the number of Gaussians. The
matrix of means, `MU`, is KxP. The matrix of
variances, `SIG2`, is also KxP: each row stores the
variances corresponding to one Gaussian. Your code might include
a line something
like `G(n,k)=exp(-sum((X(n,:)-MU(k,:)).^2./SIG2(k,:))-0.5*sum(log(SIG2(k,:))));`,
and another line something like `classprob=G*C;`, and
another line something
like `Z(classprob(:,2)>classprob(:,1))=-1;`.

Estimate the means and variances of four Gaussians in class
one, and five Gaussians in class two, such that the boundary
between the classes matches the boundary you see in the scatter
plot (remember that variance is the square of standard
deviation!). Test your hypothesis by classifying the data.
Create a scatter plot showing how the GMM labels the data, and
fix any large mistakes: this scatter plot should roughly
comparable to the correct scatter plot, but might not be exactly
the same. Compute the error rate of your classifier
(`errorrate=sum(ZETA~=Z)/length(Z);`), and list the error rate
in the title of your figure (using `title(sprintf('GMM w/KB
Params, Error=%g',errorrate));`). Print a copy of the figure
as an image, and hand it in with your report.

Create a matlab
function `[C,MU,SIG2]=gmmtrain(X,ZETA,C0,MU0,SIG20);` that
accepts labeled data and initial parameter estimates, and performs
one iteration of EM (expectation maximization) in order to ``improve''
the parameter estimates.

The first thing your code should do is to compute the posterior
probability of the k'th Gaussian given the n'th observation and the
zeta'th class, for each possible combination of n, k, and zeta. This
might be done by calling `gmmclassify` to compute the Gaussian
probability densities, and then computing the posterior probability
as `GAMMA(n,k,zeta)=G(n,k)*C0(k,zeta)/(G(n,:)*C0(:,zeta));`.
You will also need to know the ``expected number of times'' that the
k'th Gaussian was chosen: this
is `NU(k,zeta)=sum(GAMMA(:,k,zeta));`. You will also need to
know the sum of the gamma-probabilities summed across class labels,
e.g., `GAMMASUM(n,k)=sum(GAMMA(n,k,:));`. This part of the
code is called the ``E-step,'' because it calculates the expected
number of times that the k'th Gaussian was chosen.

The next thing your code should do is find new values of the
mixture weights `C`, the means `MU`, and the
variances `SIG2`. This is called the ``M-step,'' because
the new parameters are chosen in order to maximize the
likelihood (likelihood = prob(data|parameters)). The maximum
likelihood mixture weight is just the number of times a
component was chosen, divided by the number of times it could
have been chosen,
thus `C(k,zeta)=NU(k,zeta)/sum(NU(:,zeta));`. Notice
that this formula has a special property that is characteristic
of the EM algorithm: if `C0(k,zeta)==0` for any
particular class zeta and Gaussian k, then it will also be true
that `NU(k,zeta)==0`, and
therefore `C(k,zeta)==0`. This property is extremely
useful: it means that you can divide the Gaussians between two
classes (for example, setting `C0(1:4,2)=0`
and `C0(5:8,1)=0`), and they will stay divided, even
after re-training.

The
new mean vector is a weighted average of the observation vectors, thus
`MU(k,:)=(GAMMASUM(:,k)'*X)/sum(GAMMASUM(:,k));`. The new
sigma vector is computed exactly the same way, but first you need to
compute the squared deviation of each observation dimension from the
mean of each Gaussian,
thus `SQDEV(n,:,k)=(X(n,:)-MU(k,:)).^2;`. The variance is then
the weighted average
deviation:
`SIG2(k,:)=(GAMMASUM(:,k)'*SQDEV(:,:,k))/sum(GAMMASUM(:,k));`

It's important to make sure that the variance never gets too
close to zero. The EM algorithm won't take care of this
automatically for you; you need to do it yourself. For example,
after you've computed `SIG2`, you can lowerbound it with
something like `SIG2=max(1e-4,SIG2);`

Run the `gmmtrain` function once, using your
knowledge-based parameters from II.A as an initial condition.
Oops. What happened? Create a scatter plot, showing the
labeling function computed by the trained GMM parameters.
Take a look at the new mean vectors. Can you tell what
happened?

If your code is working correctly, you should find that all of the mean vectors have moved out toward the center of the dataset, and that in doing so, they have really screwed up the classification boundaries. The EM algorithm estimates GMM parameters in order to maximize the likelihood of the observations given the labels. But maximum likelihood is a really terrible training criterion for this dataset. The best classification accuracy is achieved by mean vectors that are out toward the edges of the dataset (as you should have guessed when you initialized the parameters in part III.A), but the maximum likelihood mean vectors are closer to the center of the dataset (in order to better summarize all of the data).

Devise a training algorithm for the GMM that explicitly minimizes classification error (MCE training). For example, you could estimate the error rate as the average of E=average(y*tanh(a*LLR)), where "y" is the true class label, "a" is a slope hyperparameter, and LLR=log(p(x|class 1)/p(x|class -1)). Can you find the gradient of E with respect to the variances of the Gaussians? Then try writing a gradient descent algorithm that explicitly minimizes the error rate in the training set.

This time, if your code is working, you should find that the error rate on the training set goes down.