Training Autoencoders of Different Configurations and Activation Functions

Autoencoders are unsupervised learning models that can be used for data compression and pre-training of neural network. The problem uses the position encoding scheme, so the dataset [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] is represented by:

[[1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],[0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],[0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],[0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],[0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],[0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],[0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],[0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0],[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0],[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0],[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0],[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0],[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0],[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0]]

The input layer has 16 nodes, and so does the output layer. Training is achieved by feeding these 16 data points to the model and minimize the loss. The loss is computed by cross-entropy but not the mean squared error. I did some research on which one is better in training neural network. It turned out that cross entropy is a better cost function to use if there is sigmoid or soft-max non-linear activation function in the output layer of the network, which is the case. On the other side, if the target is continuous and normally distributed, MSE is the better choice. In this case, we want to drive output node values to either 1.0 or 0.0 depending on the target values. With MSE, the weight adjustment factor (the gradient) contains a term of (yd) * (1 – yd). As the computed output gets closer and closer to either 0.0 or 1.0 the value of (yd) * (1 – yd) gets smaller rapidly. However, with cross entropy, we don’t have (output) * (1 – output) in it, which means the training changes in each epoch do not go smaller rapidly as using MSE. So I used cross entropy to calculated loss.

I used Kesra to do the job. Kesra is a higher level of deep learning tool built on top of Tensorflow. The optimizer used is RMSProp. The Root Mean Square Propagation optimizer use the method in which the learning rate is adapted for each of the parameters. It divides the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight. The running average is calculated by:

Screen Shot 2017-07-05 at 6.52.13 PM

The updating rule is:

Screen Shot 2017-07-05 at 6.53.23 PM

This is a variant of the SGD method, and is capable to work with mini-batches as well opposed to only full-batches. Aside from the optimizer, I set the converging criterion as ‘loss changes less than 1e-7 in 1 epoch’. I tested it by several test runs. When the criterion is met, the loss is normally in range of 1e-7 to 1e-4, which is good enough to see a fine reconstruction of the input. The strict converging criterion provides a better reconstruction in my case.

Configuration 1: 3 Layers (16-5-16) Autoencoder with Sigmoid function as activation function: 

(Code for all 3-Layer autoencoders can be found on my Github)

Run time 1:  Set initial weights as constant 1 uniformly, train the AutoEncoder until it converges at loss = 1.4400e-06 after 29875 epochs. The optimizer is: rms = optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0) Learning rate is set as 0.001.

Run time 2: Set initial weights as constant 0.8 uniformly, train the AutoEncoder until it converges at loss = 4.6263e-06 after 28180 epochs. The optimizer is: rms = optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0) Learning rate is set as 0.001.

Run time 3: Set initial weights as constant 0.6 uniformly, train the AutoEncoder until it converges at loss = 2.6801e-06 after 31546 epochs. The optimizer is: rms = optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0) Learning rate is set as 0.001.

Run time 4: Set initial weights as constant 0.4 uniformly, train the AutoEncoder until it converges at loss = 1.2335e-05 after 30254 epochs. The optimizer is: rms = optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0) Learning rate is set as 0.001.

Run time 5: Set initial weights as constant 0.2 uniformly, train the AutoEncoder until it converges at loss = 1.1106e-04 after 34902 epochs. The optimizer is: rms = optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0) Learning rate is set as 0.001.

Discussion:

By comparing the stable states, we have, we can see that we got binary-like states on nodes on the hidden layer from 5 runs with different initial weights. The number of epochs we need to get a converged result (stable states on the hidden layer) varies when we use different initial weights. This can be explained by searching on a ‘map’. The target we want is somewhere on the map. We start to search for the target from a starting point, then the ‘epochs’ we took lead us mostly toward the target. The number steps we take to reach the target, or get close enough to the target is largely influenced by the distance between our starting point and the target point. We cannot have exact 0,1 states because the autoencoder does not use the definition of binary code. Hence out values cannot be directly considered as binary code. These values are distorted by the activation function we use, which is the sigmoid function. 1 in the stable states does not map to 1 in real value in binary code.  Also, the sequence of columns can be changed, because there is no difference when we pick one perceptron in the hidden layer can call it ‘Perceptron 1’. After rearranging the sequences of the columns, we can see the results generated by the five runs above can be converted into the form below:

Perceptron1 Perceptron2 Perceptron3 Perceptron4 Perceptron5
0 1.0 1.0 1.0 0.0 0.0
1 1.0 1.0 0.5 0.0 1.0
2 0.5 0.5 0.0 0.0 0.5
3 1.0 1.0 1.0 1.0 1.0
4 0.0 0.0 0.0 0.0 1.0
5 0.0 0.0 0.0 1.0 1.0
6 0.5 0.5 0.0 1.0 0.0
7 0.0 0.0 0.5 0.5 0.0
8 0.0 0.0 1.0 1.0 1.0
9 0.0 0.0 1.0 0.0 0.0
10 0.5 0.5 1.0 0.0 1.0
11 1.0 1.0 0.0 0.5 0.5
12 1.0 1.0 1.0 1.0 0.0
13 0.5 0.5 0.5 0.0 0.0
14 0.5 0.5 0.0 1.0 0.5
15 0.0 0.0 1.0 1.0 0.0

It is not right to use brutal force to make 0.5 into 1, because it is not binary code scheme. Since we are compressing 0-15 to 5 nodes, the outcome should not be expected as decimal to binary, because that required one node to be always 0, which means that node is somehow ignored. It is not possible. When the final loss is in range 1e-7 to 1e-4, reconstructions are very close to the original input data, and this can be regarded as successful. Although we can see that results from using different initial weights tend to converge together, differences should not be ignored. Such differences are caused by the following reason. When we pick different initial weights, we are approaching the convergence point from different ‘direction’ if we imagine the search space is on a map. We don’t stop at the exact point of the target point. Instead, we stop when we are close enough. So we stop near the exact point ‘region’ but we don’t stop at the same point.

By comparing different final loss we got from using different initial weights, we have:

Initial Weights 1.00E+00 8.00E-01 6.00E-01 4.00E-01 2.00E-01
Final Loss 1.44E-06 4.63E-06 2.68E-06 1.23E-05 1.11E-04

Configuration 2: 3 Layers (16-4-16) Autoencoder with Sigmoid function as activation function: 

Run time 1: Set initial weights as constant 1 uniformly, train the AutoEncoder until it converges at loss = 1.1562e-04 after 27430 epochs of running. The optimizer is: rms = optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0) Learning rate is set as 0.001.

Run time 2: Set initial weights as constant 0.8 uniformly, train the AutoEncoder until it converges at loss = 2.4305e-04 after 24000 epochs of running. The optimizer is: rms = optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0) Learning rate is set as 0.001.

Run time 3: Set initial weights as constant 0.6 uniformly, train the AutoEncoder until it converges at loss = 1.9732e-04 after 20542 epochs of running. The optimizer is: rms = optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0) Learning rate is set as 0.001.

Run time 4: Set initial weights as constant 0.4 uniformly, train the AutoEncoder until it converges at loss = 2.7663e-06 after 47222 epochs of running. The optimizer is: rms = optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0) Learning rate is set as 0.001.

Run time 5: Set initial weights as constant 0.2 uniformly, train the AutoEncoder until it converges at loss = 2.1054e-04 after 35420 epochs of running. The optimizer is: rms = optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0) Learning rate is set as 0.001.

Discussion:

This time we should expect more binary-like outcome, because decimal numbers 0-15 can be encoded to 4-digit binary number. The outcomes generated by 5 runs can be rearranged and converted into:

Perceptron1 Perceptron2 Perceptron3 Perceptron4
0 1 0.5 0 0
1 1 0 1 1
2 1 0 1 0
3 0 1 1 0
4 0 0 0 0
5 0 1 0 0
6 0 1 1 1
7 1 1 1 1
8 0 0 0.5 1
9 0 0 0 0.5
10 1 0 0 1
11 1 0 1 0
12 0 0 1 0
13 1 1 1 0
14 0 0.5 0 1
15 0.5 1 0 0.5

I still could not get the binary encoding. However, after doing the same rearranging and converting process, what we have for the 4-perceptron hidden layer are more binary-like than what we had for the 5-perceptron hidden layer architecture. This is cause by the fact that 4 perceptron layer is closer to the binary encoding expectation than 5 perceptron model. By comparing the final loss of the 5 runs, we have:

Initial Weights 1 0.8 0.6 0.4 0.2
Final Loss 1.16E-04 2.43E-04 1.97E-04 2.77E-06 2.11E-04

Although this time the final loss is all larger than what we had for 5-perceptron model, we still get satisfying reconstruction of the input data because the loss is still small enough, which means the convergence points we stopped at were close enough to the exact target.

Configuration 3: 3 Layers (16-3-16) Autoencoder with Sigmoid function as activation function: 

Run Time 1: Set initial weights as constant 1 uniformly, train the AutoEncoder until it converges at loss = 5.3932e-05. The optimizer is: rms = optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0) Learning rate is set as 0.001.

Run Time 2: Set initial weights as constant 0.8 uniformly, train the AutoEncoder until it converges at loss = 2.7949e-05 after 79374 epochs. The optimizer is: rms = optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0) Learning rate is set as 0.001.

Run Time 3: Set initial weights as constant 0.6 uniformly, train the AutoEncoder until it converges at loss = 5.0423e-06 after 98412 epochs. The optimizer is: rms = optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0) Learning rate is set as 0.001.

Run Time 4: Set initial weights as constant 0.4 uniformly, train the AutoEncoder until it converges at loss = 6.2787e-04 after 50558 epochs. The optimizer is: rms = optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0) Learning rate is set as 0.001.

Run Time 5: Set initial weights as constant 0.2 uniformly, train the AutoEncoder until it converges at loss = 3.2740e-04 after 81542 epochs. The optimizer is: rms = optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0) Learning rate is set as 0.001.

Discussion:

Although we can still see some similar patterns, I cannot get a pattern shared by the five results with same rearranging and converting. Compressing 16 numbers into 3 numbers goes higher than changing decimal numbers to binary codes, which makes finding a binary-like pattern impossible. The training epochs needed to reach the convergence point are much more than configuration 1 and 2, because although we pick same sets of initial weights, the distances between these starting points and the target points in different cases are different. The initial weights we used for configuration 1 is closer to the convergence point of configuration 1 than initial weights we used for configuration 3 to the convergence point of configuration 3. Imaging we are moving on a map, if we start from more distant from the target, we need to spend more time and effort on the way. By comparing the final loss, we have:

Initial Weights 1 0.8 0.6 0.4 0.2
Final Loss 5.39E-05 2.79E-05 5.04E-06 6.28E-04 3.27E-04

We can see that the reconstructions for the 5 runs are all good enough because the final loss is small. We get different loss after convergence because when we pick different initial weights, we are approaching the convergence point from different ‘direction’ if we imagine the search space is on a map.

Configuration 4&5: 5 Layers (16-8-4-8-16) Autoencoder with Sigmoid/ReLu function as activation function: 

Code for 5-layers autoencoder with sigmoid function and ReLu function can be found on my Github.

Run time 1:  Set initial weights as constant 1 uniformly, train the AutoEncoder. The optimizer is: rms = optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0) Learning rate is set as 0.001.

Run time 2: Set initial weights as constant 0.8 uniformly, train the AutoEncoder. The optimizer is: rms = optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0) Learning rate is set as 0.001.

Run time 3: Set initial weights as constant 0.6 uniformly, train the AutoEncoder. The optimizer is: rms = optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0) Learning rate is set as 0.001.

Run time 4: Set initial weights as constant 0.4 uniformly, train the AutoEncoder. The optimizer is: rms = optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0) Learning rate is set as 0.001.

Run time 5: Set initial weights as constant 0.2 uniformly, train the AutoEncoder. The optimizer is: rms = optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0) Learning rate is set as 0.001.

Discussion:

First I compared the run time, number of epochs, and loss:

Test Number Run Time/min Number of epochs Loss Run Time/min Number of epochs Loss
1 122 732556 4.5e-6 50 310650 0.521
2 168 1008624 3.2e-6 564 3565044 0.0531
3 201 1206874 1.2e-6 75 489225 0.589
4 189 1136268 1.3e-6 23 151547 0.602
5 153 950130 1.2e-7 34 210834 0.541

Using sigmoid function, we can have successful reconstruction, and the time and number of epochs depend on the ‘distance’ from ‘starting point’ (initial weights) to the ‘target point’ (optimal weights). The stable states I reported above are compressed expressions of number 0-15. They are encoded by two layers, first of which has 8 perceptron and second of which has 4 perceptron. After encoding, number 0-15 are compressed into 4 digits. Then the data is decoded and reconstructed through 2 layers. We can tell that the reconstruction is successful by looking at the reconstructed data, or by looking at the final loss.

Using ReLu function, we cannot reconstruct input data successfully, and the run time is either much more than using sigmoid function, or much less than using sigmoid function. This implies that using ReLu function for all perceptron on all layers will cause the training process to be unstable. Since we are using a Gradient Descent method, we would face the vanishing gradient problem. Each of the neural network’s weights receives an update proportional to the gradient of the error function with respect to the current weight in each iteration of training. and backpropagation computes gradients by the chain rule. This has the effect of multiplying n of these small numbers to compute gradients of the “front” layers in an n-layer network, meaning that the gradient (error signal) decreases exponentially with n and the front layer train very slowly. ReLu ranges from 0 to positive infinity thus it is non-saturating. The derivative is always 1 hence no attenuation of an error signal propagating backward. ReLu function can be used for perceptron on hidden layers, but it cannot be used for output layer (reconstruction layer). When there is negative value on the node, it is wiped away by the ReLu function. If the weighted sum of the input is negative, it is reset to 0 after activation function. This causes the loss of information, which makes the error distorted. The tilted error is then used to back propagate, then the learning process is distorted. When the error is skewed to be smaller, weights are not changed enough in each epoch. When the learning rate decreases, the weights cannot approach the target weights. In this case, if we have wrong weights and reconstruct -1 for 0, the ReLu function would change -0.1 to 0. It seems like we don’t have error because we got 0 for 0. Then we cannot change these wrong weights because no matter how large the learning rate is, we make no change if there are no errors. As a result, we cannot get successful reconstructions using ReLu, and the learning process is unstable.

(Featured Image Source: https://www.researchgate.net/figure/269272188_fig1_Fig-1-The-architecture-of-basic-Sparse-Autoencoder-SAE-for-nuclei-classification)

Arthur Zhang View All →

A current master student in WUSTL, department of Electrical and System Engineering.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

%d bloggers like this: