Autencoder for numerical data

1,487 views

Skip to first unread message

New learner

unread,

Mar 1, 2017, 12:32:12 AM3/1/17

to Keras-users

I am still new in deep learning and autoencoder. Thus for me to learn autoencoder, instead of using pixels data set as my input, I am trying to apply autoencoder for numerical dataset. I use https://blog.keras.io/building-autoencoders-in-keras.html to apply it in numerical dataset.

My data set is :

Dimension of (507:14)

My code is:


from keras.layers import Input, Dense, Reshape
from keras.models import Model
import matplotlib.pyplot as plt

# this is the size of our encoded representations
encoding_dim = 14  # 32 floats -> compression of factor 24.5, assuming the input is 784 floats

# this is our input placeholder
input_img = Input(shape=(14,))
# "encoded" is the encoded representation of the input
encoded = Dense(encoding_dim, activation='relu')(input_img)
# "decoded" is the lossy reconstruction of the input
decoded = Dense(14, activation='sigmoid')(encoded)

# this model maps an input to its reconstruction
autoencoder = Model(input=input_img, output=decoded)

# this model maps an input to its encoded representation
encoder = Model(input=input_img, output=encoded)

encoded_input = Input(shape=(encoding_dim,))
# retrieve the last layer of the autoencoder model
decoder_layer = autoencoder.layers[-1]
# create the decoder model
decoder = Model(input=encoded_input, output=decoder_layer(encoded_input))

autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy')

#from keras.datasets import mnist
import numpy as np
#(x_train, _), (x_test, _) = mnist.load_data()

import pandas as pd
#split into train and test sets
data = np.genfromtxt('<file directory>/boston_house_prices.csv',delimiter = ',')

train_size = int(len(data)*0.60)
test_size = len(data)-train_size
x_train, x_test = data[0:train_size,:], data[train_size:len(data),:]

#normalize the value:
x_train = x_train.astype('float32') / 255.
x_test = x_test.astype('float32') / 255.

## we ignore the flatten value needed only when pixel data
###x_train = x_train.reshape((len(x_train), np.prod(x_train.shape[1:])))
###x_test = x_test.reshape((len(x_test), np.prod(x_test.shape[1:])))

autoencoder.fit(x_train, x_train,
                nb_epoch=50,
                batch_size=256,
                shuffle=True,
                validation_data=(x_test, x_test))

# encode and decode some digits
# note that we take them from the *test* set
encoded_imgs = encoder.predict(x_test)
decoded_imgs = decoder.predict(encoded_imgs)

# display reconstruction
plt.plot(decoded_imgs[:,:])
plt.show() 

# display original
plt.plot(x_test[:,:])
plt.show()

1) Why does my original data(first graph) and reconstruction data (second graph) is completely different? Would anyone please help me to guide which one is actually my mistake?

What I understand the result of decoded result should be the same as the input. I will also attach my data that I used for this case.

boston_house_prices.csv

Auto Generated Inline Image 1

Auto Generated Inline Image 2

amw5...@gmail.com

unread,

Mar 7, 2017, 1:53:44 PM3/7/17

to Keras-users

An autoencoder, in the example you cite, takes a set of signal inputs, learns a compressed version of the original (encoding), then takes the encoded representation and tries to reconstruct the original signal (decoding). The same way that you can take a massive building, and reduce it to a set of blueprints, then take those blueprints to construct the building somewhere else. That's not a perfect analogy, but I'm tired. I typically see it used for images or sound, or some other multi-variate data. I'm not sure why you'd apply it to the Boston housing data, which is often an example data set for learning to predict a single continuous outcome variable.

The reason why your graphs are so different is because the auto-encoder you built is doing a terrible job. Your input is 14 numeric variables, each divided by 255, and that's what your autoencoder is trying to recreate. The first graph is the reconstruction, the second is the original signal, after you divided all the values by 255. I bet the graphs would be a little better if you changed your activation functions to "linear" instead of relu & sigmoid.

What is it you want to actually do with the data? If it's not "take the input data set and learn a compressed version of it so that I can reconstruct the original variables with minimal error", then I think you've got the wrong tool. And if that IS what you're trying to do, you've got some adjustments to make in order to improve your results.

Reply all

Reply to author

Forward

0 new messages