Implementation of 8 puzzle by deep reinforcement learning

qq406...@gmail.com

unread,

Apr 12, 2016, 4:00:35 AM4/12/16

to Deep Q-Learning

Recently,I was trying to train a neural network to play the game 15 puzzle , a very simple game in http://gamescrafters.berkeley.edu/games.php?puzzle=8puzzle

However,the result seems not very well.

The neural network I used is 3 layers,9×1000×4 fully connected.

I don't use the CNN because the input that is equal to a matrix can be considered as a vector.

don't know whether 3 layers is enough to achieve the neural network?

In addition,the reward of simulator is based on various Manhattan distance from the target.

Could anyone tell me whether this standard of reward is effective?

Besides,the epsilon is down by 0.01 every 1000 game.

And the gama is set 0.99,batchsize = 10

Could anyone help me?

Ashley Edwards

unread,

Apr 13, 2016, 12:15:57 PM4/13/16

to Deep Q-Learning

Maybe you should just give the agent a reward when it reaches the goal. Using the Manhattan distance might introduce suboptimal policies.

qq406...@gmail.com

unread,

Apr 13, 2016, 12:27:00 PM4/13/16

to Deep Q-Learning

Thank you for response!

I give a agent a large reward when it reaches the goal.

However, all of the network output will be zeros after training the network about 7-10 times.

This strange phenomenon seriously confuse me .

Besides,reward of action is based on the difference in Manhattan distance between previous state and later state compared to the goal state.

Why Manhattan distance might introduce suboptimal policies？

and do you have another criterion for reward？

在 2016年4月14日星期四 UTC+8上午12:15:57，Ashley Edwards写道：

Ashley Edwards

unread,

Apr 13, 2016, 2:09:25 PM4/13/16

to Deep Q-Learning

Have you checked to see if your network works for simpler domains? Also, are you sure the configurations you are giving the agent are solvable?

If you were giving a reward based only on the distance to the goal then the agent would be encouraged to never terminate the game since it would always get some positive reward. Since you're subtracting, I don't think any "infinite" loops will be introduced. However, it might be necessary for the agent to take an action that moves it away from the goal (for example, by sliding a tile out of the way). In that case, it would get a negative reward, so it might end up avoiding actions like that. In that case, the agent might end up oscillating between safe actions to avoid ever getting a negative reward. This will especially happen if the goal is never even reached.

qq406...@gmail.com

unread,

Apr 14, 2016, 3:22:01 AM4/14/16

to Deep Q-Learning

Thank you for response!

I think the epsilon greedy strategy is to avoid the agent trapped in condition that "the agent might end up oscillating between safe actions"

because the action of agent have some probability to do a random action

Thus,by setting an appropriate epsilon reduction can avoid the case you are worried about.

Do you agree me ?

在 2016年4月14日星期四 UTC+8上午2:09:25，Ashley Edwards写道：

Ashley Edwards

unread,

Apr 14, 2016, 12:42:35 PM4/14/16

to Deep Q-Learning

I'm saying it would oscillate between safe actions since it can't stand still. If, for example, the agent had a no-op action, then it might just stand still because the expected value would be 0. The expected value for moving tiles around might be negative. Epsilon-greedy would help, but if the goal takes a long time to reach, then the agent's policy might tell be to take the no-op action to avoid getting negative values. Have you checked that the agent ever actually reaches the goal? If so, I'd take out these intermediate rewards, only give the agent a reward for reaching the goal, and let reinforcement learning do its thing. Intermediate rewards often yield unexpected policies.

qq406...@gmail.com

unread,

Apr 15, 2016, 1:11:19 AM4/15/16

to Deep Q-Learning

Thanks for response!

Do you mean the agent will avoid the continuous negative reward and choose to oscilate if the goal is too long to arrive?

In my experment,the agent can't arrive goal.

How to modify the reward standard in order to successfully training the network?

在 2016年4月15日星期五 UTC+8上午12:42:35，Ashley Edwards写道：

Dorje Haoxi

unread,

Apr 17, 2016, 11:19:09 AM4/17/16

to Deep Q-Learning

7-8 times is far away from enough I guess, usually I think 1 million times exploration would be just OK...

qq406...@gmail.com

unread,

Apr 19, 2016, 3:12:13 AM4/19/16

to Deep Q-Learning

Hi , Ashley!

The recent result show that the agent can arrive the goal when exploring.

However,the network seems not learning anything because reward are not be improved after exploring.

I guess whether we should modify the reward setting or other parameters?

在 2016年4月15日星期五 UTC+8上午12:42:35，Ashley Edwards写道：

I'm saying it would oscillate between safe actions since it can't stand still. If, for example, the agent had a no-op action, then it might just stand still because the expected value would be 0. The expected value for moving tiles around might be negative. Epsilon-greedy would help, but if the goal takes a long time to reach, then the agent's policy might tell be to take the no-op action to avoid getting negative values. Have you checked that the agent ever actually reaches the goal? If so, I'd take out these intermediate rewards, only give the agent a reward for reaching the goal, and let reinforcement learning do its thing. Intermediate rewards often yield unexpected policies.

qq406...@gmail.com

unread,

Apr 19, 2016, 7:21:22 AM4/19/16

to Deep Q-Learning

Thanks for response!

I have tried 1 million times exploration.

But it seems doesn't work.

The agent still can't learn anything after 1million exploration.

在 2016年4月17日星期日 UTC+8下午11:19:09，Dorje Haoxi写道：

Sanjana Roy

unread,

Jul 16, 2019, 12:01:10 PM7/16/19

to Deep Q-Learning

Can you please provide the reinforcement Learning Code for 8 puzzle to me .... I urgently need it

Reply all

Reply to author

Forward