Let's say there is an AI designed to learn to play Super Mario. And we define an action called "jump" to jump,which practically press down button A soon then release the button soon.The result is a hop---short jump.
But it's known to all there are not only short jump in the game,but also long jump---to show it, AI has to keep pressing down button A,not release button A until it is time to let Mario fall down.
As the distance to jump is not discrete, I don't think it make sense to define several jump actions with different distances. Instead, I think it make sense to define an action called "keep jump"---keep pressing button A, and another one called "release jump"---release button A.So that AI can choose whether to fall down in a certain time step,like human.
Then here raises a question---as there are some other actions besides action keep/release jump, how to let AI just choose "release jump" or do nothing after "pressing jump" during training.
in this action design, it is easy to see it would be terrible if AI forgets to release button---it would lead the AI to keep jumping and then gain a distorted recognition of the environment as it thinks "Hey, I am not acting,the environment just keep me jumping. I am always jumping in the game."
Then in my own thought, I pop up 2 ways to solve it:
Suppose we are using policy gradient to find an optimal policy with distribution of actions.And we take a kind of neural network to train with the input of raw pixels of game.
1.using "If then" to filter wrong actions: That is, after pressing jump, record it with a flag, then after gaining a distribution on all actions in the just following time step, no matter which action is chosen, just do :
if action chosen == "release jump":
do release jump
change record of flag
else:
do nothing
But the problem is what the AI really did does not influence the training, after all, it just choose some action,say C but not "release jump". Would it confuse the AI that C is good during jump? I get a loss here.
2.no change to choice of actions, but add a numeric flag in network:
That is:after pressing jump, just add a flag as certain unit value in a certain hidden layer of neural network.
The problem with this solution is:
1)waste of time, as AI has to figure out all other actions besides "release jump" is meaningless after pressing jumping (just for simplicity, ignoring the direction button effect during jump.) after a long iteration of training;
2)have to think about what numeric value to represent "jump pressed" or "not jump pressed"---what if it be OK just choose 0 and 1,not negative effect on gradient calculation?
Is there any idea on this, or you think my design on "keep/release action" make no sense?