Hi everyone,
I am having trouble understanding how to implement policy gradient using cross_entropy trick.
In the slides, it is mentioned that

but, it is still not clear to me what to give as an input to the train function.
For example, let's say we are using policy gradient method to train the simple walker example (the first example on RL, where the reward is measured as the distance from the origin at time T). If we have a neural network for the policy with a single input (the position) and two outputs (up or down actions), what should be the "observered_inputs" variable for the train function?