Hello Mrs Feng,
Though I am a novice when it comes to RL, I wish to answer your questions base on my understanding. 1) The initial target policy or policy matrix could be a set of random values chosen intuitively which you would improve (optimize) after several training episode based on your domain (environment). 2) So you could use a random function in matlab that enables you to randomly place your agent on different start position while training to improve your policy or action matrix. 3) I have seen several links relating to Implementation of Mountain car problem using SARSA
http://www.dia.fi.upm.es/~jamartin/download.htm.
I hope this helps until someone else explains better.
One other thing don't forget that the agent needs some reward that guides it to the goal. So if you are using Q-learning, you would have an initial reward matrix that will help you to optimize your action matrix.
Best wishes :-)
--
It is best to Light a Candle than being the cause of Darkness