A natural form of policy parameterization for shooting is to have a policy network, which outputs the parameter of a Bernoulli distribution. Let
The main drawback of this approach is that the policy will have a high possibility to explore in the meaningless region, e.g., the target is well out of its reachable distance. To reduce the ineffective exploration, we propose a human priori to the policy. In particular, we choose the Beta distribution as the priori and train the network outputs as the "likelihood". The prior is given by a predefined rule, such as
Suppose that the network outputs are