Solutions for classic control environments of Gymnasium package.
Since all environments have continuous state space, it needs to be converted into discrete space. It is done by equally dividing state space into interval for each feature. Tabular methods were used only for environments with discrete action space, therefore, no discretisation was needed there.
Q-learning where state is converted into discrete values. Behavioral policy is
Q-learning with optimistic value function and
N-step Sarsa with multiple n to compare performance. The worst performing was N=1 with average of 280 steps. Other agent were roughly on 200 steps with slightly better results with increasing N. None of them is by far the optimal policy - one would need better exploration of states near the goal as they are not visited enough. Also, observation space has 6 features which means the discrete representation has either very rough granularity or is very large and needs massive amount of time to explore fully.
The simplest approximated approach would be polynomial function but I was unable to find reliable hyperparameters. Therefore, more complex solutions is needed.
The RBF is sort of a intermediate step to neural networks. Agent need ~100 episodes to learn how to get to the top. It reaches the top in ~140 steps.
Deep Q Network is method based on Q learning where Q table is replaced by neural network. It means agent doesn't need to explore as much as in tabular method because tabular agent needs to visit every state multiple time while DQN agent approximates from similar states. There are many variations, I used the one with two identical NN policy and target called double DQN which reduces oscillation and Dueling DDQN which further impores performance.
Special note to MountainCar environment: it is highly unlikely to solve (although possible if lucky) with simple
TBD
In this case, use of tabular methods is still possible by applying discretisation to action space as well, but it would yield poor results in general. Thus, only approximation via neural networks is considered here.
DDPG is gradient descent algorithm like Actor-Critic method. Gradient descent is a family of algorithms that approximate policy directly, not via utility function like Q methods. This fact makes it applicable in environments with both discrete and continuous action spaces.