-
Notifications
You must be signed in to change notification settings - Fork 84
3.2 RL Environment
The environment E
, labelled as ctc-executioner-v0
and inherited from gym.Env, is a simulator for order execution.
This section will first provide an overview of the environment and then describe each component and their functionalities.
To make use of this environment:
import gym_ctc_executioner
env = gym.make("ctc-executioner-v0")
env.setOrderbook(orderbook)
The environment covers the entire process of an order execution such that an agent that makes use of this environment does not have to be aware of the inner workings and can regard the execution process as a black-box.
Upon initialization, an order book and a match engine is provided. The order book is the essential core that implicitly defines the state space and the outcome of each step. All the other components, including the match engine, are therefore abstractions and mechanisms in order to construct the environment that allows to investigate and learn how to place orders.
During the execution process, which is initiated by an agent using the reset
method, a memory serves as the storage for an internal state which contains an ongoing execution, whose values will be updated while the agent proceeds its epochs. (The current implementation supports only one execution to be stored in the memory and therefore multiple agents at a time would cause raise conditions).
With every step
taken by the agent, a chain of tasks will be processed:
- The agent selects an action
a
and passes it to the environment. - A internal state
s
(defined as ActionState) is being constructed whereas it is either derived from a previous state or from the order book in case a new epoch has started. - Then an Order is created according to the remaining inventory and time horizon the agent has left, and the specified action to be taken.
- The order is sent to the match engine which will perform an attempt to execute the order in the current order book state (from which the agents state was derived). This process continues to attempt a matching using the following order book states, provided the runtime of this step has not been consumed.
- The matching will result in either no-, a partial- or a full-execution of the submitted order. Whichever outcome it might be, a certain reward can be derived alongside the next state (again derived from the order book) and whether the epoch is done or not.
- Those values will then be stored in the memory and returned to the agent in order to take another step.
Unlike in most traditional reinforcement learning environments, each step taken by the agent leads to a complete change of the state space. Consider a chess board environment, where the state space is the board equipped with figures. After every move taken by the agent, the state space would look exactly the same, except of the figure moved with that step. This process would go on until the agent either wins or looses the game and the state space would be reset to the very same as in the beginning of the previous epoch.
In the execution environment, however, the state space will likely never be the same since a random sequence of order book state throughout an epoch defines the state space. Since these order book states are likely to be different for every step, the state the agent is in will therefore change equally. It is, as if not only one or two figures of the chess board change their position, but almost all of them.
What the agent effectively observes depends on the configuration (see Feature Engineering), resulting in some state s ∈ R^d
.
A discrete action space represented by a vector of size equal to the number of limit levels (L
) is configured.
The action space features actions a ∈ Z
which represent the limit level segmented in $0.10 steps.
The action space is configurable and the default implementation is of size 101
, derived from the limit level starting at -50 up to +50. Negative limit levels indicate the listing deep in the book and positive listings relate to the level in the opposing side of the book.
Thus, at each time-step t
the agent selects an action a_t
from the set of legal actions, A = {l_min, . . . , l_max}
, whereas l_min
is the most negative limit level and l_max
is the most positive limit level.
The reward is defined as the difference of the market price before execution and the volume weighted average price (VWAP) paid. That is,
, whereas
p
is the price paid for v^p
shares and V
represents the total volume of shares.
Hence, for buying assets the reward is defined as
and for selling assets
, whereas
p_T
is the best market price at execution time step t=T
.
Given the definition of the discounted return in Section 1.6 we calculate
, where
t_0
is the time-step at which the execution has its time horizon fully consumed.
An agent that is compatible with the OpenAI gym.Env interface will be able to make use of this environment.
As defined in Section 1.6 the optimal action-value function is
, where
π
is a policy mapping the states to either actions or distributions over actions. See therefore Sections Agents to learn more about the possible variety of agents in use.