(Work in progress. The focus is on terms that won't be part of most introductory courses since those definitions are easy to find and are usually in the WildML glossary.)
See also the basics glossary.
- Backpropagation
- Synthetic gradients
- Result: Faster updating of parameter weights
- Method: Using predicted 'synthetic gradients' (estimate based on local gradient) instead of true backpropagated error gradients
- [Paper (Jaderberg et. al., Jul 2017)]
- Dilated convolutions
- Convolutions with filter cells that are regularly spaced out.
- Purpose: Receptive field grows quicker, so can merge more spatial information across input (keeping filter size constant).
- Skip connections
- Mappings (connections) that skip one or more layers.
- Component of a 'deep residual layer'
- Goal: help network to learn approximate identity layers (if that is what is locally optimal)
- in which case the output of the most recent network layer should be approx 0.
- Goal: help network to learn approximate identity layers (if that is what is locally optimal)
- Introduced by He et. al., Dec 2015 as part of deep residual networks, winner of ILSVRC 2015.
- Also called residaul connections, shortcut connections.
- Decimation layer
- Down-sampling, usually either through max-pooling or average pooling
- Dilated LSTMs
- Nested LSTM
- Use nesting as an approach to constructing temporal hierarchies in memory
- selective access to inner memories -> frees inner memories to remember and process events on longer time scales
- [Paper (Moniz et. al., Jan 2018)]
- R-CNN (Region-based CNN)
- Object detection model
- [Fast pytorch implementation] [Paper (Faster R-CNN, Ren et. al., Jan 2016)] [Explanatory blog post]
- ResNets
- DenseNets
- Dense connection structure where each layer is directly connected to all its predecessors
- -> better information flow and feature reuse
- -> BUT dense skip connections also bring problems of potential risk of overfitting, parameter redundancy and large memory consumption
- 'densely gathers features from previous layers in nthe network and combines them using concatenation'
- Dense connection structure where each layer is directly connected to all its predecessors
- SparseNet
- Synthesised from ResNets and DenseNets
- [Paper: Sparsely Connected Convolutional Networks (Zhu and Deng et. al., Jan 2018)]
- Capsule Networks
- NADE
- MADE
- PixelCNN
- PixelCNN++
- PixelRNN
- More architectures
- VGG16
- ResNeXt
- Feature Pyramid Networks
- Autoencoders
- Variational Inference
- Variational Autoencoder
- Variational Lower Bound
- MCMC (Markov Chain Monte Carlo)
- Gibbs Sampling
- Monte Carlo EM
- EM
- Bayesian Neural Networks
- Laplace's approximation
- Metropolis-Hastings
- Hamiltonian Monte Carlo
- Evolution Strategies
- Guided ES
- uses surrogate gradients
-
Neural Turing Machine
- Neural network controller with read-write access to an external memory matrix
-
Differentiable Neural Computers
- Neural network controller with read-write-erase access to an external memory matrix
- Kanerva Machine
- Differentiable Neural Dictionary (DND)
- from Neural Episodic Control (Pritzel et. al., 2017)
-
$M_a = (K_a, V_a)$ ,$K_a, V_a$ dynamically sized arrays of vectors, each containing the same number of vectors (1-1 correspondence, like a dictionary) - Operations
- Lookup: map key h to output o:
- weighted sum of values in memory, weights give by normalised closeness (kernels) between lookup key and corresponding key in memory. Closer match = higher weight.
- Write (after query/lookup)
- key = lookup key
- value = 'application-specific', e.g. Q-value for RL
- (k, v) appended to
$K_a, V_a$ . If key already exists, entry is updated instead of being duplicated.
- Lookup: map key h to output o:
- Use approximations in practice: kNN-like
- Boltzmann Machines
- Hopfield Networks
- Linear Factor Models
- Independent Component Analysis (ICA)
- Sparse Coding
- Wake-sleep
-
Finite-state machine (Abstract model)
- can be in one of a finite number of states
$s_t$ in S - can change from one state to another in response to an input
-
$s_{t+1} = f(x_t, s_t), (s_t\in S \forall t, |S|$ finite) - memory limited by number of states FSM has, so cannot do some tasks that the Turing machine can.
- can be in one of a finite number of states
-
Turing machine (Abstract model)
- Infinite memory tape divided into discrete cells
- Finite table of user-specified instructions
- HEAD positioned over a cell.
- READS symbol from cell,
- LOOKS UP symbol read in finite table of user-specified instructions
- WRITES in cell
- MOVES 1 left or right
- either CARRIES OUT instruction or HALTS computation (indicated in table of user-specified instructions)
- For any algorithm, a Turing machine capable of simulating that algorithm's logic can be constructed.
- (Turing-completeness: ability of sys of instructions to simulate a Turing machine, theoretically able of expressing all tasks accomplishable by computers, nearly all prog langs turing complete if limitations of finite memory are ignored.)
- Alpha-divergence
- Special cases:
- Alpha=0: Variational Bayes
- Alpha=1: Expectation Propagation
- TODO: what is a 'mode' of a posterior p? What does it mean by a solution that aims to cover multiple modes?
- Special cases:
-
Intuition of RL:
- Loop through two steps:
- Agent performs action.
- State may change, agent may get reward.
- Agent explores the environment by taking actions.
- Actions involve time
- Don't pre-program procedures in agent, but agent knows list of actions
- Loop through two steps:
-
Bellman Equation
-
$V(s) = \max{a}(R(s,a)+\gamma E[V(s')])$ - where
$\gamma$ is the discount factor. - Deterministic version:
$V(s) = \max{a}(R(s,a)+\gamma V(s'))$ - Expanded for MDPs:
$V(s) = \max{a}(R(s,a)+\gamma \sum_{s'} P(s,a,s')V(s'))$
- where
-
-
Plans vs Policies:
- Plans comprise the optimal action for each state, with no stochasticity. Policies incorporate stochasticity.
-
Deterministic vs non-deterministic search:
- Deterministic search: Agent's intention maps 100% to agent's action.
- Non-deterministic search: Small chance of agent acting differently to how it intends to act
-
Markov Decision Processes (MDP)
- Mathematical framework for modelling decision-making where outcomes are partly random and partly under the control of a decision-maker
- Markov Property:
- Memorylessness: Conditional P(X) dist depends only on present state
- Associated Bellman eqn:
$V(s) = \max{a}(R(s,a)+\gamma E[V(s')])$ - aka
$V(s) = \max{a}(R(s,a)+\gamma \sum_{s'} P(s,a,s')V(s'))$
- aka
-
Q-learning
- Give values to actions
$Q(s_0,a_i)$ instead of states-
$Q(s,a) = R(s,a)+\gamma \sum_{s'} P(s,a,s')V(s')$ - i.e.
$Q(s,a) = R(s,a)+\gamma \sum_{s'} P(s,a,s')\max{a'}Q(s',a')$
- i.e.
-
- Give values to actions
-
Temporal Difference
- TODO: refine
- (Consider Q-learning under deterministic search for convenienc)
$TD_t(a,s) = Q_t(s,a) - Q_{t-1}(s,a) = R(s,a)+\gamma\max{a'}Q(s',a') - Q_{t-1}(s,a)$ -
$TD(a,s)$ may be nonzero because of randomness. (Though we've written the deterministic search version of )
-
Update eqn:
$Q_t(s,a) = Q_{t-1}(s,a) + \alpha TD_t(a,s)$ -
$\alpha$ is the learning rate. - Hope: algorithm will converge to the 'correct' Q-value, unless the environment is constantly changing.
-
-
Living penalty
- e.g. small negative reward when entering each non-terminal state to motivate agent to finish the game quickly
-
Successor Representation
-
Options framework
- involves abstractions over the space of actions
- at each step, the agent chooses either a one-step 'primitive' action or a 'multi-step' action policy (option). Each option defines a policy over actions (either primitive or other options) and can be terminated according to a stochastic function of
$\beta$ . - Paper: Sutton et. al. Definition from Kulkarni and Narasimhan et. al (2016)
- Deep Q-learning
-
Learning: Feed in state to NN, final layer gives q-values for each action
- Compares predicted value to previous observed value: loss
$L = \sum(Q_{prev_observed} - Q_{pred})$
- Compares predicted value to previous observed value: loss
-
Learning: Feed in state to NN, final layer gives q-values for each action
- Learning happens for each state
- **Acting**: Put final layer through softmax (or some other action selection policy, see below) and select the corresponding action.
-
Experience replay
- Problem: Update after every action, so consecutive states that are similar may bias the neural network.
- Solution: Save state information. Start updating after some initial time period, and update with states drawn uniformly from memory in the interval
$(t-k_1, t-k_2)$ . - Schaul et al. (2016), Prioritized Experience Replay
-
Action selection policies
- Most commonly used:
-
$\epsilon$ -greedy- Select highest q-value action
$(1-\epsilon)$ of the time, randomly otherwise.- Tokic (2010): can adapt
$\epsilon$ depending on the state (smaller$\epsilon$ if the agent is certain about its state)
- Tokic (2010): can adapt
- Select highest q-value action
-
$\epsilon$ -soft$(1-\epsilon)$ - Opposite of
$\epsilon$ -greedy: select highest q-value action$\epsilon$ of the time, randomly otherwise.
- Opposite of
- Softmax
-
$\sigma(\textbf{z})_j = \frac{e^{z_j}}{\sum_k e^{z_k}}$ for$j=1,...,K$ . - Outputs across all actions sum to one
-
-
- Key is exploration vs exploitation
- Agent may find itself stuck in a local maximum (thinks e.g. a positive-reward action
$Q_2$ is the best action because it hasn't found the better one$Q_4$ .)
- Agent may find itself stuck in a local maximum (thinks e.g. a positive-reward action
- Most commonly used:
-
On-policy vs off-policy
- On-policy: update value with action actually taken
- Off-policy: update value with max_a Q(s,a'), i.e. no constraint on next action.
-
Policy Gradient Methods
- General Challenges
- Sensitive to choice in stepsize
- Often have poor sample efficiency, taking millions or billions of steps to learn simple tasks
- Approaches:
- constraining or optimising size of policy update
- Trust Region Policy Optimisation (TRPO)
- [Implementation in PyTorch]
- Pros
- Good for continuous control tasks
- Cons
- 'isn’t easily compatible with algorithms that share parameters between a policy and value function or auxiliary losses'
- Proximal Policy Optimisation (PPO)
- Tries to minimise cost while ensuring the deviation from the previous policy is relatively small
- Implementation:
-
$L^{CLIP}(\theta) = \hat{E_t}[\min(r_t(\theta)\hat{A_t},\clip(r_t(\theta),1-\epsilon,1+\epsilon)\hat{A_t})]$ -
$r_t$ : ratio of probability under new and old policies respectively (?) check -
$\hat{A_t}$ : estimated advantage at time t -
$\epsilon$ : hyperparameter, usually 0.1 or 0.2
-
- Much simpler to implement than ACER
- Trust region update compatible with SGD
-
- OpenAI blog post
- PPO2
- GPU-enabled implementation of PPO by OpenAI.
- Actor Critic with Experience Replay (ACER)
- Sample-efficient ploicy gradient algorithm
- Uses a replay buffer, so it can perform more than one gradient update using each piece of sampled experience, as well as a Q-Function approximate trained with the Retrace algorithm.
- References:
- General Challenges
-
A3C (Asynchoronous Advantage Actor-Critic)
- Actor-critic:
- Two outputs:
- Actor: outputs Policy, i.e. Q-values
$Q(s,a_i)$ for all$a_i$ , possible actions via Softmax - Critic: outputs Value of state we're in
$V(s)$
- Actor: outputs Policy, i.e. Q-values
- Two outputs:
- Asynchronous
- Multiple agents tackling the same environment, each initalised differently (diff random seed)
- More experience to learn from
- Reduces chance of all agents being stuck in a local max
- Can combine N nets into one single net,
- where N = number of agents.
- So weights are shared.
- Agents share experience by contributing to a common critic
- Multiple agents tackling the same environment, each initalised differently (diff random seed)
- Advantage
- Have two losses, one for each output (Value loss, policy loss)
- Value loss: TODO (fill in)
- Policy loss:
- Let Advantage A = Q(s,a) - V(s)
- How much better is the Q-value you're selecting compared to the 'known' V value across agents?
- Goal is to maximise advantage: encourages actions that have Q(s,a) > V.
- Let Advantage A = Q(s,a) - V(s)
- Actor-critic:
-
A2C (Synchronous A3C: Advantage Actor-Critic)
- A2C tends to be unstable due to occasional entropy collapse. (AI Safety Gridworlds, Nov 2017)
- Particularly sensitive to hyperparameter(s) relating to policy entropy
-
Rainbow
- Combination of improvements in deep RL
-
DQN
-
Policy gradient methods
- Batch reinforcement learning
- Do not interact with the system during learning (Used e.g. in real-world industrial settings since unrestricted exploration can damage the system)
- RL: AI A to Z course