Skip to content

Latest commit

 

History

History
295 lines (269 loc) · 14.8 KB

glossary.md

File metadata and controls

295 lines (269 loc) · 14.8 KB

Glossary

(Work in progress. The focus is on terms that won't be part of most introductory courses since those definitions are easy to find and are usually in the WildML glossary.)

See also the basics glossary.

Contents:

Training methods

Models

Convolution-related layers

  • Dilated convolutions
    • Convolutions with filter cells that are regularly spaced out.
    • Purpose: Receptive field grows quicker, so can merge more spatial information across input (keeping filter size constant).
  • Skip connections
    • Mappings (connections) that skip one or more layers.
      • E.g. Adds (a 1x1 convolution of) an earlier layer to the most recent network layer
        • Image from He et. al., 2015.
    • Component of a 'deep residual layer'
      • Goal: help network to learn approximate identity layers (if that is what is locally optimal)
        • in which case the output of the most recent network layer should be approx 0.
    • Introduced by He et. al., Dec 2015 as part of deep residual networks, winner of ILSVRC 2015.
    • Also called residaul connections, shortcut connections.
  • Decimation layer
    • Down-sampling, usually either through max-pooling or average pooling

Sequence-related layers

  • Dilated LSTMs
  • Nested LSTM
    • Use nesting as an approach to constructing temporal hierarchies in memory
    • selective access to inner memories -> frees inner memories to remember and process events on longer time scales
    • [Paper (Moniz et. al., Jan 2018)]

Architectures

Bayesian Inference and Approximate Inference

  • Autoencoders
  • Variational Inference
  • Variational Autoencoder
    • Variational Lower Bound
  • MCMC (Markov Chain Monte Carlo)
  • Gibbs Sampling
  • Monte Carlo EM
  • EM
  • Bayesian Neural Networks
    • Laplace's approximation
    • Metropolis-Hastings
    • Hamiltonian Monte Carlo

Black-box optimisation

External memory

  • Neural Turing Machine
    • Neural network controller with read-write access to an external memory matrix
  • Differentiable Neural Computers
    • Neural network controller with read-write-erase access to an external memory matrix
  • Kanerva Machine
  • Differentiable Neural Dictionary (DND)
    • from Neural Episodic Control (Pritzel et. al., 2017)
    • $M_a = (K_a, V_a)$, $K_a, V_a$ dynamically sized arrays of vectors, each containing the same number of vectors (1-1 correspondence, like a dictionary)
    • Operations
      1. Lookup: map key h to output o:
        • weighted sum of values in memory, weights give by normalised closeness (kernels) between lookup key and corresponding key in memory. Closer match = higher weight.
      2. Write (after query/lookup)
        • key = lookup key
        • value = 'application-specific', e.g. Q-value for RL
        • (k, v) appended to $K_a, V_a$. If key already exists, entry is updated instead of being duplicated.
    • Use approximations in practice: kNN-like

Other models

  • Boltzmann Machines
  • Hopfield Networks
  • Linear Factor Models
    • Independent Component Analysis (ICA)
    • Sparse Coding
  • Wake-sleep
  • Finite-state machine (Abstract model)
    • can be in one of a finite number of states $s_t$ in S
    • can change from one state to another in response to an input
    • $s_{t+1} = f(x_t, s_t), (s_t\in S \forall t, |S|$ finite)
    • memory limited by number of states FSM has, so cannot do some tasks that the Turing machine can.
  • Turing machine (Abstract model)
    • Infinite memory tape divided into discrete cells
    • Finite table of user-specified instructions
    • HEAD positioned over a cell.
      • READS symbol from cell,
      • LOOKS UP symbol read in finite table of user-specified instructions
      • WRITES in cell
      • MOVES 1 left or right
      • either CARRIES OUT instruction or HALTS computation (indicated in table of user-specified instructions)
    • For any algorithm, a Turing machine capable of simulating that algorithm's logic can be constructed.
      • (Turing-completeness: ability of sys of instructions to simulate a Turing machine, theoretically able of expressing all tasks accomplishable by computers, nearly all prog langs turing complete if limitations of finite memory are ignored.)

Metrics and measures

  • Alpha-divergence
    • Special cases:
      • Alpha=0: Variational Bayes
      • Alpha=1: Expectation Propagation
      • TODO: what is a 'mode' of a posterior p? What does it mean by a solution that aims to cover multiple modes?

Reinforcement Learning

  • Intuition of RL:

    • Loop through two steps:
      • Agent performs action.
      • State may change, agent may get reward.
    • Agent explores the environment by taking actions.
    • Actions involve time
    • Don't pre-program procedures in agent, but agent knows list of actions
  • Bellman Equation

    • $V(s) = \max{a}(R(s,a)+\gamma E[V(s')])$
      • where $\gamma$ is the discount factor.
      • Deterministic version: $V(s) = \max{a}(R(s,a)+\gamma V(s'))$
      • Expanded for MDPs: $V(s) = \max{a}(R(s,a)+\gamma \sum_{s'} P(s,a,s')V(s'))$
  • Plans vs Policies:

    • Plans comprise the optimal action for each state, with no stochasticity. Policies incorporate stochasticity.
  • Deterministic vs non-deterministic search:

    • Deterministic search: Agent's intention maps 100% to agent's action.
    • Non-deterministic search: Small chance of agent acting differently to how it intends to act
  • Markov Decision Processes (MDP)

    • Mathematical framework for modelling decision-making where outcomes are partly random and partly under the control of a decision-maker
    • Markov Property:
      • Memorylessness: Conditional P(X) dist depends only on present state
    • Associated Bellman eqn: $V(s) = \max{a}(R(s,a)+\gamma E[V(s')])$
      • aka $V(s) = \max{a}(R(s,a)+\gamma \sum_{s'} P(s,a,s')V(s'))$
  • Q-learning

    • Give values to actions $Q(s_0,a_i)$ instead of states
      • $Q(s,a) = R(s,a)+\gamma \sum_{s'} P(s,a,s')V(s')$
        • i.e. $Q(s,a) = R(s,a)+\gamma \sum_{s'} P(s,a,s')\max{a'}Q(s',a')$
  • Temporal Difference

    • TODO: refine
    • (Consider Q-learning under deterministic search for convenienc)
    • $TD_t(a,s) = Q_t(s,a) - Q_{t-1}(s,a) = R(s,a)+\gamma\max{a'}Q(s',a') - Q_{t-1}(s,a)$
    • $TD(a,s)$ may be nonzero because of randomness. (Though we've written the deterministic search version of )
  • Update eqn: $Q_t(s,a) = Q_{t-1}(s,a) + \alpha TD_t(a,s)$

    • $\alpha$ is the learning rate.
    • Hope: algorithm will converge to the 'correct' Q-value, unless the environment is constantly changing.
  • Living penalty

    • e.g. small negative reward when entering each non-terminal state to motivate agent to finish the game quickly
  • Successor Representation

  • Options framework

    • involves abstractions over the space of actions
    • at each step, the agent chooses either a one-step 'primitive' action or a 'multi-step' action policy (option). Each option defines a policy over actions (either primitive or other options) and can be terminated according to a stochastic function of $\beta$.
    • Paper: Sutton et. al. Definition from Kulkarni and Narasimhan et. al (2016)

Deep Reinforcement Learning

  • Deep Q-learning
    • Learning: Feed in state to NN, final layer gives q-values for each action
      • Compares predicted value to previous observed value: loss $L = \sum(Q_{prev_observed} - Q_{pred})$
	- Learning happens for each state
- **Acting**: Put final layer through softmax (or some other action selection policy, see below) and select the corresponding action.
  • Experience replay

    • Problem: Update after every action, so consecutive states that are similar may bias the neural network.
    • Solution: Save state information. Start updating after some initial time period, and update with states drawn uniformly from memory in the interval $(t-k_1, t-k_2)$.
    • Schaul et al. (2016), Prioritized Experience Replay
  • Action selection policies

    • Most commonly used:
      • $\epsilon$-greedy
        • Select highest q-value action $(1-\epsilon)$ of the time, randomly otherwise.
          • Tokic (2010): can adapt $\epsilon$ depending on the state (smaller $\epsilon$ if the agent is certain about its state)
      • $\epsilon$-soft $(1-\epsilon)$
        • Opposite of $\epsilon$-greedy: select highest q-value action $\epsilon$ of the time, randomly otherwise.
      • Softmax
        • $\sigma(\textbf{z})_j = \frac{e^{z_j}}{\sum_k e^{z_k}}$ for $j=1,...,K$.
        • Outputs across all actions sum to one
    • Key is exploration vs exploitation
      • Agent may find itself stuck in a local maximum (thinks e.g. a positive-reward action $Q_2$ is the best action because it hasn't found the better one $Q_4$.)
  • On-policy vs off-policy

    • On-policy: update value with action actually taken
    • Off-policy: update value with max_a Q(s,a'), i.e. no constraint on next action.
  • Policy Gradient Methods

    • General Challenges
      • Sensitive to choice in stepsize
      • Often have poor sample efficiency, taking millions or billions of steps to learn simple tasks
      • Approaches:
        • constraining or optimising size of policy update
    • Trust Region Policy Optimisation (TRPO)
      • [Implementation in PyTorch]
      • Pros
        • Good for continuous control tasks
      • Cons
        • 'isn’t easily compatible with algorithms that share parameters between a policy and value function or auxiliary losses'
    • Proximal Policy Optimisation (PPO)
      • Tries to minimise cost while ensuring the deviation from the previous policy is relatively small
      • Implementation:
        • $L^{CLIP}(\theta) = \hat{E_t}[\min(r_t(\theta)\hat{A_t},\clip(r_t(\theta),1-\epsilon,1+\epsilon)\hat{A_t})]$
          • $r_t$: ratio of probability under new and old policies respectively (?) check
          • $\hat{A_t}$: estimated advantage at time t
          • $\epsilon$: hyperparameter, usually 0.1 or 0.2
        • Much simpler to implement than ACER
        • Trust region update compatible with SGD
      • OpenAI blog post
    • PPO2
      • GPU-enabled implementation of PPO by OpenAI.
    • Actor Critic with Experience Replay (ACER)
      • Sample-efficient ploicy gradient algorithm
      • Uses a replay buffer, so it can perform more than one gradient update using each piece of sampled experience, as well as a Q-Function approximate trained with the Retrace algorithm.
    • References:
  • A3C (Asynchoronous Advantage Actor-Critic)

    • Actor-critic:
      • Two outputs:
        1. Actor: outputs Policy, i.e. Q-values $Q(s,a_i)$ for all $a_i$, possible actions via Softmax
        2. Critic: outputs Value of state we're in $V(s)$
    • Asynchronous
      • Multiple agents tackling the same environment, each initalised differently (diff random seed)
        • More experience to learn from
        • Reduces chance of all agents being stuck in a local max
        • Can combine N nets into one single net,
          • where N = number of agents.
          • So weights are shared.
      • Agents share experience by contributing to a common critic
    • Advantage
      • Have two losses, one for each output (Value loss, policy loss)
      • Value loss: TODO (fill in)
      • Policy loss:
        • Let Advantage A = Q(s,a) - V(s)
          • How much better is the Q-value you're selecting compared to the 'known' V value across agents?
        • Goal is to maximise advantage: encourages actions that have Q(s,a) > V.
  • A2C (Synchronous A3C: Advantage Actor-Critic)

    • A2C tends to be unstable due to occasional entropy collapse. (AI Safety Gridworlds, Nov 2017)
    • Particularly sensitive to hyperparameter(s) relating to policy entropy
  • Rainbow

    • Combination of improvements in deep RL
  • DQN

  • Policy gradient methods

Other RL

  • Batch reinforcement learning
    • Do not interact with the system during learning (Used e.g. in real-world industrial settings since unrestricted exploration can damage the system)

References:

  • RL: AI A to Z course