diff --git a/lectures/html/L6_off_policy.html b/lectures/html/L6_off_policy.html index b91faf93..89843f28 100755 --- a/lectures/html/L6_off_policy.html +++ b/lectures/html/L6_off_policy.html @@ -166,7 +166,7 @@

TD-based Q Function Learning

  • In continuous deterministic Q-learning, the update target becomes $r+\gamma Q(s', \pi_{\phi}(s'))$.
  • In literature, the policy network is also known as the "actor" and the value network is known as the "critic"
  • - + @@ -341,10 +341,10 @@

    Issue: Rare Beneficial Samples
    in the Replay Buffer

  • Example: Montezuma's revenge, Blind Cliffwalk, Any long-horizon sparse reward problem
  • - +
    - +
    @@ -358,7 +358,7 @@

    Blind Cliffwalk

  • Episode is terminated whenever the agent takes the $\color{red}{\text{wrong}}$ action.
  • Agent will get reward $1$ after taking $n$ $\color{black}{\text{right}}$ actions.
  • - +
    Schaul et. al, Prioritized experience replay (ICLR 2016)
    @@ -373,7 +373,7 @@

    Analysis with Q-Learning

    current state.
    - +
    Schaul et. al, Prioritized experience replay (ICLR 2016)
    @@ -485,7 +485,7 @@

    Value Network with Discrete Distribution

  • $Q_\th(s, a) = \sum_i p_{\th, i}(s, a) z_i$.
  • Update rules of $Q$, $x_t$ is state at time-step $t$.
    - +
  • @@ -573,7 +573,7 @@

    Ablation study of tricks in Rainbow

    Hessel et. al, Rainbow: Combining Improvements in Deep Reinforcement Learning (AAAI 2018)
    diff --git a/lectures/html/L7_exploration.html b/lectures/html/L7_exploration.html index e6016f5e..5f3150ce 100755 --- a/lectures/html/L7_exploration.html +++ b/lectures/html/L7_exploration.html @@ -67,7 +67,7 @@

    Motivation to Explore vs Exploit

  • New restauraunt is always a risk (unknown food quality, service etc.)
  • But without going to the new restauraunt you never know! How do we balance this?
  • - +

    Why Exploration is Difficult

    @@ -80,7 +80,7 @@

    Why Exploration is Difficult

  • Even exploration in low-dimensional space may be tricky when there are "alleys". Low probability of going through small gaps to then explore other states:
  • - +
    @@ -88,7 +88,7 @@

    Exploration to Escape Local Minima in Reward

    @@ -100,7 +100,7 @@

    Knowing what to explore is critical

    @@ -145,8 +145,8 @@

    Multi-Armed Bandits

  • In this game you have a few slots and can choose a slot to pull. You then receive a reward sampled from an unknown distribution
  • - - + +
    @@ -184,7 +184,7 @@

    Multi-Armed Bandits

  • Goal is to maximize cumulative reward $\sum_{t=1}^T r_t$
  • - + @@ -275,7 +275,7 @@

    Total Regret Decomposition

    Desirable Total Regret Behavior

    - +
    Gif from https://brilliant.org/wiki/gaussian-mixture-model/, which is also a easy resource to learn how EM works.
    @@ -884,7 +884,7 @@

    Random Network Distillation Performance

    Random Network Distillation Performance

    - +
    Burda et. al, Random Network Distillation