Skip to content

Commit

Permalink
fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
StoneT2000 committed Apr 25, 2024
1 parent 4f6dc75 commit 9b06ff0
Show file tree
Hide file tree
Showing 2 changed files with 19 additions and 19 deletions.
14 changes: 7 additions & 7 deletions lectures/html/L6_off_policy.html
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,7 @@ <h1 class="nt">TD-based Q Function Learning</h1>
<li>In continuous deterministic Q-learning, the update target becomes $r+\gamma Q(s', \pi_{\phi}(s'))$.</li>
<li>In literature, the policy network is also known as the "actor" and the value network is known as the "critic"</li>
</ul>
<img src="./L16/ddpg_networks.png" width="80%"/>
<img src="./L6/ddpg_networks.png" width="80%"/>
</div>

<!-- ################################################################### -->
Expand Down Expand Up @@ -341,10 +341,10 @@ <h1 class="nt"> Issue: Rare Beneficial Samples <br/>in the Replay Buffer </h1>
<li> Example: Montezuma's revenge, Blind Cliffwalk, Any long-horizon sparse reward problem </li>
<div class="row">
<div class="column" style="flex: 10%">
<img src="./L16/MR.png" width="75%"></img>
<img src="./L6/MR.png" width="75%"></img>
</div>
<div class="column" style="flex: 10%">
<img src="./L16/Blind_Cliffwalk.png" width="100%"></img>
<img src="./L6/Blind_Cliffwalk.png" width="100%"></img>
</div>
</div>
</ul>
Expand All @@ -358,7 +358,7 @@ <h1 class="nt"> Blind Cliffwalk </h1>
<li> Episode is terminated whenever the agent takes the $\color{red}{\text{wrong}}$ action. </li>
<li> Agent will get reward $1$ after taking $n$ $\color{black}{\text{right}}$ actions. </li>
<div style=margin-top:10px>
<img src="./L16/Blind_Cliffwalk.png" width="75%"/>
<img src="./L6/Blind_Cliffwalk.png" width="75%"/>
</div>
</ul>
<div class="credit"><a href="https://arxiv.org/pdf/1511.05952.pdf"> Schaul et. al, Prioritized experience replay (ICLR 2016)</a> </div>
Expand All @@ -373,7 +373,7 @@ <h1 class="nt"> Analysis with Q-Learning</h1>
current state. </li>

<div style=margin-top:10px>
<img src="./L16/Blind_Cliffwalk_QL.png" width="35%"/>
<img src="./L6/Blind_Cliffwalk_QL.png" width="35%"/>
</div>
</ul>
<div class="credit"><a href="https://arxiv.org/pdf/1511.05952.pdf"> Schaul et. al, Prioritized experience replay (ICLR 2016)</a> </div>
Expand Down Expand Up @@ -485,7 +485,7 @@ <h1 class="nt"> Value Network with Discrete Distribution </h1>
<li> $Q_\th(s, a) = \sum_i p_{\th, i}(s, a) z_i$. </li>
<li> Update rules of $Q$, $x_t$ is state at time-step $t$.
<div style=margin-top:10px>
<img src="./L16/DRL.png" width="42%"/>
<img src="./L6/DRL.png" width="42%"/>
</div>
</li>
</ul>
Expand Down Expand Up @@ -573,7 +573,7 @@ <h1 class="nt"> Ablation study of tricks in Rainbow </h1>
<ul>
<li> Prioritized replay, multi-step learning, distributional RL are the most important tricks in Rainbow. </li>
<div style=margin-top:10px>
<img src="./L16/Rainbow.png" width="42%"/>
<img src="./L6/Rainbow.png" width="42%"/>
</div>
</ul>
<div class="credit"><a href="https://arxiv.org/pdf/1710.02298.pdf">Hessel et. al, Rainbow: Combining Improvements in Deep Reinforcement Learning (AAAI 2018)</a> </div>
Expand Down
24 changes: 12 additions & 12 deletions lectures/html/L7_exploration.html
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ <h1 class="nt">Motivation to Explore vs Exploit</h1>
<li>New restauraunt is always a risk (unknown food quality, service etc.) </li>
<li>But without going to the new restauraunt you never know! How do we balance this?</li>
</ul>
<img src="./SP24_L7/exploration_vs_exploitation.png" alt="" width="50%">
<img src="./L7/exploration_vs_exploitation.png" alt="" width="50%">
</div>
<div class="step slide">
<h1 class="nt">Why Exploration is Difficult</h1>
Expand All @@ -80,15 +80,15 @@ <h1 class="nt">Why Exploration is Difficult</h1>
</ul>
</li>
<li>Even exploration in low-dimensional space may be tricky when there are "alleys". Low probability of going through small gaps to then explore other states:</li>
<img src="./SP24_L7/simple_2d_map.png" alt="" width="30%" />
<img src="./L7/simple_2d_map.png" alt="" width="30%" />
</ul>
</div>
<div class="step slide">
<h1 class="nt">Exploration to Escape Local Minima in Reward</h1>
<ul>
<li>Suppose your dense reward for the environment below is euclidean distance to the flag. The return maximizing sequence of actions are to go through the samll gap and reach the flag (global minimum)</li>
<li>But you will never know to do that unless you explore, and with this dense reward function your trained agent will likely headbutt into the blue wall (local minimum)</li>
<img src="./SP24_L7/simple_2d_map_headbutt.png" alt="" width="30%" />
<img src="./L7/simple_2d_map_headbutt.png" alt="" width="30%" />
</ul>
</div>
<div class="step slide">
Expand All @@ -100,7 +100,7 @@ <h1 class="nt">Knowing what to explore is critical</h1>
</ul>
<ul class="substep">
<li >The agent will become a couch potato and stare at the TV all day.</li>
<img src="./SP24_L7/the-noisy-TV-problem.gif" alt="" width="100%">
<img src="./L7/the-noisy-TV-problem.gif" alt="" width="100%">
</ul>
</div>

Expand Down Expand Up @@ -145,8 +145,8 @@ <h1 class="nt">Multi-Armed Bandits</h1>
<li>In this game you have a few slots and can choose a slot to pull. You then receive a reward sampled from an unknown distribution</li>
</ul>
<div style="display: flex">
<img src="./SP24_L7/slots.png" alt="" width="30%" style="display:inline-block">
<img src="./SP24_L7/slot-dist.png" alt="" width="30%" style="display:inline-block">
<img src="./L7/slots.png" alt="" width="30%" style="display:inline-block">
<img src="./L7/slot-dist.png" alt="" width="30%" style="display:inline-block">
</div>
</div>
<!-- ################################################################### -->
Expand Down Expand Up @@ -184,7 +184,7 @@ <h1 class="nt">Multi-Armed Bandits</h1>
<li>
Goal is to maximize cumulative reward $\sum_{t=1}^T r_t$
</li>
<img src="./SP24_L7/slot-dist.png" alt="" width="30%">
<img src="./L7/slot-dist.png" alt="" width="30%">
</li>
</ul>
</div>
Expand Down Expand Up @@ -275,7 +275,7 @@ <h1 class="nt">Total Regret Decomposition</h1>
<!-- ################################################################### -->
<div class="step slide">
<h1 class="et">Desirable Total Regret Behavior</h1>
<img src="./SP24_L7/regret_as_function_of_time.png" width="1000" height="650"></img>
<img src="./L7/regret_as_function_of_time.png" width="1000" height="650"></img>

<ul>
<li>What can you infer from this figure?</li>
Expand Down Expand Up @@ -331,7 +331,7 @@ <h1 class="vt">Decaying $\epsilon$-Greedy Algorithm</h1>
<div class="step slide">
<h1 class="et">The Principle of Optimism in the Face of Uncertainty</h1>
<div class="row" style="flex: 0%">
<img src="./SP24_L7/optimism_before.png" width="900" height="600" />
<img src="./L7/optimism_before.png" width="900" height="600" />
</div>
<ul>

Expand Down Expand Up @@ -568,7 +568,7 @@ <h1 class="nt">Counting via Hashing: Autoencoders</h1>
<li>$\mathcal{L}(\{s_n\}_{n=1}^N) = \underbrace{-\frac{1}{N} \sum_{n=1}^N \log p(s_n)}_\text{reconstruction loss} + \underbrace{\frac{1}{N} \frac{\lambda}{K} \sum_{n=1}^N\sum_{i=1}^k \min \big \{ (1-b_i(s_n))^2, b_i(s_n)^2 \big\}}_\text{sigmoid activation being closer to binary}$</li>
<!-- <li>Intuition for autoencoder: Effectively mapping high dimensional state to lower dimensions (a code), and then decoding that into a binary </li> -->

<img src="./SP24_L7/autoencoder.png" width="30%">
<img src="./L7/autoencoder.png" width="30%">
</ul>
<div class="credit"><a href="https://arxiv.org/abs/1611.04717">Tang et. al, # Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning</a></div>
<div class="credit"><a href="https://lilianweng.github.io/posts/2020-06-07-exploration-drl/">Lilliang Weng, Exploration Strategies in Deep Reinforcement Learning</a></div>
Expand Down Expand Up @@ -643,7 +643,7 @@ <h1 class="nt">Quick Refresher on GMM</h1>
<li>Gaussian Mixture Model Training Process initializes $k$ different Gaussians and fits them to the data. Suppose for example our state space has 2 dimensions below.</li>
<li>The GMM is our density model and generates probabilities of seeing some input $p_t(s)$</li>
<li>Typically optimized via Expectation Maximization (EM)</li>
<img src="./SP24_L7/gaussian_mixture_model.gif"/>
<img src="./L7/gaussian_mixture_model.gif"/>

</ul>
<div class="ack">Gif from <a href="https://brilliant.org/wiki/gaussian-mixture-model/">https://brilliant.org/wiki/gaussian-mixture-model/</a>, which is also a easy resource to learn how EM works.</div>
Expand Down Expand Up @@ -884,7 +884,7 @@ <h1 class="nt">Random Network Distillation Performance</h1>
</div>
<div class="step slide">
<h1 class="nt">Random Network Distillation Performance</h1>
<img src="./SP24_L7/rnd_fig.png" width="65%"/>
<img src="./L7/rnd_fig.png" width="65%"/>
<div class="credit"><a href="https://arxiv.org/pdf/1606.01868.pdf">Burda et. al, Random Network Distillation</a></div>
</div>

Expand Down

0 comments on commit 9b06ff0

Please sign in to comment.