diff --git a/lectures/html/L6_off_policy.html b/lectures/html/L6_off_policy.html
index b91faf93..89843f28 100755
--- a/lectures/html/L6_off_policy.html
+++ b/lectures/html/L6_off_policy.html
@@ -166,7 +166,7 @@
TD-based Q Function Learning
In continuous deterministic Q-learning, the update target becomes $r+\gamma Q(s', \pi_{\phi}(s'))$.
In literature, the policy network is also known as the "actor" and the value network is known as the "critic"
-
+
@@ -341,10 +341,10 @@ Issue: Rare Beneficial Samples
in the Replay Buffer
Example: Montezuma's revenge, Blind Cliffwalk, Any long-horizon sparse reward problem
@@ -358,7 +358,7 @@ Blind Cliffwalk
Episode is terminated whenever the agent takes the $\color{red}{\text{wrong}}$ action.
Agent will get reward $1$ after taking $n$ $\color{black}{\text{right}}$ actions.
-

+
@@ -373,7 +373,7 @@ Analysis with Q-Learning
current state.
-

+
@@ -485,7 +485,7 @@ Value Network with Discrete Distribution
$Q_\th(s, a) = \sum_i p_{\th, i}(s, a) z_i$.
Update rules of $Q$, $x_t$ is state at time-step $t$.
-

+
@@ -573,7 +573,7 @@ Ablation study of tricks in Rainbow
- Prioritized replay, multi-step learning, distributional RL are the most important tricks in Rainbow.
-

+
diff --git a/lectures/html/L7_exploration.html b/lectures/html/L7_exploration.html
index e6016f5e..5f3150ce 100755
--- a/lectures/html/L7_exploration.html
+++ b/lectures/html/L7_exploration.html
@@ -67,7 +67,7 @@ Motivation to Explore vs Exploit
New restauraunt is always a risk (unknown food quality, service etc.)
But without going to the new restauraunt you never know! How do we balance this?
-
+
Why Exploration is Difficult
@@ -80,7 +80,7 @@
Why Exploration is Difficult
Even exploration in low-dimensional space may be tricky when there are "alleys". Low probability of going through small gaps to then explore other states:
-

+
@@ -88,7 +88,7 @@
Exploration to Escape Local Minima in Reward
- Suppose your dense reward for the environment below is euclidean distance to the flag. The return maximizing sequence of actions are to go through the samll gap and reach the flag (global minimum)
- But you will never know to do that unless you explore, and with this dense reward function your trained agent will likely headbutt into the blue wall (local minimum)
-
+
@@ -100,7 +100,7 @@
Knowing what to explore is critical
- The agent will become a couch potato and stare at the TV all day.
-
+
@@ -145,8 +145,8 @@ Multi-Armed Bandits
In this game you have a few slots and can choose a slot to pull. You then receive a reward sampled from an unknown distribution
@@ -184,7 +184,7 @@ Multi-Armed Bandits
Goal is to maximize cumulative reward $\sum_{t=1}^T r_t$
-
+
@@ -275,7 +275,7 @@ Total Regret Decomposition
Desirable Total Regret Behavior
-

+
- What can you infer from this figure?
@@ -331,7 +331,7 @@ Decaying $\epsilon$-Greedy Algorithm
The Principle of Optimism in the Face of Uncertainty
-

+
@@ -568,7 +568,7 @@ Counting via Hashing: Autoencoders
- $\mathcal{L}(\{s_n\}_{n=1}^N) = \underbrace{-\frac{1}{N} \sum_{n=1}^N \log p(s_n)}_\text{reconstruction loss} + \underbrace{\frac{1}{N} \frac{\lambda}{K} \sum_{n=1}^N\sum_{i=1}^k \min \big \{ (1-b_i(s_n))^2, b_i(s_n)^2 \big\}}_\text{sigmoid activation being closer to binary}$
-
+
@@ -643,7 +643,7 @@
Quick Refresher on GMM
- Gaussian Mixture Model Training Process initializes $k$ different Gaussians and fits them to the data. Suppose for example our state space has 2 dimensions below.
- The GMM is our density model and generates probabilities of seeing some input $p_t(s)$
- Typically optimized via Expectation Maximization (EM)
-

+
@@ -884,7 +884,7 @@
Random Network Distillation Performance
Random Network Distillation Performance
-

+