Notes: RL Foundational Models

What the suite looks like:
10^40 states
- performes about the same as muZero but without planning
- However because it doesn’t have a learned internal model, it’s not the most sample efficient.

Distillation

Note: distillation loss is used to train the large model faster.
Auto-curriculum learning
- No-op filtering
- Prioritized level replay




Multi-agent:

Scaling:

context window for 300 trial * 5 steps = 1500. However, a 1800 context size is still usefull.
