Notes: RL Foundational Models

fig 2

What the suite looks like:

fig 3 10^40 states

performes about the same as muZero but without planning
However because it doesn’t have a learned internal model, it’s not the most sample efficient.

fig 10

Distillation

fig 15

Note: distillation loss is used to train the large model faster.

Auto-curriculum learning

fig 11

fig 5

fig 6

fig f.1

Multi-agent:

fig 8

Scaling:

fig 12

fig 13 context window for 300 trial * 5 steps = 1500. However, a 1800 context size is still usefull.

fig 14