Notes: Code Generation Models

Why code

codex scaling loss

Downstream evaluation

codex pass rate vs. model size

Language modeling vs. code generation

codex train from scratch vs. pretrain

Fine-tuning: “Effective data transfer”

codex effective data transfer

Data Size:

GPT-3 trained on 300B tokens, ~200B words.
Dataset grows slowly with model size.
1T words enough for a 10x larger model?
- Common crawl = 10^14 words
- Library of congress = 10^7 * 10^5 = 10^12 words (overestimate)
- Python on Github = 50B tokens.
- Just scaling up model size will run into data limitations soon. However what about transfer?

Generate more samples codex generate more samples

Longer programs codex_longer_programs

Discriminators antropic ml discriminators

Binary discriminators for “is this code correct?” (is this code valid is easy)
- These don’t do much better than log-probs of code-trained LM
Naive RL -> value function won’t learn much. Plausibily suggest that naive RL for function synthesis is a weak problem formulation.
One would hope to do much better by using more information (ex. stack traces, human-feedback etc…).
“Generate many samples” is the most naive form of search, and search is the most naive form of RL…..in the presense of a good automatic evalaution, this may work.