Evaluating LLM trained on Code

Why code

  • good size corpus napkin math
  • hierarchical structure
  • automatically tested
  • Errors/stacktraces are just language
  • Eval tool edge
  • A lot of ‘context information’ - documentation, commits, diffs, PR, etc

  • Complimentary skills for most other downstream tasks.

codex scaling loss

Downstream evaluation

codex pass rate vs. model size

Language modeling vs. code generation

codex train from scratch vs. pretrain

Fine-tuning: “Effective data transfer”

codex effective data transfer

Data Size:

  • GPT-3 trained on 300B tokens, ~200B words.
  • Dataset grows slowly with model size.
  • 1T words enough for a 10x larger model?
    • Common crawl = 10^14 words
    • Library of congress = 10^7 * 10^5 = 10^12 words (overestimate)
    • Python on Github = 50B tokens.
    • Just scaling up model size will run into data limitations soon. However what about transfer?

Generate more samples codex generate more samples

Longer programs codex_longer_programs

Discriminators antropic ml discriminators

  • Binary discriminators for “is this code correct?” (is this code valid is easy)
    • These don’t do much better than log-probs of code-trained LM
  • Naive RL -> value function won’t learn much. Plausibily suggest that naive RL for function synthesis is a weak problem formulation.
  • One would hope to do much better by using more information (ex. stack traces, human-feedback etc
).
  • “Generate many samples” is the most naive form of search, and search is the most naive form of RL
..in the presense of a good automatic evalaution, this may work.

Alpha code paper alphacode overview

Ranking

alphacode ranking

Approach:

  1. Pre-train LLM on Github with standard lanugage modeling objective.
  2. Fine-tune on code generation task (competitive programming).
  3. Generate a very large number of samples for each problem.
  4. Filter the samples to obtain a small set of high-quality solutions (~10).

alphacode architecture overview

alphacode dataset sizes

alphacode codeforces ranking

alphacode solve rates

Alpha code blog Viz tool

Progress by method:

alphacode progress by method alphacode progress table

codex vs. alphacode

alphacode eval

Filtering and clustering

alphacode filtering