Training Language Models on Synthetic Edit Sequences Improves Code Synthesis

New York University
A visualization of LintSeq.

Training autoregressive LMs to synthesize code edit-by-edit improves the slope of test-time scaling laws on benchmarks like HumanEval, MBPP, and CodeContests. Our approach introduces an algorithm, LintSeq, that uses code linters to generate synthetic edit data for code synthesis at scale. Edits sampled with this algorithm reflect the semantics & syntax of their programming language, modeling human-like abstractions. We show that pre-processing SFT data with LintSeq results in models that produce high-quality & more diverse programs when sampled from at test-time, suggesting that LMs benefit from the addition of human-like abstractions for synthesis to training data.



Abstract

Software engineers mainly write code by editing existing programs. In contrast, language models (LMs) autoregressively synthesize programs in a single pass. One explanation for this is the scarcity of sequential edit data. While high-quality instruction data for code synthesis is scarce, edit data for synthesis is even scarcer.

To fill this gap, we develop a synthetic data generation algorithm called LintSeq. This algorithm refactors programs into sequences of synthetic edits by using a linter to procedurally sample across interdependent lines of source code. Synthetic edits sampled with LintSeq reflect the syntax and semantics of their programming language.

To test the algorithm, we use it to refactor a dataset of instruction + program pairs into instruction + program-diff-sequence tuples. Then, we fine-tune a series of smaller LMs ranging from 2.6B to 14B parameters on both the re-factored and original versions of this dataset. We perform comprehensive evaluations comparing edit sequence code LMs against baselines on HumanEval, MBPP(+), CodeContests, DS-1000, and BigCodeBench. We show that models fine-tuned to iteratively synthesize code match or outperform baselines on pass@1, and exhibit better scaling across higher pass@k as a function of total test-time FLOPs. Finally, we also pretrain our own tiny LMs for code understanding. We show that fine-tuning these models to synthesize code edit-by-edit results in strong performance on HumanEval and MBPP(+) compared to existing code language models of similar scale such as AlphaCode and Codex.


1. LintSeq: Code Generation as a Sequential Edit Problem

At each iteration, the LintSeq algorithm samples an edit chunk from a program by: randomly selecting a line of code to delete; identifying the minimal set of lines that are dependent on this line with a code linter; and finally, removing the line and its dependents. These steps are repeated until all lines of code have been removed. LintSeq then processes the reversed sequence of program states with Unix-diff to express it as a sequence of edits.

Intuitively, LintSeq decomposes source code across semantically/syntactically interdependent edit chunks. A synthetic Python edit output by LintSeq might contain: the full body of a "for" loop, a declaration of a variable (e.g. "a = 10") and all subsequent lines of code referencing this variable, or an entire function definition. By design, each LintSeq edit also satisfies a "linter-correctness" invariant, i.e. applying it to previously written code does not introduce new linter errors.

A visualization of LintSeq.

2. Sequential Code Generation improves the Slope of Test-Time Scaling Laws for Repeated Sampling

By adding edits that reflect programming language abstractions to instruction data with LintSeq, we train language models to synthesize programs one ("linted") edit at a time. This results in substantial improvements to the overall diversity of model-generated code, compared to training on a dataset of equivalent "single-shot" instruction data for code synthesis. The improved diversity of sampled programs means that pass@k performance increases faster as a function of test-time FLOPs, allowing for a better trade-off between the two. We demonstrate this effect on four different language models.

In the figure below, we report the best performance of models SFT-ed on edit sequence pre-processed vs standard instruction data after sweeping over sampling temperature, top-p, and min-p. We then plot benchmark score (pass@k) as a function of the total cost of repeated sampling from each model in FLOPs. Shading shows standard error in linear fit.

LMs from the Gemma 2, Phi-3, and Llama 3 families exhibit better inference-time scaling laws during repeated sampling when SFT-ed on program diff sequences rather than on standard program data.

BibTeX

@misc{piterbarg2024editseq,
      title={Training Language Models on Synthetic Edit Sequences Improves Code Synthesis}, 
      author={Ulyana Piterbarg and Lerrel Pinto and Rob Fergus},
      year={2024},
      eprint={2410.02749},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
      }
}