Training Language Models on Synthetic Edit Sequences Improves Code Synthesis

New York University

There are infinitely many ways to write a program. Training autoregressive LMs to natively synthesize programs with sequences of diffs improves the trade-off between generation quality and inference-time compute. We show that repeatedly sampling solutions to coding problems from small language models tuned on program-diff sequences yields benchmark coverage that is competitive with GPT-4 and GPT-4-Omni, with similar total cost to generating a single completion from the best open-source LLMs like Llama 3.1 405B.



Abstract

Software engineers mainly write code by editing existing programs. In contrast, large language models (LLMs) autoregressively synthesize programs in a single pass. One explanation for this is the scarcity of open-sourced edit data. While high-quality instruction data for code synthesis is already scarce, high-quality edit data is even scarcer. To fill this gap, we develop a synthetic data generation algorithm called LintSeq. This algorithm refactors existing code into a sequence of code edits by using a linter to procedurally sample across the error-free insertions that can be used to sequentially write programs. It outputs edit sequences as text strings consisting of consecutive program diffs. To test LintSeq, we use it to refactor a dataset of instruction + program pairs into instruction + program-diff-sequence tuples. Then, we instruction finetune a series of smaller LLMs ranging from 2.6B to 14B parameters on both the re-factored and original versions of this dataset, comparing zero-shot performance on code synthesis benchmarks. We show that during repeated sampling, edit sequence finetuned models produce more diverse programs than baselines. This results in better inference-time scaling for benchmark coverage as a function of samples, i.e. the fraction of problems "pass@k" solved by any attempt given "k" tries. For example, on HumanEval pass@50, small LLMs finetuned on synthetic edit sequences are competitive with GPT-4 and outperform models finetuned on the baseline dataset by +20% (+/-3%) in absolute score. Finally, we also pretrain our own tiny LMs for code understanding. We show that finetuning tiny models on synthetic code edits results in state-of-the-art code synthesis for the on-device model class. Our 150M parameter edit sequence LM matches or outperforms code models with twice as many parameters, both with and without repeated sampling, including Codex and AlphaCode.


1. LintSeq: Code Generation as a Sequential Edit Problem

We introduce LintSeq, a simple algorithm that can be used to refactor static code data across multiple equivalent views -- i.e., into equivalent sequences of code edits or diffs. This algorithm is loosely inspired by recent work on diffusion models for text generation. LintSeq uses a linter (a static verifier for code) to sample across all of the static error-free insertion edits that can be used to write programs chunk-by-chunk.

A visualization of LintSeq.

2. Inference-Time Scaling Laws for Code Generation

Reparameterizing programs in training data has a dramatic impact on downstream inference-time scaling laws. Repeatedly sampling from LMs trained with supervised finetuning (SFT) on program diff sequences yields higher quality, more diverse programs. We demonstrate this effect on six different small language models, ranging in scale from ~100M to ~10B parameters. Using SFT, we tune each model on paired natural language instruction + program-diff-sequences vs. on equivalent instruction + program data. In the plot below, we show code synthesis benchmark coverage as a function of samples across SFT-ed models (temperature 1, top-p 0.95). Coverage refers to the fraction of benchmark programming problems “pass@k” solved by any attempt given “k” tries.

Small LMs including models from the Gemma, Phi-3, and Llama 3 families exhibit better inference-time scaling laws during repeated sampling when finetuning on program diff sequences rather than on standard program data.

3. Tiny Edit Sequence LMs are SOTA for their Size

To test the effect of reparameterizing code generation as a sequential edit problem on code LMs of the smallest scales, we tune two tiny decoder-only transformers on instruction + program-diff-sequence data with SFT. These models, TinyCodeLM-150M and TinyCodeLM-400M, are pretrained for Python code understanding on only 72 billion tokens of text. The LintSeq instruction finetuned variants of both models are state-of-the-art on HumanEval and MBPP(+) for their size.

Tiny language models instruction finetuned on edit sequences are state-of-the-art for their size.

BibTeX

@misc{piterbarg2024editseq,
      title={Training Language Models on Synthetic Edit Sequences Improves Code Synthesis}, 
      author={Ulyana Piterbarg and Lerrel Pinto and Rob Fergus},
      year={2024},
      eprint={2410.02749},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
      }
}