We introduce LintSeq, a simple algorithm that can be used to refactor static code data across multiple equivalent views -- i.e., into equivalent sequences of code edits or diffs. This algorithm is loosely inspired by recent work on diffusion models for text generation. LintSeq uses a linter (a static verifier for code) to sample across all of the static error-free insertion edits that can be used to write programs chunk-by-chunk.
Reparameterizing programs in training data has a dramatic impact on downstream inference-time scaling laws. Repeatedly sampling from LMs trained with supervised finetuning (SFT) on program diff sequences yields higher quality, more diverse programs. We demonstrate this effect on six different small language models, ranging in scale from ~100M to ~10B parameters. Using SFT, we tune each model on paired natural language instruction + program-diff-sequences vs. on equivalent instruction + program data. In the plot below, we show code synthesis benchmark coverage as a function of samples across SFT-ed models (temperature 1, top-p 0.95). Coverage refers to the fraction of benchmark programming problems “pass@k” solved by any attempt given “k” tries.
To test the effect of reparameterizing code generation as a sequential edit problem on code LMs of the smallest scales, we tune two tiny decoder-only transformers on instruction + program-diff-sequence data with SFT. These models, TinyCodeLM-150M and TinyCodeLM-400M, are pretrained for Python code understanding on only 72 billion tokens of text. The LintSeq instruction finetuned variants of both models are state-of-the-art on HumanEval and MBPP(+) for their size.
@misc{piterbarg2024editseq,
title={Training Language Models on Synthetic Edit Sequences Improves Code Synthesis},
author={Ulyana Piterbarg and Lerrel Pinto and Rob Fergus},
year={2024},
eprint={2410.02749},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
}