Models fine-tuned using synthetic data include:
- Apple’s on-device and server foundation models
- Llama-3.1
- They used “rejection sampling based on code execution feedback” and “automatic annotation of very large docs”1
- They could generate unit tests for generated code samples, then reject code samples that couldn’t actually pass these tests.
Oxen.ai has a neat tutorial on how to generate synthetic data for LLM training.2
Anthropic AI has been relying on synthetic data:3
Quote
Amodei also doesn’t think a shortage of data will present a challenge to AI development, unlike some experts. Either by generating synthetic data or extrapolating out from existing data, AI developers will “get around” data limitations, he says.