Models fine-tuned using synthetic data include:

  • Apple’s on-device and server foundation models
  • Llama-3.1
    • They used “rejection sampling based on code execution feedback” and “automatic annotation of very large docs”1
    • They could generate unit tests for generated code samples, then reject code samples that couldn’t actually pass these tests.

Oxen.ai has a neat tutorial on how to generate synthetic data for LLM training.2

Anthropic AI has been relying on synthetic data:3

Quote

Amodei also doesn’t think a shortage of data will present a challenge to AI development, unlike some experts. Either by generating synthetic data or extrapolating out from existing data, AI developers will “get around” data limitations, he says.

Footnotes

  1. Balaji, Herding Llamas: A Sneak Peek Into Meta’s Infrastructure for Generative AI. SC’24.

  2. Create Your Own Synthetic Data With Only 5 Political Spam Texts [1/4] | Oxen.ai

  3. This Week in AI: Anthropic’s CEO talks scaling up AI and Google predicts floods: