Models fine-tuned using synthetic data include:

  • Apple’s on-device and server foundation models
  • Llama-3.1
    • They used “rejection sampling based on code execution feedback” and “automatic annotation of very large docs”1
    • They could generate unit tests for generated code samples, then reject code samples that couldn’t actually pass these tests.

Oxen.ai has a neat tutorial on how to generate synthetic data for LLM training.2

Footnotes

  1. Balaji, Herding Llamas: A Sneak Peek Into Meta’s Infrastructure for Generative AI. SC’24.

  2. Create Your Own Synthetic Data With Only 5 Political Spam Texts [1/4] | Oxen.ai