Glenn's Digital Garden

❯

Synthetic data

Jan 25, 2025

artificial-intelligence

Models fine-tuned using synthetic data include:

Apple’s on-device and server foundation models
Llama-3.1
- They used “rejection sampling based on code execution feedback” and “automatic annotation of very large docs”¹
- They could generate unit tests for generated code samples, then reject code samples that couldn’t actually pass these tests.

Oxen.ai has a neat tutorial on how to generate synthetic data for LLM training.²

Footnotes

Balaji, Herding Llamas: A Sneak Peek Into Meta’s Infrastructure for Generative AI. SC’24. ↩
Create Your Own Synthetic Data With Only 5 Political Spam Texts [1/4] | Oxen.ai ↩

Graph View

Backlinks

DeepSeek-R1
LLM training datasets
Scaling laws

Created with Quartz v4.4.0 © 2025

glennklockwood.com
@glennklockwood.com