Grok-1 is a 314B parameter MOE LLM. It has:

  • 8192 sequence length
  • 8 experts with two active per token
  • 64 layers
  • 48 attention heads for queries, 8 for keys/values
  • 6144 embedding size (48 128)
  • Used SentencePiece tokenizer with 128Ki vocabulary
  • Rotary embeddings (RoPE)

It was trained using JAX, and the model checkpoint was shipped as 770 shards, totaling 318 GB. Parameters are 8-bit quantized.