Grok-1 is a 314B parameter MOE LLM. It has:
- 8192 sequence length
- 8 experts with two active per token
- 64 layers
- 48 attention heads for queries, 8 for keys/values
- 6144 embedding size (48 128)
- Used SentencePiece tokenizer with 128Ki vocabulary
- Rotary embeddings (RoPE)
It was trained using JAX, and the model checkpoint was shipped as 770 shards, totaling 318 GB. Parameters are 8-bit quantized.