LayerNorm is an operation that shifts and rescales a vector so that its mean is zero and its variance is 1.0, then re-scales this result by two learned parameters.

This scales the magnitudes of one vector (activations) to be scaled to match another vector before they are mixed.

When applied to a vector of length , LayerNorm contributes learned parameters to a model everywhere it is used.

Mathematically

Given an input vector of dimension , LayerNorm is:

where

is the mean of the input vector

is the standard deviation of the input vector

is the learned per-dimension scale and bias

is a numerical stability constant

I don’t really know what this stuff means.