Visualizing the flow of data in a Continuous Bag of Words model.
Two context words (previous & next) predict the center target word.
The hidden layer state $h$ is simply the average of the embedding vectors corresponding to the input context words. It compresses the context into a single vector of size $N$. $$h = \frac{1}{C} \sum_{w \in \text{context}} \text{vec}(w)$$ In our case with window size 2 (1 before, 1 after): $$h = \frac{\text{vec}(w_{t-1}) + \text{vec}(w_{t+1})}{2}$$
To predict the target, the hidden vector $h$ is multiplied by the Output Matrix $W'$ (dimensions $N \times V$). This produces a raw score ($z$) for every word in the vocabulary. $$z = h \cdot W'$$ A high score for a word means its vector in $W'$ aligns closely with the context vector $h$. Finally, Softmax converts these scores into probabilities.
While this visualizer helps understand the mechanism, real-world models operate on a vastly different scale. Below is a comparison between our toy model and the famous Google News model (Mikolov et al., 2013).
| Hyperparameter | This Toy Model | Google News Model |
|---|---|---|
| Vocabulary Size ($V$) | 5 words | 3,000,000 words |
| Embedding Size ($N$) | 3 dimensions | 300 dimensions |
| Window Size | 1 (Total 2 context words) | 5 (Total 10 context words) |
| Total Parameters | 15 (in $W$) | 900,000,000 (in $W$) |
Note on Input Complexity: In a naive neural network approach where input vectors are concatenated, a context of 10 words with a vocabulary of 3 million would result in a massive input layer of 30 million nodes ($10 \times 3M$). Word2Vec avoids this by projecting inputs directly into a shared embedding space (lookup tables) and averaging them.