Large pre-trained word embeddings are a cornerstone of modern NLP, but their memory footprint is a real deployment bottleneck. A standard 300-dimensional GloVe vocabulary of 2 million tokens consumes ~2.4 GB in float32. On mobile devices, edge hardware, or in multi-model serving environments, this is often untenable. This article surveys practical techniques for compressing word embeddings — from quantization and pruning to knowledge distillation and hash-based methods — with a focus on what actually works in production.
Why Embedding Compression Matters
Embedding layers frequently dominate the parameter count of NLP models. In a typical sequence labeling or text classification pipeline, the embedding matrix can account for 80–95% of total parameters. Even in transformer-based models, the embedding and output projection layers together represent a significant fraction of memory.
The practical consequences are concrete:
- Mobile/edge deployment: Shipping a 2 GB embedding table inside a mobile app is not viable.
- Multi-tenant serving: When serving hundreds of fine-tuned models, each carrying its own embedding copy, memory costs multiply.
- Cold start latency: Loading large embeddings from disk adds seconds to model initialization.
- Training efficiency: Smaller embedding layers mean faster gradient updates and reduced communication overhead in distributed training.
Technique 1: Scalar and Product Quantization
The most straightforward approach is reducing numerical precision. Instead of storing each embedding dimension as a 32-bit float, we can use fewer bits.
Uniform scalar quantization
Map each dimension to an 8-bit or 4-bit integer using a simple affine transform:
q = round((x - min) / (max - min) * (2^b - 1))
where b is the target bit width. This achieves a 4× compression at 8 bits and 8× at 4 bits, with surprisingly small quality degradation for most downstream tasks.
Results from practice: 8-bit quantization of GloVe-300d typically loses less than 0.5% accuracy on standard benchmarks (SST-2, CoNLL NER). At 4 bits, losses increase to 1–3% depending on the task.
Product quantization (PQ)
Product quantization, introduced by Jégou et al. (2011), splits each embedding vector into M sub-vectors and quantizes each sub-vector independently using a small codebook learned via k-means.
For example, a 300-dimensional embedding split into M=30 sub-vectors of 10 dimensions each, with 256 centroids per sub-space, requires only 30 bytes per word (30 × 8 bits) instead of 1200 bytes — a 40× compression ratio.
Shu and Nakayama (2018) extended this idea with neural network-based codebook learning, where the quantization assignment and codebook entries are jointly optimized end-to-end with the downstream task. This approach, implemented in the neuralcompressor library, consistently outperforms post-hoc PQ because the codebooks adapt to the loss landscape rather than just minimizing reconstruction error.
python# Conceptual example of differentiable product quantization # Each embedding is encoded as M codebook indices def compress_embedding(x, codebooks): sub_vectors = x.reshape(M, D // M) indices = [] for m in range(M): # Soft assignment during training (Gumbel-softmax) # Hard assignment during inference distances = cdist(sub_vectors[m], codebooks[m]) idx = gumbel_softmax(-distances, hard=True) indices.append(idx) return indices # M bytes per word instead of D*4 bytes
Technique 2: Low-Rank Factorization
Instead of storing a full V × D embedding matrix E, decompose it into two smaller matrices:
E ≈ A × B, where A ∈ R^(V×k), B ∈ R^(k×D), k << D
This reduces parameters from V×D to (V+D)×k. With V=400k, D=300, and k=64, we go from 120M to ~26M parameters — a 4.6× reduction.
Truncated SVD is the simplest approach: compute the SVD of the pre-trained embedding matrix and keep only the top-k singular values. This is a one-shot compression that works well when the embedding matrix has rapidly decaying singular values (which it typically does).
Learned factorization during training is even better. Acharya et al. (2019) showed that training with a factored embedding layer from scratch, with an appropriate rank schedule, achieves comparable quality to the full-rank version at significant compression.
A related idea is block-sparse embeddings, where the matrix is constrained to have sparse structure rather than low rank, allowing hardware-efficient sparse operations.
Technique 3: Vocabulary Pruning and Subword Sharing
Often the simplest compression is reducing vocabulary size:
- Frequency pruning: Remove rare words (below a frequency threshold) and fall back to a subword tokenizer (BPE, WordPiece). A vocabulary of 2M words pruned to 50k subwords reduces the embedding table by 40×.
- Subword sharing: Models like ALBERT explicitly share the embedding matrix across the input and output layers, halving the embedding parameter count.
- Adaptive embeddings (Baevski and Auli, 2019): Assign larger embedding dimensions to frequent tokens and smaller dimensions to rare ones. This exploits the Zipfian distribution of natural language — the top 10k words cover ~95% of text, so they deserve higher-capacity representations.
Technique 4: Hash Embeddings
Hash-based approaches eliminate the explicit lookup table entirely. Instead of storing a unique vector for each word, multiple hash functions map words to a smaller set of shared embedding vectors, which are then combined:
pythondef hash_embedding(word, num_hashes=2, table_size=50000, dim=300): vecs = [embedding_tables[i][hash_fn(word, i) % table_size] for i in range(num_hashes)] return sum(vecs) # or concatenate, or weighted sum
Svenstrup et al. (2017) showed that hash embeddings with 2–3 hash functions and importance weighting approach the quality of full lookup tables while using 10–50× less memory. The key insight is that hash collisions are tolerable because context disambiguates most words.
Technique 5: Knowledge Distillation for Embeddings
Train a small "student" embedding model to mimic the representations of a large "teacher":
- Start with a pre-trained teacher embedding (e.g., GloVe-300d).
- Define a student embedding with lower dimensionality (e.g., 64d).
- Train the student to minimize a combination of:
- Reconstruction loss: MSE between projected student and teacher embeddings.
- Task loss: Cross-entropy on the downstream task.
- Relational loss: Preserve pairwise distances/similarities from the teacher space.
This approach is particularly effective when combined with quantization — distill to 64 dimensions, then quantize to 4 bits, for a total compression of ~37×.
What I Have Learned From Building neuralcompressor
From building and maintaining neuralcompressor, a few practical lessons:
-
End-to-end training beats post-hoc compression. When you can retrain, always prefer methods that optimize compression jointly with the task objective. The gap widens as compression ratios increase.
-
Composition of techniques works. The best results come from combining complementary approaches — for example, vocabulary pruning + product quantization + low-rank factorization can achieve 100×+ compression with minimal quality loss.
-
Evaluation must be task-specific. Reconstruction error (cosine similarity to the original) is a poor proxy for downstream performance. Always evaluate on your actual task. Some tasks (sentiment analysis) are very robust to compression; others (fine-grained NER) are more sensitive.
-
The Pareto frontier depends on your hardware. Product quantization gives better quality-per-bit than scalar quantization, but scalar quantization is faster to decode on CPUs without SIMD codebook lookup support. Profile on your target hardware.
-
Vocabulary overlap matters more than you think. In domain-specific applications, a small, well-curated domain vocabulary often outperforms a compressed general-purpose vocabulary. Consider this before reaching for compression.
Comparison Summary
| Technique | Typical Compression | Quality Impact | Requires Retraining? |
|---|---|---|---|
| 8-bit scalar quantization | 4× | Very low (<0.5%) | No |
| Product quantization | 10–40× | Low–moderate | Optional (better with) |
| Low-rank factorization | 3–6× | Low | Optional |
| Vocabulary pruning | 5–40× | Task-dependent | Yes |
| Hash embeddings | 10–50× | Moderate | Yes |
| Knowledge distillation | 5–20× | Low–moderate | Yes |
| PQ + distillation + pruning | 50–200× | Moderate | Yes |
Looking Ahead
The trend toward subword tokenization (BPE, SentencePiece) and contextual embeddings (BERT, etc.) has partially reduced the urgency of static embedding compression. But the core problem resurfaces in new forms:
- LLM embedding layers still contain millions of parameters that benefit from quantization.
- Multilingual models with 250k+ token vocabularies face the same V×D scaling problem.
- Retrieval-augmented generation (RAG) systems store millions of document embeddings that need compression for practical vector databases.
- On-device LLMs (phones, browsers) are driving renewed interest in aggressive quantization of all model components, embeddings included.
The fundamental trade-off — information density vs. computational cost — remains, and the techniques described here continue to evolve alongside the models they compress.
References
- Shu, R. and Nakayama, H. (2018). Compressing Word Embeddings via Deep Compositional Code Learning. ICLR 2018.
- Jégou, H. et al. (2011). Product Quantization for Nearest Neighbor Search. IEEE TPAMI.
- Acharya, A. et al. (2019). Online Embedding Compression for Text Classification using Low Rank Matrix Factorization. AAAI 2019.
- Baevski, A. and Auli, M. (2019). Adaptive Input Representations for Neural Language Modeling. ICLR 2019.
- Svenstrup, D. et al. (2017). Hash Embeddings for Efficient Word Representations. NeurIPS 2017.