Production NLP systems rarely use embeddings alone. The real pattern is a two-stage pipeline: fast approximate retrieval over millions of embeddings, then precise reranking of the top candidates. Understanding this architecture is what separates prototype demos from systems serving real user traffic.

Embedding-based search in production almost never uses naive nearest-neighbor search. Instead, a two-stage architecture dominates. Stage one is fast retrieval: embed the user query, then use Approximate Nearest Neighbor (ANN) algorithms like HNSW (Hierarchical Navigable Small World) to find the top 50–200 most similar documents from a vector database containing millions of vectors. This stage trades perfect accuracy for speed — modern ANN libraries can retrieve from 100 million vectors in under 10 milliseconds. Stage two is reranking: a more expensive cross-encoder model jointly processes the query and each candidate document, producing a precise relevance score for each pair. The top 5–20 results after reranking go to the user or into an LLM for answer generation. Why two stages? Bi-encoder embeddings (used in stage one) are fast because query and document are encoded independently — but they compress meaning into a single vector, losing fine-grained relevance signals. Cross-encoders (used in stage two) are more accurate because they can attend jointly across query and document tokens, but they're too slow to run against millions of documents. The two-stage architecture gets the best of both. Production systems add further refinements: query expansion (generate multiple paraphrases of the user query and retrieve for each), hybrid retrieval (combine embedding-based and keyword-based search), and learned sparse embeddings (SPLADE) that combine the precision of keyword search with the semantic understanding of dense embeddings. These production patterns underpin every modern RAG system and search application running at scale.

NLP Embeddings at Scale: Retrieval, Reranking, and Production Patterns