Machine Learning Pills

Machine Learning Pills

Share this post

Machine Learning Pills
Machine Learning Pills
Issue #85 - Advanced Retrieval Strategies: HyDE

Issue #85 - Advanced Retrieval Strategies: HyDE

David Andrés's avatar
David Andrés
Jan 04, 2025
∙ Paid
11

Share this post

Machine Learning Pills
Machine Learning Pills
Issue #85 - Advanced Retrieval Strategies: HyDE
2
Share

💊 Pill of the Week

Large Language Models (LLMs) have revolutionized how we process and retrieve information, but they face challenges when dealing with mismatches between query formats and document content. Hypothetical Document Embeddings (HyDE) offers an innovative solution to this problem by transforming user queries into document-like formats before performing retrieval.

In the past, we've explored other advanced retrieval strategies. You can review them here:

Issue #66 - Advanced Retrieval Strategies: Query Translation I

Issue #66 - Advanced Retrieval Strategies: Query Translation I

David Andrés
·
July 20, 2024
Read full story
Issue #70 - Advanced Retrieval Strategies: Query Translation II

Issue #70 - Advanced Retrieval Strategies: Query Translation II

David Andrés
·
August 25, 2024
Read full story

Understanding HyDE

HyDE addresses a fundamental challenge in Retrieval Augmented Generation (RAG) systems: the disparity between short, potentially informal user queries and longer, well-structured documents. The technique works by using an LLM to generate a hypothetical document that answers the user's query, then using this generated document for retrieval instead of the original query.

The power of HyDE lies in its ability to bridge the gap between query and document spaces. When a user submits a query, HyDE first transforms it into a well-structured document that might contain the answer. While this generated document may contain inaccurate information, its structure and format more closely match the actual documents in the knowledge base, making retrieval more effective.

🎉 15-Day Free Subscription Giveaway! 🎉
We love giving back to our readers! In every issue of this newsletter, one lucky person who ❤️ likes the article will win a free 15-day subscription to MLPills.

Don’t miss your chance—like this article and you could be our next winner!

When to Use HyDE

HyDE is particularly valuable in several scenarios:

The technique shines when working with embedding models trained through contrastive learning on document-document pairs. These models, which learn to distinguish between semantically similar and dissimilar texts, benefit from HyDE's document-like query representations.

It's especially useful when dealing with specialized domains that differ significantly from typical datasets that retrievers are trained on. In these cases, HyDE can help bridge the domain gap between queries and documents.

HyDE can also be beneficial when your retrieval performance metrics, such as recall, aren't meeting expectations. By generating multiple hypothetical documents for each query, HyDE can capture different aspects of the information need, potentially improving retrieval accuracy.

However, not all scenarios require HyDE. If your embedding model has been specifically trained for asymmetric semantic search (matching short queries to longer documents) using supervised learning, HyDE may not provide significant benefits. This is often the case for models trained on question-answering datasets like MS MARCO.

Implementation Process

Implementing HyDE involves a sequence of structured steps, each designed to enhance retrieval performance by leveraging the power of large language models (LLMs) and embedding-based similarity search. Here's a detailed breakdown of the process:

  1. Query Generation: The process begins with the user submitting a query, which the system processes using an LLM capable of generating hypothetical documents. Along with the query, the system provides an instruction for the LLM to produce documents that comprehensively address the query's intent. To ensure diversity and robustness in the generated responses, the system typically creates multiple hypothetical documents, often around five, by varying the temperature settings during generation. This approach captures a broader semantic spectrum of the query, making the system more resilient to nuances or ambiguities.

  2. Embedding Creation: Once the hypothetical documents are generated, each is processed through an embedding model to convert the text into high-dimensional vectors that represent the semantic structure of the content. The embedding model acts as a lossy compressor, filtering out potentially irrelevant or inaccurate details from the generated documents while preserving their overarching semantic meaning. The embeddings from the multiple documents are then averaged into a single, consolidated vector. This aggregated embedding serves as a robust and comprehensive representation of the query’s intent, combining the strengths of the individual documents while minimizing noise.

  3. Similarity Matching: The final step involves matching the averaged embedding of the hypothetical documents against precomputed embeddings of the document corpus. Using a similarity metric such as cosine similarity, the system evaluates the relevance of each document in the corpus by measuring the angular distance between the embeddings. By operating entirely in the document-document embedding space, this approach leverages the enriched context and semantic depth of the hypothetical document embeddings, yielding better alignment with the corpus compared to direct query-to-document matching.

  4. Retrieval Results: Based on the similarity scores, the system retrieves the most relevant documents from the corpus. The use of hypothetical documents as intermediaries bridges potential semantic gaps between the user’s query and the corpus content, significantly enhancing retrieval performance, particularly for complex or ambiguous queries. Each step builds on the previous one to ensure that the final results are both accurate and contextually rich.

By iteratively combining LLM-generated documents, embedding aggregation, and precise similarity matching, HyDE establishes a robust framework for improving the quality and relevance of document retrieval.

Advantages and Limitations

HyDE offers several compelling benefits:

It helps normalize the format difference between queries and documents, potentially improving retrieval accuracy. The technique is also relatively simple to implement, requiring only a few additional LLM calls per query. By generating multiple hypothetical documents per query, it can capture different aspects of the information need, leading to more robust retrieval.

The technique is particularly effective at generalizing to new, unseen domains, making it valuable for specialized applications where traditional retrievers might struggle. The averaging of multiple hypothetical document embeddings helps reduce the impact of any individual generation errors or hallucinations.

However, HyDE does come with trade-offs:

The most significant is increased latency and computational cost due to the additional LLM generation steps and the need to generate multiple hypothetical documents per query. This overhead might be significant in high-throughput systems or when working with limited computational resources.

Recent research has shown that HyDE can be particularly effective when combined with other retrieval techniques. For instance, hybrid search approaches that combine HyDE with traditional methods like BM25 often produce superior results. Some studies have even found that concatenating the original query with the hypothetical document can further improve performance.

Performance Considerations

While HyDE can significantly improve retrieval quality, it comes with computational overhead. Each query requires an additional LLM inference step to generate the hypothetical document. Organizations implementing HyDE should carefully balance this tradeoff between retrieval quality and response time.

For real-time applications, consider strategies such as:

  • Caching frequently requested queries and their generated documents.

  • Using smaller, more efficient models for document generation.

  • Implementing parallel processing for the generation and embedding steps.

  • Setting appropriate timeouts and fallback mechanisms to ensure responsiveness.

By adopting these optimizations, you can reduce latency while maintaining the advantages of HyDE for retrieval augmentation.

Best Practices

When implementing HyDE, consider these recommendations:

  • The prompt to your LLM should be clear and consistent, focusing on generating document-like responses that maintain the semantic intent of the original query. While the generated document may contain factual inaccuracies, its structure and format should match your corpus documents.

  • Generate multiple hypothetical documents per query using different temperature settings to ensure diversity in the generated content. This helps capture different aspects of the query intent and makes the final embedding more robust.

  • Consider implementing a hybrid approach that combines HyDE with other retrieval methods. This can help balance the benefits of different approaches while mitigating their individual weaknesses.

  • Monitor the performance impact of adding HyDE to your pipeline. In some cases, the additional latency might not justify the improvement in retrieval quality, particularly if your existing embedding model is already well-suited to your use case.

Conclusion

HyDE represents an innovative approach to improving RAG retrieval by addressing the format mismatch between queries and documents. While it may not be necessary for all systems, particularly those using specially trained asymmetric search models, it offers a valuable tool for improving retrieval performance in many scenarios.

The technique's ability to transform queries into document-like formats, combined with its relatively straightforward implementation, makes it an attractive option for enhancing RAG systems. Its particular strength in handling specialized domains and ability to generate multiple hypothetical documents for robustness make it especially valuable for applications where traditional retrieval methods might struggle.

Consider experimenting with HyDE as part of a broader strategy for improving retrieval performance, potentially combining it with other techniques like hybrid search or reranking for optimal results. While the additional computational overhead should be carefully considered, the potential improvements in retrieval quality can make it a worthwhile addition to your RAG pipeline.


‍🎓Further Learning*

Let us present: “From Beginner to Advanced LLM Developer”. This comprehensive course takes you from foundational skills to mastering scalable LLM products through hands-on projects, fine-tuning, RAG, and agent development. Whether you're building a standout portfolio, launching a startup idea, or enhancing enterprise solutions, this program equips you to lead the LLM revolution and thrive in a fast-growing, in-demand field.

Who Is This Course For?

This certification is for software developers, machine learning engineers, data scientists or computer science and AI students to rapidly convert to an LLM Developer role and start building

*Sponsored: by purchasing any of their courses you would also be supporting MLPills.


⚡Power-Up Corner

The synergy between HyDE and Contriever represents a significant advancement in retrieval systems, particularly for organizations dealing with specialized domains and large-scale document collections. While HyDE excels at bridging the query-document gap, Contriever's sophisticated approach to document encoding creates a powerful foundation for more accurate and efficient retrieval.

Technical Foundation of Contriever

Keep reading with a 7-day free trial

Subscribe to Machine Learning Pills to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 MLPills
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share