Issue #63 - Text chunking for RAG systems

and

Jul 03, 2024

💊 Pill of the Week

In the world of Natural Language Processing (NLP) and particularly in Retrieval-Augmented Generation (RAG) systems, the ability to effectively handle large documents is crucial. This is where text chunking comes into play. Today we will explore the concept of text chunking, its importance in RAG systems, various methods of implementation, and considerations for optimal use.

What is Text Chunking?

Text chunking, also known as text splitting, is the process of breaking down large documents or texts into smaller, more manageable pieces called "chunks." These chunks are typically designed to be self-contained units of information that can be processed and retrieved independently.

The Importance of Chunking in RAG

In RAG systems, chunking plays a vital role in several stages of the process:

Document Ingestion and Preprocessing: Chunking is a key step in preparing documents for use in a RAG system.
Indexing: Each chunk is indexed, usually by converting it into a vector representation.
Retrieval: The system searches for relevant chunks based on a query, allowing for more precise retrieval.
Context Formation: Relevant chunks are used to form the context for the language model.
Generation: The language model uses the retrieved chunks as additional context for generating a response.

Read more about RAG in this previous issue:

Issue #56 - Retrieval-Augmented Generation

David Andrés and Josep Ferrer

April 27, 2024

Read full story

How Text Splitters Work?

Text splitters in RAG systems typically follow this process:

Split the text into small, semantically meaningful pieces (often sentences).
Combine these small pieces into larger chunks until reaching a certain size.
Once the size limit is reached, create a new chunk, often with some overlap to maintain context.

Text splitters can be customized along two axes:

How the text is split
How the chunk size is measured

Types of Text Splitters in LangChain

LangChain offers various text splitters, each with its own strengths and use cases:

Recursive Text Splitter: Recursively splits text into chunks, aiming to keep related pieces together without adding metadata.
HTML Text Splitter: Splits text based on HTML structure, adding metadata about the origins of each chunk.
Markdown Text Splitter: Divides text based on Markdown headers, preserving structure and adding header-level metadata.
Code Text Splitter: Splits text based on the syntax of various programming languages, useful for processing code.
Token Text Splitter: Splits text based on token count, offering flexibility with various methods to measure tokens.
Character Text Splitter: Splits text based on a specific user-defined character, such as a newline or space.
Semantic Chunker: Splits text into sentences and then combines them based on semantic similarity using embeddings.
AI21 Semantic Text Splitter: Identifies and splits text into coherent pieces based on distinct topics, adding relevant metadata.

🚨In the free version of this issue we will share only 5 of them, for the remaining ones please consider becoming a paid subcriber:

For the full issue you can check here:

Issue #64 - Text chunking for RAG systems

David Andrés and Josep Ferrer

June 29, 2024

Read full story

Recursive Text Splitter

Classes: RecursiveCharacterTextSplitter, RecursiveJsonSplitter
Splits on: User-defined characters
Adds metadata: No
Description: Recursively splits text, trying to keep related pieces together. Recommended for general use.

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=100,
    length_function=len,
    is_separator_regex=False,
)

chunks= text_splitter.create_documents(docs)

HTML Text Splitter

Classes: HTMLHeaderTextSplitter, HTMLSectionSplitter
Splits on: HTML-specific characters
Adds metadata: Yes
Description: Splits text based on HTML structure, adding relevant information about chunk origins.

from langchain_text_splitters import HTMLHeaderTextSplitter

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

html_splitter = HTMLHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)
html_header_splits = html_splitter.split_text(html_string)

This can then be pipelined to another splitter (like the recursive one).

Markdown Text Splitter

Class: MarkdownHeaderTextSplitter
Splits on: Markdown-specific characters
Adds metadata: Yes
Description: Splits text based on Markdown structure, adding information about chunk origins.

Its implementation is similar to the previous one:

from langchain_text_splitters import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(markdown_document)

Code Text Splitter

Only for paying subscribers

For the full issue you can check here:

Issue #64 - Text chunking for RAG systems

David Andrés and Josep Ferrer

June 29, 2024

Read full story

Token Text Splitter

Multiple classes available
Splits on: Tokens
Adds metadata: No
Description: Splits text based on token count, with various methods to measure tokens.

Here we will use the tiktoken splitter, howerver there are many other methods that you can use.

from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

chunks = text_splitter.split_text(docs)

Character Text Splitter

Class: CharacterTextSplitter
Splits on: User-defined character
Adds metadata: No
Description: A simple method that splits text based on a specific character.

from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    is_separator_regex=False,
)

chunks = text_splitter.create_documents(docs)

Semantic Chunker

Only for paying subscribers

For the full issue you can check here:

Issue #64 - Text chunking for RAG systems

David Andrés and Josep Ferrer

June 29, 2024

Read full story

AI21 Semantic Text Splitter

Only for paying subscribers

Choosing the Right Text Splitter

When selecting a text splitter for your RAG system, consider the following factors:

Document structure: Choose a splitter that aligns with your document's format (e.g., HTML, Markdown, code).
Semantic coherence: Opt for splitters that maintain the semantic relationship between chunks.
Metadata requirements: If you need additional context about chunk origins, choose splitters that add metadata.
Language specificity: For code documents, use language-specific splitters.
Token limits: Consider your model's token limits when setting chunk sizes.
Overlap Consideration: When maintaining context between chunks is critical, configure overlap settings.

What is Overlapping?

Overlapping in text chunking refers to the practice of including a portion of the previous chunk's content in the beginning of the next chunk. This creates a "overlap" of information between adjacent chunks.

Why is this interesting?

Maintains Context
- Preserves semantic continuity between chunks
- Helps in understanding content that spans chunk boundaries
Improves Retrieval Accuracy
- Increases the chances of retrieving relevant information that might be split across chunks
Enhances Language Model Performance
- Provides more context for language models when generating responses
Handles Cross-References
- Helps in situations where information in one part of the text refers to another

Why Metadata matters in RAG?

Metadata is additional information that describes or gives context to the primary data (in this case, the text chunks). It provides supplementary details about the content, source, structure, or other attributes of the data.

Source Tracking
- Helps identify the original document or source of each chunk
- Crucial for attribution and fact-checking
Context Preservation
- Provides additional context that might be lost in chunking
- Can include information about document structure, headers, or section titles
Relevance Assessment
- Aids in determining the relevance of chunks to specific queries
- Can include keywords, topics, or categories
Version Control
- Tracks different versions or updates to documents
- Ensures the most up-to-date information is used
Filtering and Sorting
- Allows for more precise filtering of chunks based on various criteria
- Enables sorting of results based on metadata attributes
Data Governance and Compliance
- Tracks data ownership, access permissions, and usage rights
- Crucial for maintaining compliance with data protection regulations
Enhanced Retrieval
- Enables more sophisticated retrieval strategies beyond simple text matching
- Can leverage metadata for semantic or category-based searches
User Experience
- Provides additional information to users about the retrieved content
- Can be used to display snippets, summaries, or other contextual information
Model Fine-tuning
- Can be used as additional features for fine-tuning retrieval or language models
- Helps models understand the structure and context of the data
Data Quality Management
- Helps in assessing and maintaining the quality of the document collection
- Can include information about data quality, completeness, or reliability

Conclusion

Effective chunking and text splitting are fundamental to building high-performance RAG systems. By choosing the right text splitter and fine-tuning its parameters, you can significantly improve the retrieval precision and generation quality of your RAG pipeline. As you develop your system, experiment with different splitting strategies to find the optimal approach for your specific use case.

🚀CleverControl: AI-Powered Insights for Enhanced Productivity & Security

This MLPills issue has been sponsored by CleverControl. If you don’t want to see this type of ads that support my work, please consider becoming a paid subcriber:

As a blogger who values innovation, I'm impressed by how CleverControl uses AI to power employee monitoring software.

Key features:

AI-powered insights: Real-time data analysis for improved performance and productivity.

Personalized recommendations: Tailored guidance for each employee to optimize workflow and reach full potential.
Comprehensive reports: Detailed reports on employee activity, including website usage, keyboard activity, and idle time.
Enhanced security: Robust features like data encryption, access controls, and real-time alerts to protect sensitive information.
Live viewing and screen recording: Monitor employee activity in real-time or review recordings later for dispute resolution and feedback.
Face recognition: Ensures only authorized employees access company devices and systems.
Workflow tracking: Helps employees track their own workflow, identify areas for improvement, and stay on top of tasks.

CleverControl empowers businesses to improve productivity, enhance security, and empower employees to reach their full potential.

👉 Get more information here.

🤖 Tech Round-Up

No time to check the news this week?

This week's TechRoundUp comes full of AI news. From Andrew Ng's new AI fund to Gmail's new Gemini feature.

Let's dive into the latest Tech highlights you probably shouldn’t this week 💥

1️⃣ Andrew Ng's New AI Fund

Andrew Ng is raising $120M for his next AI fund! This initiative aims to support startups tackling significant AI challenges. It's a big move for the AI community.

2️⃣ Meta's AI Chatbots on Instagram

Meta is testing user-created AI chatbots on Instagram. Users can now customize chatbots to interact with followers, enhancing engagement and user experience.

3️⃣ ChatGPT for Mac

Mac users, rejoice! ChatGPT is now available on Mac, bringing seamless AI interaction to your desktop. Enhance your productivity with this powerful tool.

4️⃣ Reddit's AI Safeguards

Reddit is implementing changes to protect against AI crawlers. These updates are crucial for preserving content integrity and safeguarding user data.

5️⃣ Google's Gemini AI in Gmail

Google introduces Gemini AI to Gmail with a sidebar feature that helps write and summarize emails. Boost your productivity with AI-assisted email management.

Follow Josep here

For the full issue you can check here:

Issue #64 - Text chunking for RAG systems

David Andrés and Josep Ferrer

June 29, 2024

Read full story

You could have received this in full last Saturday!

In addition to helping me carry on with this project and allowing me to bring you the best possible content.

🧑‍🎓Are you a student?

Contact me to receive a special offer: david@mlpills.dev