Issue #68 - Split different text examples for RAG

Aug 04, 2024

∙ Paid

💊 Pill of the week

Splitting large documents is essential when dealing with LLMs and RAG. Today we will present some techniques and how the outcome looks.

We will share how to split text for four different types of text:

Plain text
HTML text
Markdown text
Code text

For each of them we will share different techniques and the first five chunks of the example text. But for this issue we will share only the first and most common one: plain text.

We introduced the different techniques in this issue:

Issue #64 - Text chunking for RAG systems

David Andrés and Josep Ferrer

June 29, 2024

Read full story

Plain text

The most typical type of text you’ll need to deal with is plain text. In this section we will work with the following text about the history of AI:

Document Title: The History of Artificial Intelligence

Introduction
Artificial Intelligence (AI) has a rich and fascinating history spanning several decades. From its early conceptual stages to the advanced systems we see today, AI has continuously evolved and reshaped our understanding of machine capabilities.

Early Beginnings
The term "Artificial Intelligence" was first coined in 1956 at the Dartmouth Conference. However, the concept of intelligent machines dates back much further. In the 1940s, scientists like Alan Turing began exploring the possibility of creating machines that could think.

Key Milestones
1. 1950: Alan Turing proposes the Turing Test
2. 1956: Dartmouth Conference marks the birth of AI as a field
3. 1997: IBM's Deep Blue defeats world chess champion Garry Kasparov
4. 2011: IBM Watson wins Jeopardy! against human champions
5. 2016: Google's AlphaGo defeats world champion Go player Lee Sedol

Modern Developments
Today, AI is ubiquitous, powering everything from smartphone assistants to autonomous vehicles. Machine learning and deep learning have opened new frontiers, enabling systems to learn from vast amounts of data and make complex decisions.

Conclusion
As we look to the future, AI continues to promise revolutionary advancements across various fields, from healthcare to space exploration. The journey of AI is far from over, and its potential remains as exciting as ever.

When working with plain text documents, it's often necessary to break them down into smaller, manageable chunks for processing or analysis. This section demonstrates various techniques for splitting plain text using different methods from the LangChain library.

Here a summary:

Recursive Text Splitter

The Recursive Text Splitter is designed to split text into chunks while respecting the hierarchical structure of the document. It's particularly useful for maintaining context and coherence in the resulting chunks.

This splitter attempts to split text recursively at the most significant boundaries first (e.g., paragraphs), then moves to finer boundaries (e.g., sentences) if necessary. It ensures that the chunks maintain meaningful context.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text = "Document Title: The History of Artificial Intelligence\n\nIntroduction\nArtificial Intelligence (AI) has a rich..."  # (full text here)

splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)
chunks = splitter.split_text(text)

This code uses the RecursiveCharacterTextSplitter to break down the text into chunks of approximately 100 characters, with a 20-character overlap between chunks.

Output:

CHUNK # 0:
'Document Title: The History of Artificial Intelligence

Introduction'
----------------------------------------

CHUNK # 1:
'Artificial Intelligence (AI) has a rich and fascinating history spanning several decades. From its early conceptual stages to the advanced systems we see today, AI has continuously evolved and'
----------------------------------------

CHUNK # 2:
'we see today, AI has continuously evolved and reshaped our understanding of machine capabilities.'
----------------------------------------

CHUNK # 3:
'
Early Beginnings'
----------------------------------------

CHUNK # 4:
'The term "Artificial Intelligence" was first coined in 1956 at the Dartmouth Conference. However, the concept of intelligent machines dates back much further. In the 1940s, scientists like Alan'

The output shows that the splitter respects paragraph boundaries and tries to keep related content together. For example, the first chunk contains the document title and the "Introduction" header, while subsequent chunks contain complete sentences or logical parts of the text.

Character Text Splitter

The Character Text Splitter is a straightforward method for splitting text based on character count. It's useful when you need consistent chunk sizes without necessarily preserving the document's structure.

This splitter divides the text at specified separators (e.g., newline characters) and then further splits the resulting segments to achieve the desired chunk size.

from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(separator="\n", chunk_size=150, chunk_overlap=30)
chunks = splitter.split_text(text)

This code uses the CharacterTextSplitter, which splits the text based on a specified separator (in this case, newline characters) and aims for chunks of 150 characters with a 30-character overlap.

Output:

CHUNK # 0:
'Document Title: The History of Artificial Intelligence

Introduction'
----------------------------------------

CHUNK # 1:
'Artificial Intelligence (AI) has a rich and fascinating history spanning several decades. From its early conceptual stages to the advanced systems we see today, AI has continuously evolved and reshaped our understanding of machine capabilities.'
----------------------------------------

CHUNK # 2:
'
Early Beginnings'
----------------------------------------

CHUNK # 3:
'The term "Artificial Intelligence" was first coined in 1956 at the Dartmouth Conference. However, the concept of intelligent machines dates back much further. In the 1940s, scientists like Alan Turing began exploring the possibility of creating machines that could think.'
----------------------------------------

CHUNK # 4:
'
Key Milestones
1. 1950: Alan Turing proposes the Turing Test
2. 1956: Dartmouth Conference marks the birth of AI as a field
3. 1997: IBM's Deep Blue defeats world chess champion Garry Kasparov'

The output demonstrates that this splitter is more sensitive to the newline separators, often creating chunks that align with paragraph boundaries. This can be useful when working with structured text where paragraphs convey distinct ideas.

Token Text Splitter

The Token Text Splitter is designed to split text based on token count rather than character count. This is particularly useful when working with models that have token limits, ensuring that each chunk fits within those limits.

This splitter estimates the number of tokens in the text and creates chunks that contain approximately the specified number of tokens.

from langchain.text_splitter import TokenTextSplitter

splitter = TokenTextSplitter(chunk_size=50, chunk_overlap=10)
chunks = splitter.split_text(text)

The TokenTextSplitter divides the text into chunks based on token count rather than character count.

Output:

CHUNK # 0:
'Document Title: The History of Artificial Intelligence

Introduction
Artificial Intelligence (AI) has a rich and fascinating history spanning several decades. From its early conceptual stages to the advanced systems we see today, AI has continuously evolved and reshaped our'
----------------------------------------

CHUNK # 1:
' today, AI has continuously evolved and reshaped our understanding of machine capabilities.

Early Beginnings
The term "Artificial Intelligence" was first coined in 1956 at the Dartmouth Conference. However, the concept of intelligent machines dates back much further'
----------------------------------------

CHUNK # 2:
', the concept of intelligent machines dates back much further. In the 1940s, scientists like Alan Turing began exploring the possibility of creating machines that could think.

Key Milestones
1. 1950: Alan Turing proposes the Turing Test
2'
----------------------------------------

CHUNK # 3:
' 1950: Alan Turing proposes the Turing Test
2. 1956: Dartmouth Conference marks the birth of AI as a field
3. 1997: IBM's Deep Blue defeats world chess champion Garry Kasparov
4. 2011: IBM Watson wins Jeopard'
----------------------------------------

CHUNK # 4:
'
4. 2011: IBM Watson wins Jeopardy! against human champions
5. 2016: Google's AlphaGo defeats world champion Go player Lee Sedol

Modern Developments
Today, AI is ubiquitous, powering everything from smartphone assistants'

The output shows that chunks are created based on approximate token counts, which may result in splits that don't necessarily align with sentence or paragraph boundaries. This can be beneficial when working with models that have strict token limits, as it ensures each chunk stays within those limits.

Semantic Chunker

Continue reading this post for free, courtesy of David Andrés.

Or purchase a paid subscription.