Retrieval Augmented Generation (RAG), undefined

Recursive Chunking

Recursive chunking uses a hierarchical approach to segmentation, attempting to split text at the highest-level boundaries first (chapters, sections) before progressively moving to finer-grained separators (paragraphs, sentences) as needed:

Core Mechanism: The algorithm tries to split text using a prioritized list of separators (e.g., section breaks, then paragraphs, then sentences). If using a high-level separator would create chunks that exceed the maximum size, it recursively attempts using the next separator in the hierarchy.

Advantages:

Structure Awareness: Respects document hierarchy and logical organization
Adaptive Granularity: Uses the most appropriate level of splitting for each section
Balance: Maintains chunk size constraints while preserving as much context as possible

Disadvantages:

Implementation Complexity: More sophisticated logic than simpler approaches
Separator Dependency: Effectiveness depends on well-defined document structure
Processing Overhead: Requires multiple passes through the text

Implementation Example:

Best Use Cases: Recursive chunking is ideal for complex, structured documents like technical documentation, research papers, and books with clear hierarchical organization. It's particularly effective when document structure varies throughout the content, requiring different segmentation approaches for different sections.

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Recursive splitter with hierarchical separators
text_splitter = RecursiveCharacterTextSplitter(
    # Ordered from highest to lowest granularity
    separators=["## ", "\n\n", "\n", ". ", "! ", "? ", ",", " ", ""],
    chunk_size=1000,
    chunk_overlap=200
)

chunks = text_splitter.split_text(document)