Paragraph-Based Segmentation

Paragraph-based segmentation uses natural paragraph breaks as chunk boundaries, respecting the author's original organization of ideas. This approach aligns with how humans structure information, typically grouping related thoughts within paragraph units:

Core Mechanism: The text is split at paragraph boundaries (usually identified by double line breaks or other formatting indicators). Paragraphs can be kept as individual chunks or combined until they approach a maximum size threshold.

Advantages:

  • Content Coherence: Preserves logically related content as intended by the author
  • Natural Boundaries: Uses existing document structure rather than imposing arbitrary divisions
  • Implementation Simplicity: Relatively straightforward to detect paragraph breaks in most formatted text

Disadvantages:

  • Variable Chunk Sizes: Can produce very short or very long chunks depending on document formatting
  • Format Dependency: Requires reliable paragraph markers in the source document
  • Inconsistent Length: May create inefficient embeddings for extremely short paragraphs

Implementation Example:

Best Use Cases: Paragraph-based segmentation works well for well-structured documents like articles, blog posts, and reports where paragraphs contain discrete ideas. It's particularly effective for content where paragraph boundaries meaningfully separate different concepts or topics.

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Paragraph-respecting splitter
text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n"],  # Prioritize double line breaks (paragraphs)
    chunk_size=1500,
    chunk_overlap=150
)

chunks = text_splitter.split_text(document)