Embedding-Based Clustering

Embedding-based clustering segments text by analyzing semantic similarity patterns within the content. This data-driven approach groups related content based on meaning rather than structural features:

Core Mechanism: The document is first split into small units (sentences or paragraphs), which are then embedded into vector space. Clustering algorithms identify groups of semantically similar segments, which are combined to form coherent chunks up to a maximum size.

Advantages:

  • Semantic Coherence: Groups content based on meaning rather than structural boundaries
  • Adaptive to Content: Naturally identifies topic clusters regardless of formatting
  • Conceptual Organization: Creates chunks that align with actual information relationships

Disadvantages:

  • Computational Intensity: Requires embedding generation and clustering algorithms
  • Parameter Sensitivity: Results depend heavily on clustering parameters and embedding quality
  • Unpredictable Chunk Sizes: May create imbalanced chunks based on topic distribution
from langchain_community.embeddings import HuggingFaceEmbeddings
from sklearn.cluster import AgglomerativeClustering
import numpy as np

def embedding_based_chunking(document, max_chunk_size=1000):
    # Split into initial small units (e.g., sentences)
    sentences = document.split('. ')
    sentences = [s + '.' for s in sentences if s]
    
    # Generate embeddings
    embeddings = HuggingFaceEmbeddings()
    vectors = embeddings.embed_documents(sentences)
    
    # Cluster similar sentences
    clustering = AgglomerativeClustering(n_clusters=None, distance_threshold=0.3)
    clusters = clustering.fit_predict(np.array(vectors))
    
    # Group sentences by cluster while respecting max size
    chunks = []
    current_chunk = []
    current_size = 0
    
    for cluster_id in range(max(clusters) + 1):
        cluster_sentences = [sentences[i] for i in range(len(sentences)) if clusters[i] == cluster_id]
        cluster_text = ' '.join(cluster_sentences)
        
        if len(cluster_text) + current_size <= max_chunk_size:
            current_chunk.append(cluster_text)
            current_size += len(cluster_text)
        else:
            chunks.append(' '.join(current_chunk))
            current_chunk = [cluster_text]
            current_size = len(cluster_text)
    
    if current_chunk:
        chunks.append(' '.join(current_chunk))
    
    return chunks

Best Use Cases: Embedding-based clustering excels for documents with diverse topics, research papers covering multiple concepts, and content where semantic relationships aren't clearly indicated by structure. It's particularly valuable for knowledge bases, encyclopedic content, and technical documentation where information relationships are complex.