Embedding-Based Clustering
Embedding-based clustering segments text by analyzing semantic similarity patterns within the content. This data-driven approach groups related content based on meaning rather than structural features:
Core Mechanism: The document is first split into small units (sentences or paragraphs), which are then embedded into vector space. Clustering algorithms identify groups of semantically similar segments, which are combined to form coherent chunks up to a maximum size.
Advantages:
- Semantic Coherence: Groups content based on meaning rather than structural boundaries
- Adaptive to Content: Naturally identifies topic clusters regardless of formatting
- Conceptual Organization: Creates chunks that align with actual information relationships
Disadvantages:
- Computational Intensity: Requires embedding generation and clustering algorithms
- Parameter Sensitivity: Results depend heavily on clustering parameters and embedding quality
- Unpredictable Chunk Sizes: May create imbalanced chunks based on topic distribution
from langchain_community.embeddings import HuggingFaceEmbeddings
from sklearn.cluster import AgglomerativeClustering
import numpy as np
def embedding_based_chunking(document, max_chunk_size=1000):
# Split into initial small units (e.g., sentences)
sentences = document.split('. ')
sentences = [s + '.' for s in sentences if s]
# Generate embeddings
embeddings = HuggingFaceEmbeddings()
vectors = embeddings.embed_documents(sentences)
# Cluster similar sentences
clustering = AgglomerativeClustering(n_clusters=None, distance_threshold=0.3)
clusters = clustering.fit_predict(np.array(vectors))
# Group sentences by cluster while respecting max size
chunks = []
current_chunk = []
current_size = 0
for cluster_id in range(max(clusters) + 1):
cluster_sentences = [sentences[i] for i in range(len(sentences)) if clusters[i] == cluster_id]
cluster_text = ' '.join(cluster_sentences)
if len(cluster_text) + current_size <= max_chunk_size:
current_chunk.append(cluster_text)
current_size += len(cluster_text)
else:
chunks.append(' '.join(current_chunk))
current_chunk = [cluster_text]
current_size = len(cluster_text)
if current_chunk:
chunks.append(' '.join(current_chunk))
return chunks
Best Use Cases: Embedding-based clustering excels for documents with diverse topics, research papers covering multiple concepts, and content where semantic relationships aren't clearly indicated by structure. It's particularly valuable for knowledge bases, encyclopedic content, and technical documentation where information relationships are complex.