Natural Language Processing (NLP), undefined

Tokenization

Tokenization transforms continuous text into discrete units (tokens) that machines can process. It's the first step in teaching computers to read language.

For the sentence "She couldn't believe it was only $9.99!", a word-level tokenizer might produce ["She", "couldn't", "believe", "it", "was", "only", "$9.99", "!"]. Modern systems often use subword tokenization, breaking words into meaningful fragments, so "couldn't" becomes ["could", "n't"].

This process has significant implications. Effective tokenization helps models handle new words, morphologically rich languages, and rare terms while keeping vocabulary sizes manageable. The evolution from simple word splitting to sophisticated subword algorithms like BPE has been crucial for modern language models.

Subword Algorithms

Subword tokenization algorithms have become the standard approach in modern NLP systems by striking an optimal balance between vocabulary size, semantic coherence, and ability to handle unseen words. These methods adaptively decompose words into meaningful units while preserving important morphological relationships.

Byte-Pair Encoding (BPE), originally a data compression algorithm, was adapted for NLP by Sennrich et al. (2016). BPE starts with a character-level vocabulary and iteratively merges the most frequent adjacent pairs until reaching a target vocabulary size. This creates a vocabulary of common words and subword units, with frequent words preserved as single tokens while rare words decompose into meaningful subunits.

WordPiece, developed by Google for BERT, follows a similar iterative merging strategy but uses a likelihood-based criterion rather than simple frequency. It prioritizes merges that maximize the likelihood of the training data, creating slightly different segmentation patterns than BPE. WordPiece adds '##' prefixes to subword units that don't start words, helping the model distinguish between beginning and middle/end subwords.

Unigram Language Model, used in SentencePiece, takes a probabilistic approach. It starts with a large vocabulary and iteratively removes tokens that contribute least to the corpus likelihood until reaching the target size. This creates segmentations that optimize the probability of the training corpus under a unigram language model.

SentencePiece implements both BPE and Unigram methods with a critical innovation: it treats the input as a raw Unicode string without any preprocessing. By avoiding language-specific preprocessing like word segmentation, it works consistently across languages, including those without explicit word boundaries like Japanese and Chinese.

These subword approaches have enabled breakthroughs in multilingual modeling and handling of morphologically rich languages. They dramatically reduce out-of-vocabulary issues while maintaining reasonable vocabulary sizes (typically 30,000-50,000 tokens), supporting efficient training and inference in modern language models.

Tokenization Challenges

Tokenization presents numerous challenges that significantly impact NLP system performance, particularly across languages and domains with different structural characteristics.

Language-Specific Issues: Languages without clear word boundaries (Chinese, Japanese, Thai) require specialized segmentation methods before or during tokenization. Morphologically rich languages (Finnish, Turkish, Hungarian) with extensive compounding and agglutination can generate extremely long words with complex internal structure that challenge typical tokenization approaches.

Domain-Specific Challenges: Technical vocabularies in fields like medicine, law, or computer science contain specialized terminology that may be inefficiently tokenized if not represented in training data. Social media text presents unique challenges with abbreviations, emojis, hashtags, and creative spelling that standard tokenizers struggle to handle appropriately.

Efficiency Tradeoffs: Larger vocabularies reduce token counts per text (improving efficiency) but increase model size and may worsen generalization to rare words. Smaller vocabularies produce longer token sequences but handle novel words better through compositional subword recombination.

Consistency Issues: Inconsistent tokenization between pre-training and fine-tuning or between model components can degrade performance. This becomes particularly challenging in multilingual settings where different languages may require different tokenization strategies.

Information Loss: Tokenization inevitably loses some information about the original text, such as whitespace patterns, capitalization details, or special characters that get normalized. These details can be significant for tasks like code generation or formatting-sensitive applications.

Modern NLP systems address these challenges through various strategies, including language-specific pre-tokenization, BPE dropout for robustness, vocabulary augmentation for domain adaptation, and byte-level approaches that can represent any Unicode character without explicit vocabulary limitations.