Natural Language Processing (NLP), undefined

Subword Algorithms

Subword tokenization algorithms have become the standard approach in modern NLP systems by striking an optimal balance between vocabulary size, semantic coherence, and ability to handle unseen words. These methods adaptively decompose words into meaningful units while preserving important morphological relationships.

Byte-Pair Encoding (BPE), originally a data compression algorithm, was adapted for NLP by Sennrich et al. (2016). BPE starts with a character-level vocabulary and iteratively merges the most frequent adjacent pairs until reaching a target vocabulary size. This creates a vocabulary of common words and subword units, with frequent words preserved as single tokens while rare words decompose into meaningful subunits.

WordPiece, developed by Google for BERT, follows a similar iterative merging strategy but uses a likelihood-based criterion rather than simple frequency. It prioritizes merges that maximize the likelihood of the training data, creating slightly different segmentation patterns than BPE. WordPiece adds '##' prefixes to subword units that don't start words, helping the model distinguish between beginning and middle/end subwords.

Unigram Language Model, used in SentencePiece, takes a probabilistic approach. It starts with a large vocabulary and iteratively removes tokens that contribute least to the corpus likelihood until reaching the target size. This creates segmentations that optimize the probability of the training corpus under a unigram language model.

SentencePiece implements both BPE and Unigram methods with a critical innovation: it treats the input as a raw Unicode string without any preprocessing. By avoiding language-specific preprocessing like word segmentation, it works consistently across languages, including those without explicit word boundaries like Japanese and Chinese.

These subword approaches have enabled breakthroughs in multilingual modeling and handling of morphologically rich languages. They dramatically reduce out-of-vocabulary issues while maintaining reasonable vocabulary sizes (typically 30,000-50,000 tokens), supporting efficient training and inference in modern language models.