HARSH PRATAP SINGH

i should have paid a closer look at tokeniser

Well, I gave an interview sometime ago, and wasn't able to explain tokenizer properly, missed important points, and got the sweet rejection I deserved (along with some great advices). But, I got an interested in tokenizer because of that, and damnn I found insights that I didn't have maturity to see when I used tiktoken for the first time. So going back the basics to training interesting models.

Tokenization

It's the process during which a piece of text is broken down into smaller pieces, tokens, by a tokenizer. These tokens are then assigned integer values (i.e. token IDs) which uniquely identify the tokens within the tokenizer vocabulary (set of all possible tokens used in the tokenizer training), they essentially indexes into the tokenizer vocabulary. Just as a side note, tokenizer training is different from neural network training. You can train your own tokenizer and restrict its token space by various parameters, including the size of the vocabulary. What do you think happens if any of the tokens in your text do not exist in the tokenizer’s vocabulary of the LLM you are trying to use? disaster. so most LLM vocabularies are pretty huge. tokenization can be of 3 main groups :

BPE, Unigram, Wordpiece, etc all sorts of them exist. You can try building it and play with a small model like MiniLM locally. If you play with emojis, you will see that if your token doesnt exist in the tokenizer vocabulary it gets tokenized as a special character.

Embeddings

Tokenizers developed to do complicated numerical analysis of texts, mostly based on frequencies of individual tokens in a given text. What we need is context, to somehow capture the relationships between the tokens in the text to preserve the meaning of the text, and embeddings (vectors representing tokens) are better for that. Embeddings are byproduct of transformer training and are actually trained on the heaps of tokenized texts. Embeddings are what is actually fed as the input to LLMs when we ask it to generate text. Both the encoder and decoder accept embeddings as their input and the output of the encoder are also embeddings which are then passed into the decoder’s cross-attention head which plays a fundamental role in generating (predicting) tokens in the decoder’s output. The token IDs are used to fetch the embeddings from the embeddings matrix which are then assembled into a tensor which is then fed to the input of the transformer.

Your RAG depends on it actually

In the RAG pipeline, the text is first tokenized, then its embeddings are obtained for each token via ID, then assemble the embedding tensor, then fed into the transformer where the attention magic happens. Earlier, I used to think about RAG pipelines from the embeddings, from the chunking, but I never used to think about tokenizers. Now, you should easily be able to see, missing words in the tokenizer vocabulary can produce undesirable tokens, which has implications on RAG.

In old word-level systems, out-of-vocabulary words were a direct problem. In modern subword and byte-level tokenizers, the usual problem is different: the tokenizer can still represent the string, but it may represent it badly. A rare word, weird Unicode sequence, emoji, typo, long number, or domain-specific identifier may explode into many tokens, get normalized strangely, or produce a token sequence that the model rarely saw during training.

Easiest example is of emojis when tokenizers don't handle it well. Even if you add contrasting emojis to the same sentence, when you embed it and then display and then see the embedding matrices along with the text where we replace the emojis with textual descriptions, you will see that the embeddings for both the emojis, even though they may mean very different things, are very close. Another case could be of misspelled words being picked correctly, or managing date and time like "It was delivered some-time ago", who knows that sometime ago? the models generally handle cases like these with the help of additional context and chunking with metadata properly, but if your agent doesn't confirm the specific date, i.e. any sort of time context is missing, all the best. you can literally try introducing typos into the dates or even empty space characters and wreak more havoc.

Sooo, a little bit of cleaning of input text actually go a long way, standardise the format your dates so they’re consistent throughout your embeddings, remove trailing spaces wherever you can, the same goes for any other numerical data like prices in different currencies, whatever. bloody hell there can be adversarial attacks based on word perplexities!

BPE was introduced for neural MT partly to handle rare and unknown words by representing them as subword sequences rather than requiring a huge word vocabulary. Byte-level BPE goes further, GPT-2-style tokenizers can represent arbitrary bytes, so the failure is usually not “unknown token,” but fragmentation, inefficient representation, strange byte pieces, or distribution shift.

tokenizer catches you off-guard

A tokenizer looks like preprocessing, but in a real large-scale model training pipeline it is closer to infrastructure. It sits before the model, before the loss, before the dataloader emits tensors, before evaluation, and before inference. If it is wrong, the rest of the stack can be perfectly engineered and still behave strangely. it is part of the model architecture, the data pipeline, the compute budget, and sometimes the failure mode.

A training run can fail at the same step again and again, and the instinct is to look at the optimizer, checkpoint, distributed setup, GPU memory, mixed precision, or dataloader. But if skipping one exact data step avoids the crash, the search space collapses. Now it is not “the training run is broken.” It is “this example, or this batch, or this transformation of the data is broken.” That transformation includes the tokenizer.

The raw document is not what the transformer sees. The transformer sees token IDs. Before that happens, the system performs normalization, pre-tokenization, subword segmentation, special-token insertion, truncation, packing, batching, and tensorization:

raw text → normalization → pre-tokenization → subword segmentation → token IDs → packing/truncation → batch tensors → model

A pathological raw string can become a pathological tokenization problem. Extremely long numeric sequences, repeated characters, weird Unicode, broken markup, logs, base64 blobs, corrupted JSON, or domain-specific identifiers can produce unusually long token sequences or trigger slow paths in tokenizer implementations. At small scale this looks like a bad row. At pretraining scale it looks like a deterministic crash that costs real money to reproduce.

when a training run fails deterministically at the same data step, do not debug the whole universe. Isolate the pipeline.

  1. Can the same checkpoint train on the next batch?
  2. Can the same batch tokenize offline?
  3. Does the crash disappear if the tokenizer is swapped?
  4. Does the crash disappear if the raw document is removed?
  5. Does the crash disappear if multiprocessing tokenization is disabled?
  6. Does the tokenized sequence have extreme length, weird special tokens, replacement characters, or abnormal numeric runs?

This is boring engineering, but it is the difference between a week of blind debugging and one reproducible unit test.

This is also why tokenizer choice affects model quality, not only runtime. The tokenizer decides the atomic symbols the model learns over. If the tokenizer represents numbers, code, dates, identifiers, or non-English text poorly, the model has to spend capacity learning around that representation. the tokenizer size, pre-tokenization regex, and tokenizer training data can materially affect generation speed, effective context size, memory usage, and downstream performance.

At small scale, tokenizer bugs are annoying. At large scale, they are expensive. A failed experiment is not just a stack trace, it is GPU time, queue time, researcher time, and uncertainty. The expensive part is often not the bug itself, but the size of the search space. Was it the data? tokenizer? packing? distributed loader? checkpoint? optimizer? precision? kernel? scheduler? The right move is to reduce the search space as aggressively as possible. If changing the data fixes it, debug the data path. If swapping the tokenizer fixes it, debug the tokenizer. If disabling packing fixes it, debug sequence construction. If the same raw sample reproduces the failure offline, turn that sample into a regression test.

Tokenizer health checks I would run before serious training

Before launching an expensive run, I would want a tokenizer report over a representative data sample:

A tokenizer with a good average can still have awful tails. The tails are what crash training, waste context, inflate cost, or poison batches. there have been real tokenizer implementation issues involving hangs in subprocess/dataloader settings, so tokenizer latency and multiprocessing behavior are not purely theoretical concerns

Numbers are not just text

Numbers look simple to humans, but tokenizers often represent them in unnatural ways. One model may split a number digit by digit. Another may group it into two- or three-digit chunks. Another may tokenize from left to right, which is awkward for arithmetic because carries usually propagate from the right.

This matters because the model does not receive the number as a number. It receives a sequence of token IDs. The representation can change the difficulty of the task. A date, price, timestamp, account ID, version string, latitude-longitude pair, or long integer can become many tokens with weak numerical structure.

So for numeric-heavy domains, tokenizer evaluation should include numeric stress tests:

Debugging checklist: when to suspect the tokenizer

Suspect the tokenizer when:

The fastest path is usually not to inspect the whole training stack. It is to freeze everything except one variable: data shard, tokenizer, packing logic, or dataloader mode.

The tokenizer is an architecture decision

A tokenizer feels external because it is trained before the model. But after training starts, it becomes part of the model contract. The embedding table is indexed by token IDs. The model learns patterns over those IDs. The context window is measured in those IDs. The training budget is consumed by those IDs. The inference bill is charged by those IDs.

So tokenizer choice affects:

Changing the tokenizer is therefore not a harmless preprocessing change. It changes the model’s input language.