AI Definitions: Tokenization

Tokenization – The process of converting the raw training data (text, images, or audio) into small units called tokens. This takes place twice in an LLM: When it is being set up (pretrained). Raw training data (text, images, or audio) is converted into small units (tokens). It happens a second time (inference) when a user prompts the LLM and the prompt (whether text, images, or audio) is converted into smaller units (tokens).

More AI definitions