Skip to content

olm.data.tokenization.hf_train_custom

Classes

HFTokenizerTrainCustom(files, vocab_size, ...)

class olm.data.tokenization.hf_train_custom.HFTokenizerTrainCustom(files: List[str], vocab_size: int, special_tokens: List[str], save_location: str, unk_token: str = '[UNK]')

Bases: TokenizerBase

decode(tokens: torch.Tensor) → str

Decodes a single 1D tensor of token IDs back into a string.

encode(text: str) → torch.Tensor

Encodes a single string into a 1D PyTorch tensor of input IDs. Padding is implicitly disabled for single inputs.

class olm.data.tokenization.hf_train_custom.TokenizerBase

Bases: ABC

Abstract base class for all tokenizers in OLM.

Defines the interface for converting between text strings and integer token IDs. Subclasses must implement encode and decode methods.

abstractmethod decode(tokens: torch.Tensor) → str

Converts a sequence of token IDs back into a text string.

  • Parameters: tokens (torch.Tensor) – A 1D tensor or list of token IDs.
  • Returns: The decoded text string.
  • Return type: str

abstractmethod encode(text: str) → torch.Tensor

Converts a text string into a sequence of token IDs.

  • Parameters: text (str) – The input text to tokenize.
  • Returns: A 1D tensor containing the token IDs.
  • Return type: torch.Tensor