olm.data.tokenization.hf_train_custom¶

Parameters: tokens ( torch.Tensor ) – A 1D tensor or list of token IDs.
Returns: The decoded text string.
Return type: str

`HFTokenizerTrainCustom`(files, vocab_size, ...)

Decodes a single 1D tensor of token IDs back into a string.

Encodes a single string into a 1D PyTorch tensor of input IDs. Padding is implicitly disabled for single inputs.

Bases: ABC

Abstract base class for all tokenizers in OLM.

Defines the interface for converting between text strings and integer token IDs. Subclasses must implement encode and decode methods.

Converts a sequence of token IDs back into a text string.

Converts a text string into a sequence of token IDs.