Source: src/olm/data/datasets/base_dataset.py:1
Classes
BaseTextDataset(tokenizer: Any, context_length: int, skip_batches: int = 0, shuffle: bool = False, seed: int = 42)
Bases: IterableDataset, ABC
Source: src/olm/data/datasets/base_dataset.py:8
Abstract base class for text-based streaming datasets.
BaseTextDataset handles tokenization, buffering, next-token target
construction, worker sharding, and distributed-rank sharding. Subclasses
only need to implement _get_text_iterator and yield raw text strings.
Iteration
Yields (input_ids, labels) tuples. Both tensors have shape
[context_length] and dtype torch.long. labels is the
one-token-shifted target sequence for causal language modeling.
Parameters
tokenizer: Tokenizer with anencodemethod.context_length(int): Number of input tokens per sample.skip_batches(int): Number of yielded samples to skip, useful for coarse resume behavior.shuffle(bool): Whether the concrete dataset should shuffle its source stream when supported.seed(int): Shuffle seed.