olm.data.datasets¶
class olm.data.datasets.BaseTextDataset(*args: Any, **kwargs: Any)¶
Bases: IterableDataset, ABC
Abstract base class for text-based streaming datasets.
Handles tokenization buffering and sequence generation generically. Subclasses must implement _get_text_iterator to yield text chunks.
class olm.data.datasets.DataLoader(*args: Any, **kwargs: Any)¶
Bases: DataLoader
Wrapper around PyTorch’s DataLoader with sensible defaults for LLM training.
This class extends torch.utils.data.DataLoader with: - Better defaults for language model training - Automatic worker configuration - Pin memory optimization for GPU training - Persistent workers for efficiency
- Parameters:
- dataset – Dataset to load from (can be map-style or iterable).
- batch_size – Number of samples per batch (default: 8).
- shuffle – Whether to shuffle data at every epoch (default: False for iterable datasets).
- num_workers – Number of worker processes for data loading (default: 0).
- pin_memory – If True, tensors are copied to CUDA pinned memory (default: True).
- drop_last – Drop the last incomplete batch if dataset size is not divisible by batch_size.
- persistent_workers – Keep workers alive between epochs for faster startup (default: True if num_workers > 0).
- prefetch_factor – Number of batches to prefetch per worker (default: 2).
- collate_fn – Function to merge samples into batches.
- **kwargs – Additional arguments passed to torch.utils.data.DataLoader.
Example¶
>>> from olm.data.datasets import DataLoader
>>> loader = DataLoader(
... dataset=my_dataset,
... batch_size=16,
... num_workers=4,
... pin_memory=True,
... )
>>> for batch in loader:
... # Training loop
... pass
class olm.data.datasets.FineWebEduDataset(*args: Any, **kwargs: Any)¶
Bases: HuggingFaceTextDataset
FineWeb Edu dataset configuration.
- Parameters:
- split – Dataset split (‘train’ or ‘validation’)
- context_length – Sequence length for training (default: 1024)
- subset – Dataset subset to use (default: ‘sample-10BT’)
- tokenizer – Tokenizer object (e.g. from AutoTokenizer)
- streaming – Whether to use streaming mode (default: True)
- shuffle – Whether to shuffle the dataset (default: False)
- seed – Random seed for shuffling (default: 42)
- cache_dir – Directory to cache downloaded data (default: None)
- skip_batches – Number of batches to skip
class olm.data.datasets.HuggingFaceTextDataset(*args: Any, **kwargs: Any)¶
Bases: BaseTextDataset
Generic dataset loader for Hugging Face text datasets.
Inherits from BaseTextDataset to share token buffering logic and multi-worker safety.
class olm.data.datasets.LocalTextDataset(*args: Any, **kwargs: Any)¶
Bases: BaseTextDataset
Dataset that streams text from local .txt files in a directory.
Modules¶
base_dataset |
|
|---|---|
data_loader |
DataLoader wrapper for OLM library. |
hf_dataset |
|
local_dataset |