olm.data.datasets.data_loader¶
DataLoader wrapper for OLM library.
This module provides a clean wrapper around PyTorch’s DataLoader with sensible defaults for language model training and convenient helpers.
Classes¶
DataLoader(*args, **kwargs) |
Wrapper around PyTorch's DataLoader with sensible defaults for LLM training. |
|---|---|
class olm.data.datasets.data_loader.DataLoader(*args: Any, **kwargs: Any)¶
Bases: DataLoader
Wrapper around PyTorch’s DataLoader with sensible defaults for LLM training.
This class extends torch.utils.data.DataLoader with: - Better defaults for language model training - Automatic worker configuration - Pin memory optimization for GPU training - Persistent workers for efficiency
- Parameters:
- dataset – Dataset to load from (can be map-style or iterable).
- batch_size – Number of samples per batch (default: 8).
- shuffle – Whether to shuffle data at every epoch (default: False for iterable datasets).
- num_workers – Number of worker processes for data loading (default: 0).
- pin_memory – If True, tensors are copied to CUDA pinned memory (default: True).
- drop_last – Drop the last incomplete batch if dataset size is not divisible by batch_size.
- persistent_workers – Keep workers alive between epochs for faster startup (default: True if num_workers > 0).
- prefetch_factor – Number of batches to prefetch per worker (default: 2).
- collate_fn – Function to merge samples into batches.
- **kwargs – Additional arguments passed to torch.utils.data.DataLoader.