olm.data.datasets.hf_dataset¶
Classes¶
FineWebEduDataset(*args, **kwargs) |
FineWeb Edu dataset configuration. |
|---|---|
HuggingFaceTextDataset(*args, **kwargs) |
Generic dataset loader for Hugging Face text datasets. |
class olm.data.datasets.hf_dataset.Any(*args, **kwargs)¶
Bases: object
Special type indicating an unconstrained type.
- Any is compatible with every type.
- Any assumed to have all methods.
- All values assumed to be instances of Any.
Note that all the above statements are true from the point of view of static type checkers. At runtime, Any should not be used with instance checks.
class olm.data.datasets.hf_dataset.BaseTextDataset(*args: Any, **kwargs: Any)¶
Bases: IterableDataset, ABC
Abstract base class for text-based streaming datasets.
Handles tokenization buffering and sequence generation generically. Subclasses must implement _get_text_iterator to yield text chunks.
class olm.data.datasets.hf_dataset.FineWebEduDataset(*args: Any, **kwargs: Any)¶
Bases: HuggingFaceTextDataset
FineWeb Edu dataset configuration.
- Parameters:
- split – Dataset split (‘train’ or ‘validation’)
- context_length – Sequence length for training (default: 1024)
- subset – Dataset subset to use (default: ‘sample-10BT’)
- tokenizer – Tokenizer object (e.g. from AutoTokenizer)
- streaming – Whether to use streaming mode (default: True)
- shuffle – Whether to shuffle the dataset (default: False)
- seed – Random seed for shuffling (default: 42)
- cache_dir – Directory to cache downloaded data (default: None)
- skip_batches – Number of batches to skip
class olm.data.datasets.hf_dataset.HuggingFaceTextDataset(*args: Any, **kwargs: Any)¶
Bases: BaseTextDataset
Generic dataset loader for Hugging Face text datasets.
Inherits from BaseTextDataset to share token buffering logic and multi-worker safety.