-
Notifications
You must be signed in to change notification settings - Fork 3k
Open
Description
Describe the bug
When loading only a subset of dataset files while the dataset's README.md contains split metadata, the system throws a NonMatchingSplitsSizesError . This prevents users from loading partial datasets for quick validation in cases of poor network conditions or very large datasets.
Steps to reproduce the bug
- Use the Hugging Face
datasetslibrary to load a dataset with only specific files specified - Ensure the dataset repository has split metadata defined in README.md
- Observe the error when attempting to load a subset of files
# Example code that triggers the error
from datasets import load_dataset
book_corpus_ds = load_dataset(
"SaylorTwift/the_pile_books3_minus_gutenberg",
name="default",
data_files="data/train-00000-of-00213-312fd8d7a3c58a63.parquet",
split="train",
cache_dir="./data"
)Error Message
Traceback (most recent call last):
File "/Users/QingGo/code/llm_learn/src/data/clean_cc_bc.py", line 13, in <module>
book_corpus_ds = load_dataset(
"SaylorTwift/the_pile_books3_minus_gutenberg",
...
File "/Users/QingGo/code/llm_learn/.venv/lib/python3.13/site-packages/datasets/utils/info_utils.py", line 77, in verify_splits
raise NonMatchingSplitsSizesError(str(bad_splits))
datasets.exceptions.NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=106199627990.47722, num_examples=192661, shard_lengths=None, dataset_name=None), 'recorded': SplitInfo(name='train', num_bytes=454897326, num_examples=905, shard_lengths=None, dataset_name='the_pile_books3_minus_gutenberg')}]
Expected behavior
When loading partial dataset files, the system should:
- Skip the
NonMatchingSplitsSizesErrorvalidation, OR - Only log a warning message instead of raising an error
Environment info
datasetsversion: 4.3.0- Platform: macOS-15.7.1-arm64-arm-64bit-Mach-O
- Python version: 3.13.2
huggingface_hubversion: 0.36.0- PyArrow version: 22.0.0
- Pandas version: 2.3.3
fsspecversion: 2025.9.0
Metadata
Metadata
Assignees
Labels
No labels