NonMatchingSplitsSizesError when loading partial dataset files

### Describe the bug

When loading only a subset of dataset files while the dataset's README.md contains split metadata, the system throws a NonMatchingSplitsSizesError . This prevents users from loading partial datasets for quick validation in cases of poor network conditions or very large datasets.

### Steps to reproduce the bug

1. Use the Hugging Face `datasets` library to load a dataset with only specific files specified
2. Ensure the dataset repository has split metadata defined in README.md
3. Observe the error when attempting to load a subset of files

```python
# Example code that triggers the error
from datasets import load_dataset

book_corpus_ds = load_dataset(
    "SaylorTwift/the_pile_books3_minus_gutenberg",
    name="default",
    data_files="data/train-00000-of-00213-312fd8d7a3c58a63.parquet",
    split="train",
    cache_dir="./data"
)
```

### Error Message
```
Traceback (most recent call last):
  File "/Users/QingGo/code/llm_learn/src/data/clean_cc_bc.py", line 13, in <module>
    book_corpus_ds = load_dataset(
        "SaylorTwift/the_pile_books3_minus_gutenberg",
    ...
  File "/Users/QingGo/code/llm_learn/.venv/lib/python3.13/site-packages/datasets/utils/info_utils.py", line 77, in verify_splits
    raise NonMatchingSplitsSizesError(str(bad_splits))
datasets.exceptions.NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=106199627990.47722, num_examples=192661, shard_lengths=None, dataset_name=None), 'recorded': SplitInfo(name='train', num_bytes=454897326, num_examples=905, shard_lengths=None, dataset_name='the_pile_books3_minus_gutenberg')}]
```

### Expected behavior

When loading partial dataset files, the system should:
1. Skip the `NonMatchingSplitsSizesError` validation, OR
2. Only log a warning message instead of raising an error

### Environment info

- `datasets` version: 4.3.0
- Platform: macOS-15.7.1-arm64-arm-64bit-Mach-O
- Python version: 3.13.2
- `huggingface_hub` version: 0.36.0
- PyArrow version: 22.0.0
- Pandas version: 2.3.3
- `fsspec` version: 2025.9.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NonMatchingSplitsSizesError when loading partial dataset files #7867

Describe the bug

Steps to reproduce the bug

Error Message

Expected behavior

Environment info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NonMatchingSplitsSizesError when loading partial dataset files #7867

Description

Describe the bug

Steps to reproduce the bug

Error Message

Expected behavior

Environment info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions