-
Notifications
You must be signed in to change notification settings - Fork 3k
Open
Description
Describe the bug
Some files in the epfml/FineWeb-HQ dataset fail to load via the Hugging Face datasets library.
- xet-hosted files load fine
- LFS-hosted files sometimes fail
Example:
- Fails: https://huggingface.co/datasets/epfml/FineWeb-HQ/blob/main/data/CC-MAIN-2024-26/000_00003.parquet
- Works: https://huggingface.co/datasets/epfml/FineWeb-HQ/blob/main/data/CC-MAIN-2024-42/000_00027.parquet
Discussion: https://huggingface.co/datasets/epfml/FineWeb-HQ/discussions/2
Steps to reproduce the bug
from datasets import load_dataset
ds = load_dataset(
"epfml/FineWeb-HQ",
data_files="data/CC-MAIN-2024-26/000_00003.parquet",
)Error message:
HfHubHTTPError: 403 Forbidden: None.
Cannot access content at: https://cdn-lfs-us-1.hf.co/repos/...
Make sure your token has the correct permissions.
...
<Error><Code>AccessDenied</Code><Message>Access Denied</Message></Error>
Expected behavior
It should load the dataset for all files.
Environment info
- python 3.10
- datasets 4.4.1
Metadata
Metadata
Assignees
Labels
No labels