Skip to content

[data] Improve Ray Data reading from Hugging Face datasets documentation#58832

Merged
richardliaw merged 3 commits intoray-project:masterfrom
robertnishihara:docs
Nov 20, 2025
Merged

[data] Improve Ray Data reading from Hugging Face datasets documentation#58832
richardliaw merged 3 commits intoray-project:masterfrom
robertnishihara:docs

Conversation

@robertnishihara
Copy link
Collaborator

@robertnishihara robertnishihara commented Nov 19, 2025

Part 2 of #58492

NOTE: Still making edits

Signed-off-by: Robert Nishihara <[email protected]>
@robertnishihara robertnishihara requested a review from a team as a code owner November 19, 2025 22:42
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves the documentation for reading Hugging Face datasets by adding instructions on authentication. The added information about installing huggingface_hub and setting the HF_TOKEN is helpful. However, the explanatory text could be clearer to avoid confusion. I've suggested a rephrasing to improve clarity.

Comment on lines +670 to +672
Set your Hugging Face token to authenticate. While public datasets can be read without
a token, Hugging Face rate limits are more aggressive without a token. To read Hugging
Face datasets without a token, simply set the filesystem argument to ``HfFileSystem()``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current phrasing might be confusing. It implies that setting filesystem=HfFileSystem() is something you do specifically when you don't have a token. However, HfFileSystem is used for reading from the Hugging Face Hub regardless of whether a token is present. The huggingface_hub library, used by HfFileSystem, automatically handles authentication if the HF_TOKEN is set.

I suggest rephrasing this to clarify the benefits of using a token and what happens without one, without conflating it with the use of HfFileSystem.

Suggested change
Set your Hugging Face token to authenticate. While public datasets can be read without
a token, Hugging Face rate limits are more aggressive without a token. To read Hugging
Face datasets without a token, simply set the filesystem argument to ``HfFileSystem()``.
Set your Hugging Face token to authenticate. This allows you to access private datasets and provides higher rate limits. Public datasets can be read without a token, but you may be subject to stricter rate limits.

@robertnishihara robertnishihara added the go add ONLY when ready to merge, run all tests label Nov 20, 2025
@ray-gardener ray-gardener bot added the data Ray Data-related issues label Nov 20, 2025
Signed-off-by: Robert Nishihara <[email protected]>
Signed-off-by: Robert Nishihara <[email protected]>
@richardliaw richardliaw changed the title [DRAFT] Improve Ray Data reading from Hugging Face datasets documentation [data] Improve Ray Data reading from Hugging Face datasets documentation Nov 20, 2025
@richardliaw richardliaw merged commit d711f05 into ray-project:master Nov 20, 2025
6 checks passed
@robertnishihara robertnishihara deleted the docs branch November 20, 2025 19:10
400Ping pushed a commit to 400Ping/ray that referenced this pull request Nov 21, 2025
ykdojo pushed a commit to ykdojo/ray that referenced this pull request Nov 27, 2025
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants