fix: private dataset splits/metadata not loading in Studio UI#4965
fix: private dataset splits/metadata not loading in Studio UI#4965Shivamjohri247 wants to merge 30 commits intounslothai:mainfrom
Conversation
…upport The frontend was calling the HF datasets-server API directly from the browser to fetch dataset split/subset metadata. This does not work for private datasets because the external datasets-server API does not reliably serve metadata for private repos even with Bearer tokens. The dataset preview (check-format) already worked for private datasets because it routes through the backend which uses the `datasets` library with proper token support. Fix: Add a new POST /api/datasets/splits backend endpoint that uses `get_dataset_config_names` and `get_dataset_split_names` from the `datasets` library with the HF token. Update the frontend hook to call this backend endpoint via authFetch instead of the external datasets-server. Fixes unslothai#4962
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 686ac94478
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Code Review
This pull request introduces a new backend endpoint, /splits, to fetch HuggingFace dataset configurations and splits. This change moves the logic from the frontend to the backend to better handle CORS and secure HuggingFace tokens. The review feedback suggests improving the error handling in this new endpoint by catching specific HuggingFace exceptions and returning appropriate HTTP status codes, such as 404 or 403, rather than a generic 500 Internal Server Error.
- Catch HfHubHTTPError and propagate the original HTTP status code (404, 403, etc.) instead of always returning 500 - Raise HTTPException when all per-config split lookups fail, so the UI receives an actionable error instead of a misleading empty-success - Re-raise HTTPException without wrapping to avoid double-wrapping
|
Addressed bot review feedback in commit Codex feedback: Now raises Gemini feedback: Added a dedicated |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ead10eae59
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| except Exception as config_err: | ||
| logger.warning( | ||
| f"Could not fetch splits for config '{config}': {config_err}" | ||
| ) | ||
| last_config_error = str(config_err) |
There was a problem hiding this comment.
Preserve status code for per-config split failures
When get_dataset_split_names fails for every config with an HfHubHTTPError (for example, 401/403 on private or gated repos), this broad except Exception path strips the typed error down to a string and the function later raises a generic 500. That converts actionable auth/not-found responses into server errors, so clients lose reliable status semantics for these failure cases. Keep the original exception/status for at least one failed config (or re-raise it) when no config succeeds.
Useful? React with 👍 / 👎.
When get_dataset_split_names fails with HfHubHTTPError for every config, extract and propagate the original HTTP status code (401, 403, etc.) instead of converting it to a generic 500. This ensures the frontend receives actionable status semantics for auth errors.
|
Addressed latest Codex feedback in commit The per-config failure path now catches |
danielhanchen
left a comment
There was a problem hiding this comment.
Looks good. Clean approach -- routing splits through the backend avoids the CORS and token-exposure problems with the external datasets-server API for private/gated datasets.
Reviewed:
-
Backend endpoint (
POST /splits): Correct use ofget_dataset_config_names/get_dataset_split_nameswith token passthrough. Per-config error handling is solid -- partial failures surface what succeeded while preserving the HF status code (401/403) if every config fails. Using a syncdefis correct since thedatasetslibrary calls are blocking; FastAPI will run this in its threadpool. -
Models: Clean, follows existing patterns.
-
Frontend hook:
authFetchswap,body?.detail(matching FastAPI), removal of thepending/failedfields the backend no longer returns -- all correct.
No issues found. LGTM.
- Move HfHubHTTPError import outside try block so the outer except clause cannot raise NameError if the import fails - Catch DatasetNotFoundError explicitly and return 404 instead of 500 for nonexistent/inaccessible datasets - Remap upstream HF 401 to 403 to prevent authFetch from misinterpreting HF auth failures as studio session expiries (which triggers logout) - Reset last_config_status to 500 in the generic exception handler to avoid mismatched status/message when mixing exception types - Validate dataset_name with min_length=1 to reject empty strings early
get_dataset_split_names can raise DatasetNotFoundError (e.g. stale config names). Without a specific handler in the inner loop, it fell through to the generic except Exception with status 500 instead of 404.
The datasets library raises DatasetNotFoundError for both truly missing and gated/private datasets with the same message. The previous detail text contained "not found" which the frontend normalizer matched first, showing "Dataset not found" even for gated datasets. The new text includes "private" so the normalizer's auth check (which runs first) matches and shows the correct auth-related guidance instead.
|
Auto-review verdict: Approved Routes HuggingFace dataset split/config discovery through a new backend endpoint using the datasets Python library with token support, fixing broken subset/split dropdowns for private and gated datasets. Review fixes improved error handling: moved imports for robustness, remapped HF 401 to 403 to prevent authFetch logout, added DatasetNotFoundError handling with proper frontend normalizer keyword alignment, and validated empty dataset_name at the model layer. Reason: All real issues fixed across 4 iterations; bug confirmed real with 37/37 simulations passing |
|
@Shivamjohri247 I didn’t find a docs line that says the /splits API specifically has this restriction, but /splits is documented as part of the Dataset Viewer API:
Do you happen to know if there’s some nuance here I’m missing, or have a private dataset I could test against? Thank you a lot! |
|
Auto-review verdict: Changes requested Reason: no verdict parsed; defaulting based on commit history |
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5b1db1fec4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f693874a49
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
Addressed the P2 Codex flag about the repo ID regex being too strict. Opened #5096 which replaces the custom regex with |
The previous regex rejected valid HF repo IDs with leading underscores (e.g. org/_dataset). Delegating to HF's own validator ensures we match their exact rules: alphanumeric, -, _, . with no -- or .. sequences.
|
Addressed the P2 Codex flag about the repo ID regex being too strict. Replaced the custom regex with |
for more information, see https://pre-commit.ci
Move structlog and datasets fake module setup from module-level sys.modules mutation into autouse fixtures in conftest.py. Ensures each test gets a fresh stub with automatic teardown, preventing order-dependent test failures in full-suite runs.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f35706b820
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Remove autouse from conftest.py fixtures and declare fake_datasets and fake_structlog explicitly in each test function that needs them. This prevents the stubs from affecting unrelated tests in the repo.
|
Addressed the P1 Codex flag about global autouse fixtures. Removed |
for more information, see https://pre-commit.ci
The previous commit accidentally duplicated fake_datasets, fake_structlog in test function parameters due to sequential sed commands matching already-modified lines. This corrects all 7 affected test files.
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 81bd79a220
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Tests that import fastapi or studio backend modules now call
pytest.importorskip("fastapi") at module level so non-studio CI
pipelines skip them gracefully instead of failing at collection.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2415e794f4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
The previous import from huggingface_hub.utils._validators is a private module that could break on dependency upgrades. Use the public path instead.
Summary
Fixes #4962
Problem: When using private HuggingFace datasets in the Studio UI, the subset/split dropdowns fail to load. The dataset preview and training itself work fine — only the metadata (splits info) fails. Switching the dataset to public resolves the issue.
Root cause: The frontend hook
useHfDatasetSplitswas calling the external HF datasets-server API (datasets-server.huggingface.co/splits) directly from the browser. This external API does not reliably serve metadata for private datasets, even with a Bearer token. Meanwhile, the dataset preview (check-format) already worked for private datasets because it routes through the backend which uses thedatasetsPython library with proper token support.Changes
Backend
studio/backend/models/datasets.py— AddedDatasetSplitsRequest,SplitEntry, andDatasetSplitsResponsePydantic models for the new endpoint.studio/backend/routes/datasets.py— AddedPOST /api/datasets/splitsendpoint that usesget_dataset_config_namesandget_dataset_split_namesfrom thedatasetslibrary with the HF token for proper private dataset support.Frontend
studio/frontend/src/hooks/use-hf-dataset-splits.ts— Updated the hook to call the new backend endpoint (/api/datasets/splits) viaauthFetchinstead of directly calling the external datasets-server API. The HF token is now sent in the POST body to the backend, which proxies it server-side to the HF Hub API.Testing
accessToken→ hook callsauthFetch("/api/datasets/splits", { body: { dataset_name, hf_token } })→ backend usesdatasetslibrary withtoken=hf_token→ returns configs and splitsauthFetch→fetch