Skip to content

fix: private dataset splits/metadata not loading in Studio UI#4965

Open
Shivamjohri247 wants to merge 30 commits intounslothai:mainfrom
Shivamjohri247:fix/private-dataset-splits-metadata
Open

fix: private dataset splits/metadata not loading in Studio UI#4965
Shivamjohri247 wants to merge 30 commits intounslothai:mainfrom
Shivamjohri247:fix/private-dataset-splits-metadata

Conversation

@Shivamjohri247
Copy link
Copy Markdown

Summary

Fixes #4962

Problem: When using private HuggingFace datasets in the Studio UI, the subset/split dropdowns fail to load. The dataset preview and training itself work fine — only the metadata (splits info) fails. Switching the dataset to public resolves the issue.

Root cause: The frontend hook useHfDatasetSplits was calling the external HF datasets-server API (datasets-server.huggingface.co/splits) directly from the browser. This external API does not reliably serve metadata for private datasets, even with a Bearer token. Meanwhile, the dataset preview (check-format) already worked for private datasets because it routes through the backend which uses the datasets Python library with proper token support.

Changes

Backend

  • studio/backend/models/datasets.py — Added DatasetSplitsRequest, SplitEntry, and DatasetSplitsResponse Pydantic models for the new endpoint.
  • studio/backend/routes/datasets.py — Added POST /api/datasets/splits endpoint that uses get_dataset_config_names and get_dataset_split_names from the datasets library with the HF token for proper private dataset support.

Frontend

  • studio/frontend/src/hooks/use-hf-dataset-splits.ts — Updated the hook to call the new backend endpoint (/api/datasets/splits) via authFetch instead of directly calling the external datasets-server API. The HF token is now sent in the POST body to the backend, which proxies it server-side to the HF Hub API.

Testing

  • No existing tests for this code path. Verified by tracing the data flow:
    • Frontend passes accessToken → hook calls authFetch("/api/datasets/splits", { body: { dataset_name, hf_token } }) → backend uses datasets library with token=hf_token → returns configs and splits
  • Error handling preserved: auth errors (401/403), not-found (404), and other failures are normalized and shown to the user
  • AbortController/signal support preserved through authFetchfetch

…upport

The frontend was calling the HF datasets-server API directly from the
browser to fetch dataset split/subset metadata. This does not work for
private datasets because the external datasets-server API does not
reliably serve metadata for private repos even with Bearer tokens.

The dataset preview (check-format) already worked for private datasets
because it routes through the backend which uses the `datasets` library
with proper token support.

Fix: Add a new POST /api/datasets/splits backend endpoint that uses
`get_dataset_config_names` and `get_dataset_split_names` from the
`datasets` library with the HF token. Update the frontend hook to call
this backend endpoint via authFetch instead of the external
datasets-server.

Fixes unslothai#4962
@Shivamjohri247 Shivamjohri247 marked this pull request as ready for review April 10, 2026 19:47
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 686ac94478

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread studio/backend/routes/datasets.py Outdated
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new backend endpoint, /splits, to fetch HuggingFace dataset configurations and splits. This change moves the logic from the frontend to the backend to better handle CORS and secure HuggingFace tokens. The review feedback suggests improving the error handling in this new endpoint by catching specific HuggingFace exceptions and returning appropriate HTTP status codes, such as 404 or 403, rather than a generic 500 Internal Server Error.

Comment thread studio/backend/routes/datasets.py Outdated
- Catch HfHubHTTPError and propagate the original HTTP status code
  (404, 403, etc.) instead of always returning 500
- Raise HTTPException when all per-config split lookups fail, so the
  UI receives an actionable error instead of a misleading empty-success
- Re-raise HTTPException without wrapping to avoid double-wrapping
@Shivamjohri247
Copy link
Copy Markdown
Author

Addressed bot review feedback in commit 44ac484:

Codex feedback: Now raises HTTPException with the last config error when get_dataset_config_names succeeds but every get_dataset_split_names call fails. The UI receives an actionable error instead of a misleading empty-success response.

Gemini feedback: Added a dedicated except HfHubHTTPError block that propagates the original HTTP status code (404, 403, etc.) from the HF Hub API. Also added except HTTPException: raise before the generic handler to avoid double-wrapping.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ead10eae59

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread studio/backend/routes/datasets.py Outdated
Comment on lines +354 to +358
except Exception as config_err:
logger.warning(
f"Could not fetch splits for config '{config}': {config_err}"
)
last_config_error = str(config_err)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve status code for per-config split failures

When get_dataset_split_names fails for every config with an HfHubHTTPError (for example, 401/403 on private or gated repos), this broad except Exception path strips the typed error down to a string and the function later raises a generic 500. That converts actionable auth/not-found responses into server errors, so clients lose reliable status semantics for these failure cases. Keep the original exception/status for at least one failed config (or re-raise it) when no config succeeds.

Useful? React with 👍 / 👎.

When get_dataset_split_names fails with HfHubHTTPError for every
config, extract and propagate the original HTTP status code (401, 403,
etc.) instead of converting it to a generic 500. This ensures the
frontend receives actionable status semantics for auth errors.
@Shivamjohri247
Copy link
Copy Markdown
Author

Addressed latest Codex feedback in commit 3294805:

The per-config failure path now catches HfHubHTTPError separately, extracts response.status_code, and preserves it when raising the final HTTPException. This ensures that 401/403 auth errors from the HF Hub are propagated to the frontend instead of being converted to a generic 500.

Copy link
Copy Markdown
Contributor

@danielhanchen danielhanchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Clean approach -- routing splits through the backend avoids the CORS and token-exposure problems with the external datasets-server API for private/gated datasets.

Reviewed:

  • Backend endpoint (POST /splits): Correct use of get_dataset_config_names / get_dataset_split_names with token passthrough. Per-config error handling is solid -- partial failures surface what succeeded while preserving the HF status code (401/403) if every config fails. Using a sync def is correct since the datasets library calls are blocking; FastAPI will run this in its threadpool.

  • Models: Clean, follows existing patterns.

  • Frontend hook: authFetch swap, body?.detail (matching FastAPI), removal of the pending/failed fields the backend no longer returns -- all correct.

No issues found. LGTM.

@danielhanchen danielhanchen added auto-reviewed and removed auto-reviewing PR is being auto-reviewed labels Apr 12, 2026
- Move HfHubHTTPError import outside try block so the outer except
  clause cannot raise NameError if the import fails
- Catch DatasetNotFoundError explicitly and return 404 instead of 500
  for nonexistent/inaccessible datasets
- Remap upstream HF 401 to 403 to prevent authFetch from misinterpreting
  HF auth failures as studio session expiries (which triggers logout)
- Reset last_config_status to 500 in the generic exception handler to
  avoid mismatched status/message when mixing exception types
- Validate dataset_name with min_length=1 to reject empty strings early
get_dataset_split_names can raise DatasetNotFoundError (e.g. stale
config names). Without a specific handler in the inner loop, it fell
through to the generic except Exception with status 500 instead of 404.
The datasets library raises DatasetNotFoundError for both truly missing
and gated/private datasets with the same message. The previous detail
text contained "not found" which the frontend normalizer matched first,
showing "Dataset not found" even for gated datasets. The new text
includes "private" so the normalizer's auth check (which runs first)
matches and shows the correct auth-related guidance instead.
@danielhanchen danielhanchen added the auto-approved Auto-review passed, ready to merge label Apr 12, 2026
@danielhanchen
Copy link
Copy Markdown
Contributor

Auto-review verdict: Approved

Routes HuggingFace dataset split/config discovery through a new backend endpoint using the datasets Python library with token support, fixing broken subset/split dropdowns for private and gated datasets. Review fixes improved error handling: moved imports for robustness, remapped HF 401 to 403 to prevent authFetch logout, added DatasetNotFoundError handling with proper frontend normalizer keyword alignment, and validated empty dataset_name at the model layer.

Reason: All real issues fixed across 4 iterations; bug confirmed real with 37/37 simulations passing

@wasimysaid
Copy link
Copy Markdown
Collaborator

wasimysaid commented Apr 13, 2026

@Shivamjohri247
I tested the new backend route and confirmed we’re hitting POST /api/datasets/splits with the token. The upstream response for the private dataset I tried was:
Not supported: dataset repository <repo> is private. Private datasets are only supported for PRO users and Enterprise Hub organizations.

I didn’t find a docs line that says the /splits API specifically has this restriction, but /splits is documented as part of the Dataset Viewer API:

And the Dataset Viewer docs say private dataset support there is for PRO users / Team / Enterprise orgs:

Do you happen to know if there’s some nuance here I’m missing, or have a private dataset I could test against? Thank you a lot!

@danielhanchen danielhanchen added auto-review-failed Auto-review found issues and removed auto-reviewing PR is being auto-reviewed labels Apr 15, 2026
@danielhanchen
Copy link
Copy Markdown
Contributor

Auto-review verdict: Changes requested

Reason: no verdict parsed; defaulting based on commit history

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5b1db1fec4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread test_pr4965_empty_configs_fast_fail.py Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f693874a49

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread studio/backend/routes/datasets.py Outdated
@Shivamjohri247
Copy link
Copy Markdown
Author

Addressed the P2 Codex flag about the repo ID regex being too strict. Opened #5096 which replaces the custom regex with huggingface_hub.utils.validate_repo_id to match HF's exact validation rules (allows leading underscores, disallows -- and .. sequences).

The previous regex rejected valid HF repo IDs with leading underscores
(e.g. org/_dataset). Delegating to HF's own validator ensures we match
their exact rules: alphanumeric, -, _, . with no -- or .. sequences.
@Shivamjohri247
Copy link
Copy Markdown
Author

Addressed the P2 Codex flag about the repo ID regex being too strict. Replaced the custom regex with huggingface_hub.utils.validate_repo_id so validation matches HF's exact rules (allows leading underscores, disallows -- and .. sequences).

pre-commit-ci Bot and others added 2 commits April 17, 2026 18:44
Move structlog and datasets fake module setup from module-level
sys.modules mutation into autouse fixtures in conftest.py.
Ensures each test gets a fresh stub with automatic teardown,
preventing order-dependent test failures in full-suite runs.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f35706b820

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread conftest.py Outdated
Remove autouse from conftest.py fixtures and declare fake_datasets
and fake_structlog explicitly in each test function that needs them.
This prevents the stubs from affecting unrelated tests in the repo.
@Shivamjohri247
Copy link
Copy Markdown
Author

Addressed the P1 Codex flag about global autouse fixtures. Removed autouse=True from conftest.py and declared fake_datasets and fake_structlog explicitly in each test function that needs them. The stubs no longer affect unrelated tests in the repo.

Shivamjohri247 and others added 2 commits April 18, 2026 00:38
The previous commit accidentally duplicated fake_datasets, fake_structlog
in test function parameters due to sequential sed commands matching
already-modified lines. This corrects all 7 affected test files.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 81bd79a220

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread test_pr4965_empty_configs_fast_fail.py
Tests that import fastapi or studio backend modules now call
pytest.importorskip("fastapi") at module level so non-studio CI
pipelines skip them gracefully instead of failing at collection.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2415e794f4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread studio/backend/routes/datasets.py Outdated
The previous import from huggingface_hub.utils._validators is a private
module that could break on dependency upgrades. Use the public path instead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-review-failed Auto-review found issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Private Dataset Metadata Request Does not pass the hf token

4 participants