fix: private dataset splits/metadata not loading in Studio UI by Shivamjohri247 · Pull Request #4965 · unslothai/unsloth

Shivamjohri247 · 2026-04-10T19:45:56Z

Summary

Problem: When using private HuggingFace datasets in the Studio UI, the subset/split dropdowns fail to load. The dataset preview and training itself work fine — only the metadata (splits info) fails. Switching the dataset to public resolves the issue.

Root cause: The frontend hook useHfDatasetSplits was calling the external HF datasets-server API (datasets-server.huggingface.co/splits) directly from the browser. This external API does not reliably serve metadata for private datasets, even with a Bearer token. Meanwhile, the dataset preview (check-format) already worked for private datasets because it routes through the backend which uses the datasets Python library with proper token support.

Changes

Backend

studio/backend/models/datasets.py — Added DatasetSplitsRequest, SplitEntry, and DatasetSplitsResponse Pydantic models for the new endpoint.
studio/backend/routes/datasets.py — Added POST /api/datasets/splits endpoint that uses get_dataset_config_names and get_dataset_split_names from the datasets library with the HF token for proper private dataset support.

Frontend

studio/frontend/src/hooks/use-hf-dataset-splits.ts — Updated the hook to call the new backend endpoint (/api/datasets/splits) via authFetch instead of directly calling the external datasets-server API. The HF token is now sent in the POST body to the backend, which proxies it server-side to the HF Hub API.

Testing

No existing tests for this code path. Verified by tracing the data flow:
- Frontend passes accessToken → hook calls authFetch("/api/datasets/splits", { body: { dataset_name, hf_token } }) → backend uses datasets library with token=hf_token → returns configs and splits
Error handling preserved: auth errors (401/403), not-found (404), and other failures are normalized and shown to the user
AbortController/signal support preserved through authFetch → fetch

…upport The frontend was calling the HF datasets-server API directly from the browser to fetch dataset split/subset metadata. This does not work for private datasets because the external datasets-server API does not reliably serve metadata for private repos even with Bearer tokens. The dataset preview (check-format) already worked for private datasets because it routes through the backend which uses the `datasets` library with proper token support. Fix: Add a new POST /api/datasets/splits backend endpoint that uses `get_dataset_config_names` and `get_dataset_split_names` from the `datasets` library with the HF token. Update the frontend hook to call this backend endpoint via authFetch instead of the external datasets-server. Fixes unslothai#4962

for more information, see https://pre-commit.ci

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 686ac94478

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

gemini-code-assist

Code Review

This pull request introduces a new backend endpoint, /splits, to fetch HuggingFace dataset configurations and splits. This change moves the logic from the frontend to the backend to better handle CORS and secure HuggingFace tokens. The review feedback suggests improving the error handling in this new endpoint by catching specific HuggingFace exceptions and returning appropriate HTTP status codes, such as 404 or 403, rather than a generic 500 Internal Server Error.

- Catch HfHubHTTPError and propagate the original HTTP status code (404, 403, etc.) instead of always returning 500 - Raise HTTPException when all per-config split lookups fail, so the UI receives an actionable error instead of a misleading empty-success - Re-raise HTTPException without wrapping to avoid double-wrapping

Shivamjohri247 · 2026-04-10T19:53:20Z

Addressed bot review feedback in commit 44ac484:

Codex feedback: Now raises HTTPException with the last config error when get_dataset_config_names succeeds but every get_dataset_split_names call fails. The UI receives an actionable error instead of a misleading empty-success response.

Gemini feedback: Added a dedicated except HfHubHTTPError block that propagates the original HTTP status code (404, 403, etc.) from the HF Hub API. Also added except HTTPException: raise before the generic handler to avoid double-wrapping.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ead10eae59

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-11T17:57:22Z

+            except Exception as config_err:
+                logger.warning(
+                    f"Could not fetch splits for config '{config}': {config_err}"
+                )
+                last_config_error = str(config_err)


Preserve status code for per-config split failures

When get_dataset_split_names fails for every config with an HfHubHTTPError (for example, 401/403 on private or gated repos), this broad except Exception path strips the typed error down to a string and the function later raises a generic 500. That converts actionable auth/not-found responses into server errors, so clients lose reliable status semantics for these failure cases. Keep the original exception/status for at least one failed config (or re-raise it) when no config succeeds.

Useful? React with 👍 / 👎.

When get_dataset_split_names fails with HfHubHTTPError for every config, extract and propagate the original HTTP status code (401, 403, etc.) instead of converting it to a generic 500. This ensures the frontend receives actionable status semantics for auth errors.

Shivamjohri247 · 2026-04-11T18:08:25Z

Addressed latest Codex feedback in commit 3294805:

The per-config failure path now catches HfHubHTTPError separately, extracts response.status_code, and preserves it when raising the final HTTPException. This ensures that 401/403 auth errors from the HF Hub are propagated to the frontend instead of being converted to a generic 500.

danielhanchen

Looks good. Clean approach -- routing splits through the backend avoids the CORS and token-exposure problems with the external datasets-server API for private/gated datasets.

Reviewed:

Backend endpoint (POST /splits): Correct use of get_dataset_config_names / get_dataset_split_names with token passthrough. Per-config error handling is solid -- partial failures surface what succeeded while preserving the HF status code (401/403) if every config fails. Using a sync def is correct since the datasets library calls are blocking; FastAPI will run this in its threadpool.
Models: Clean, follows existing patterns.
Frontend hook: authFetch swap, body?.detail (matching FastAPI), removal of the pending/failed fields the backend no longer returns -- all correct.

No issues found. LGTM.

- Move HfHubHTTPError import outside try block so the outer except clause cannot raise NameError if the import fails - Catch DatasetNotFoundError explicitly and return 404 instead of 500 for nonexistent/inaccessible datasets - Remap upstream HF 401 to 403 to prevent authFetch from misinterpreting HF auth failures as studio session expiries (which triggers logout) - Reset last_config_status to 500 in the generic exception handler to avoid mismatched status/message when mixing exception types - Validate dataset_name with min_length=1 to reject empty strings early

get_dataset_split_names can raise DatasetNotFoundError (e.g. stale config names). Without a specific handler in the inner loop, it fell through to the generic except Exception with status 500 instead of 404.

The datasets library raises DatasetNotFoundError for both truly missing and gated/private datasets with the same message. The previous detail text contained "not found" which the frontend normalizer matched first, showing "Dataset not found" even for gated datasets. The new text includes "private" so the normalizer's auth check (which runs first) matches and shows the correct auth-related guidance instead.

danielhanchen · 2026-04-12T18:21:03Z

Auto-review verdict: Approved

Routes HuggingFace dataset split/config discovery through a new backend endpoint using the datasets Python library with token support, fixing broken subset/split dropdowns for private and gated datasets. Review fixes improved error handling: moved imports for robustness, remapped HF 401 to 403 to prevent authFetch logout, added DatasetNotFoundError handling with proper frontend normalizer keyword alignment, and validated empty dataset_name at the model layer.

Reason: All real issues fixed across 4 iterations; bug confirmed real with 37/37 simulations passing

wasimysaid · 2026-04-13T18:49:12Z

@Shivamjohri247
I tested the new backend route and confirmed we’re hitting POST /api/datasets/splits with the token. The upstream response for the private dataset I tried was:
Not supported: dataset repository <repo> is private. Private datasets are only supported for PRO users and Enterprise Hub organizations.

I didn’t find a docs line that says the /splits API specifically has this restriction, but /splits is documented as part of the Dataset Viewer API:

https://huggingface.co/docs/dataset-viewer/en/splits

And the Dataset Viewer docs say private dataset support there is for PRO users / Team / Enterprise orgs:

https://huggingface.co/docs/dataset-viewer/en/quick_start

https://huggingface.co/docs/hub/en/datasets-viewer-configure

Do you happen to know if there’s some nuance here I’m missing, or have a private dataset I could test against? Thank you a lot!

danielhanchen · 2026-04-15T14:48:04Z

Auto-review verdict: Changes requested

Reason: no verdict parsed; defaulting based on commit history

for more information, see https://pre-commit.ci

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5b1db1fec4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f693874a49

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Shivamjohri247 · 2026-04-17T18:42:55Z

Addressed the P2 Codex flag about the repo ID regex being too strict. Opened #5096 which replaces the custom regex with huggingface_hub.utils.validate_repo_id to match HF's exact validation rules (allows leading underscores, disallows -- and .. sequences).

The previous regex rejected valid HF repo IDs with leading underscores (e.g. org/_dataset). Delegating to HF's own validator ensures we match their exact rules: alphanumeric, -, _, . with no -- or .. sequences.

Shivamjohri247 · 2026-04-17T18:44:56Z

Addressed the P2 Codex flag about the repo ID regex being too strict. Replaced the custom regex with huggingface_hub.utils.validate_repo_id so validation matches HF's exact rules (allows leading underscores, disallows -- and .. sequences).

for more information, see https://pre-commit.ci

Move structlog and datasets fake module setup from module-level sys.modules mutation into autouse fixtures in conftest.py. Ensures each test gets a fresh stub with automatic teardown, preventing order-dependent test failures in full-suite runs.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f35706b820

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Remove autouse from conftest.py fixtures and declare fake_datasets and fake_structlog explicitly in each test function that needs them. This prevents the stubs from affecting unrelated tests in the repo.

Shivamjohri247 · 2026-04-17T19:02:24Z

Addressed the P1 Codex flag about global autouse fixtures. Removed autouse=True from conftest.py and declared fake_datasets and fake_structlog explicitly in each test function that needs them. The stubs no longer affect unrelated tests in the repo.

for more information, see https://pre-commit.ci

The previous commit accidentally duplicated fake_datasets, fake_structlog in test function parameters due to sequential sed commands matching already-modified lines. This corrects all 7 affected test files.

for more information, see https://pre-commit.ci

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 81bd79a220

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Tests that import fastapi or studio backend modules now call pytest.importorskip("fastapi") at module level so non-studio CI pipelines skip them gracefully instead of failing at collection.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2415e794f4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

The previous import from huggingface_hub.utils._validators is a private module that could break on dependency upgrades. Use the public path instead.

Shivamjohri247 marked this pull request as ready for review April 10, 2026 19:47

Shivamjohri247 requested review from Manan17, rolandtannous and wasimysaid as code owners April 10, 2026 19:47

[pre-commit.ci] auto fixes from pre-commit.com hooks

6ef6e88

for more information, see https://pre-commit.ci

chatgpt-codex-connector Bot reviewed Apr 10, 2026

View reviewed changes

Comment thread studio/backend/routes/datasets.py Outdated

gemini-code-assist Bot reviewed Apr 10, 2026

View reviewed changes

Comment thread studio/backend/routes/datasets.py Outdated

Merge branch 'main' into fix/private-dataset-splits-metadata

ead10ea

chatgpt-codex-connector Bot reviewed Apr 11, 2026

View reviewed changes

danielhanchen mentioned this pull request Apr 12, 2026

fix: private dataset splits/metadata not loading in Studio UI danielhanchen/unsloth-staging-2#33

Closed

danielhanchen added auto-reviewing PR is being auto-reviewed and removed auto-reviewing PR is being auto-reviewed labels Apr 12, 2026

danielhanchen mentioned this pull request Apr 12, 2026

fix: private dataset splits/metadata not loading in Studio UI danielhanchen/unsloth-staging-2#34

Closed

danielhanchen added the auto-reviewing PR is being auto-reviewed label Apr 12, 2026

danielhanchen reviewed Apr 12, 2026

View reviewed changes

danielhanchen added auto-reviewed and removed auto-reviewing PR is being auto-reviewed labels Apr 12, 2026

danielhanchen added 3 commits April 12, 2026 17:30

fix: catch DatasetNotFoundError in per-config split loop

bd3f026

get_dataset_split_names can raise DatasetNotFoundError (e.g. stale config names). Without a specific handler in the inner loop, it fell through to the generic except Exception with status 500 instead of 404.

danielhanchen added the auto-approved Auto-review passed, ready to merge label Apr 12, 2026

danielhanchen removed the auto-reviewed label Apr 13, 2026

Merge branch 'main' into fix/private-dataset-splits-metadata

9d2de0d

danielhanchen mentioned this pull request Apr 15, 2026

[tests] Review tests for PR #4965 unslothai/unsloth-staging-1#61

Closed

danielhanchen added auto-review-failed Auto-review found issues and removed auto-reviewing PR is being auto-reviewed labels Apr 15, 2026

[pre-commit.ci] auto fixes from pre-commit.com hooks

5b1db1f

for more information, see https://pre-commit.ci

chatgpt-codex-connector Bot reviewed Apr 15, 2026

View reviewed changes

Comment thread test_pr4965_empty_configs_fast_fail.py Outdated

Merge branch 'main' into fix/private-dataset-splits-metadata

f693874

Shivamjohri247 mentioned this pull request Apr 17, 2026

Scope fake module injection to per-test fixtures (PR #4965) #5094

Closed

chatgpt-codex-connector Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread studio/backend/routes/datasets.py Outdated

Shivamjohri247 mentioned this pull request Apr 17, 2026

Use HF validate_repo_id instead of custom regex (PR #4965) #5096

Closed

Use huggingface_hub validate_repo_id instead of custom regex

c4d562d

The previous regex rejected valid HF repo IDs with leading underscores (e.g. org/_dataset). Delegating to HF's own validator ensures we match their exact rules: alphanumeric, -, _, . with no -- or .. sequences.

pre-commit-ci Bot and others added 2 commits April 17, 2026 18:44

[pre-commit.ci] auto fixes from pre-commit.com hooks

3ba13ef

for more information, see https://pre-commit.ci

chatgpt-codex-connector Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread conftest.py Outdated

Scope fixtures to PR tests only, not globally via autouse

df1c7fd

Remove autouse from conftest.py fixtures and declare fake_datasets and fake_structlog explicitly in each test function that needs them. This prevents the stubs from affecting unrelated tests in the repo.

[pre-commit.ci] auto fixes from pre-commit.com hooks

6f6fe31

for more information, see https://pre-commit.ci

pre-commit-ci Bot requested review from Datta0, Etherll, mmathew23 and pluesclues as code owners April 17, 2026 19:02

Shivamjohri247 and others added 2 commits April 18, 2026 00:38

Fix duplicate fixture params in test function signatures

81bd79a

The previous commit accidentally duplicated fake_datasets, fake_structlog in test function parameters due to sequential sed commands matching already-modified lines. This corrects all 7 affected test files.

[pre-commit.ci] auto fixes from pre-commit.com hooks

77175d6

for more information, see https://pre-commit.ci

chatgpt-codex-connector Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread test_pr4965_empty_configs_fast_fail.py

Guard studio backend imports with pytest.importorskip

2415e79

Tests that import fastapi or studio backend modules now call pytest.importorskip("fastapi") at module level so non-studio CI pipelines skip them gracefully instead of failing at collection.

chatgpt-codex-connector Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread studio/backend/routes/datasets.py Outdated

Shivamjohri247 added 2 commits April 18, 2026 01:17

Import HFValidationError from public huggingface_hub.errors

7ce5e97

The previous import from huggingface_hub.utils._validators is a private module that could break on dependency upgrades. Use the public path instead.

Merge branch 'main' into fix/private-dataset-splits-metadata

4b23854

Uh oh!

Conversation

Shivamjohri247 commented Apr 10, 2026

Summary

Changes

Backend

Frontend

Testing

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Shivamjohri247 commented Apr 10, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Shivamjohri247 commented Apr 11, 2026

Uh oh!

danielhanchen left a comment

Choose a reason for hiding this comment

Uh oh!

danielhanchen commented Apr 12, 2026

Uh oh!

wasimysaid commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielhanchen commented Apr 15, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Shivamjohri247 commented Apr 17, 2026

Uh oh!

Shivamjohri247 commented Apr 17, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Shivamjohri247 commented Apr 17, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wasimysaid commented Apr 13, 2026 •

edited

Loading