Skip to content

Add endpoint to fetch dataset metadata with token#4964

Open
HadiSDev wants to merge 14 commits intounslothai:mainfrom
HadiSDev:main
Open

Add endpoint to fetch dataset metadata with token#4964
HadiSDev wants to merge 14 commits intounslothai:mainfrom
HadiSDev:main

Conversation

@HadiSDev
Copy link
Copy Markdown

@HadiSDev HadiSDev commented Apr 10, 2026

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a backend proxy for fetching HuggingFace dataset splits, enabling the use of server-side authentication tokens to access private or gated datasets. The changes include new Pydantic models for split metadata, a /splits API endpoint in the backend, and an update to the frontend useHfDatasetSplits hook to route requests through the backend. Feedback focuses on optimizing the backend logic to avoid performance issues caused by sequential network requests for datasets with many configurations and improving error logging within the split-fetching loop to avoid masking potential issues.

Comment thread studio/backend/routes/datasets.py Outdated
Comment thread studio/backend/routes/datasets.py Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 926ba30b26

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread studio/backend/routes/datasets.py Outdated
danielhanchen and others added 4 commits April 12, 2026 18:34
- Remap upstream HF 401/403 to 422 so authFetch does not mistake
  HF auth failures for expired Studio sessions (avoids forced logout)
- Guard against malformed HF split entries (null splits key, missing
  fields) to prevent KeyError on non-standard responses
- Return generic error message in catch-all handler instead of raw
  str(e) which can leak internal network details
- Validate SplitsRequest.dataset is non-empty at the model level
- Strip whitespace from HF_TOKEN env var to match how user-supplied
  tokens are handled (trailing spaces in .env files break HF auth)
- Avoid leaking upstream URL in error detail when HF returns non-JSON
  responses (e.g. HTML 429 rate-limit pages); use a generic message
  instead of falling back to the raw requests library error string
@danielhanchen danielhanchen added auto-approved Auto-review passed, ready to merge and removed auto-reviewing PR is being auto-reviewed labels Apr 12, 2026
@danielhanchen
Copy link
Copy Markdown
Contributor

Auto-review verdict: Approved

Adds a backend proxy endpoint for HuggingFace dataset splits metadata, enabling server-side HF_TOKEN fallback so private/gated datasets load correctly when the user has not entered a personal HF token in the Studio UI. The implementation is correct, minimal, and introduces no regressions.

Reason: All real issues (auth 401 remap, env token stripping, error detail sanitization, malformed response handling) were fixed and verified across 25 simulation tests

@HadiSDev
Copy link
Copy Markdown
Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a backend proxy for fetching HuggingFace dataset splits, enabling the use of server-side authentication tokens for private or gated datasets. The frontend has been updated to route requests through the new /api/datasets/splits endpoint and handles error responses accordingly. Feedback was provided regarding the backend's error handling, specifically suggesting that silent exception blocks should be replaced with debug logging to facilitate troubleshooting when upstream responses fail to parse.

Comment thread studio/backend/routes/datasets.py Outdated
- Check isinstance(error_data, dict) before calling .get("error") to
  handle non-dict JSON responses from HuggingFace API
- Replace silent except/pass with logger.debug for error response
  parsing failures

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@HadiSDev
Copy link
Copy Markdown
Author

/gemini review

@HadiSDev
Copy link
Copy Markdown
Author

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Can't wait for the next one!

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a backend proxy for the HuggingFace dataset splits API, allowing the application to use server-side authentication for private datasets. It adds the necessary Pydantic models, a new /splits endpoint in the backend routes, and updates the frontend useHfDatasetSplits hook to communicate with the backend instead of the external API directly. The review feedback suggests improving the robustness of the response parsing by verifying the data type before iteration and broadening the exception handling to include all request-related errors.

Comment thread studio/backend/routes/datasets.py
Comment thread studio/backend/routes/datasets.py
HadiSDev and others added 2 commits April 13, 2026 15:51
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Author

@HadiSDev HadiSDev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/gemini review

Catch requests.RequestException instead of bare Exception for the
splits endpoint fallback handler, so only network-level errors are
caught while unexpected programming errors propagate normally.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@HadiSDev
Copy link
Copy Markdown
Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a backend proxy for fetching HuggingFace dataset splits, which allows the application to use server-side environment variables for authentication when accessing private or gated datasets. The changes include new Pydantic models for request and response validation, a new /splits endpoint in the backend, and an update to the frontend hook to route requests through this new endpoint. Review feedback focused on improving the robustness of the backend by adding explicit type checks for the external API's JSON response and its fields to prevent potential runtime errors or validation failures.

Comment on lines +352 to +354
data = resp.json()

splits_data = data.get("splits")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The code assumes the response from Hugging Face is a dictionary. If the API returns a different JSON type (like a list), data.get("splits") will raise an AttributeError. It is safer to validate the response type before accessing its keys.

Suggested change
data = resp.json()
splits_data = data.get("splits")
data = resp.json()
if not isinstance(data, dict):
logger.error(f"Unexpected response format from HuggingFace for {request.dataset!r}: {data}")
raise HTTPException(status_code = 502, detail = "Invalid response format from HuggingFace")
splits_data = data.get("splits")

Comment on lines +358 to +363
entries = [
SplitEntry(dataset = s["dataset"], config = s["config"], split = s["split"])
for s in splits_data
if isinstance(s, dict)
and all(k in s for k in ("dataset", "config", "split"))
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To prevent potential ValidationError from Pydantic when the upstream API returns unexpected data types (e.g., null for a required field), it is recommended to verify that the required fields are present and have the expected types before instantiating SplitEntry.

Suggested change
entries = [
SplitEntry(dataset = s["dataset"], config = s["config"], split = s["split"])
for s in splits_data
if isinstance(s, dict)
and all(k in s for k in ("dataset", "config", "split"))
]
entries = [
SplitEntry(dataset = s["dataset"], config = s["config"], split = s["split"])
for s in splits_data
if isinstance(s, dict)
and all(isinstance(s.get(k), str) for k in ("dataset", "config", "split"))
]

@HadiSDev
Copy link
Copy Markdown
Author

@danielhanchen I am ready if you are

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-approved Auto-review passed, ready to merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants