Add endpoint to fetch dataset metadata with token#4964
Add endpoint to fetch dataset metadata with token#4964HadiSDev wants to merge 14 commits intounslothai:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a backend proxy for fetching HuggingFace dataset splits, enabling the use of server-side authentication tokens to access private or gated datasets. The changes include new Pydantic models for split metadata, a /splits API endpoint in the backend, and an update to the frontend useHfDatasetSplits hook to route requests through the backend. Feedback focuses on optimizing the backend logic to avoid performance issues caused by sequential network requests for datasets with many configurations and improving error logging within the split-fetching loop to avoid masking potential issues.
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 926ba30b26
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
- Remap upstream HF 401/403 to 422 so authFetch does not mistake HF auth failures for expired Studio sessions (avoids forced logout) - Guard against malformed HF split entries (null splits key, missing fields) to prevent KeyError on non-standard responses - Return generic error message in catch-all handler instead of raw str(e) which can leak internal network details - Validate SplitsRequest.dataset is non-empty at the model level
for more information, see https://pre-commit.ci
- Strip whitespace from HF_TOKEN env var to match how user-supplied tokens are handled (trailing spaces in .env files break HF auth) - Avoid leaking upstream URL in error detail when HF returns non-JSON responses (e.g. HTML 429 rate-limit pages); use a generic message instead of falling back to the raw requests library error string
for more information, see https://pre-commit.ci
|
Auto-review verdict: Approved Adds a backend proxy endpoint for HuggingFace dataset splits metadata, enabling server-side HF_TOKEN fallback so private/gated datasets load correctly when the user has not entered a personal HF token in the Studio UI. The implementation is correct, minimal, and introduces no regressions. Reason: All real issues (auth 401 remap, env token stripping, error detail sanitization, malformed response handling) were fixed and verified across 25 simulation tests |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a backend proxy for fetching HuggingFace dataset splits, enabling the use of server-side authentication tokens for private or gated datasets. The frontend has been updated to route requests through the new /api/datasets/splits endpoint and handles error responses accordingly. Feedback was provided regarding the backend's error handling, specifically suggesting that silent exception blocks should be replaced with debug logging to facilitate troubleshooting when upstream responses fail to parse.
- Check isinstance(error_data, dict) before calling .get("error") to
handle non-dict JSON responses from HuggingFace API
- Replace silent except/pass with logger.debug for error response
parsing failures
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
/gemini review |
|
@codex review |
|
Codex Review: Didn't find any major issues. Can't wait for the next one! ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
There was a problem hiding this comment.
Code Review
This pull request implements a backend proxy for the HuggingFace dataset splits API, allowing the application to use server-side authentication for private datasets. It adds the necessary Pydantic models, a new /splits endpoint in the backend routes, and updates the frontend useHfDatasetSplits hook to communicate with the backend instead of the external API directly. The review feedback suggests improving the robustness of the response parsing by verifying the data type before iteration and broadening the exception handling to include all request-related errors.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Catch requests.RequestException instead of bare Exception for the splits endpoint fallback handler, so only network-level errors are caught while unexpected programming errors propagate normally. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a backend proxy for fetching HuggingFace dataset splits, which allows the application to use server-side environment variables for authentication when accessing private or gated datasets. The changes include new Pydantic models for request and response validation, a new /splits endpoint in the backend, and an update to the frontend hook to route requests through this new endpoint. Review feedback focused on improving the robustness of the backend by adding explicit type checks for the external API's JSON response and its fields to prevent potential runtime errors or validation failures.
| data = resp.json() | ||
|
|
||
| splits_data = data.get("splits") |
There was a problem hiding this comment.
The code assumes the response from Hugging Face is a dictionary. If the API returns a different JSON type (like a list), data.get("splits") will raise an AttributeError. It is safer to validate the response type before accessing its keys.
| data = resp.json() | |
| splits_data = data.get("splits") | |
| data = resp.json() | |
| if not isinstance(data, dict): | |
| logger.error(f"Unexpected response format from HuggingFace for {request.dataset!r}: {data}") | |
| raise HTTPException(status_code = 502, detail = "Invalid response format from HuggingFace") | |
| splits_data = data.get("splits") |
| entries = [ | ||
| SplitEntry(dataset = s["dataset"], config = s["config"], split = s["split"]) | ||
| for s in splits_data | ||
| if isinstance(s, dict) | ||
| and all(k in s for k in ("dataset", "config", "split")) | ||
| ] |
There was a problem hiding this comment.
To prevent potential ValidationError from Pydantic when the upstream API returns unexpected data types (e.g., null for a required field), it is recommended to verify that the required fields are present and have the expected types before instantiating SplitEntry.
| entries = [ | |
| SplitEntry(dataset = s["dataset"], config = s["config"], split = s["split"]) | |
| for s in splits_data | |
| if isinstance(s, dict) | |
| and all(k in s for k in ("dataset", "config", "split")) | |
| ] | |
| entries = [ | |
| SplitEntry(dataset = s["dataset"], config = s["config"], split = s["split"]) | |
| for s in splits_data | |
| if isinstance(s, dict) | |
| and all(isinstance(s.get(k), str) for k in ("dataset", "config", "split")) | |
| ] |
# Conflicts: # studio/backend/routes/datasets.py
|
@danielhanchen I am ready if you are |
#4962