Conversation
…on-sdk into tim/sdk-configures-edge
…on-sdk into tim/sdk-configures-edge
brandon-wada
left a comment
There was a problem hiding this comment.
Couple of fixes to take a look at, mostly to ensure some stylistic rules
There was a problem hiding this comment.
Maybe worth a small mocked test on the put method and detector readiness method
There was a problem hiding this comment.
Added test_edge_set_config_sends_payload_and_polls and test_edge_get_detector_readiness.
src/groundlight/experimental_api.py
Outdated
| self.detector_reset_api = DetectorResetApi(self.api_client) | ||
|
|
||
| self.edge_api = EdgeApi(self.api_client) | ||
| self.edge = EdgeAPI(self) |
There was a problem hiding this comment.
Something has gone very wrong here. I think what we're looking for is for edge_base_url to be defined in the EdgeApi, and for there to only be one EdgeAPI defined on the experimental client
There was a problem hiding this comment.
These are two separate APIs that serve different purposes. EdgeApi (OpenAPI-generated) calls the cloud API for presigned model download URLs -- used internally by download_mlbinary(). EdgeEndpointApi (new) talks directly to the edge endpoint for config management (gl.edge.get_config(), etc.).
I have adjusted the naming and added comments to disambiguate.
| pass | ||
|
|
||
|
|
||
| class EdgeNotAvailableError(GroundlightClientError): |
There was a problem hiding this comment.
Is there a reason to define this in client rather than experimental_api? This error should only be possible in while using the experimental client, no?
There was a problem hiding this comment.
Correct that it is only be possible from the experimental client at this point.
We don't currently have any exceptions that are specific to the experimental client, so I decided not to put it there. I considered grouping it with the edge stuff, e.g. groundlight.edge.exceptions, but I decided not to introduce a new pattern for a single exception.
src/groundlight/edge/api.py
Outdated
| deadline = time.time() + timeout_sec | ||
| while time.time() < deadline: | ||
| readiness = self.get_detector_readiness() | ||
| if desired_ids and all(readiness.get(did, False) for did in desired_ids): |
There was a problem hiding this comment.
Via Claude: edge condition if the detector list is empty. It should either return immediately or return a different error rather than timeout
| headers = self._client.get_raw_headers() | ||
| try: | ||
| response = requests.request( | ||
| method, url, headers=headers, verify=self._client.configuration.verify_ssl, timeout=10, **kwargs |
There was a problem hiding this comment.
Claude has a nit: 10 second default is fine but should be overridable via kwargs
There was a problem hiding this comment.
I think this would add complexity without any value to the SDK user. All of these methods should return very quickly. 10 seconds is a reasonable and generous timeout. If the request is taking longer than this, there is something wrong with the network, which won't be fixed by adjusting the timeout.
There was a problem hiding this comment.
Internal method, I'll buy that
src/groundlight/edge/api.py
Outdated
| self._request("PUT", "/edge-config", json=config.to_payload()) | ||
|
|
||
| poll_interval_seconds = 1 | ||
| desired_ids = {d.detector_id for d in config.detectors if d.detector_id} |
There was a problem hiding this comment.
Claude: d.detector_id should never be falsey, it's misleading to have the check here
Me: However, in the current pydantic config I don't think we actually forbid the empty string. We probably should though
There was a problem hiding this comment.
Agreed. I added pydantic validation and removed the defensive code that assumes detector ID can be falsey.
…on-sdk into tim/sdk-configures-edge
| headers = self._client.get_raw_headers() | ||
| try: | ||
| response = requests.request( | ||
| method, url, headers=headers, verify=self._client.configuration.verify_ssl, timeout=10, **kwargs |
There was a problem hiding this comment.
Internal method, I'll buy that
## Summary Adds HTTP endpoints for reading and modifying the edge endpoint configuration at runtime, without requiring a Helm redeploy. A [Python SDK companion PR](groundlight/python-sdk#419) provides the client-side methods. **New endpoints:** - `GET /edge-config` -- returns the active `EdgeEndpointConfig` - `PUT /edge-config` -- replaces the active config (diffs against current state, adds/removes detector pods accordingly) - `GET /edge-detector-readiness` -- reports which configured detectors have inference pods ready to serve ## Key design decisions ### Config as a shared file on PVC Multiple uvicorn workers cannot share in-memory state. The active config is persisted to a YAML file on the existing PVC (`/opt/groundlight/edge/config/active-edge-config.yaml`). Workers read it via an mtime-based cache (`EdgeConfigManager.active()`), which calls `os.path.getmtime` on each access and only re-parses when the file has changed. This gives cross-worker consistency with negligible overhead. ### Config seeding via init container An `apply-edge-config` init container copies the Helm ConfigMap to the active config file on PVC. It uses a revision fingerprint (revision + deploy timestamp) stored on PVC to distinguish `helm install`/`upgrade` from restarts -- only copying when the fingerprint changes. This ensures Helm config is applied on deploy while SDK changes survive restarts. At startup, `main.py` checks: `EDGE_CONFIG` env var (for Balena) > active config on PVC > Pydantic defaults. ### Backward compatibility Older Helm charts lack the init container. In that case, `main.py` detects the missing active config file and falls back to copying the Helm ConfigMap directly. ### Detector reconciliation `PUT /edge-config` triggers `reconcile_config()`, which: 1. Diffs the desired detectors against DB records (`compute_detector_diff`) 2. Marks removed detectors as `pending_deletion` in the DB 3. Creates new deployment records for added detectors 4. Saves the new config to disk The `inference-model-updater` picks up these DB changes on its next loop iteration -- deleting pods for `pending_deletion` records, then creating pods for new ones. This could cause the config update process to take a bit longer than necessary (because we have to wait for the refresh loop to occur). I thought about adding logic to make the inference-model-updater poll for config updates while it waits, and terminate early if it sees one. In the end, I decided to not add that (yet). YAGNI. ### Deprecation of default-edge-config.yaml The static `deploy/helm/groundlight-edge-endpoint/files/default-edge-config.yaml` has been removed. Default config is now defined by Pydantic model defaults in the `python-sdk`. When no config file is provided to Helm, `_helpers.tpl` generates an empty YAML object, and the system uses Pydantic defaults. ## Notable refactorings - **`EdgeConfigManager`** (new, replaces `edge_config_loader.py`): Singleton-style class with class methods for config lifecycle -- `save()`, `active()`, `detector_configs()`, `detector_config()`. - **`EdgeInferenceManager` simplification**: Removed stored `detector_inference_configs`, `inference_client_urls`, `oodd_inference_client_urls`, `min_times_between_escalations`, and the 30-second scheduler poll. The manager now computes pod URLs on the fly from detector IDs and reads config from `EdgeConfigManager.active()`. Only `last_escalation_times` remains as runtime state. - **`naming.py`** (new): Extracted `get_edge_inference_service_name` and `get_edge_inference_model_name` out of `edge_inference.py` to break a circular import between `edge_inference.py` and `edge_config_manager.py`. - **`pending_deletion` column**: Added to the `InferenceDeployment` DB model. The column has a server default of `False`, so existing databases are compatible without migration. - **`apply-edge-config` init container** (new): Busybox-based init container added to the edge-endpoint Deployment. Uses a revision fingerprint to conditionally copy Helm config to PVC. Script lives in `files/apply-edge-config.sh`, embedded in the `edge-config` ConfigMap. ## Authentication These new endpoints are **unauthenticated**, consistent with all other non-inference endpoints on the edge endpoint (status page, metrics, health checks). The edge endpoint assumes a trusted local network. `GET /edge-config` and `GET /edge-detector-readiness` are comparable in sensitivity to the existing status page and `metrics.json`, which already expose detector IDs, model versions, and readiness. `PUT /edge-config` is the only unauthenticated write endpoint that modifies system state. An attacker with network access could add or remove detectors. This is an accepted tradeoff under the trusted-network assumption. If edge endpoints are ever exposed to untrusted networks, this endpoint should be the first to get auth. ## Release strategy This PR must be deployed **before** the companion SDK PR (`python-sdk` `0.26.0`). The SDK's `gl.edge.*` methods call endpoints that only exist after this PR is deployed. Releasing the SDK first would cause those methods to 404. The edge endpoint's `groundlight` dependency does not need to change for this PR -- it uses SDK features already available in `0.25.x`. However, once the SDK is released at `0.26.0`, a follow-up bump to `>=0.26.0, <0.27.0` is recommended. ## Tests - `test/api/test_edge_config.py`: GET returns 200, PUT validates body (422 on invalid, 200 on valid) - `test/core/test_edge_config_manager.py`: `compute_detector_diff` (6 cases), `apply_detector_changes` (3 cases), `EdgeConfigManager` (save/load/active roundtrip, mtime caching, startup priority, detector config lookups) ## Load Testing This PR touches some code in the hot path of inference (in what I hope is a trivial way). To ensure that I am not introducing any performance regressions, I performed the following benchmark. The results appear nearly identical, each test achieving a "Max Steady RPS" of 30 with the `countstep-yolox-tracking` pipeline. ```python uv run python multiple_client_throughput_test.py COUNT --edge-pipeline-config count-step-yolox-tracking --time-between-ramp 10 --max-clients 8 ``` With changes: <img width="1000" height="600" alt="image" src="proxy.php?url=https%3A%2F%2Fgithub.com%2F%3Ca+href%3D"https://github.com/user-attachments/assets/7414a6f4-cb97-443f-b4a9-3724a8edfc2d">https://github.com/user-attachments/assets/7414a6f4-cb97-443f-b4a9-3724a8edfc2d" /> Before changes: <img width="1000" height="600" alt="image" src="proxy.php?url=https%3A%2F%2Fgithub.com%2F%3Ca+href%3D"https://github.com/user-attachments/assets/ff5473e2-7f13-4da0-97a5-077a455179cf">https://github.com/user-attachments/assets/ff5473e2-7f13-4da0-97a5-077a455179cf" /> --------- Co-authored-by: Auto-format Bot <[email protected]>
Summary
Adds SDK methods for reading and modifying edge endpoint configuration at runtime, accessible via a namespaced
gl.edgepattern. Depends on companion Edge PR.New code
EdgeAPIclass (groundlight/edge/api.py)Provides three methods:
get_config()-- callsGET /edge-config, returnsEdgeEndpointConfigset_config(config, timeout_sec)-- callsPUT /edge-config, then pollsGET /edge-detector-readinessuntil all desired detectors are ready (or timeout)get_detector_readiness()-- callsGET /edge-detector-readiness, returnsdict[str, bool]All HTTP calls go through
_request(), which raisesEdgeNotAvailableErroron 404 or connection failure with an informative message guiding the user to check their endpoint configuration.EdgeNotAvailableErrorNew exception in
groundlight.client. Raised when an edge-only method is called against a non-edge endpoint (cloud API, unreachable host). The error message includes a hint about theGROUNDLIGHT_ENDPOINTenv var.gl.edgeaccess patternExperimentalApi.__init__creates anEdgeAPIinstance asself.edge. Edge methods are only accessible through this namespace.Version
Bumped to
0.26.0(minor version bump for new public API surface).Release order
The companion edge-endpoint PR must be deployed first. The
gl.edge.*methods call endpoints (/edge-config,/edge-detector-readiness) that only exist after the edge PR is deployed. Releasing this SDK version first would cause those methods to raiseEdgeNotAvailableError(404).Authentication note
These methods do not add authentication headers beyond what the SDK already sends (the standard
x-api-token). The edge endpoint does not currently enforce auth on the config or readiness endpoints. The SDK sends the token in case auth is added in the future, but it is not required today.