[RFC] Add device re-interview support#1789
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## dev #1789 +/- ##
=======================================
Coverage 99.53% 99.53%
=======================================
Files 64 64
Lines 13059 13124 +65
=======================================
+ Hits 12998 13063 +65
Misses 61 61 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
This definitely needs a lot of cleanup and I'm still not sure about the zigpy side of the solution, but this seems to be working with these changes to ZHA and Core (they include the dynamic entity rediscovery): |
| # Re-interview the device after successful OTA to pick up | ||
| # any changes in clusters/endpoints/model and re-apply quirks | ||
| try: | ||
| await self.reinterview() | ||
| except Exception: # noqa: BLE001 | ||
| LOGGER.warning( | ||
| "Post-OTA re-interview failed for %r," | ||
| " device may need manual re-interview", | ||
| self, | ||
| exc_info=True, | ||
| ) |
There was a problem hiding this comment.
Side note: In the future, I'd like to add information to the DB containing which OTA/fw version a device was last interviewed with. We could also show repairs to re-interview devices if this post-OTA re-interview fails, for some reason. In a similar manner, we could also do this for quirks (and likely include explicit version bumps in quirks that actually need configuration).
There was a problem hiding this comment.
Pull request overview
Adds experimental “device re-interview” support to zigpy to refresh a device’s discovered state (endpoints/clusters/model/manufacturer/OTA version) at runtime—without removing/rejoining—by discovering via a temporary shadow Device and swapping it in on success.
Changes:
- Adds
Device.reinterview()with shadow-device discovery flow and integrates re-interview triggering after successful OTA updates. - Refactors initialization logic by extracting
_discover()from_initialize()to reuse discovery without emitting initialization completion. - Adds application-level swap/finalization logic (
_device_reinterviewed,_finalize_device) and new tests covering success/failure paths and the public API.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
zigpy/device.py |
Adds re-interview flow, extracts _discover(), and triggers re-interview after OTA update. |
zigpy/application.py |
Adds device finalization helper and implements shadow→real device swap logic plus a public reinterview_device() API. |
tests/test_device.py |
Updates OTA tests to mock reinterview and adds new re-interview unit tests. |
tests/test_application.py |
Adds tests for _device_reinterviewed() swap behavior, DB removal ordering, group migration, and API delegation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # attribute cache, group members, and relays) | ||
| if self._dblistener is not None: | ||
| old_device.remove_listener(self._dblistener) | ||
| await self._dblistener._remove_device(old_device) |
There was a problem hiding this comment.
This is intentional. _dblistener.remove_device doesn't exist and device_removed only enqueues the deletion from the DB. We need to actually make sure it's deleted before we later re-add DB data the fire-and-forget way.
We could also introduce some sort of flush method, which we can wait on here, but IMO this is fine.
There was a problem hiding this comment.
Pull request overview
Adds an experimental “device re-interview” mechanism to zigpy that can re-run discovery/quirk matching at runtime (notably after OTA), using a temporary shadow Device and swapping it in only after successful discovery.
Changes:
- Introduces
Device.reinterview()and refactors initialization discovery into a reusable_discover()phase. - Adds
ControllerApplication.reinterview_device()plus swap/finalization helpers and new re-interview success/failure events. - Extends test coverage with end-to-end, failure, guard, and group/DB/relay migration scenarios.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| zigpy/device.py | Adds re-interview flow, extracts _discover(), triggers re-interview after successful OTA, and adjusts PollControl behavior during re-interview. |
| zigpy/application.py | Adds _finalize_device(), implements _device_reinterviewed() swap logic, and exposes reinterview_device() API. |
| tests/test_device.py | Updates OTA tests to account for re-interview call and adds comprehensive re-interview behavior tests. |
| tests/test_application.py | Adds tests for DB removal ordering, public API delegation, group migration, relay persistence, and failure propagation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # reinterview() handles its own errors internally. | ||
| await self.reinterview() |
There was a problem hiding this comment.
I think it's more correct to have requesting the node descriptor be part of the update still. Basically everything part of the interview process happens before self.reinterview() returns. So the update would only fully complete after a re-interview. I think that's fine/good.
| self.listener_event("raw_device_initialized", device) | ||
| device = zigpy.quirks.get_device(device) | ||
| self.devices[device.ieee] = device | ||
| if self._dblistener is not None: | ||
| device.add_context_listener(self._dblistener) |
There was a problem hiding this comment.
Oh, I think this is actually a pre-existing issue ever since the PollControl stuff was added to the Device's constructor (#1621). It would also happen when we create a new Device object if we found a quirk at startup.
It shouldn't be a big issue, as packets would only arrive to the new device though. So it's "just" a small memory leak.
There was a problem hiding this comment.
Yeah, see:
We correctly clean up everything in the re-interview PR already. So, we don't need strictly need that PR for this. It was just discovered here.
|
|
||
| # Temporarily register the shadow in app.devices so it receives | ||
| # ZDO responses routed by packet_received(). | ||
| self._application.devices[self._ieee] = shadow |
There was a problem hiding this comment.
This is an approach that I really dislike but we don't have much of a choice with how packet routing in zigpy is set up.
We could potentially rework packet_received to emit an event that Device objects subscribe to, but this might have a performance penalty if every received packet relies on 99/100 devices rejecting it. Maybe a filtered subscribe, keyed by device EUI64 with zigpy's application controller maintaining the NWK <-> EUI64 mapping?
There was a problem hiding this comment.
I get it's not nice – unfortunately, the whole approach isn't great, but I'm not sure we should try and abstract this further. We'd still need to register and unregister the new and old device for routing, right? Like, we don't want the packets to go to the old device. We want to completely swap it for routing temporarily.
And for that, I kind of feel like this is a one-liner that's rather explicit. And if we fail (never get a node descriptor response), we revert it right below: https://github.com/TheJulianJES/zigpy/blob/edfcf556a8fc3756ad8a26a3f21f630bd208a477/zigpy/device.py#L338
| "Re-interview failed, keeping existing device", | ||
| exc_info=True, | ||
| ) | ||
| self._application.listener_event("device_reinterview_failure", self) |
There was a problem hiding this comment.
Could we turn this into a typed event?
There was a problem hiding this comment.
You mean with the new event system using emit?
I feel like we have to change all other events as well then, as they all affect ZHA. I guess we could do that before or after this PR.
# Conflicts: # zigpy/device.py
|
TODO: Are clusters cleaned up properly? (e.g. OTA cluster processes matches / callback) |
Draft. Tested and working. Open for discussion regarding the approach.
Note
This is somewhat similar to the approach of quirk matching, where we also create a new
Deviceobject.With how both zigpy and zha-quirks currently work, I don't see any other way than basically how it's done in this PR.
IMO, re-interview support is something we really need, especially for OTA updates. I think this solution is "good enough". If quirks ever drastically change in the future, we can re-evaluate this.
Related changes
This is already working with these changes to ZHA and Core (they include the dynamic entity rediscovery):
Proposed change
This is an experimental PR to allow zigpy to re-interview a device at runtime, without needing to fully remove it, and re-join it.
This works by creating a temporary "shadow device" for which the initialization is started. Only if it's successful is the existing data wiped from the DB and the existing runtime
Deviceobject replaced with the "shadow device" and hooked up for DB events.Possible advantages of this approach
I think the approach of this might actually be nicer than a lot of other ideas that came up:
During the reinterview process, the device would still work as isDevice.reinterview()get_devicemethod, which essentially requires a newDeviceobject if we want to redo quirk matchingTODO:
Device?Deviceobject but revert all runtime (+ DB?) changes if unsuccessful?Device? Similar torecompute_capabilities(kind of...)?Device:_last_seen+_relays+lqi+rssi+ group membership + ...?The ZHA part of this also somewhat depends on:
AI summary
Adds the ability to re-interview a device after an OTA firmware update (or on demand from upstream applications like ZHA). Re-interviewing re-discovers the device's node descriptor, endpoints, clusters, model/manufacturer info, and OTA firmware version, then re-applies quirks.
This is needed because a firmware update can change a device's exposed endpoints, clusters, or model string, which may require a different quirk to be applied.
Approach: Shadow Device
A fresh "shadow"
Deviceobject is created with the same IEEE/NWK and fully initialized via ZDO requests. The old device continues to handle messages normally throughout the entire process.More details
Other (unnecessary) AI information (CLICK TO EXPAND)
Changes
zigpy/device.py_discover()from_initialize()— pure discovery logic without side effects, enabling reuse by the reinterview flow.Device.reinterview()— creates a shadow device, runs discovery, swaps on success.reinterviewingproperty and_reinterview_in_progressguard flag.reinterview()automatically after a successful OTA update inupdate_firmware().poll_control_checkin_callbackso sleepy devices are more likely to respond.zigpy/application.py_device_reinterviewed(old_device, shadow)— handles the atomic swap: DB removal, cleanup, quirk re-application, and event firing.reinterview_device(ieee)— public API for upstream applications (e.g. ZHA) to trigger a re-interview.Events
device_reinterviewed— fired after a successful re-interview, with the new (possibly quirked) device.device_reinterview_failure— fired when re-interview fails, with the old device (still functional).Test plan
_device_reinterviewed_reinterview_in_progressguard)reinterview()_device_reinterviewedswaps device inapp.devicesand fires events_device_reinterviewedremoves old device from DB before saving new onereinterview_device()public API delegates correctlyreinterview_device()raisesKeyErrorfor unknown IEEEThought with
_save_devicenot really doing anything for quirked devices_save_device behavior for quirked devices (CLICK TO EXPAND)
This doesn't seem to be an issue here, but
_save_devicebeing called after for quirked devices doesn't really save anything to the DB. It completely skips even saving attribute cache (and unsupported ones). This isn't ideal and we should look at this at some point, though we do not want to save endpoints or clusters added or removed by quirks to the DB.zigpy/zigpy/appdb.py
Lines 411 to 420 in 7d1ea41
Why it's not an issue below (AI-generated):
How attributes are saved
The
_save_devicemethod inappdb.pyhas an early return forBaseCustomDeviceinstances (quirked devices), skipping endpoint, cluster, and attribute cache persistence. This is intentional — quirk-added clusters/endpoints should not be persisted to the DB, which serves as the source of truth for quirk matching.This is not an issue for the reinterview flow. In
device_initialized():raw_device_initializedfires with the raw shadow (a plainDevice, not aBaseCustomDevice) before quirks are applied. So_save_devicereceives the raw device, theisinstancecheck isFalse, and the full save path runs:_save_device(shadow)— shadow is a rawDevice→ saves original endpoints, clusters, and attribute cache (fromread_attributescalls during_discover())get_device(shadow)— returns quirked device with potentially modified clusters/endpointsregister_cluster_events— quirked device's clusters get event handlers hooked up, so future attribute reads/reports persist individually viaon_attribute_read,on_attribute_updated, etc.The DB ends up with the original (pre-quirk) endpoints and clusters as source of truth, plus the attribute cache populated during discovery. This matches how initial device join already works.