Skip to content

[RFC] Add device re-interview support#1789

Draft
TheJulianJES wants to merge 28 commits intozigpy:devfrom
TheJulianJES:tjj/reinterviewing
Draft

[RFC] Add device re-interview support#1789
TheJulianJES wants to merge 28 commits intozigpy:devfrom
TheJulianJES:tjj/reinterviewing

Conversation

@TheJulianJES
Copy link
Copy Markdown
Contributor

@TheJulianJES TheJulianJES commented Mar 15, 2026

Draft. Tested and working. Open for discussion regarding the approach.

Note

This is somewhat similar to the approach of quirk matching, where we also create a new Device object.
With how both zigpy and zha-quirks currently work, I don't see any other way than basically how it's done in this PR.

IMO, re-interview support is something we really need, especially for OTA updates. I think this solution is "good enough". If quirks ever drastically change in the future, we can re-evaluate this.

Related changes

This is already working with these changes to ZHA and Core (they include the dynamic entity rediscovery):

Proposed change

This is an experimental PR to allow zigpy to re-interview a device at runtime, without needing to fully remove it, and re-join it.

This works by creating a temporary "shadow device" for which the initialization is started. Only if it's successful is the existing data wiped from the DB and the existing runtime Device object replaced with the "shadow device" and hooked up for DB events.

Possible advantages of this approach

I think the approach of this might actually be nicer than a lot of other ideas that came up:

  • We don't need to modify the DB until after we're done with a successful reinterview
  • We don't need to remove the (working) device until after we're done with a successful reinterview
    • During the reinterview process, the device would still work as is
    • EDIT: We (obviously) need to forward any messages to the new shadow device, only reverting them if we time out. This does work nicely.
  • We don't need to selectively modify existing attributes in the cache and just replace all of them
  • We don't need to change ZHA attribute reading logic to re-read existing attributes already in cache
    • We can likely just hook up the reconfiguration to Device.reinterview()
  • It's also a relatively small approach and works with existing logic
  • We have a "completely fresh device" at the end
  • This works with the quirks registry's get_device method, which essentially requires a new Device object if we want to redo quirk matching

TODO:

  • Test
  • See if we even want to do this
    • Would there be any advantages if we completely remove and re-add the Device?
    • Could we re-interview on an existing Device object but revert all runtime (+ DB?) changes if unsuccessful?
    • Or is this approach fine?
  • Make some parts slightly less hacky
  • Investigate if this even works nicely with ZHA and HA
    • ZHA holds references to the zigpy device, HA to the ZHA device
    • Can we have ZHA devices "re-initialize" from the zigpy Device? Similar to recompute_capabilities (kind of...)?
  • Copy over parts from old Device: _last_seen + _relays + lqi + rssi + group membership + ...?

The ZHA part of this also somewhat depends on:


AI summary

Adds the ability to re-interview a device after an OTA firmware update (or on demand from upstream applications like ZHA). Re-interviewing re-discovers the device's node descriptor, endpoints, clusters, model/manufacturer info, and OTA firmware version, then re-applies quirks.

This is needed because a firmware update can change a device's exposed endpoints, clusters, or model string, which may require a different quirk to be applied.

Approach: Shadow Device

A fresh "shadow" Device object is created with the same IEEE/NWK and fully initialized via ZDO requests. The old device continues to handle messages normally throughout the entire process.

  • On success: the old device is removed from the DB (cascading to endpoints, clusters, and attribute cache), cleaned up, and replaced by the shadow. Quirks are re-applied and the new device is persisted.
  • On failure (e.g. sleepy end-device doesn't respond): the shadow is discarded. The old device and its DB state are completely untouched and it keeps working.

More details

Other (unnecessary) AI information (CLICK TO EXPAND)

Changes

zigpy/device.py

  • Extract _discover() from _initialize() — pure discovery logic without side effects, enabling reuse by the reinterview flow.
  • Add Device.reinterview() — creates a shadow device, runs discovery, swaps on success.
  • Add reinterviewing property and _reinterview_in_progress guard flag.
  • Trigger reinterview() automatically after a successful OTA update in update_firmware().
  • Enable fast polling during reinterview in poll_control_checkin_callback so sleepy devices are more likely to respond.

zigpy/application.py

  • Add _device_reinterviewed(old_device, shadow) — handles the atomic swap: DB removal, cleanup, quirk re-application, and event firing.
  • Add reinterview_device(ieee) — public API for upstream applications (e.g. ZHA) to trigger a re-interview.

Events

  • device_reinterviewed — fired after a successful re-interview, with the new (possibly quirked) device.
  • device_reinterview_failure — fired when re-interview fails, with the old device (still functional).

Test plan

  • Successful re-interview creates shadow, calls _device_reinterviewed
  • Failed re-interview (sleepy device timeout) preserves old device and DB
  • Concurrent reinterview calls are prevented (_reinterview_in_progress guard)
  • Reinterview is skipped while initialization is in progress
  • Poll control checkin enables fast polling during reinterview
  • Successful OTA triggers reinterview()
  • _device_reinterviewed swaps device in app.devices and fires events
  • _device_reinterviewed removes old device from DB before saving new one
  • reinterview_device() public API delegates correctly
  • reinterview_device() raises KeyError for unknown IEEE
  • Existing OTA and initialization tests still pass (267 tests, 0 failures)

Thought with _save_device not really doing anything for quirked devices

_save_device behavior for quirked devices (CLICK TO EXPAND)

This doesn't seem to be an issue here, but _save_device being called after for quirked devices doesn't really save anything to the DB. It completely skips even saving attribute cache (and unsupported ones). This isn't ideal and we should look at this at some point, though we do not want to save endpoints or clusters added or removed by quirks to the DB.

zigpy/zigpy/appdb.py

Lines 411 to 420 in 7d1ea41

if isinstance(device, zigpy.quirks.BaseCustomDevice):
await self._db.commit()
return
await self._save_endpoints(device)
for ep in device.non_zdo_endpoints:
await self._save_clusters(ep)
await self._save_attribute_cache(ep)
await self._save_unsupported_attributes(ep)
await self._db.commit()

Why it's not an issue below (AI-generated):

How attributes are saved

The _save_device method in appdb.py has an early return for BaseCustomDevice instances (quirked devices), skipping endpoint, cluster, and attribute cache persistence. This is intentional — quirk-added clusters/endpoints should not be persisted to the DB, which serves as the source of truth for quirk matching.

This is not an issue for the reinterview flow. In device_initialized():

self.listener_event("raw_device_initialized", device)   # fires with raw shadow
device = zigpy.quirks.get_device(device)                 # quirk applied AFTER

raw_device_initialized fires with the raw shadow (a plain Device, not a BaseCustomDevice) before quirks are applied. So _save_device receives the raw device, the isinstance check is False, and the full save path runs:

  1. _save_device(shadow) — shadow is a raw Device → saves original endpoints, clusters, and attribute cache (from read_attributes calls during _discover())
  2. get_device(shadow) — returns quirked device with potentially modified clusters/endpoints
  3. register_cluster_events — quirked device's clusters get event handlers hooked up, so future attribute reads/reports persist individually via on_attribute_read, on_attribute_updated, etc.

The DB ends up with the original (pre-quirk) endpoints and clusters as source of truth, plus the attribute cache populated during discovery. This matches how initial device join already works.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.53%. Comparing base (c065ff9) to head (da56821).

Additional details and impacted files
@@           Coverage Diff           @@
##              dev    #1789   +/-   ##
=======================================
  Coverage   99.53%   99.53%           
=======================================
  Files          64       64           
  Lines       13059    13124   +65     
=======================================
+ Hits        12998    13063   +65     
  Misses         61       61           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@TheJulianJES
Copy link
Copy Markdown
Contributor Author

This definitely needs a lot of cleanup and I'm still not sure about the zigpy side of the solution, but this seems to be working with these changes to ZHA and Core (they include the dynamic entity rediscovery):

Comment thread zigpy/device.py Outdated
Comment on lines +1002 to +1012
# Re-interview the device after successful OTA to pick up
# any changes in clusters/endpoints/model and re-apply quirks
try:
await self.reinterview()
except Exception: # noqa: BLE001
LOGGER.warning(
"Post-OTA re-interview failed for %r,"
" device may need manual re-interview",
self,
exc_info=True,
)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: In the future, I'd like to add information to the DB containing which OTA/fw version a device was last interviewed with. We could also show repairs to re-interview devices if this post-OTA re-interview fails, for some reason. In a similar manner, we could also do this for quirks (and likely include explicit version bumps in quirks that actually need configuration).

@TheJulianJES TheJulianJES requested a review from Copilot March 19, 2026 02:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds experimental “device re-interview” support to zigpy to refresh a device’s discovered state (endpoints/clusters/model/manufacturer/OTA version) at runtime—without removing/rejoining—by discovering via a temporary shadow Device and swapping it in on success.

Changes:

  • Adds Device.reinterview() with shadow-device discovery flow and integrates re-interview triggering after successful OTA updates.
  • Refactors initialization logic by extracting _discover() from _initialize() to reuse discovery without emitting initialization completion.
  • Adds application-level swap/finalization logic (_device_reinterviewed, _finalize_device) and new tests covering success/failure paths and the public API.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File Description
zigpy/device.py Adds re-interview flow, extracts _discover(), and triggers re-interview after OTA update.
zigpy/application.py Adds device finalization helper and implements shadow→real device swap logic plus a public reinterview_device() API.
tests/test_device.py Updates OTA tests to mock reinterview and adds new re-interview unit tests.
tests/test_application.py Adds tests for _device_reinterviewed() swap behavior, DB removal ordering, group migration, and API delegation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread zigpy/application.py
Comment thread zigpy/application.py
# attribute cache, group members, and relays)
if self._dblistener is not None:
old_device.remove_listener(self._dblistener)
await self._dblistener._remove_device(old_device)
Copy link
Copy Markdown
Contributor Author

@TheJulianJES TheJulianJES Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional. _dblistener.remove_device doesn't exist and device_removed only enqueues the deletion from the DB. We need to actually make sure it's deleted before we later re-add DB data the fire-and-forget way.

We could also introduce some sort of flush method, which we can wait on here, but IMO this is fine.

Comment thread tests/test_device.py Outdated
Comment thread zigpy/device.py
Comment thread zigpy/device.py
Comment thread zigpy/device.py
Comment thread zigpy/device.py
@TheJulianJES TheJulianJES requested a review from Copilot March 19, 2026 03:24
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an experimental “device re-interview” mechanism to zigpy that can re-run discovery/quirk matching at runtime (notably after OTA), using a temporary shadow Device and swapping it in only after successful discovery.

Changes:

  • Introduces Device.reinterview() and refactors initialization discovery into a reusable _discover() phase.
  • Adds ControllerApplication.reinterview_device() plus swap/finalization helpers and new re-interview success/failure events.
  • Extends test coverage with end-to-end, failure, guard, and group/DB/relay migration scenarios.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
zigpy/device.py Adds re-interview flow, extracts _discover(), triggers re-interview after successful OTA, and adjusts PollControl behavior during re-interview.
zigpy/application.py Adds _finalize_device(), implements _device_reinterviewed() swap logic, and exposes reinterview_device() API.
tests/test_device.py Updates OTA tests to account for re-interview call and adds comprehensive re-interview behavior tests.
tests/test_application.py Adds tests for DB removal ordering, public API delegation, group migration, relay persistence, and failure propagation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread zigpy/device.py
Comment on lines +1004 to +1005
# reinterview() handles its own errors internally.
await self.reinterview()
Copy link
Copy Markdown
Contributor Author

@TheJulianJES TheJulianJES Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more correct to have requesting the node descriptor be part of the update still. Basically everything part of the interview process happens before self.reinterview() returns. So the update would only fully complete after a re-interview. I think that's fine/good.

Comment thread zigpy/application.py
Comment on lines 608 to 612
self.listener_event("raw_device_initialized", device)
device = zigpy.quirks.get_device(device)
self.devices[device.ieee] = device
if self._dblistener is not None:
device.add_context_listener(self._dblistener)
Copy link
Copy Markdown
Contributor Author

@TheJulianJES TheJulianJES Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think this is actually a pre-existing issue ever since the PollControl stuff was added to the Device's constructor (#1621). It would also happen when we create a new Device object if we found a quirk at startup.

It shouldn't be a big issue, as packets would only arrive to the new device though. So it's "just" a small memory leak.

Copy link
Copy Markdown
Contributor Author

@TheJulianJES TheJulianJES Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, see:

We correctly clean up everything in the re-interview PR already. So, we don't need strictly need that PR for this. It was just discovered here.

Comment thread zigpy/device.py

# Temporarily register the shadow in app.devices so it receives
# ZDO responses routed by packet_received().
self._application.devices[self._ieee] = shadow
Copy link
Copy Markdown
Collaborator

@puddly puddly Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an approach that I really dislike but we don't have much of a choice with how packet routing in zigpy is set up.

We could potentially rework packet_received to emit an event that Device objects subscribe to, but this might have a performance penalty if every received packet relies on 99/100 devices rejecting it. Maybe a filtered subscribe, keyed by device EUI64 with zigpy's application controller maintaining the NWK <-> EUI64 mapping?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get it's not nice – unfortunately, the whole approach isn't great, but I'm not sure we should try and abstract this further. We'd still need to register and unregister the new and old device for routing, right? Like, we don't want the packets to go to the old device. We want to completely swap it for routing temporarily.

And for that, I kind of feel like this is a one-liner that's rather explicit. And if we fail (never get a node descriptor response), we revert it right below: https://github.com/TheJulianJES/zigpy/blob/edfcf556a8fc3756ad8a26a3f21f630bd208a477/zigpy/device.py#L338

Comment thread zigpy/device.py
"Re-interview failed, keeping existing device",
exc_info=True,
)
self._application.listener_event("device_reinterview_failure", self)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we turn this into a typed event?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean with the new event system using emit?
I feel like we have to change all other events as well then, as they all affect ZHA. I guess we could do that before or after this PR.

# Conflicts:
#	zigpy/device.py
@TheJulianJES
Copy link
Copy Markdown
Contributor Author

TODO: Are clusters cleaned up properly? (e.g. OTA cluster processes matches / callback)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants