[RFC] Add device re-interview support by TheJulianJES · Pull Request #1789 · zigpy/zigpy

TheJulianJES · 2026-03-15T23:19:44Z

Draft. Tested and working. Open for discussion regarding the approach.

Note

This is somewhat similar to the approach of quirk matching, where we also create a new Device object.
With how both zigpy and zha-quirks currently work, I don't see any other way than basically how it's done in this PR.

IMO, re-interview support is something we really need, especially for OTA updates. I think this solution is "good enough". If quirks ever drastically change in the future, we can re-evaluate this.

Related changes

This is already working with these changes to ZHA and Core (they include the dynamic entity rediscovery):

Proposed change

This is an experimental PR to allow zigpy to re-interview a device at runtime, without needing to fully remove it, and re-join it.

This works by creating a temporary "shadow device" for which the initialization is started. Only if it's successful is the existing data wiped from the DB and the existing runtime Device object replaced with the "shadow device" and hooked up for DB events.

Possible advantages of this approach

I think the approach of this might actually be nicer than a lot of other ideas that came up:

We don't need to modify the DB until after we're done with a successful reinterview
We don't need to remove the (working) device until after we're done with a successful reinterview
- ~~During the reinterview process, the device would still work as is~~
- EDIT: We (obviously) need to forward any messages to the new shadow device, only reverting them if we time out. This does work nicely.
We don't need to selectively modify existing attributes in the cache and just replace all of them
We don't need to change ZHA attribute reading logic to re-read existing attributes already in cache
- We can likely just hook up the reconfiguration to Device.reinterview()
It's also a relatively small approach and works with existing logic
We have a "completely fresh device" at the end
This works with the quirks registry's get_device method, which essentially requires a new Device object if we want to redo quirk matching

TODO:

Test
See if we even want to do this
- Would there be any advantages if we completely remove and re-add the Device?
- Could we re-interview on an existing Device object but revert all runtime (+ DB?) changes if unsuccessful?
- Or is this approach fine?
Make some parts slightly less hacky
Investigate if this even works nicely with ZHA and HA
- ZHA holds references to the zigpy device, HA to the ZHA device
- Can we have ZHA devices "re-initialize" from the zigpy Device? Similar to recompute_capabilities (kind of...)?
Copy over parts from old Device: _last_seen + _relays + lqi + rssi + group membership + ...?

The ZHA part of this also somewhat depends on:

Entity recomputation and add/remove at runtime zha#517

AI summary

Adds the ability to re-interview a device after an OTA firmware update (or on demand from upstream applications like ZHA). Re-interviewing re-discovers the device's node descriptor, endpoints, clusters, model/manufacturer info, and OTA firmware version, then re-applies quirks.

This is needed because a firmware update can change a device's exposed endpoints, clusters, or model string, which may require a different quirk to be applied.

Approach: Shadow Device

A fresh "shadow" Device object is created with the same IEEE/NWK and fully initialized via ZDO requests. The old device continues to handle messages normally throughout the entire process.

On success: the old device is removed from the DB (cascading to endpoints, clusters, and attribute cache), cleaned up, and replaced by the shadow. Quirks are re-applied and the new device is persisted.
On failure (e.g. sleepy end-device doesn't respond): the shadow is discarded. The old device and its DB state are completely untouched and it keeps working.

More details

Other (unnecessary) AI information (CLICK TO EXPAND)

Changes

`zigpy/device.py`

Extract _discover() from _initialize() — pure discovery logic without side effects, enabling reuse by the reinterview flow.
Add Device.reinterview() — creates a shadow device, runs discovery, swaps on success.
Add reinterviewing property and _reinterview_in_progress guard flag.
Trigger reinterview() automatically after a successful OTA update in update_firmware().
Enable fast polling during reinterview in poll_control_checkin_callback so sleepy devices are more likely to respond.

`zigpy/application.py`

Add _device_reinterviewed(old_device, shadow) — handles the atomic swap: DB removal, cleanup, quirk re-application, and event firing.
Add reinterview_device(ieee) — public API for upstream applications (e.g. ZHA) to trigger a re-interview.

Events

device_reinterviewed — fired after a successful re-interview, with the new (possibly quirked) device.
device_reinterview_failure — fired when re-interview fails, with the old device (still functional).

Test plan

Thought with `_save_device` not really doing anything for quirked devices

_save_device behavior for quirked devices (CLICK TO EXPAND)

This doesn't seem to be an issue here, but _save_device being called after for quirked devices doesn't really save anything to the DB. It completely skips even saving attribute cache (and unsupported ones). This isn't ideal and we should look at this at some point, though we do not want to save endpoints or clusters added or removed by quirks to the DB.

zigpy/zigpy/appdb.py

Lines 411 to 420 in 7d1ea41

    
           if isinstance(device, zigpy.quirks.BaseCustomDevice): 
        
               await self._db.commit() 
        
               return 
        
           await self._save_endpoints(device) 
        
           for ep in device.non_zdo_endpoints: 
        
               await self._save_clusters(ep) 
        
               await self._save_attribute_cache(ep) 
        
               await self._save_unsupported_attributes(ep) 
        
           await self._db.commit()

Why it's not an issue below (AI-generated):

How attributes are saved

The _save_device method in appdb.py has an early return for BaseCustomDevice instances (quirked devices), skipping endpoint, cluster, and attribute cache persistence. This is intentional — quirk-added clusters/endpoints should not be persisted to the DB, which serves as the source of truth for quirk matching.

This is not an issue for the reinterview flow. In device_initialized():

self.listener_event("raw_device_initialized", device)   # fires with raw shadow
device = zigpy.quirks.get_device(device)                 # quirk applied AFTER

raw_device_initialized fires with the raw shadow (a plain Device, not a BaseCustomDevice) before quirks are applied. So _save_device receives the raw device, the isinstance check is False, and the full save path runs:

_save_device(shadow) — shadow is a raw Device → saves original endpoints, clusters, and attribute cache (from read_attributes calls during _discover())
get_device(shadow) — returns quirked device with potentially modified clusters/endpoints
register_cluster_events — quirked device's clusters get event handlers hooked up, so future attribute reads/reports persist individually via on_attribute_read, on_attribute_updated, etc.

The DB ends up with the original (pre-quirk) endpoints and clusters as source of truth, plus the attribute cache populated during discovery. This matches how initial device join already works.

codecov · 2026-03-15T23:22:00Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.53%. Comparing base (c065ff9) to head (da56821).

Additional details and impacted files

@@           Coverage Diff           @@
##              dev    #1789   +/-   ##
=======================================
  Coverage   99.53%   99.53%           
=======================================
  Files          64       64           
  Lines       13059    13124   +65     
=======================================
+ Hits        12998    13063   +65     
  Misses         61       61

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…le that

TheJulianJES · 2026-03-16T05:12:24Z

This definitely needs a lot of cleanup and I'm still not sure about the zigpy side of the solution, but this seems to be working with these changes to ZHA and Core (they include the dynamic entity rediscovery):

…nutes if device is offline

TheJulianJES · 2026-03-19T02:52:32Z

+        # Re-interview the device after successful OTA to pick up
+        # any changes in clusters/endpoints/model and re-apply quirks
+        try:
+            await self.reinterview()
+        except Exception:  # noqa: BLE001
+            LOGGER.warning(
+                "Post-OTA re-interview failed for %r,"
+                " device may need manual re-interview",
+                self,
+                exc_info=True,
+            )


Side note: In the future, I'd like to add information to the DB containing which OTA/fw version a device was last interviewed with. We could also show repairs to re-interview devices if this post-OTA re-interview fails, for some reason. In a similar manner, we could also do this for quirks (and likely include explicit version bumps in quirks that actually need configuration).

Copilot

Pull request overview

Adds experimental “device re-interview” support to zigpy to refresh a device’s discovered state (endpoints/clusters/model/manufacturer/OTA version) at runtime—without removing/rejoining—by discovering via a temporary shadow Device and swapping it in on success.

Changes:

Adds Device.reinterview() with shadow-device discovery flow and integrates re-interview triggering after successful OTA updates.
Refactors initialization logic by extracting _discover() from _initialize() to reuse discovery without emitting initialization completion.
Adds application-level swap/finalization logic (_device_reinterviewed, _finalize_device) and new tests covering success/failure paths and the public API.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File	Description
`zigpy/device.py`	Adds re-interview flow, extracts `_discover()`, and triggers re-interview after OTA update.
`zigpy/application.py`	Adds device finalization helper and implements shadow→real device swap logic plus a public `reinterview_device()` API.
`tests/test_device.py`	Updates OTA tests to mock reinterview and adds new re-interview unit tests.
`tests/test_application.py`	Adds tests for `_device_reinterviewed()` swap behavior, DB removal ordering, group migration, and API delegation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

TheJulianJES · 2026-03-19T03:23:33Z

+        # attribute cache, group members, and relays)
+        if self._dblistener is not None:
+            old_device.remove_listener(self._dblistener)
+            await self._dblistener._remove_device(old_device)


This is intentional. _dblistener.remove_device doesn't exist and device_removed only enqueues the deletion from the DB. We need to actually make sure it's deleted before we later re-add DB data the fire-and-forget way.

We could also introduce some sort of flush method, which we can wait on here, but IMO this is fine.

Copilot

Pull request overview

Adds an experimental “device re-interview” mechanism to zigpy that can re-run discovery/quirk matching at runtime (notably after OTA), using a temporary shadow Device and swapping it in only after successful discovery.

Changes:

Introduces Device.reinterview() and refactors initialization discovery into a reusable _discover() phase.
Adds ControllerApplication.reinterview_device() plus swap/finalization helpers and new re-interview success/failure events.
Extends test coverage with end-to-end, failure, guard, and group/DB/relay migration scenarios.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
zigpy/device.py	Adds re-interview flow, extracts `_discover()`, triggers re-interview after successful OTA, and adjusts PollControl behavior during re-interview.
zigpy/application.py	Adds `_finalize_device()`, implements `_device_reinterviewed()` swap logic, and exposes `reinterview_device()` API.
tests/test_device.py	Updates OTA tests to account for re-interview call and adds comprehensive re-interview behavior tests.
tests/test_application.py	Adds tests for DB removal ordering, public API delegation, group migration, relay persistence, and failure propagation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

TheJulianJES · 2026-03-19T03:45:59Z

+        # reinterview() handles its own errors internally.
+        await self.reinterview()


I think it's more correct to have requesting the node descriptor be part of the update still. Basically everything part of the interview process happens before self.reinterview() returns. So the update would only fully complete after a re-interview. I think that's fine/good.

TheJulianJES · 2026-03-19T03:47:27Z

        self.listener_event("raw_device_initialized", device)
        device = zigpy.quirks.get_device(device)
        self.devices[device.ieee] = device
        if self._dblistener is not None:
            device.add_context_listener(self._dblistener)


Oh, I think this is actually a pre-existing issue ever since the PollControl stuff was added to the Device's constructor (#1621). It would also happen when we create a new Device object if we found a quirk at startup.

It shouldn't be a big issue, as packets would only arrive to the new device though. So it's "just" a small memory leak.

Yeah, see:

Clean up raw device callbacks after quirk wrapping #1795

We correctly clean up everything in the re-interview PR already. So, we don't need strictly need that PR for this. It was just discovered here.

puddly · 2026-03-20T02:00:51Z

+
+            # Temporarily register the shadow in app.devices so it receives
+            # ZDO responses routed by packet_received().
+            self._application.devices[self._ieee] = shadow


This is an approach that I really dislike but we don't have much of a choice with how packet routing in zigpy is set up.

We could potentially rework packet_received to emit an event that Device objects subscribe to, but this might have a performance penalty if every received packet relies on 99/100 devices rejecting it. Maybe a filtered subscribe, keyed by device EUI64 with zigpy's application controller maintaining the NWK <-> EUI64 mapping?

I get it's not nice – unfortunately, the whole approach isn't great, but I'm not sure we should try and abstract this further. We'd still need to register and unregister the new and old device for routing, right? Like, we don't want the packets to go to the old device. We want to completely swap it for routing temporarily.

And for that, I kind of feel like this is a one-liner that's rather explicit. And if we fail (never get a node descriptor response), we revert it right below: https://github.com/TheJulianJES/zigpy/blob/edfcf556a8fc3756ad8a26a3f21f630bd208a477/zigpy/device.py#L338

puddly · 2026-03-20T02:03:39Z

+                "Re-interview failed, keeping existing device",
+                exc_info=True,
+            )
+            self._application.listener_event("device_reinterview_failure", self)


Could we turn this into a typed event?

You mean with the new event system using emit?
I feel like we have to change all other events as well then, as they all affect ZHA. I guess we could do that before or after this PR.

# Conflicts: # zigpy/device.py

TheJulianJES · 2026-04-13T22:49:13Z

TODO: Are clusters cleaned up properly? (e.g. OTA cluster processes matches / callback)

TheJulianJES added 3 commits March 15, 2026 23:58

Initial reinterviewing changes

2f03481

Initial tests

3fd895b

Move imports to the top...

c00b969

TheJulianJES added 3 commits March 16, 2026 04:54

WIP: Changes

67125f5

WIP: Register shadow device in app.devices[ieee]

14a8325

WIP: Do not fire device_initialized after reinterview. Let ZHA hand…

869c20e

…le that

TheJulianJES added 4 commits March 16, 2026 07:11

WIP: Simplify a bit

b546a02

WIP: Simplify a bit 2

ebb642d

Fix auto-init during reinterview

118ac1e

Reduce tries to 2 because 5 tries with long ZDO timeout is 2.5 mi…

18fccf5

…nutes if device is offline

TheJulianJES commented Mar 19, 2026

View reviewed changes

TheJulianJES requested a review from Copilot March 19, 2026 02:53

Copilot AI reviewed Mar 19, 2026

View reviewed changes

TheJulianJES added 6 commits March 19, 2026 04:08

Fix not clearing reinterview flag

8731c52

Shorten comment

af7a670

Clean up shadow device callbacks/tasks

187b583

Remove reinterview() exception handling after OTA

cdaae9c

Add more useful test

fb9b274

Combine tests

7830c6e

TheJulianJES requested a review from Copilot March 19, 2026 03:24

Copilot AI reviewed Mar 19, 2026

View reviewed changes

TheJulianJES added 4 commits March 19, 2026 04:41

Update test name

00650f5

Remove useless test

181fdac

Combine tests

c8bbcf1

Add comment

d8e25ad

TheJulianJES mentioned this pull request Mar 19, 2026

Clean up raw device callbacks after quirk wrapping #1795

Draft

TheJulianJES added 2 commits March 19, 2026 22:21

Explain sync call in comment

f9295a5

Use _finalize_device return value for new obj

17b5801

TheJulianJES added 5 commits March 19, 2026 22:23

Merge group loops

8a3e27e

Re-register DB listener if failed re-interview removed it

05ca08c

and remove listener before, in case it wasn't removed yet

cb15106

Only guard the actual re-discovery, not internal swapping

79bfb22

Fix stale docstring

edfcf55

puddly reviewed Mar 20, 2026

View reviewed changes

Merge branch 'dev' into tjj/reinterviewing

da56821

# Conflicts: # zigpy/device.py

	if isinstance(device, zigpy.quirks.BaseCustomDevice):
	await self._db.commit()
	return

	await self._save_endpoints(device)
	for ep in device.non_zdo_endpoints:
	await self._save_clusters(ep)
	await self._save_attribute_cache(ep)
	await self._save_unsupported_attributes(ep)
	await self._db.commit()

		# reinterview() handles its own errors internally.
		await self.reinterview()

Conversation

TheJulianJES commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Note

Related changes

Proposed change

Possible advantages of this approach

TODO:

AI summary

Approach: Shadow Device

More details

Changes

zigpy/device.py

zigpy/application.py

Events

Test plan

Thought with _save_device not really doing anything for quirked devices

How attributes are saved

Uh oh!

codecov bot commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

TheJulianJES commented Mar 16, 2026

Uh oh!

TheJulianJES Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

TheJulianJES Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

TheJulianJES Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TheJulianJES Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TheJulianJES Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

puddly Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TheJulianJES Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

puddly Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

TheJulianJES Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

TheJulianJES commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

TheJulianJES commented Mar 15, 2026 •

edited

Loading

`zigpy/device.py`

`zigpy/application.py`

Thought with `_save_device` not really doing anything for quirked devices

codecov bot commented Mar 15, 2026 •

edited

Loading

TheJulianJES Mar 19, 2026 •

edited

Loading

TheJulianJES Mar 19, 2026 •

edited

Loading

TheJulianJES Mar 19, 2026 •

edited

Loading

TheJulianJES Mar 19, 2026 •

edited

Loading

puddly Mar 20, 2026 •

edited

Loading