test: add stress tests for initialization#23289
Conversation
d4d2bed to
7e70350
Compare
7e70350 to
1888c8d
Compare
maflcko
left a comment
There was a problem hiding this comment.
Strong concept ACK. I tried doing this manually for a few releases in the past, but a functional test simplifies that a lot.
|
Strong Concept ACK Functional testing, unit testing, fuzz testing and now stress testing: great to see! Fault injection testing and mutation testing next in line? :) |
lsilva01
left a comment
There was a problem hiding this comment.
Tested ACK 1888c8d on Ubuntu 20.04
brunoerg
left a comment
There was a problem hiding this comment.
tACK 1888c8dc40d1dbbffe2f522749151ca8d0efee00
dd1f1b5 to
71ba24a
Compare
|
Code Review ACK 71ba24a724556af02901d2aa06a71d24e18f6f3c This is great, and is very helpful for testing initialization! Thank you James! Naive ideaSince GDB has a Python API, perhaps someone can use that for more granular stress testing (shouldn't be in this PR since it seems like quite a large undertaking) |
|
Concept ACK, very cool idea. I also encountered the intermittent |
1768b27 to
c870ee0
Compare
On the one hand, holding off on merging test coverage because there may be a bug in master seems wrong. On the other hand, spurious test failures in CI for a hard-to-track-down bug will be annoying. For context and neither here nor there: this bug seems pretty minor - under very particular circumstances (SIGTERM during just the right init point) it may cause a user to have to reindex. But that said, we should certainly have a better understanding of why it's happening. |
|
If txindex is causing issues here, maybe submit it in a follow-up pull request? |
c870ee0 to
ff2fede
Compare
To allow easier access to the node's datadir and debug logs.
ff2fede to
b32f5a5
Compare
b32f5a5 to
e1220b7
Compare
I see different things locally: For me, the failure happens in roughly 1-2% of the runs in the |
e1220b7 to
9c055d2
Compare
|
Concept ACK |
9c055d2 to
d5f0dd1
Compare
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
d5f0dd1 to
d9803f7
Compare
|
Okay, I've addressed a bunch of feedback and disabled txindex for the time being. I think this is a good first pass on this test provided all the non-deterministic failures have been removed. @mzumsande/@dongcarl if you'd like to PR the indexing change and fix you're welcome to, otherwise I can take care of it shortly. |
This allows filters to be reconstructed when the best known block is the Genesis block without needing to reindex. It fixes Init errors seen in bitcoin#23289.
I opened #23365 to fix the intermittent issue. |
|
Concept ACK |
|
Anything left here? I think as things stand this is a stable test that adds some good coverage. AFAIK there's no outstanding feedback aside from a few nits and marginal suggestions. |
This allows filters to be reconstructed when the best known block is the Genesis block without needing to reindex. It fixes Init errors seen in bitcoin#23289.
|
Code review ACK d9803f7 |
d9803f7 test: add stress tests for initialization (James O'Beirne) 23f8561 test: add node.chain_path and node.debug_log_path (James O'Beirne) Pull request description: In the course of coming up with a test plan for bitcoin#23280, I thought it would be neat to include a Python snippet showing how I tested the initialization process. I quickly realized I was reinventing the functional test framework... so here's a new test. This change bangs init around like the Fonz hitting a jukebox. It adds some interesting (read: lazy and random) coverage for the initialization process by - interrupting init with SIGTERM after certain log statements, - interrupting init at random points, and - starting init with some essential data missing (block files, block indices, etc.) to test init error paths. As far as I can tell, some of these code paths are uncovered otherwise (namely the startup errors). --- Incidentally, I think I may have uncovered some kind of a bug or race condition with indexing initialization based on an intermittent failure in this testcase. This test sometimes fails after shutting down immediately after `loadblk` thread start: ``` 2021-10-15T21:14:51.295000Z TestFramework (INFO): Starting node and will exit after line 'loadblk thread start' 36 │ 2021-10-15T21:14:51.296000Z TestFramework.node0 (DEBUG): bitcoind started, waiting for RPC to come up 37 │ 2021-10-15T21:14:51.493000Z TestFramework (INFO): terminating node after 110 log lines seen 38 │ 2021-10-15T21:14:51.625000Z TestFramework (INFO): Starting node and will exit after line 'txindex thread start' 39 │ 2021-10-15T21:14:51.625000Z TestFramework.node0 (DEBUG): bitcoind started, waiting for RPC to come up ------> [[ FAILURE HERE ]] 2021-10-15T21:15:21.626000Z TestFramework (WARNING): missed line {bail_line}; bailing now after {num_lines} lines ``` and then fails to start up afterwards. Combined logs showing `Error: txindex best block of the index goes beyond pruned data`, when the node under test is not pruned: ``` node0 2021-10-15T21:16:51.848439Z [shutoff] [validationinterface.cpp:244] [ChainStateFlushed] Enqueuing ChainStateFlushed: block hash=1014bc4ff4917602ae53d10e9dfe230af4b7d52a6cdaa8a47798b9c288180907 node0 2021-10-15T21:16:51.848954Z [shutoff] [init.cpp:302] [Shutdown] Shutdown: done test 2021-10-15T21:16:51.882000Z TestFramework (ERROR): Unexpected exception caught during testing Traceback (most recent call last): File "/home/james/src/bitcoin/test/functional/test_framework/test_framework.py", line 132, in main self.run_test() File "/home/james/src/bitcoin/./test/functional/stress_init.py", line 87, in run_test check_clean_start() File "/home/james/src/bitcoin/./test/functional/stress_init.py", line 60, in check_clean_start node.wait_for_rpc_connection() File "/home/james/src/bitcoin/test/functional/test_framework/test_node.py", line 224, in wait_for_rpc_connection raise FailedToStartError(self._node_msg( test_framework.test_node.FailedToStartError: [node 0] bitcoind exited with status 1 during initialization test 2021-10-15T21:16:51.882000Z TestFramework (DEBUG): Closing down network thread test 2021-10-15T21:16:51.933000Z TestFramework (INFO): Stopping nodes test 2021-10-15T21:16:51.933000Z TestFramework.node0 (DEBUG): Stopping node node0 stderr Error: txindex best block of the index goes beyond pruned data. Please disable the index or reindex (which will download the whole blockchain again) node0 stderr Error: txindex best block of the index goes beyond pruned data. Please disable the index or reindex (which will download the whole blockchain again) node0 stderr Error: txindex best block of the index goes beyond pruned data. Please disable the index or reindex (which will download the whole blockchain again) node0 stderr Error: txindex best block of the index goes beyond pruned data. Please disable the index or reindex (which will download the whole blockchain again) node0 stderr Error: txindex best block of the index goes beyond pruned data. Please disable the index or reindex (which will download the whole blockchain again) ``` ACKs for top commit: laanwj: Code review ACK d9803f7 Tree-SHA512: 4d80dc399daf199a6222e81e47d12d830dc7af07355eddbb7f52479a676a645b8d3d45093ff54a9295f01a163b2f4fe0e038e83fc269969e03d4cfda69eaf111
|
I'm surprised this was merged, given that there were no ACKs on the latest branch, and there were still unresolved approach/implementation issues and comments, eg: |
|
This is also the cause of the recent sporadic failures in our CI. See #23646. |
9600ea0 test: Add edge case of pruning up to index height (Martin Zumsande) 698c524 index: Fix backwards search for bestblock (Martin Zumsande) Pull request description: This PR attempts to fix an intermittent Init issue encountered during the stress testing of #23289, which relates to the pruning-compatible filter reconstruction logic introduced in #15946. The problem would occur when the node starts with `-txindex=1` but `ThreadSync` is interrupted after it sets `m_best_block_index` to Genesis, and before it gets do any further work. In that case, during the next restart of the node, an Init error would be thrown because `BaseIndex::Init()` tries to backtrack from the tip to the last block which has been successfully indexed (here: Genesis), but the backtracking logic didn't work properly in this case: The loop `while (block_to_test && block->pprev && (block->pprev->nStatus & BLOCK_HAVE_DATA))` checks if a predecessor exists **before** performing the check `block_to_test == block` and then possbily setting `prune_violation = false` If `block_to_test` and `block` are the Genesis block this check will not be reached because `block->pprev` does not exist. To reproduce this bug on regtest: 1) start a node with a fresh datadir using `-txindex=1` (or any other index) 2) stop and restart without any index 3) mine a block 3) stop and restart again with the index enabled ->InitError `Error: txindex best block of the index goes beyond pruned data. (...)` Fix this by requiring that we have the data for the block of the current iteration `block` (instead of requiring it for the predecessor `block->pprev`) That way, the check for `block_to_test == block` is also reached when `block_to_test` is the Genesis block. No longer requiring the data of `block->pprev` also means that we can now prune up to `m_best_block_index` height without requiring a reindex (one block more than before). I added this edge case to `feature_blockfilterindex_prune.py`, the new version should fail on master. ACKs for top commit: ryanofsky: Partial code review ACK 9600ea0 for the code change, not the test changes. (Test changes are indirect and little over my head.) It seems obvious that previous code `prune_violation = true, while (block->pprev)` would incorrectly detect a prune violation at the genesis block, and the fix here make sense and looks correct. Tree-SHA512: c717f372cee8fd49718b1b18bfe237aa6ba3ff4468588c10e1272d7a2ef3981d10af4e57de51dec295e2ca72d441bc6c2812f7990011a94d7f818775e3ff1a38
9600ea0 test: Add edge case of pruning up to index height (Martin Zumsande) 698c524 index: Fix backwards search for bestblock (Martin Zumsande) Pull request description: This PR attempts to fix an intermittent Init issue encountered during the stress testing of bitcoin#23289, which relates to the pruning-compatible filter reconstruction logic introduced in bitcoin#15946. The problem would occur when the node starts with `-txindex=1` but `ThreadSync` is interrupted after it sets `m_best_block_index` to Genesis, and before it gets do any further work. In that case, during the next restart of the node, an Init error would be thrown because `BaseIndex::Init()` tries to backtrack from the tip to the last block which has been successfully indexed (here: Genesis), but the backtracking logic didn't work properly in this case: The loop `while (block_to_test && block->pprev && (block->pprev->nStatus & BLOCK_HAVE_DATA))` checks if a predecessor exists **before** performing the check `block_to_test == block` and then possbily setting `prune_violation = false` If `block_to_test` and `block` are the Genesis block this check will not be reached because `block->pprev` does not exist. To reproduce this bug on regtest: 1) start a node with a fresh datadir using `-txindex=1` (or any other index) 2) stop and restart without any index 3) mine a block 3) stop and restart again with the index enabled ->InitError `Error: txindex best block of the index goes beyond pruned data. (...)` Fix this by requiring that we have the data for the block of the current iteration `block` (instead of requiring it for the predecessor `block->pprev`) That way, the check for `block_to_test == block` is also reached when `block_to_test` is the Genesis block. No longer requiring the data of `block->pprev` also means that we can now prune up to `m_best_block_index` height without requiring a reindex (one block more than before). I added this edge case to `feature_blockfilterindex_prune.py`, the new version should fail on master. ACKs for top commit: ryanofsky: Partial code review ACK 9600ea0 for the code change, not the test changes. (Test changes are indirect and little over my head.) It seems obvious that previous code `prune_violation = true, while (block->pprev)` would incorrectly detect a prune violation at the genesis block, and the fix here make sense and looks correct. Tree-SHA512: c717f372cee8fd49718b1b18bfe237aa6ba3ff4468588c10e1272d7a2ef3981d10af4e57de51dec295e2ca72d441bc6c2812f7990011a94d7f818775e3ff1a38
a14d537 test: add stress tests for initialization (James O'Beirne) 3f80ebc test: add node.chain_path and node.debug_log_path (James O'Beirne) Pull request description: In the course of coming up with a test plan for #23280, I thought it would be neat to include a Python snippet showing how I tested the initialization process. I quickly realized I was reinventing the functional test framework... so here's a new test. This change bangs init around like the Fonz hitting a jukebox. It adds some interesting (read: lazy and random) coverage for the initialization process by - interrupting init with SIGTERM after certain log statements, - interrupting init at random points, and - starting init with some essential data missing (block files, block indices, etc.) to test init error paths. As far as I can tell, some of these code paths are uncovered otherwise (namely the startup errors). --- Incidentally, I think I may have uncovered some kind of a bug or race condition with indexing initialization based on an intermittent failure in this testcase. This test sometimes fails after shutting down immediately after `loadblk` thread start: ``` 2021-10-15T21:14:51.295000Z TestFramework (INFO): Starting node and will exit after line 'loadblk thread start' 36 │ 2021-10-15T21:14:51.296000Z TestFramework.node0 (DEBUG): bitcoind started, waiting for RPC to come up 37 │ 2021-10-15T21:14:51.493000Z TestFramework (INFO): terminating node after 110 log lines seen 38 │ 2021-10-15T21:14:51.625000Z TestFramework (INFO): Starting node and will exit after line 'txindex thread start' 39 │ 2021-10-15T21:14:51.625000Z TestFramework.node0 (DEBUG): bitcoind started, waiting for RPC to come up ------> [[ FAILURE HERE ]] 2021-10-15T21:15:21.626000Z TestFramework (WARNING): missed line {bail_line}; bailing now after {num_lines} lines ``` and then fails to start up afterwards. Combined logs showing `Error: txindex best block of the index goes beyond pruned data`, when the node under test is not pruned: ``` node0 2021-10-15T21:16:51.848439Z [shutoff] [validationinterface.cpp:244] [ChainStateFlushed] Enqueuing ChainStateFlushed: block hash=1014bc4ff4917602ae53d10e9dfe230af4b7d52a6cdaa8a47798b9c288180907 node0 2021-10-15T21:16:51.848954Z [shutoff] [init.cpp:302] [Shutdown] Shutdown: done test 2021-10-15T21:16:51.882000Z TestFramework (ERROR): Unexpected exception caught during testing Traceback (most recent call last): File "/home/james/src/bitcoin/test/functional/test_framework/test_framework.py", line 132, in main self.run_test() File "/home/james/src/bitcoin/./test/functional/stress_init.py", line 87, in run_test check_clean_start() File "/home/james/src/bitcoin/./test/functional/stress_init.py", line 60, in check_clean_start node.wait_for_rpc_connection() File "/home/james/src/bitcoin/test/functional/test_framework/test_node.py", line 224, in wait_for_rpc_connection raise FailedToStartError(self._node_msg( test_framework.test_node.FailedToStartError: [node 0] bitcoind exited with status 1 during initialization test 2021-10-15T21:16:51.882000Z TestFramework (DEBUG): Closing down network thread test 2021-10-15T21:16:51.933000Z TestFramework (INFO): Stopping nodes test 2021-10-15T21:16:51.933000Z TestFramework.node0 (DEBUG): Stopping node node0 stderr Error: txindex best block of the index goes beyond pruned data. Please disable the index or reindex (which will download the whole blockchain again) node0 stderr Error: txindex best block of the index goes beyond pruned data. Please disable the index or reindex (which will download the whole blockchain again) node0 stderr Error: txindex best block of the index goes beyond pruned data. Please disable the index or reindex (which will download the whole blockchain again) node0 stderr Error: txindex best block of the index goes beyond pruned data. Please disable the index or reindex (which will download the whole blockchain again) node0 stderr Error: txindex best block of the index goes beyond pruned data. Please disable the index or reindex (which will download the whole blockchain again) ``` ACKs for top commit: laanwj: Code review ACK a14d537 Tree-SHA512: 4d80dc399daf199a6222e81e47d12d830dc7af07355eddbb7f52479a676a645b8d3d45093ff54a9295f01a163b2f4fe0e038e83fc269969e03d4cfda69eaf111
8904f17 test: enable txindex in feature_init (James O'Beirne) 93db6d8 test: feature_init: retain debug.log and improve detection (James O'Beirne) 24fcf6e test: feature_init: tweak all .ldb files of a certain type (James O'Beirne) a8ffbc0 test: add TestNode.wait_for_debug_log (James O'Beirne) a2fb62b test: introduce TestNode.debug_log_bytes (James O'Beirne) Pull request description: Hopefully fixes #23646. This makes a few changes to feature_init (along the lines of the discussion here: bitcoin/bitcoin#23289 (comment)). The changes are detailed in the specific commits, but a summary might be: - more robust log-line detection (at the expense of maybe being slower) - retain all debug.log content - perturb .ldb files in a more complete way Top commit has no ACKs. Tree-SHA512: 9c8900998fdb96d3cc92bd30c4eeda2a47e9787e6432546d458c17f24eb9e2b6b3eb573db225172888ca71d7a62260f497f83827597b982b9e35c5452cb6e514
This allows filters to be reconstructed when the best known block is the Genesis block without needing to reindex. It fixes Init errors seen in bitcoin#23289.
Summary: ``` This change bangs init around like the Fonz hitting a jukebox. ``` Backport of [[bitcoin/bitcoin#23289 | core#23289]]. Test Plan: ninja check-functional Reviewers: #bitcoin_abc, PiRK, sdulfari Reviewed By: #bitcoin_abc, PiRK, sdulfari Subscribers: PiRK Differential Revision: https://reviews.bitcoinabc.org/D12598
In the course of coming up with a test plan for #23280, I thought it would be neat to include a Python snippet showing how I tested the initialization process. I quickly realized I was reinventing the functional test framework... so here's a new test.
This change bangs init around like the Fonz hitting a jukebox. It adds some interesting (read: lazy and random) coverage for the initialization process by
As far as I can tell, some of these code paths are uncovered otherwise (namely the startup errors).
Incidentally, I think I may have uncovered some kind of a bug or race condition with indexing initialization based on an intermittent failure in this testcase. This test sometimes fails after shutting down immediately after
loadblkthread start:and then fails to start up afterwards. Combined logs showing
Error: txindex best block of the index goes beyond pruned data, when the node under test is not pruned: