Integrate formal SMT-based methods#322
Conversation
0316ae0 to
a02f38b
Compare
|
Hm, trying to run this on the KABI experiments results in a segfault... I guess more debugging time :/ |
|
Since LLVM is built for performance, it usually just segfaults when something goes wrong, one needs to extensively use GDB :D |
b9c8b3e to
40def05
Compare
6468230 to
faf9037
Compare
2a17f91 to
9171e14
Compare
I would say, let's not complicate things now and always build the extension in. If we wanted to make the z3 dependency optional, probably the easiest way is to have an alternative (ifdef-ed) version of FWIW, I would also love to make it on by default but IIRC, the overhead is still quite high, especially due to finding the snippets. |
viktormalik
left a comment
There was a problem hiding this comment.
Reviewed the first 2 commits so far. There's a couple of points for discussion so I'm posting them straight away.
Also, could you please expand the commit messages? I prefer a commit to explain what it does (even if it is obvious) so that its intentions are clear when we get back to it in future.
diffkemp/cli.py
Outdated
| compare_ap.add_argument("--use-smt", | ||
| help="Use SMT-based checking of short snippets", | ||
| action="store_true") |
There was a problem hiding this comment.
Point for discussion: should we make the contents of this PR a new "builtin pattern" instead of a new CLI option? I can imagine that we'll eventually have more patterns which require SMT and we'd want to allow selecting just some. Using any such pattern would automatically turn on the SMT support.
There was a problem hiding this comment.
Interesting idea. I'd suggest keeping it as an option for now -- all patterns are enabled by default, right? Making this a pattern that's disabled by default feels a bit inconsistent. Furthermore, I perceive the builtin patterns as something rather lightweight and quite fast which definitely is not the case here. We can obviously change it later if we have more SMT-based patterns.
There was a problem hiding this comment.
Interesting idea. I'd suggest keeping it as an option for now -- all patterns are enabled by default, right? Making this a pattern that's disabled by default feels a bit inconsistent.
No, there are patterns which are off by default, namely control-flow-only and type-casts.
Furthermore, I perceive the builtin patterns as something rather lightweight and quite fast which definitely is not the case here. We can obviously change it later if we have more SMT-based patterns.
Yes, but it feels like changing it later would remove the new --use-smt option (or at least change) which is not a nice breakage of the CLI. I'm a bit indecisive here, too, but I don't see any fundamental difference between this and built-in patterns. We'd just have to come up with a name for it.
| // Try to find a match by moving one of the instruction iterators | ||
| // forward (find a code relocation). | ||
| if (config.Patterns.Relocations | ||
| if (config.Patterns.Relocations && !suppressRelocationsAndSMT |
There was a problem hiding this comment.
The suppressRelocationsAndSMT parameter feels a bit clumsy, TBH. What if we find out that we need to turn off more builtin patterns in future?
Two alternatives come to mind:
- Turn off
config.Patterns.Relocationsandconfig.UseSMTfor the particular call ofcmpBasicBlocksFromInstructions. - Make
SMTComparatoruse a separate instance ofDifferentialFunctionComparatorhaving a different config.
There was a problem hiding this comment.
Yeah, I wasn't a big fan of this either but couldn't think of a better solution. The first alternative feels quite dirty as well, Config represents the user configuration and should therefore IMO be constant throughout the whole run. Furthermore, I am not sure if it is even possible to use it, cmpBasicBlocks is declared as const inside LLVM API, i.e., it cannot change the values of data fields. So even if we changed config to non-const, it probably wouldn't be possible to modify its fields inside SMTBlockComparator since it is called from cmpBasicBlocks.
I can try the second alternative but there may be other drawbacks. One thing that comes to mind -- DifferentialFunctionComparator has some state, e.g. some patterns keep track of various information. We would somehow have to transfer this information (the current constructor is rather basic and accepts only a few arguments). Maintaing the list of fields that need to be transferred for the new DifferentialFunctionComparator instance to be fully functional seems quite tedious. WDYT?
There was a problem hiding this comment.
Configrepresents the user configuration and should therefore IMO be constant throughout the whole run.
Fair point. OTOH, I think that it could be beneficial to have a way to run DifferentialFunctionComparator methods (preferably in the current context) with a different configuration, especially for tasks other than the analysis itself. For instance, when searching for the code snippets, I can imagine that we'd like to skip difference localisation as it just slows it down. Likewise, we may not want top count extended statistics for some runs.
viktormalik
left a comment
There was a problem hiding this comment.
Adding a couple of lower-level comments for the code itself.
Also, I'm wondering if the SMT-based inverse branching condition "pattern" could be somehow implemented closer to the original inverse branching condition pattern or it needs to be kept separate.
| # Try to find system-wide Z3 (e.g. local build) | ||
| find_package(Z3 CONFIG) | ||
| if (NOT ${Z3_FOUND}) | ||
| # Use our FindZ3.cmake (e.g. nix build) | ||
| find_package(Z3 REQUIRED MODULE) | ||
| endif() | ||
|
|
||
| target_include_directories(runTests PRIVATE ${Z3_CXX_INCLUDE_DIRS}) | ||
| target_link_libraries(runTests gtest simpll-lib ${llvm_libs} ${Z3_LIBRARIES}) | ||
| target_compile_options(runTests PRIVATE ${Z3_COMPONENT_CXX_FLAGS}) |
There was a problem hiding this comment.
IIUC, Z3 is already linked with simpll-lib, no need to link it here again. In addition, tests do not use any Z3 includes so we don't need to add include directories either.
There was a problem hiding this comment.
I would expect this as well but compilation of tests fails without this, which is weird... Maybe we are doing something wrong in our test setup?
17d2227 to
c068e05
Compare
|
I've rebased the PR and addressed the comments (hopefully all of them, other than the more high-level comment if this should be a pattern or an option -- which we probably need to discuss more closely). This is probably ready to be reviewed again. |
|
Option converted to a pattern (named |
viktormalik
left a comment
There was a problem hiding this comment.
Looks great! Just some naming stuff to resolve, otherwise it's pretty much ready.
The new component is intended to be used whenever a difference is found and no built-in (and custom) pattern can be used. It is then going to try to refute or confirm the found semantic difference using an SMT solver. Since SMT solving is time consuming (and the solution is more of a proof-of-concept for now), we hide this capability behind an option.
Before encoding the equivalence of differing blocks into an SMT formula, we need to find the code blocks to compare. We do so similarly to how relocations are detected, i.e., we search for a synchronization point after which the remainder of the basic blocks are semantically equivalent.
We encode the semantic equivalence of the detected blocks into an SMT formula as follows. The final formula has the form precondition && block1 && block2 && !postcondition, where precondition encodes the equivalence of input variables of the block (based on varmap), block1 and block2 encode the semantics of the compared blocks and postcondition encodes the equivalence of output variables of the blocks (i.e. those variables that are used outside the block). This commit adds implementation of the precondition and encoding of blocks. The interesting part is encoding of blocks, where we make use of LLVM IR's SSA property to build a new SMT variable for each LLVM IR register. The encoding makes use of pointer addresses of the LLVM instructions. Using this encoding, we can conveniently compare the operation performed by each instruction type.
There are two types of output variables that need to be considered when consturcting the postcondition: 1) variables used in the same basic block 2) variables used outside the current basic block For the first case, we have already analyzed these variables/instruction when searhcing for a synchronization point, i.e., we already know which variables should be semantically equivalent. Based on this, we can construct the postcondition formula. On the other hand, for the second case, we do not know the relationship of variables in the left and right block, therefore, for now, we only consider the simplest case where there is exactly 1 variable of such type on both the left and the right side.
This extension aims to facilitate a more advanced case of inverse-branch-condition pattern, where the condition is not refactored to its syntactic negation but it's still semantically inverted, for example changing x < 101 with x > 100 (where x is an integer) and then swapping branches. To facilitate analysis of these cases, whenever the initial SMT solving phase, we check if there are possibly invertible conditions. If there are, we try to invert them and check semantic equivalence again.
With the use of SMT solving, we potentially explore a lot more paths leading to a larger vertex graph (due to search for a synchronization point). This is a temporary workaround before we figure out how to reduce the call graph size in a more efficient way.
Based on experiments, SMT solving in the observed cases only takes a couple of milliseconds, or seconds at worst. We also try to reduce unnecessary SMT solving by checking whether we support all the instructions in the blocks to be analyzed.
978a904 to
b68f599
Compare
viktormalik
left a comment
There was a problem hiding this comment.
Looks good!
We have CI failure, though 😢
It was decided that this is basically a pattern and that in the future, more patterns may build on SMT solving which would make the option confusing. The smt-timeout option is kept instead of making it pattern specific -- for now it seems fine if all patterns based on SMT solving share the same timeout.
|
The CI failure is not caused by this PR, seems to be some Ubuntu repo issue. |
|
And we're done here 🥳 Many thanks! |
This PR introduces an SMT-based comparison of short sequential snippets using Z3 solver. The general approach can be described as follows:
icmp slt %1, 101being an inverse oficmp sgt %1, 100if%1is an integer.Something to discuss:
--use-smt. Do we always want to build this extension? The problem is that it introducesz3as a dependency which is quite significant. However, sinceSMTBlockComparatoris tightly coupled withDifferentialFunctionComparator, trying to exclude it from the build unless explicitly turned on would require quite a lot of preprocessor if-s. Furthermore, even if it was conditionally removed from simpll, it would still be part of the python frontend CLI which could be quite confusing.Depends on: #323 #325 (needed for correct functionality in some more complicated cases, e.g. the KABI experiment)