Skip to content

Integrate formal SMT-based methods#322

Merged
viktormalik merged 10 commits intodiffkemp:masterfrom
FrNecas:fnecas-smt
Nov 26, 2024
Merged

Integrate formal SMT-based methods#322
viktormalik merged 10 commits intodiffkemp:masterfrom
FrNecas:fnecas-smt

Conversation

@FrNecas
Copy link
Copy Markdown
Collaborator

@FrNecas FrNecas commented Mar 18, 2024

This PR introduces an SMT-based comparison of short sequential snippets using Z3 solver. The general approach can be described as follows:

  • when we find a difference and no pattern is applicable, try to check equivalence using an SMT solver
  • the differing code snippets to be analyzed are determined based on searching for a synchronization point, i.e. a pair of instructions after which the code can be synchronized until the end of the basic block
  • we construct a formula expressing the equivalence of the identified snippets -- if the inputs are the same and the blocks are executed, the outputs of the blocks must be the same. Since checking satisfiability is a hard problem, we apply a timeout.
  • furthermore, the extension facilitates inverse-branching-condition pattern, e.g. it can detect a more complex change such as icmp slt %1, 101 being an inverse of icmp sgt %1, 100 if %1 is an integer.

Something to discuss:

  • The new component is off by default and must be enabled using --use-smt. Do we always want to build this extension? The problem is that it introduces z3 as a dependency which is quite significant. However, since SMTBlockComparator is tightly coupled with DifferentialFunctionComparator, trying to exclude it from the build unless explicitly turned on would require quite a lot of preprocessor if-s. Furthermore, even if it was conditionally removed from simpll, it would still be part of the python frontend CLI which could be quite confusing.

Depends on: #323 #325 (needed for correct functionality in some more complicated cases, e.g. the KABI experiment)

@FrNecas FrNecas requested a review from viktormalik March 18, 2024 18:48
@FrNecas FrNecas force-pushed the fnecas-smt branch 2 times, most recently from 0316ae0 to a02f38b Compare March 19, 2024 13:20
@FrNecas
Copy link
Copy Markdown
Collaborator Author

FrNecas commented Mar 19, 2024

Hm, trying to run this on the KABI experiments results in a segfault... I guess more debugging time :/

@lenticularis39
Copy link
Copy Markdown
Collaborator

Since LLVM is built for performance, it usually just segfaults when something goes wrong, one needs to extensively use GDB :D

@FrNecas FrNecas force-pushed the fnecas-smt branch 3 times, most recently from b9c8b3e to 40def05 Compare March 26, 2024 16:11
@FrNecas FrNecas force-pushed the fnecas-smt branch 4 times, most recently from 6468230 to faf9037 Compare April 16, 2024 11:39
@FrNecas FrNecas force-pushed the fnecas-smt branch 2 times, most recently from 2a17f91 to 9171e14 Compare April 22, 2024 14:19
@viktormalik
Copy link
Copy Markdown
Collaborator

  • The new component is off by default and must be enabled using --use-smt. Do we always want to build this extension? The problem is that it introduces z3 as a dependency which is quite significant. However, since SMTBlockComparator is tightly coupled with DifferentialFunctionComparator, trying to exclude it from the build unless explicitly turned on would require quite a lot of preprocessor if-s. Furthermore, even if it was conditionally removed from simpll, it would still be part of the python frontend CLI which could be quite confusing.

I would say, let's not complicate things now and always build the extension in. If we wanted to make the z3 dependency optional, probably the easiest way is to have an alternative (ifdef-ed) version of SMTBlockComparator which has empty implementations of the exported functions.

FWIW, I would also love to make it on by default but IIRC, the overhead is still quite high, especially due to finding the snippets.

Copy link
Copy Markdown
Collaborator

@viktormalik viktormalik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the first 2 commits so far. There's a couple of points for discussion so I'm posting them straight away.

Also, could you please expand the commit messages? I prefer a commit to explain what it does (even if it is obvious) so that its intentions are clear when we get back to it in future.

diffkemp/cli.py Outdated
Comment on lines +122 to +140
compare_ap.add_argument("--use-smt",
help="Use SMT-based checking of short snippets",
action="store_true")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point for discussion: should we make the contents of this PR a new "builtin pattern" instead of a new CLI option? I can imagine that we'll eventually have more patterns which require SMT and we'd want to allow selecting just some. Using any such pattern would automatically turn on the SMT support.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting idea. I'd suggest keeping it as an option for now -- all patterns are enabled by default, right? Making this a pattern that's disabled by default feels a bit inconsistent. Furthermore, I perceive the builtin patterns as something rather lightweight and quite fast which definitely is not the case here. We can obviously change it later if we have more SMT-based patterns.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting idea. I'd suggest keeping it as an option for now -- all patterns are enabled by default, right? Making this a pattern that's disabled by default feels a bit inconsistent.

No, there are patterns which are off by default, namely control-flow-only and type-casts.

Furthermore, I perceive the builtin patterns as something rather lightweight and quite fast which definitely is not the case here. We can obviously change it later if we have more SMT-based patterns.

Yes, but it feels like changing it later would remove the new --use-smt option (or at least change) which is not a nice breakage of the CLI. I'm a bit indecisive here, too, but I don't see any fundamental difference between this and built-in patterns. We'd just have to come up with a name for it.

// Try to find a match by moving one of the instruction iterators
// forward (find a code relocation).
if (config.Patterns.Relocations
if (config.Patterns.Relocations && !suppressRelocationsAndSMT
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suppressRelocationsAndSMT parameter feels a bit clumsy, TBH. What if we find out that we need to turn off more builtin patterns in future?

Two alternatives come to mind:

  1. Turn off config.Patterns.Relocations and config.UseSMT for the particular call of cmpBasicBlocksFromInstructions.
  2. Make SMTComparator use a separate instance of DifferentialFunctionComparator having a different config.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wasn't a big fan of this either but couldn't think of a better solution. The first alternative feels quite dirty as well, Config represents the user configuration and should therefore IMO be constant throughout the whole run. Furthermore, I am not sure if it is even possible to use it, cmpBasicBlocks is declared as const inside LLVM API, i.e., it cannot change the values of data fields. So even if we changed config to non-const, it probably wouldn't be possible to modify its fields inside SMTBlockComparator since it is called from cmpBasicBlocks.

I can try the second alternative but there may be other drawbacks. One thing that comes to mind -- DifferentialFunctionComparator has some state, e.g. some patterns keep track of various information. We would somehow have to transfer this information (the current constructor is rather basic and accepts only a few arguments). Maintaing the list of fields that need to be transferred for the new DifferentialFunctionComparator instance to be fully functional seems quite tedious. WDYT?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config represents the user configuration and should therefore IMO be constant throughout the whole run.

Fair point. OTOH, I think that it could be beneficial to have a way to run DifferentialFunctionComparator methods (preferably in the current context) with a different configuration, especially for tasks other than the analysis itself. For instance, when searching for the code snippets, I can imagine that we'd like to skip difference localisation as it just slows it down. Likewise, we may not want top count extended statistics for some runs.

Copy link
Copy Markdown
Collaborator

@viktormalik viktormalik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a couple of lower-level comments for the code itself.

Also, I'm wondering if the SMT-based inverse branching condition "pattern" could be somehow implemented closer to the original inverse branching condition pattern or it needs to be kept separate.

Comment on lines +37 to +46
# Try to find system-wide Z3 (e.g. local build)
find_package(Z3 CONFIG)
if (NOT ${Z3_FOUND})
# Use our FindZ3.cmake (e.g. nix build)
find_package(Z3 REQUIRED MODULE)
endif()

target_include_directories(runTests PRIVATE ${Z3_CXX_INCLUDE_DIRS})
target_link_libraries(runTests gtest simpll-lib ${llvm_libs} ${Z3_LIBRARIES})
target_compile_options(runTests PRIVATE ${Z3_COMPONENT_CXX_FLAGS})
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, Z3 is already linked with simpll-lib, no need to link it here again. In addition, tests do not use any Z3 includes so we don't need to add include directories either.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect this as well but compilation of tests fails without this, which is weird... Maybe we are doing something wrong in our test setup?

@FrNecas FrNecas force-pushed the fnecas-smt branch 2 times, most recently from 17d2227 to c068e05 Compare October 27, 2024 19:01
@FrNecas
Copy link
Copy Markdown
Collaborator Author

FrNecas commented Oct 27, 2024

I've rebased the PR and addressed the comments (hopefully all of them, other than the more high-level comment if this should be a pattern or an option -- which we probably need to discuss more closely). This is probably ready to be reviewed again.

@FrNecas
Copy link
Copy Markdown
Collaborator Author

FrNecas commented Nov 3, 2024

Option converted to a pattern (named smt-sequential-blocks -- if you find a better name, happy to rename). I decided to keep the smt-timeout option not specific to the pattern for now, as I believe that even if we add new patterns based on SMT solving (like the more complex branching stuff), it doesn't hurt if they share the same timeout -- if a user increases the timeout, they are fine with diffkemp running longer and so it doesn't really matter if the longer runtime comes from our current sequential-blocks pattern or from some other more complex patterns.

Copy link
Copy Markdown
Collaborator

@viktormalik viktormalik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just some naming stuff to resolve, otherwise it's pretty much ready.

The new component is intended to be used whenever a difference is found
and no built-in (and custom) pattern can be used. It is then going to
try to refute or confirm the found semantic difference using an SMT
solver. Since SMT solving is time consuming (and the solution is more of
a proof-of-concept for now), we hide this capability behind an option.
Before encoding the equivalence of differing blocks into an SMT formula,
we need to find the code blocks to compare. We do so similarly to how
relocations are detected, i.e., we search for a synchronization point
after which the remainder of the basic blocks are semantically
equivalent.
We encode the semantic equivalence of the detected blocks into an SMT
formula as follows. The final formula has the form precondition &&
block1 && block2 && !postcondition, where precondition encodes the
equivalence of input variables of the block (based on varmap), block1
and block2 encode the semantics of the compared blocks and postcondition
encodes the equivalence of output variables of the blocks (i.e. those
variables that are used outside the block).

This commit adds implementation of the precondition and encoding of
blocks. The interesting part is encoding of blocks, where we make use of
LLVM IR's SSA property to build a new SMT variable for each LLVM IR register.
The encoding makes use of pointer addresses of the LLVM instructions.
Using this encoding, we can conveniently compare the operation performed
by each instruction type.
There are two types of output variables that need to be considered when
consturcting the postcondition:
 1) variables used in the same basic block
 2) variables used outside the current basic block

For the first case, we have already analyzed these variables/instruction
when searhcing for a synchronization point, i.e., we already know which
variables should be semantically equivalent. Based on this, we can
construct the postcondition formula.

On the other hand, for the second case, we do not know the relationship
of variables in the left and right block, therefore, for now, we only
consider the simplest case where there is exactly 1 variable of such
type on both the left and the right side.
This extension aims to facilitate a more advanced case of
inverse-branch-condition pattern, where the condition is not refactored
to its syntactic negation but it's still semantically inverted, for
example changing x < 101 with x > 100 (where x is an integer) and then
swapping branches. To facilitate analysis of these cases, whenever the
initial SMT solving phase, we check if there are possibly invertible
conditions. If there are, we try to invert them and check semantic
equivalence again.
With the use of SMT solving, we potentially explore a lot more paths
leading to a larger vertex graph (due to search for a synchronization
point). This is a temporary workaround before we figure out how to
reduce the call graph size in a more efficient way.
Based on experiments, SMT solving in the observed cases only takes a
couple of milliseconds, or seconds at worst. We also try to reduce
unnecessary SMT solving by checking whether we support all the
instructions in the blocks to be analyzed.
@FrNecas FrNecas force-pushed the fnecas-smt branch 2 times, most recently from 978a904 to b68f599 Compare November 17, 2024 14:03
@FrNecas FrNecas requested a review from viktormalik November 17, 2024 14:10
Copy link
Copy Markdown
Collaborator

@viktormalik viktormalik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

We have CI failure, though 😢

It was decided that this is basically a pattern and that in the future,
more patterns may build on SMT solving which would make the option
confusing. The smt-timeout option is kept instead of making it pattern
specific -- for now it seems fine if all patterns based on SMT solving
share the same timeout.
@viktormalik
Copy link
Copy Markdown
Collaborator

The CI failure is not caused by this PR, seems to be some Ubuntu repo issue.

@viktormalik viktormalik merged commit 298da70 into diffkemp:master Nov 26, 2024
@viktormalik
Copy link
Copy Markdown
Collaborator

And we're done here 🥳

Many thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants