Conversation
|
You could probably use #pragma omp parallel for around that evaluation loop. Probably have to use a counter instead of an iterator, but that'd be fine. Also Eigen can use OpenMP as well, did your integration include passing that flag? Thanks! |
|
I have zero experience with openmp beyond the build system part, but that's a good point, I should make sure the eigen build system integration turns on its openmp usage when allowed. I did briefly think that those loops might be parallelizable but figured thread launch overhead would make it moot given how small those tend to be. I forgot about openmp there 😁 |
Oh, for loops it's super easy. see this example from boolean.cpp Any time you've got a loop with an index, and the iterations effects are independent, you can just stick that pragma above the loop. It goes faster with OMP and is ignored and runs single threaded on any compiler without OMP. We're stuck at OpenMP version 2 because MS hasn't updated their implementation, otherwise I think some C++ iterators would be supported in addition to a regular counting loop. A more complicated (yet simple) example is this one where we previously just did triangluateInto the same list, but you can't be writing to the same list from multiple threads: In this case instead of triangulating directly into sm, we triangulate into m which is thread-local since its inside the loop. We have to put a critical section around the call to combine sm with each m since sm needs to be modified one thread at a time. Again, this is completely transparent to a compiler that doesn't support the pragmas! |
|
Oh that is pretty nice and easy. Yeah I can go back in and add some at some point. |
7f8c929 to
c26044a
Compare
|
Rebased on a slimmer #1159 . Looks like Eigen auto-detects OpenMP availability though it might prefer being told the number of real cores. (A safe bet is probably apparent cores divided by 2 on x86...) https://eigen.tuxfamily.org/dox/TopicMultiThreading.html I tried throwing some OpenMP magic on the eval loop. |
ef18db1 to
902b6c7
Compare
|
rebased |
6923db4 to
3f38c38
Compare
|
OK and fixed on msvc |
Should be the same right now, but this is clearer.
3f38c38 to
2301ba2
Compare
|
Thanks @rpavlik! |
|
@rpavlik @phkahler this degrades performace! On the model from here #386 (comment) it takes 20s while master took 18s before these changes. Did you profile? |
|
I have not "bisected" to check which of the two commits slows it down. |
|
Definitely two seconds slower 18->20 and the two commits are inseparable. |
|
In my opinion we should reset master to 3d482f0 |
|
That's very strange, did you build with or without openmp? (I could believe that openmp might slow it down here if the threading and sync overhead was larger than just the evaluation cost) I was seeing a little slowdown at some point in dev but I assumed it was because I was building with sanitizers on, I hadn't gotten to profiling without sanitizers, though I did do profiling and saw that evaluation of Jacobian was a main time sink. My initial goal was mostly just to make the code more regular and readable, I was pretty surprised Eigen accepted "Expr*" as a scalar type, and indeed it blows up pretty quick if you try to use very much of the API. As the initial description said, this improved performance at least on the test suite on its own. Would be good to bisect and see which commit is the bad one: the first one should be a no-op approximately, but required by the second, the second is the "main" one, the third was @phkahler s idea which seemed like a good one to me. But if a release build is seeing consistent slow down across multiple test cases then certainly revert for now. |
|
ReleaseWitheDebugInfo, OpenMP on, and I can actually see the CPU load hitting 100% for a few moments in the profiling. And before this it did not. So it does run multi-threaded, but it is slower. Will post screen shots later. |
|
OK, so that's almost certainly what it is, overhead from the threading maybe, if it's hitting 100% but going slower. Interesting. I now have vs2022 installed on my home computer so I might tinker with that this weekend. Alternately you can tell @Wallbraker that I should use some work time on it ;) |
|
Ok, it was the openmp commit. Pr with revert of that is up now. |
|
Note that it does seem plausible here, that the real problem is the one mentioned in the Eigen docs: because Eigen is so intensely optimized, smt (hyper threading) actually slows it down, but openmp defaults to threads equal to number of logical cores, which, as a first approximation, is twice as many as it should on a modern x86 machine. There is a way to adjust the number of threads used but I didn't dig into it. But, this might bite us again sooner rather than later now that we have the Eigen rocket fuel onboard Would be really good to hook up ci to do performance regression testing. The benchmark tool is a good start/part. I haven't actually done performance regression testing before, so happy if someone else can contribute. |
|
@rpavlik @phkahler As I said we should reset master to 3d482f0. Should I do it? 3d482f0 Fast - 17.7s
b53d164 Slow 20.8s
985e4fb Slow 20.7s
|
|
@ruevs please fix it. Unfortunately I already merged a couple others since. I though one of those reverted it? |
|
OK @phkahler @rpavlik I rebased and force pushed master dropping:
The idea is still interesting and may be worth finding out why it is slower - but I think the reason may be rather subtle. |
|
Please don't force push master in the future. It messes up everyone's local clones, just revert the commits instead. I'm rather confused and frustrated how your profiling and mine found different regression causes, it was pretty clear in my testing that only the openmp commit was the problem, the other commits were neutral to performance improvements. If we're going to revert commits without review we really need a standard set of performance tests to have some objective reproducible data. I appreciate the profiling data you posted but aside from the top level time from a pathological model it doesn't really show me how the change was a problem. The models I tested with were smaller (less than a second to load, though I used the benchmark tool to process them repeatedly for sufficient accuracy) and things I've actually 3d printed. Has anyone had experience setting up a performance regression test? |
|
Understood. In the future I will not force push master and will leave @phkahler to handle reverts. As for performance profiling. How does the "4096" model behave for you against this change? May it depend on architecture and compiler/compiler settings? Strange. It is indeed pathological - but it is perhaps the cleanest "solver only" test I have. Why b53d164 slows it down for me I do not understand yet but it does, by 10+%. It is true that we need automated benchmark tests. I spent a lot of time in the last few days to go through the history and dig up all the performance related changes and models that were used to test them. For now I have created a new I'm currently putting together a list similar to the NURBS one but it's quite a bit of work to test all those models so it'll take me some time... and life happens... I'll also test this pull request, master as it is and #432 against the various models. As for automated tests - @vespakoen tried it here in the CI, but @phkahler (correctly in my opinion) pointed out that they do not belong in CI. Also the pull request did not use the existing test project - which I think it should. I think "performace" should be an optional set of tests (just because they will be slow) similar to the normal ones, but with "heavy" models and timing results shown. Unfortunately I do not have time to write them (for now). |
|
By the way I am not using the current solvespace-testsuite.exe as a performance benchmark at all because all the tests models in it are trivially simple - that was the idea, just look through the PNGs in the test directory - they are correctness tests, not performance. All 256 tests run in 3.3s (RelWithDebugInfo) on my i5-7500! This is not enough "load" to test performance. In fact if I redirect In fact below are 10 runs of this pull request versus master without it, with the output redirected to a file.
The files |
|
great, thanks! |






in #1159 we used a sparse matrix to hold expr pointers, when all we do with them is evaluate them into a sparse matrix of doubles. This stores them as a vector of triplets instead of the matrix, which makes it easy to evaluate to a vector of triplets with double data, from which we can set the sparse matrix efficiently. This actually shaved about 0.1s off of the test suite execution time (in relwithdebinfo) vs just #1159 on its own, so I think we probably want this one if we want that one.