No sparse expr by rpavlik · Pull Request #1160 · solvespace/solvespace

rpavlik · 2021-12-23T19:14:30Z

in #1159 we used a sparse matrix to hold expr pointers, when all we do with them is evaluate them into a sparse matrix of doubles. This stores them as a vector of triplets instead of the matrix, which makes it easy to evaluate to a vector of triplets with double data, from which we can set the sparse matrix efficiently. This actually shaved about 0.1s off of the test suite execution time (in relwithdebinfo) vs just #1159 on its own, so I think we probably want this one if we want that one.

phkahler · 2021-12-24T17:21:27Z

You could probably use #pragma omp parallel for around that evaluation loop. Probably have to use a counter instead of an iterator, but that'd be fine.

Also Eigen can use OpenMP as well, did your integration include passing that flag?

Thanks!

rpavlik · 2021-12-24T18:37:28Z

I have zero experience with openmp beyond the build system part, but that's a good point, I should make sure the eigen build system integration turns on its openmp usage when allowed. I did briefly think that those loops might be parallelizable but figured thread launch overhead would make it moot given how small those tend to be. I forgot about openmp there 😁

phkahler · 2021-12-24T20:43:42Z

I have zero experience with openmp beyond the build system part

Oh, for loops it's super easy. see this example from boolean.cpp

void SShell::MakeClassifyingBsps(SShell *useCurvesFrom) {
#pragma omp parallel for
    for(int i = 0; i<surface.n; i++) {
        surface[i].MakeClassifyingBsp(this, useCurvesFrom);
    }
}

Any time you've got a loop with an index, and the iterations effects are independent, you can just stick that pragma above the loop. It goes faster with OMP and is ignored and runs single threaded on any compiler without OMP. We're stuck at OpenMP version 2 because MS hasn't updated their implementation, otherwise I think some C++ iterators would be supported in addition to a regular counting loop.

A more complicated (yet simple) example is this one where we previously just did triangluateInto the same list, but you can't be writing to the same list from multiple threads:

void SShell::TriangulateInto(SMesh *sm) {
#pragma omp parallel for
    for(int i=0; i<surface.n; i++) {
        SSurface *s = &surface[i];
        SMesh m;
        s->TriangulateInto(this, &m);
        #pragma omp critical
        sm->MakeFromCopyOf(&m);
        m.Clear();
    }
}

In this case instead of triangulating directly into sm, we triangulate into m which is thread-local since its inside the loop. We have to put a critical section around the call to combine sm with each m since sm needs to be modified one thread at a time. Again, this is completely transparent to a compiler that doesn't support the pragmas!

rpavlik · 2021-12-25T23:00:15Z

Oh that is pretty nice and easy. Yeah I can go back in and add some at some point.

rpavlik · 2021-12-27T03:41:09Z

Rebased on a slimmer #1159 . Looks like Eigen auto-detects OpenMP availability though it might prefer being told the number of real cores. (A safe bet is probably apparent cores divided by 2 on x86...) https://eigen.tuxfamily.org/dox/TopicMultiThreading.html

I tried throwing some OpenMP magic on the eval loop.

rpavlik · 2021-12-31T16:56:33Z

rebased

rpavlik · 2021-12-31T17:13:46Z

OK and fixed on msvc

Should be the same right now, but this is clearer.

phkahler · 2022-01-07T00:55:29Z

Thanks @rpavlik!

ruevs · 2022-01-07T01:17:01Z

@rpavlik @phkahler this degrades performace! On the model from here #386 (comment) it takes 20s while master took 18s before these changes. Did you profile?

ruevs · 2022-01-07T01:19:36Z

I have not "bisected" to check which of the two commits slows it down.

ruevs · 2022-01-07T01:59:28Z

Definitely two seconds slower 18->20 and the two commits are inseparable.

ruevs · 2022-01-07T02:01:49Z

In my opinion we should reset master to 3d482f0

rpavlik · 2022-01-07T03:04:18Z

That's very strange, did you build with or without openmp? (I could believe that openmp might slow it down here if the threading and sync overhead was larger than just the evaluation cost) I was seeing a little slowdown at some point in dev but I assumed it was because I was building with sanitizers on, I hadn't gotten to profiling without sanitizers, though I did do profiling and saw that evaluation of Jacobian was a main time sink. My initial goal was mostly just to make the code more regular and readable, I was pretty surprised Eigen accepted "Expr*" as a scalar type, and indeed it blows up pretty quick if you try to use very much of the API. As the initial description said, this improved performance at least on the test suite on its own. Would be good to bisect and see which commit is the bad one: the first one should be a no-op approximately, but required by the second, the second is the "main" one, the third was @phkahler s idea which seemed like a good one to me. But if a release build is seeing consistent slow down across multiple test cases then certainly revert for now.

ruevs · 2022-01-07T06:57:22Z

ReleaseWitheDebugInfo, OpenMP on, and I can actually see the CPU load hitting 100% for a few moments in the profiling. And before this it did not. So it does run multi-threaded, but it is slower. Will post screen shots later.

rpavlik · 2022-01-07T22:54:14Z

OK, so that's almost certainly what it is, overhead from the threading maybe, if it's hitting 100% but going slower. Interesting. I now have vs2022 installed on my home computer so I might tinker with that this weekend. Alternately you can tell @Wallbraker that I should use some work time on it ;)

rpavlik · 2022-01-08T15:44:06Z

Ok, it was the openmp commit. Pr with revert of that is up now.

rpavlik · 2022-01-08T15:49:11Z

Note that it does seem plausible here, that the real problem is the one mentioned in the Eigen docs: because Eigen is so intensely optimized, smt (hyper threading) actually slows it down, but openmp defaults to threads equal to number of logical cores, which, as a first approximation, is twice as many as it should on a modern x86 machine. There is a way to adjust the number of threads used but I didn't dig into it. But, this might bite us again sooner rather than later now that we have the Eigen rocket fuel onboard

Would be really good to hook up ci to do performance regression testing. The benchmark tool is a good start/part. I haven't actually done performance regression testing before, so happy if someone else can contribute.

ruevs · 2022-01-08T21:53:26Z

@rpavlik @phkahler
Below is my detailed profiling. the regression comes form b53d164 (Don't put Expr* in a sparse matrix, store triplet list instead. ) and 985e4fb (adding OpenMP) simply did not help. Tested on this #386 (comment) after increasing MAX_UNKNOWNS to 4096

As I said we should reset master to 3d482f0. Should I do it?

`3d482f0` Fast - 17.7s

Function Name	Total CPU [unit, %]	Self CPU [unit, %]	Module
solvespace.exe (PID: 4788)	136647 (100.00%)	0 (0.00%)	solvespace.exe
- Eigen::SparseMatrixBase<Eigen::Block<Eigen::SparseMatrix<double,0,int>,-1,1,1> >::dot<Eigen::Matrix<double,-1,1,0,-1,1> >	75506 (55.26%)	75075 (54.94%)	solvespace.exe
- Eigen::SparseQR<Eigen::SparseMatrix<double,0,int>,Eigen::COLAMDOrdering >::factorize	130410 (95.44%)	52812 (38.65%)	solvespace.exe
- [External Code]	136646 (100.00%)	5891 (4.31%)	Multiple modules
- Eigen::SparseCompressedBase<Eigen::Block<Eigen::SparseMatrix<double,0,int>,-1,1,1> >::InnerIterator::InnerIterator	382 (0.28%)	382 (0.28%)	solvespace.exe
- [System Call] ntoskrnl.exe	236 (0.17%)	236 (0.17%)	ntoskrnl.exe
- Eigen::SparseMatrix<double,0,int>::insertUncompressed	214 (0.16%)	177 (0.13%)	solvespace.exe
- Eigen::internal::coletree<Eigen::SparseMatrix<double,0,int>,Eigen::Matrix<int,-1,1,0,-1,1> >	156 (0.11%)	150 (0.11%)	solvespace.exe
- inflate_fast	158 (0.12%)	146 (0.11%)	solvespace.exe
- Eigen::internal::assign_sparse_to_sparse<Eigen::SparseMatrix<double,0,int>,Eigen::SparseMatrix<double,0,int> >	408 (0.30%)	143 (0.10%)	solvespace.exe
- SolveSpace::IdListSolveSpace::Entity,SolveSpace::hEntity::FindByIdNoOops	143 (0.10%)	143 (0.10%)	solvespace.exe
- SolveSpace::IdListSolveSpace::Param,SolveSpace::hParam::FindByIdNoOops	102 (0.07%)	102 (0.07%)	solvespace.exe

`b53d164` Slow 20.8s

Function Name	Total CPU [unit, %]	Self CPU [unit, %]	Module
solvespace.exe (PID: 16388)	162904 (100.00%)	0 (0.00%)	solvespace.exe
- Eigen::SparseMatrixBase<Eigen::Block<Eigen::SparseMatrix<double,0,int>,-1,1,1> >::dot<Eigen::Matrix<double,-1,1,0,-1,1> >	89652 (55.03%)	89109 (54.70%)	solvespace.exe
- Eigen::SparseQR<Eigen::SparseMatrix<double,0,int>,Eigen::COLAMDOrdering >::factorize	156611 (96.14%)	64615 (39.66%)	solvespace.exe
- [External Code]	162902 (100.00%)	6094 (3.74%)	Multiple modules
- Eigen::SparseCompressedBase<Eigen::Block<Eigen::SparseMatrix<double,0,int> const ,-1,1,1> >::InnerIterator::InnerIterator	470 (0.29%)	468 (0.29%)	solvespace.exe
- [System Call] ntoskrnl.exe	268 (0.16%)	268 (0.16%)	ntoskrnl.exe
- Eigen::SparseMatrix<double,0,int>::insertUncompressed	233 (0.14%)	180 (0.11%)	solvespace.exe
- Eigen::internal::assign_sparse_to_sparse<Eigen::SparseMatrix<double,0,int>,Eigen::SparseMatrix<double,0,int> >	437 (0.27%)	154 (0.09%)	solvespace.exe
- [External Call] ucrtbase.dll	149 (0.09%)	149 (0.09%)	ucrtbase.dll
- Eigen::internal::coletree<Eigen::SparseMatrix<double,0,int>,Eigen::Matrix<int,-1,1,0,-1,1> >	149 (0.09%)	146 (0.09%)	solvespace.exe
- SolveSpace::IdListSolveSpace::Entity,SolveSpace::hEntity::FindByIdNoOops	144 (0.09%)	144 (0.09%)	solvespace.exe
- inflate_fast	151 (0.09%)	139 (0.09%)	solvespace.exe

`985e4fb` Slow 20.7s

Function Name	Total CPU [unit, %]	Self CPU [unit, %]	Module
solvespace.exe (PID: 14924)	172417 (100.00%)	0 (0.00%)	solvespace.exe
- Eigen::SparseMatrixBase<Eigen::Block<Eigen::SparseMatrix<double,0,int>,-1,1,1> >::dot<Eigen::Matrix<double,-1,1,0,-1,1> >	88794 (51.50%)	88341 (51.24%)	solvespace.exe
- Eigen::SparseQR<Eigen::SparseMatrix<double,0,int>,Eigen::COLAMDOrdering >::factorize	155813 (90.37%)	64704 (37.53%)	solvespace.exe
- [External Code]	172417 (100.00%)	16307 (9.46%)	Multiple modules
- Eigen::SparseCompressedBase<Eigen::Block<Eigen::SparseMatrix<double,0,int> const ,-1,1,1> >::InnerIterator::InnerIterator	424 (0.25%)	423 (0.25%)	solvespace.exe
- [System Call] ntoskrnl.exe	214 (0.12%)	214 (0.12%)	ntoskrnl.exe
- [External Call] ucrtbase.dll	172 (0.10%)	172 (0.10%)	ucrtbase.dll
- Eigen::SparseMatrix<double,0,int>::insertUncompressed	219 (0.13%)	161 (0.09%)	solvespace.exe
- Eigen::internal::coletree<Eigen::SparseMatrix<double,0,int>,Eigen::Matrix<int,-1,1,0,-1,1> >	156 (0.09%)	153 (0.09%)	solvespace.exe
- Eigen::internal::assign_sparse_to_sparse<Eigen::SparseMatrix<double,0,int>,Eigen::SparseMatrix<double,0,int> >	432 (0.25%)	149 (0.09%)	solvespace.exe
- SolveSpace::IdListSolveSpace::Entity,SolveSpace::hEntity::FindByIdNoOops	143 (0.08%)	143 (0.08%)	solvespace.exe
- inflate_fast	131 (0.08%)	128 (0.07%)	solvespace.exe

phkahler · 2022-01-08T22:01:21Z

@ruevs please fix it. Unfortunately I already merged a couple others since. I though one of those reverted it?

ruevs · 2022-01-08T22:35:21Z

OK @phkahler @rpavlik I rebased and force pushed master dropping:

b53d164 Don't put Expr* in a sparse matrix, store triplet list instead.
985e4fb OpenMP-enable the evaluation of a jacobian.
81b127e Revert "OpenMP-enable the evaluation of a jacobian."

The idea is still interesting and may be worth finding out why it is slower - but I think the reason may be rather subtle.

rpavlik · 2022-01-09T12:21:36Z

Please don't force push master in the future. It messes up everyone's local clones, just revert the commits instead.

I'm rather confused and frustrated how your profiling and mine found different regression causes, it was pretty clear in my testing that only the openmp commit was the problem, the other commits were neutral to performance improvements. If we're going to revert commits without review we really need a standard set of performance tests to have some objective reproducible data.

I appreciate the profiling data you posted but aside from the top level time from a pathological model it doesn't really show me how the change was a problem. The models I tested with were smaller (less than a second to load, though I used the benchmark tool to process them repeatedly for sufficient accuracy) and things I've actually 3d printed.

Has anyone had experience setting up a performance regression test?

ruevs · 2022-01-09T17:54:38Z

Understood. In the future I will not force push master and will leave @phkahler to handle reverts.

As for performance profiling. How does the "4096" model behave for you against this change? May it depend on architecture and compiler/compiler settings? Strange.

It is indeed pathological - but it is perhaps the cleanest "solver only" test I have. Why b53d164 slows it down for me I do not understand yet but it does, by 10+%.

It is true that we need automated benchmark tests.

I spent a lot of time in the last few days to go through the history and dig up all the performance related changes and models that were used to test them. For now I have created a new performance label and tagged all I dug up so far.
https://github.com/solvespace/solvespace/issues?q=label%3Aperformance+sort%3Acreated-desc

I'm currently putting together a list similar to the NURBS one but it's quite a bit of work to test all those models so it'll take me some time... and life happens...

I'll also test this pull request, master as it is and #432 against the various models.

As for automated tests - @vespakoen tried it here in the CI, but @phkahler (correctly in my opinion) pointed out that they do not belong in CI. Also the pull request did not use the existing test project - which I think it should. I think "performace" should be an optional set of tests (just because they will be slow) similar to the normal ones, but with "heavy" models and timing results shown. Unfortunately I do not have time to write them (for now).

ruevs · 2022-01-09T19:01:12Z

By the way I am not using the current solvespace-testsuite.exe as a performance benchmark at all because all the tests models in it are trivially simple - that was the idea, just look through the PNGs in the test directory - they are correctness tests, not performance.

All 256 tests run in 3.3s (RelWithDebugInfo) on my i5-7500! This is not enough "load" to test performance. In fact if I redirect stderr to a file the tests run in 2.9s - this is 0.4s for console IO!

In fact below are 10 runs of this pull request versus master without it, with the output redirected to a file. No sparse expr ends up 0.0183s slower - but as I said this is a totally meaningless test.

Run	No sparse expr	master (rebased `5efc148`)
1	2.893	2.886
2	2.917	2.933
3	2.889	2.891
4	2.897	2.861
5	2.894	2.86
6	2.924	2.88
7	2.917	2.868
8	2.915	2.863
9	2.881	2.897
10	2.901	2.906
Average	2.9028	2.8845

The files
NoSparse_vs_Master_testsuite.zip

ruevs · 2022-01-09T22:59:18Z

Here is the summary issue with collected performance testing models #1186. I have not had the time to run all of them against this pull request, master and #432 .

rpavlik · 2022-01-10T15:30:41Z

great, thanks!

rpavlik mentioned this pull request Dec 23, 2021

Evilspirit eigen over ryan buildsys #1159

Merged

phkahler mentioned this pull request Dec 24, 2021

System::CalculateRank() is slow #386

Closed

rpavlik force-pushed the no-sparse-expr branch from 7f8c929 to c26044a Compare December 27, 2021 03:19

rpavlik force-pushed the no-sparse-expr branch 3 times, most recently from ef18db1 to 902b6c7 Compare December 31, 2021 16:48

rpavlik force-pushed the no-sparse-expr branch 2 times, most recently from 6923db4 to 3f38c38 Compare December 31, 2021 17:12

rpavlik added 3 commits January 1, 2022 19:40

Fix which outer size we use.

f14fe66

Should be the same right now, but this is clearer.

Don't put Expr* in a sparse matrix, store triplet list instead.

56fd345

OpenMP-enable the evaluation of a jacobian.

2301ba2

rpavlik force-pushed the no-sparse-expr branch from 3f38c38 to 2301ba2 Compare January 2, 2022 01:41

rpavlik requested review from phkahler, ruevs and vespakoen January 3, 2022 15:50

ruevs added performance solver labels Jan 4, 2022

phkahler merged commit 985e4fb into solvespace:master Jan 7, 2022

rpavlik deleted the no-sparse-expr branch January 7, 2022 00:56

Conversation

rpavlik commented Dec 23, 2021

Uh oh!

phkahler commented Dec 24, 2021

Uh oh!

rpavlik commented Dec 24, 2021

Uh oh!

phkahler commented Dec 24, 2021

Uh oh!

rpavlik commented Dec 25, 2021

Uh oh!

rpavlik commented Dec 27, 2021

Uh oh!

rpavlik commented Dec 31, 2021

Uh oh!

rpavlik commented Dec 31, 2021

Uh oh!

phkahler commented Jan 7, 2022

Uh oh!

ruevs commented Jan 7, 2022

Uh oh!

ruevs commented Jan 7, 2022

Uh oh!

ruevs commented Jan 7, 2022

Uh oh!

ruevs commented Jan 7, 2022

Uh oh!

rpavlik commented Jan 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ruevs commented Jan 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rpavlik commented Jan 7, 2022

Uh oh!

rpavlik commented Jan 8, 2022

Uh oh!

rpavlik commented Jan 8, 2022

Uh oh!

ruevs commented Jan 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

3d482f0 Fast - 17.7s

b53d164 Slow 20.8s

985e4fb Slow 20.7s

Uh oh!

phkahler commented Jan 8, 2022

Uh oh!

ruevs commented Jan 8, 2022

Uh oh!

rpavlik commented Jan 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ruevs commented Jan 9, 2022

Uh oh!

ruevs commented Jan 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ruevs commented Jan 9, 2022

Uh oh!

rpavlik commented Jan 10, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rpavlik commented Jan 7, 2022 •

edited

Loading

ruevs commented Jan 7, 2022 •

edited

Loading

ruevs commented Jan 8, 2022 •

edited

Loading

`3d482f0` Fast - 17.7s

`b53d164` Slow 20.8s

`985e4fb` Slow 20.7s

rpavlik commented Jan 9, 2022 •

edited

Loading

ruevs commented Jan 9, 2022 •

edited

Loading