Skip to content

No sparse expr#1160

Merged
phkahler merged 3 commits intosolvespace:masterfrom
rpavlik:no-sparse-expr
Jan 7, 2022
Merged

No sparse expr#1160
phkahler merged 3 commits intosolvespace:masterfrom
rpavlik:no-sparse-expr

Conversation

@rpavlik
Copy link
Contributor

@rpavlik rpavlik commented Dec 23, 2021

in #1159 we used a sparse matrix to hold expr pointers, when all we do with them is evaluate them into a sparse matrix of doubles. This stores them as a vector of triplets instead of the matrix, which makes it easy to evaluate to a vector of triplets with double data, from which we can set the sparse matrix efficiently. This actually shaved about 0.1s off of the test suite execution time (in relwithdebinfo) vs just #1159 on its own, so I think we probably want this one if we want that one.

@phkahler
Copy link
Member

You could probably use #pragma omp parallel for around that evaluation loop. Probably have to use a counter instead of an iterator, but that'd be fine.

Also Eigen can use OpenMP as well, did your integration include passing that flag?

Thanks!

@rpavlik
Copy link
Contributor Author

rpavlik commented Dec 24, 2021

I have zero experience with openmp beyond the build system part, but that's a good point, I should make sure the eigen build system integration turns on its openmp usage when allowed. I did briefly think that those loops might be parallelizable but figured thread launch overhead would make it moot given how small those tend to be. I forgot about openmp there 😁

@phkahler
Copy link
Member

I have zero experience with openmp beyond the build system part

Oh, for loops it's super easy. see this example from boolean.cpp

void SShell::MakeClassifyingBsps(SShell *useCurvesFrom) {
#pragma omp parallel for
    for(int i = 0; i<surface.n; i++) {
        surface[i].MakeClassifyingBsp(this, useCurvesFrom);
    }
}

Any time you've got a loop with an index, and the iterations effects are independent, you can just stick that pragma above the loop. It goes faster with OMP and is ignored and runs single threaded on any compiler without OMP. We're stuck at OpenMP version 2 because MS hasn't updated their implementation, otherwise I think some C++ iterators would be supported in addition to a regular counting loop.

A more complicated (yet simple) example is this one where we previously just did triangluateInto the same list, but you can't be writing to the same list from multiple threads:

void SShell::TriangulateInto(SMesh *sm) {
#pragma omp parallel for
    for(int i=0; i<surface.n; i++) {
        SSurface *s = &surface[i];
        SMesh m;
        s->TriangulateInto(this, &m);
        #pragma omp critical
        sm->MakeFromCopyOf(&m);
        m.Clear();
    }
}

In this case instead of triangulating directly into sm, we triangulate into m which is thread-local since its inside the loop. We have to put a critical section around the call to combine sm with each m since sm needs to be modified one thread at a time. Again, this is completely transparent to a compiler that doesn't support the pragmas!

@rpavlik
Copy link
Contributor Author

rpavlik commented Dec 25, 2021

Oh that is pretty nice and easy. Yeah I can go back in and add some at some point.

@rpavlik
Copy link
Contributor Author

rpavlik commented Dec 27, 2021

Rebased on a slimmer #1159 . Looks like Eigen auto-detects OpenMP availability though it might prefer being told the number of real cores. (A safe bet is probably apparent cores divided by 2 on x86...) https://eigen.tuxfamily.org/dox/TopicMultiThreading.html

I tried throwing some OpenMP magic on the eval loop.

@rpavlik rpavlik force-pushed the no-sparse-expr branch 3 times, most recently from ef18db1 to 902b6c7 Compare December 31, 2021 16:48
@rpavlik
Copy link
Contributor Author

rpavlik commented Dec 31, 2021

rebased

@rpavlik rpavlik force-pushed the no-sparse-expr branch 2 times, most recently from 6923db4 to 3f38c38 Compare December 31, 2021 17:12
@rpavlik
Copy link
Contributor Author

rpavlik commented Dec 31, 2021

OK and fixed on msvc

@phkahler
Copy link
Member

phkahler commented Jan 7, 2022

Thanks @rpavlik!

@phkahler phkahler merged commit 985e4fb into solvespace:master Jan 7, 2022
@rpavlik rpavlik deleted the no-sparse-expr branch January 7, 2022 00:56
@ruevs
Copy link
Member

ruevs commented Jan 7, 2022

@rpavlik @phkahler this degrades performace! On the model from here #386 (comment) it takes 20s while master took 18s before these changes. Did you profile?

@ruevs
Copy link
Member

ruevs commented Jan 7, 2022

I have not "bisected" to check which of the two commits slows it down.

@ruevs
Copy link
Member

ruevs commented Jan 7, 2022

Definitely two seconds slower 18->20 and the two commits are inseparable.

@ruevs
Copy link
Member

ruevs commented Jan 7, 2022

In my opinion we should reset master to 3d482f0

@rpavlik
Copy link
Contributor Author

rpavlik commented Jan 7, 2022

That's very strange, did you build with or without openmp? (I could believe that openmp might slow it down here if the threading and sync overhead was larger than just the evaluation cost) I was seeing a little slowdown at some point in dev but I assumed it was because I was building with sanitizers on, I hadn't gotten to profiling without sanitizers, though I did do profiling and saw that evaluation of Jacobian was a main time sink. My initial goal was mostly just to make the code more regular and readable, I was pretty surprised Eigen accepted "Expr*" as a scalar type, and indeed it blows up pretty quick if you try to use very much of the API. As the initial description said, this improved performance at least on the test suite on its own. Would be good to bisect and see which commit is the bad one: the first one should be a no-op approximately, but required by the second, the second is the "main" one, the third was @phkahler s idea which seemed like a good one to me. But if a release build is seeing consistent slow down across multiple test cases then certainly revert for now.

@ruevs
Copy link
Member

ruevs commented Jan 7, 2022

ReleaseWitheDebugInfo, OpenMP on, and I can actually see the CPU load hitting 100% for a few moments in the profiling. And before this it did not. So it does run multi-threaded, but it is slower. Will post screen shots later.

@rpavlik
Copy link
Contributor Author

rpavlik commented Jan 7, 2022

OK, so that's almost certainly what it is, overhead from the threading maybe, if it's hitting 100% but going slower. Interesting. I now have vs2022 installed on my home computer so I might tinker with that this weekend. Alternately you can tell @Wallbraker that I should use some work time on it ;)

@rpavlik
Copy link
Contributor Author

rpavlik commented Jan 8, 2022

Ok, it was the openmp commit. Pr with revert of that is up now.

@rpavlik
Copy link
Contributor Author

rpavlik commented Jan 8, 2022

Note that it does seem plausible here, that the real problem is the one mentioned in the Eigen docs: because Eigen is so intensely optimized, smt (hyper threading) actually slows it down, but openmp defaults to threads equal to number of logical cores, which, as a first approximation, is twice as many as it should on a modern x86 machine. There is a way to adjust the number of threads used but I didn't dig into it. But, this might bite us again sooner rather than later now that we have the Eigen rocket fuel onboard

Would be really good to hook up ci to do performance regression testing. The benchmark tool is a good start/part. I haven't actually done performance regression testing before, so happy if someone else can contribute.

@ruevs
Copy link
Member

ruevs commented Jan 8, 2022

@rpavlik @phkahler
Below is my detailed profiling. the regression comes form b53d164 (Don't put Expr* in a sparse matrix, store triplet list instead. ) and 985e4fb (adding OpenMP) simply did not help. Tested on this #386 (comment) after increasing MAX_UNKNOWNS to 4096

As I said we should reset master to 3d482f0. Should I do it?

3d482f0 Fast - 17.7s

202201082325_rev_3d482f0d5_Fast
202201082325_rev_3d482f0d5_Fast_Detail

Function Name Total CPU [unit, %] Self CPU [unit, %] Module
solvespace.exe (PID: 4788) 136647 (100.00%) 0 (0.00%) solvespace.exe
- Eigen::SparseMatrixBase<Eigen::Block<Eigen::SparseMatrix<double,0,int>,-1,1,1> >::dot<Eigen::Matrix<double,-1,1,0,-1,1> > 75506 (55.26%) 75075 (54.94%) solvespace.exe
- Eigen::SparseQR<Eigen::SparseMatrix<double,0,int>,Eigen::COLAMDOrdering >::factorize 130410 (95.44%) 52812 (38.65%) solvespace.exe
- [External Code] 136646 (100.00%) 5891 (4.31%) Multiple modules
- Eigen::SparseCompressedBase<Eigen::Block<Eigen::SparseMatrix<double,0,int>,-1,1,1> >::InnerIterator::InnerIterator 382 (0.28%) 382 (0.28%) solvespace.exe
- [System Call] ntoskrnl.exe 236 (0.17%) 236 (0.17%) ntoskrnl.exe
- Eigen::SparseMatrix<double,0,int>::insertUncompressed 214 (0.16%) 177 (0.13%) solvespace.exe
- Eigen::internal::coletree<Eigen::SparseMatrix<double,0,int>,Eigen::Matrix<int,-1,1,0,-1,1> > 156 (0.11%) 150 (0.11%) solvespace.exe
- inflate_fast 158 (0.12%) 146 (0.11%) solvespace.exe
- Eigen::internal::assign_sparse_to_sparse<Eigen::SparseMatrix<double,0,int>,Eigen::SparseMatrix<double,0,int> > 408 (0.30%) 143 (0.10%) solvespace.exe
- SolveSpace::IdListSolveSpace::Entity,SolveSpace::hEntity::FindByIdNoOops 143 (0.10%) 143 (0.10%) solvespace.exe
- SolveSpace::IdListSolveSpace::Param,SolveSpace::hParam::FindByIdNoOops 102 (0.07%) 102 (0.07%) solvespace.exe

b53d164 Slow 20.8s

202201081756_rev_b53d16435_Slow
202201081756_rev_b53d16435_Slow_Detail

Function Name Total CPU [unit, %] Self CPU [unit, %] Module
solvespace.exe (PID: 16388) 162904 (100.00%) 0 (0.00%) solvespace.exe
- Eigen::SparseMatrixBase<Eigen::Block<Eigen::SparseMatrix<double,0,int>,-1,1,1> >::dot<Eigen::Matrix<double,-1,1,0,-1,1> > 89652 (55.03%) 89109 (54.70%) solvespace.exe
- Eigen::SparseQR<Eigen::SparseMatrix<double,0,int>,Eigen::COLAMDOrdering >::factorize 156611 (96.14%) 64615 (39.66%) solvespace.exe
- [External Code] 162902 (100.00%) 6094 (3.74%) Multiple modules
- Eigen::SparseCompressedBase<Eigen::Block<Eigen::SparseMatrix<double,0,int> const ,-1,1,1> >::InnerIterator::InnerIterator 470 (0.29%) 468 (0.29%) solvespace.exe
- [System Call] ntoskrnl.exe 268 (0.16%) 268 (0.16%) ntoskrnl.exe
- Eigen::SparseMatrix<double,0,int>::insertUncompressed 233 (0.14%) 180 (0.11%) solvespace.exe
- Eigen::internal::assign_sparse_to_sparse<Eigen::SparseMatrix<double,0,int>,Eigen::SparseMatrix<double,0,int> > 437 (0.27%) 154 (0.09%) solvespace.exe
- [External Call] ucrtbase.dll 149 (0.09%) 149 (0.09%) ucrtbase.dll
- Eigen::internal::coletree<Eigen::SparseMatrix<double,0,int>,Eigen::Matrix<int,-1,1,0,-1,1> > 149 (0.09%) 146 (0.09%) solvespace.exe
- SolveSpace::IdListSolveSpace::Entity,SolveSpace::hEntity::FindByIdNoOops 144 (0.09%) 144 (0.09%) solvespace.exe
- inflate_fast 151 (0.09%) 139 (0.09%) solvespace.exe

985e4fb Slow 20.7s

202201081753_rev_985e4fb67a_SlowOpenMP
202201081753_rev_985e4fb67a_SlowOpenMP_Detail

Function Name Total CPU [unit, %] Self CPU [unit, %] Module
solvespace.exe (PID: 14924) 172417 (100.00%) 0 (0.00%) solvespace.exe
- Eigen::SparseMatrixBase<Eigen::Block<Eigen::SparseMatrix<double,0,int>,-1,1,1> >::dot<Eigen::Matrix<double,-1,1,0,-1,1> > 88794 (51.50%) 88341 (51.24%) solvespace.exe
- Eigen::SparseQR<Eigen::SparseMatrix<double,0,int>,Eigen::COLAMDOrdering >::factorize 155813 (90.37%) 64704 (37.53%) solvespace.exe
- [External Code] 172417 (100.00%) 16307 (9.46%) Multiple modules
- Eigen::SparseCompressedBase<Eigen::Block<Eigen::SparseMatrix<double,0,int> const ,-1,1,1> >::InnerIterator::InnerIterator 424 (0.25%) 423 (0.25%) solvespace.exe
- [System Call] ntoskrnl.exe 214 (0.12%) 214 (0.12%) ntoskrnl.exe
- [External Call] ucrtbase.dll 172 (0.10%) 172 (0.10%) ucrtbase.dll
- Eigen::SparseMatrix<double,0,int>::insertUncompressed 219 (0.13%) 161 (0.09%) solvespace.exe
- Eigen::internal::coletree<Eigen::SparseMatrix<double,0,int>,Eigen::Matrix<int,-1,1,0,-1,1> > 156 (0.09%) 153 (0.09%) solvespace.exe
- Eigen::internal::assign_sparse_to_sparse<Eigen::SparseMatrix<double,0,int>,Eigen::SparseMatrix<double,0,int> > 432 (0.25%) 149 (0.09%) solvespace.exe
- SolveSpace::IdListSolveSpace::Entity,SolveSpace::hEntity::FindByIdNoOops 143 (0.08%) 143 (0.08%) solvespace.exe
- inflate_fast 131 (0.08%) 128 (0.07%) solvespace.exe

@phkahler
Copy link
Member

phkahler commented Jan 8, 2022

@ruevs please fix it. Unfortunately I already merged a couple others since. I though one of those reverted it?

@ruevs
Copy link
Member

ruevs commented Jan 8, 2022

OK @phkahler @rpavlik I rebased and force pushed master dropping:

  • b53d164 Don't put Expr* in a sparse matrix, store triplet list instead.
  • 985e4fb OpenMP-enable the evaluation of a jacobian.
  • 81b127e Revert "OpenMP-enable the evaluation of a jacobian."

The idea is still interesting and may be worth finding out why it is slower - but I think the reason may be rather subtle.

@rpavlik
Copy link
Contributor Author

rpavlik commented Jan 9, 2022

Please don't force push master in the future. It messes up everyone's local clones, just revert the commits instead.

I'm rather confused and frustrated how your profiling and mine found different regression causes, it was pretty clear in my testing that only the openmp commit was the problem, the other commits were neutral to performance improvements. If we're going to revert commits without review we really need a standard set of performance tests to have some objective reproducible data.

I appreciate the profiling data you posted but aside from the top level time from a pathological model it doesn't really show me how the change was a problem. The models I tested with were smaller (less than a second to load, though I used the benchmark tool to process them repeatedly for sufficient accuracy) and things I've actually 3d printed.

Has anyone had experience setting up a performance regression test?

@ruevs
Copy link
Member

ruevs commented Jan 9, 2022

Understood. In the future I will not force push master and will leave @phkahler to handle reverts.

As for performance profiling. How does the "4096" model behave for you against this change? May it depend on architecture and compiler/compiler settings? Strange.

It is indeed pathological - but it is perhaps the cleanest "solver only" test I have. Why b53d164 slows it down for me I do not understand yet but it does, by 10+%.

It is true that we need automated benchmark tests.

I spent a lot of time in the last few days to go through the history and dig up all the performance related changes and models that were used to test them. For now I have created a new performance label and tagged all I dug up so far.
https://github.com/solvespace/solvespace/issues?q=label%3Aperformance+sort%3Acreated-desc

I'm currently putting together a list similar to the NURBS one but it's quite a bit of work to test all those models so it'll take me some time... and life happens...

I'll also test this pull request, master as it is and #432 against the various models.

As for automated tests - @vespakoen tried it here in the CI, but @phkahler (correctly in my opinion) pointed out that they do not belong in CI. Also the pull request did not use the existing test project - which I think it should. I think "performace" should be an optional set of tests (just because they will be slow) similar to the normal ones, but with "heavy" models and timing results shown. Unfortunately I do not have time to write them (for now).

@ruevs
Copy link
Member

ruevs commented Jan 9, 2022

By the way I am not using the current solvespace-testsuite.exe as a performance benchmark at all because all the tests models in it are trivially simple - that was the idea, just look through the PNGs in the test directory - they are correctness tests, not performance.

All 256 tests run in 3.3s (RelWithDebugInfo) on my i5-7500! This is not enough "load" to test performance. In fact if I redirect stderr to a file the tests run in 2.9s - this is 0.4s for console IO!

In fact below are 10 runs of this pull request versus master without it, with the output redirected to a file. No sparse expr ends up 0.0183s slower - but as I said this is a totally meaningless test.

Run No sparse expr master (rebased 5efc148)
1 2.893 2.886
2 2.917 2.933
3 2.889 2.891
4 2.897 2.861
5 2.894 2.86
6 2.924 2.88
7 2.917 2.868
8 2.915 2.863
9 2.881 2.897
10 2.901 2.906
Average 2.9028 2.8845

The files
NoSparse_vs_Master_testsuite.zip

@ruevs
Copy link
Member

ruevs commented Jan 9, 2022

Here is the summary issue with collected performance testing models #1186. I have not had the time to run all of them against this pull request, master and #432 .

@rpavlik
Copy link
Contributor Author

rpavlik commented Jan 10, 2022

great, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants