Force all deserialized objects to the oldest GC generation#19681
Force all deserialized objects to the oldest GC generation#19681JukkaL merged 2 commits intopython:masterfrom
Conversation
|
I just realized I did my measurements with fixed-format cache, but I guess the numbers will be similar for JSON cache. |
This comment has been minimized.
This comment has been minimized.
JukkaL
left a comment
There was a problem hiding this comment.
Together with the fixed-format cache, import torch with a warm cache was ~90% faster than before for me, based on a quick experiment!
| # a hack, but it gives huge performance wins for large third-party | ||
| # libraries, like torch. | ||
| gc.collect() | ||
| gc.disable() |
There was a problem hiding this comment.
Could we get here multiple times, if there are multiple dirty sub-DAGs? If yes, do you think it'll be a problem?
A quick workaround would be to do this only at most N times per run (possibly N=1).
There was a problem hiding this comment.
Yeah, I was thinking about this. FWIW, I don't think it will be a problem, since freeze/unfreeze are quite fast. Also, we may accidentally get some objects from the stale SCCs previously processed in the oldest generation, but it is probably not so bad. But also I think it is fine to start with just one pass per run and increase the limit as we get more data for this.
(With mypy -c 'import torch' we enter here only once)
|
According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅ |
I am not sure what happens, but for some reason after GC `freeze()`/`unfreeze()` hack #19681 was merged, compiled tests are running twice slower (on GH runner, but I also see much smaller but visible slow-down locally). I have two theories: * The constant overhead we add outweighs the savings when running thousands of tiny builds. * The 8% of extra memory we use goes over the limit in the runner because we were already very close to it. In any case, I propose to try disabling this hack in most tests and see if it helps.
This is a hack, but it gives ~30% perf win for
mypy -c 'import torch'on a warm run. This should not increase memory consumption too much, since we shouldn't create any cyclic garbage during deserialization (we do create some cyclic references, likeTypeInfo->SymbolTable->Instance->TypeInfo, but those are genuine long-living objects).