RocksDB instance/file storage by aothms · Pull Request #6203 · IfcOpenShell/IfcOpenShell

aothms · 2025-02-21T15:25:20Z

One of the primary concerns when dealing with IFC is the memory usage. IfcOpenShell is not particularly careful in that regard, but it's likely something that affects the majority of libraries due to the entity-relationship model of Express. We see this for example in the buildingSMART validation service (uses ifcopenshell) where memory footprint (a) limits running multiple tasks in parallel (b) imposes a quite restrictive max file size on end-users and (c) still results in a hefty cloud bill.

One of the earliest open source implementations of IFC, the Open Source BIMserver in Java; https://github.com/opensourceBIM/BIMserver, has been using a key-value store since its inception to offload memory to disk and essentially support infinitely large models.

Recently within IfcOpenShell we started a similar effort to use RocksDB as an alternative side-by-side storage model.

Conceptually this appears really simple, where this gets harder is in the iterators and maps and realizing this in a backwards compatible fashion while still benefiting from the lazy on-disk opportunities that the rocksdb key value store would offer.

Therefore I've come up with a couple of new template classes

rocksdb_map_adapter<K, V>(rocksdb::DB*, Str prefix) (Mapping[K-> V]) Takes a rocksdb prefix and exposes a somewhat std::map compatible interface over that (de)serializes K and V into strings and writes/reads them from the key value store.
map_transformer<Fn>(Mapping basis) (Mapping[basis::KeyType -> invoke_result<Fn, V>]) Primarily deals with persistence of pointers. Instances are stored by their identity in the KV. But a cache exist so that instance pointer addresses are stable. I.e they're created only once for the lifetime of the file, just like the in-memory situation.
map_variant<Mappings...>(basis*) (common_type<Mappings...>) Unifies maps and iterators from the two storage backends into a consistent class that is std::variant-backed, both for the maps as the iterators.

This allows us to build to have a std::unordered_map<std::string, Instance*> in in_memory_file_storage.

Have a lazy rocksdb-backed mapping interface map_transformer<std::string, Instance*>(rocksdb_map_adapter("g|")) in rocksdb_file_storage

And then in our File class we have map_variant(in_memory_file_storage::by_guid_t, rocksdb_file_storage::by_guid_t). So this means that practically speaking even on the C++ side the code is mostly compatible.

The way that this is built means that the full ifcopenshell stack will be able to use this, i.e IfcConvert, python, Bonsai, your own proprietary tools.

Roadmap:

I almost have the code base in a state again that it compiles after completely ripping it open.

Implement the scaffolding of the rocksdb based storage within ifcopenshell in a compatible manner
Make sure SPF is working again
Build the streaming SPF-to-rocksdb writer that can be used as a one-of conversion script
Implement most of the actual deserialization from the key-value as a read-only mode
Implement most of the serialization into the key-value for read-write file access

…nerally, no longer copy map

aothms · 2025-03-14T18:38:15Z

We're approaching a state where it's possible to say sth about the performance of this rocksdb-backed file storage. Both time and memory in case of very specific reads are greatly reduced. After parse/open, retrieval is only a little bit slower. There's probably a lot to optimise still.

One major caveat is that as soon as instances are retrieved from the rocksdb model they are not freed for the lifetime of the file object because we do not want dangling pointers. This means that memory usage will keep growing and means we haven't really reached infinite scalability yet.

Reading of DC_Riverside_Bldg-LOD_300.ifc (275M) and read wall guids

Time (s)

format	open model	read wall guids	combined
spf	14.453	0.016	14.468
rocks	0.026	0.022	0.048 (=0.33%)

Memory (bytes)

format	open model	read wall guids	combined
spf	2 191 388 672	16 384	2 191 405 056
rocks	5 074 944	6 696 960	11 771 904 (=0.54%)

script

import time, psutil, ifcopenshell

t0, m0 = time.perf_counter(), psutil.Process().memory_info().rss
f = ifcopenshell.open('DC_Riverside_Bldg-LOD_300.rdb')
t1, m1 = time.perf_counter(), psutil.Process().memory_info().rss
[i.GlobalId for i in f.by_type('IfcWall')]
t2, m2 = time.perf_counter(), psutil.Process().memory_info().rss
g = ifcopenshell.open('DC_Riverside_Bldg-LOD_300.ifc')
t3, m3 = time.perf_counter(), psutil.Process().memory_info().rss
[i.GlobalId for i in g.by_type('IfcWall')]
t4, m4 = time.perf_counter(), psutil.Process().memory_info().rss

print('spf', t3-t2, t4-t3, sep="|")
print('rocks', t1-t0, t2-t1, sep="|")

print('spf', m3-m2, m4-m3, sep="|")
print('rocks', m1-m0, m2-m1, sep="|")

Moult · 2025-03-16T01:12:43Z

How would this look from the Python side? Would it be a drop in replacement with merely a tweak in the open/write?

This is AMAZING!

RickBrice · 2025-03-16T02:50:24Z

One major caveat is that as soon as instances are retrieved from the rocksdb model they are not freed for the lifetime of the file object because we do not want dangling pointers. This means that memory usage will keep growing and means we haven't really reached infinite scalability yet.

I don’t really know what I’m talking about so pardon me if this is a foolish idea, but could you use a shared_ptr like object instead of std::shared_ptr that hides if instance data is in memory or in the database? Instance data could be unloaded from memory after some time and then reloaded when needed. Objects having instances of the shared_ptr like object would never know.

aothms · 2025-03-16T07:55:42Z

How would this look from the Python side? Would it be a drop in replacement with merely a tweak in the open/write?

Yes, currently it just detects file type when you open. Writing is only possible through the serializer at this point. Attribute modifications are probably possible, but instance additions require a little bit more work because they need to be created through the file context to get a reference to the database.

could you use a shared_ptr like object instead of std::shared_ptr that hides if instance data is in memory or in the database?

I've been playing with similar ideas. Also the idea of a hybrid shared_ptr / weak_ptr that instances start as shared_ptr, when you add them to a file you get weak_references so that the file can be deleted when there are instances from the file alive outside of the file, which then can no longer be used. So maybe a custom pointer class is the solution. Also for equality. We use the pointer address currently in the C++ code base for testing equality, which essentially means we cannot recycle them (we already have the identity() actually that we could use instead). For a custom pointer class we could also define our own equality operator. It just feels like last resort to me, to implement your own smart pointer.

This reverts commit e8b1b6f.

…package

This reverts commit 0aae576.

aothms · 2025-09-11T16:39:12Z

🙈

aothms added 22 commits February 21, 2025 16:12

Exploration of using rocksdb as file+instance storage

da4e86a

Update codegen

fb1d78b

Rerun codegen

5c9ff09

Update CMake

e82cfda

Update late bound inst in ifcconvert

69139de

Wim build script

b4d0cfe

Getting closer to working SPF again

048f77c

Work towards serialization of data

a420c4b

Add header schema

1d9dcb3

Run codegen for header schema

c0f148c

Fix codegen

c58c826

Some successes writing and reading

ce4407f

Non-streaming rocksdb serializer nearing completion

9f49ad5

Working non-streaming serializer and read-only model

67932a2

Auto-detect filetype

df6b7a3

Fix python binding

10682fb

Update .py guess_format()

6cefb2e

Implement rocksdb-based has()

15bbb3b

Speed up toposort with inverse map

8a3c979

Do use merges, much better perf, instantiate inverses directly and ge…

8af4afa

…nerally, no longer copy map

Specific speed-ups

929611f

Remove temp goto

8c466ea

aothms added 4 commits March 18, 2025 10:41

Further small memory optimizations

425de7b

Merge branch 'v0.8.0' into tfk-rocksdb-storage

c935b2f

Fix issue in CurveSegment

61c4797

Fix binary size issue

5a597d1

aothms added 28 commits September 10, 2025 09:09

test_validate_rocksdb.py

761b69e

read-only validate

8ca9bd6

h| i| t| serialization consistency

f8e523d

Rocksdb header parsing

de8c8d3

Minor attribute count fixes

a50fc31

Capture SPF specific errors in validation test

231d047

Add zstd to build-all.py

06a89f9

Merge remote-tracking branch 'origin/v0.8.0' into tfk-rocksdb-storage

a925226

black

b5275f5

Cmake libxml compatibility

f86e5ab

Cmake libxml compatibility

ee632b7

build-all tuple typo

c61d4bb

Take note of zstd cmake path

e13dbd4

Initialize header after storage

be79d37

Test for supported compression

e2b3962

Call init_locale() when streaming

e68cc72

Try different zstd_ config opts

2260f99

Try different zstd_ config opts

c7e53e7

PORTABLE=1 for rocksdb

9367da6

Debugging action

e8b1b6f

CMAKE_INSTALL_LIBDIR

05e8680

Update test_validate_rocksdb.py

db83294

Revert "Debugging action"

42b1cf0

This reverts commit e8b1b6f.

Pass ZSTD to ifopsh build

41f9b15

file.reset_identity_cache() function for adv use cases

9d99de2

win build zstd paths

8d714ed

nix build script: install some dependencies required for import base …

4ee6863

…package

:Revert "Temporary Python 3.9 compatibility"

6ec12fe

This reverts commit 0aae576.

aothms merged commit 79672f7 into v0.8.0 Sep 11, 2025
2 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RocksDB instance/file storage#6203

RocksDB instance/file storage#6203
aothms merged 130 commits intov0.8.0from
tfk-rocksdb-storage

aothms commented Feb 21, 2025 •

edited

Loading

Uh oh!

aothms commented Mar 14, 2025

Uh oh!

Moult commented Mar 16, 2025 •

edited

Loading

Uh oh!

RickBrice commented Mar 16, 2025

Uh oh!

aothms commented Mar 16, 2025

Uh oh!

aothms commented Sep 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

aothms commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aothms commented Mar 14, 2025

Reading of DC_Riverside_Bldg-LOD_300.ifc (275M) and read wall guids

Time (s)

Memory (bytes)

Uh oh!

Moult commented Mar 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RickBrice commented Mar 16, 2025

Uh oh!

aothms commented Mar 16, 2025

Uh oh!

aothms commented Sep 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aothms commented Feb 21, 2025 •

edited

Loading

Moult commented Mar 16, 2025 •

edited

Loading