Skip to content

RocksDB instance/file storage#6203

Merged
aothms merged 130 commits intov0.8.0from
tfk-rocksdb-storage
Sep 11, 2025
Merged

RocksDB instance/file storage#6203
aothms merged 130 commits intov0.8.0from
tfk-rocksdb-storage

Conversation

@aothms
Copy link
Copy Markdown
Member

@aothms aothms commented Feb 21, 2025

One of the primary concerns when dealing with IFC is the memory usage. IfcOpenShell is not particularly careful in that regard, but it's likely something that affects the majority of libraries due to the entity-relationship model of Express. We see this for example in the buildingSMART validation service (uses ifcopenshell) where memory footprint (a) limits running multiple tasks in parallel (b) imposes a quite restrictive max file size on end-users and (c) still results in a hefty cloud bill.

One of the earliest open source implementations of IFC, the Open Source BIMserver in Java; https://github.com/opensourceBIM/BIMserver, has been using a key-value store since its inception to offload memory to disk and essentially support infinitely large models.

Recently within IfcOpenShell we started a similar effort to use RocksDB as an alternative side-by-side storage model.

afbeelding

Conceptually this appears really simple, where this gets harder is in the iterators and maps and realizing this in a backwards compatible fashion while still benefiting from the lazy on-disk opportunities that the rocksdb key value store would offer.

Therefore I've come up with a couple of new template classes

  • rocksdb_map_adapter<K, V>(rocksdb::DB*, Str prefix) (Mapping[K-> V]) Takes a rocksdb prefix and exposes a somewhat std::map compatible interface over that (de)serializes K and V into strings and writes/reads them from the key value store.
  • map_transformer<Fn>(Mapping basis) (Mapping[basis::KeyType -> invoke_result<Fn, V>]) Primarily deals with persistence of pointers. Instances are stored by their identity in the KV. But a cache exist so that instance pointer addresses are stable. I.e they're created only once for the lifetime of the file, just like the in-memory situation.
  • map_variant<Mappings...>(basis*) (common_type<Mappings...>) Unifies maps and iterators from the two storage backends into a consistent class that is std::variant-backed, both for the maps as the iterators.

This allows us to build to have a std::unordered_map<std::string, Instance*> in in_memory_file_storage.

Have a lazy rocksdb-backed mapping interface map_transformer<std::string, Instance*>(rocksdb_map_adapter("g|")) in rocksdb_file_storage

And then in our File class we have map_variant(in_memory_file_storage::by_guid_t, rocksdb_file_storage::by_guid_t). So this means that practically speaking even on the C++ side the code is mostly compatible.

The way that this is built means that the full ifcopenshell stack will be able to use this, i.e IfcConvert, python, Bonsai, your own proprietary tools.

Roadmap:

I almost have the code base in a state again that it compiles after completely ripping it open.

  • Implement the scaffolding of the rocksdb based storage within ifcopenshell in a compatible manner
  • Make sure SPF is working again
  • Build the streaming SPF-to-rocksdb writer that can be used as a one-of conversion script
  • Implement most of the actual deserialization from the key-value as a read-only mode
  • Implement most of the serialization into the key-value for read-write file access

@aothms
Copy link
Copy Markdown
Member Author

aothms commented Mar 14, 2025

We're approaching a state where it's possible to say sth about the performance of this rocksdb-backed file storage. Both time and memory in case of very specific reads are greatly reduced. After parse/open, retrieval is only a little bit slower. There's probably a lot to optimise still.

One major caveat is that as soon as instances are retrieved from the rocksdb model they are not freed for the lifetime of the file object because we do not want dangling pointers. This means that memory usage will keep growing and means we haven't really reached infinite scalability yet.

Reading of DC_Riverside_Bldg-LOD_300.ifc (275M) and read wall guids

Time (s)

format open model read wall guids combined
spf 14.453 0.016 14.468
rocks 0.026 0.022 0.048 (=0.33%)

Memory (bytes)

format open model read wall guids combined
spf 2 191 388 672 16 384 2 191 405 056
rocks 5 074 944 6 696 960 11 771 904 (=0.54%)
script
import time, psutil, ifcopenshell

t0, m0 = time.perf_counter(), psutil.Process().memory_info().rss
f = ifcopenshell.open('DC_Riverside_Bldg-LOD_300.rdb')
t1, m1 = time.perf_counter(), psutil.Process().memory_info().rss
[i.GlobalId for i in f.by_type('IfcWall')]
t2, m2 = time.perf_counter(), psutil.Process().memory_info().rss
g = ifcopenshell.open('DC_Riverside_Bldg-LOD_300.ifc')
t3, m3 = time.perf_counter(), psutil.Process().memory_info().rss
[i.GlobalId for i in g.by_type('IfcWall')]
t4, m4 = time.perf_counter(), psutil.Process().memory_info().rss

print('spf', t3-t2, t4-t3, sep="|")
print('rocks', t1-t0, t2-t1, sep="|")

print('spf', m3-m2, m4-m3, sep="|")
print('rocks', m1-m0, m2-m1, sep="|")

@Moult
Copy link
Copy Markdown
Contributor

Moult commented Mar 16, 2025

How would this look from the Python side? Would it be a drop in replacement with merely a tweak in the open/write?

This is AMAZING!

@RickBrice
Copy link
Copy Markdown
Contributor

One major caveat is that as soon as instances are retrieved from the rocksdb model they are not freed for the lifetime of the file object because we do not want dangling pointers. This means that memory usage will keep growing and means we haven't really reached infinite scalability yet.

I don’t really know what I’m talking about so pardon me if this is a foolish idea, but could you use a shared_ptr like object instead of std::shared_ptr that hides if instance data is in memory or in the database? Instance data could be unloaded from memory after some time and then reloaded when needed. Objects having instances of the shared_ptr like object would never know.

@aothms
Copy link
Copy Markdown
Member Author

aothms commented Mar 16, 2025

How would this look from the Python side? Would it be a drop in replacement with merely a tweak in the open/write?

Yes, currently it just detects file type when you open. Writing is only possible through the serializer at this point. Attribute modifications are probably possible, but instance additions require a little bit more work because they need to be created through the file context to get a reference to the database.

could you use a shared_ptr like object instead of std::shared_ptr that hides if instance data is in memory or in the database?

I've been playing with similar ideas. Also the idea of a hybrid shared_ptr / weak_ptr that instances start as shared_ptr, when you add them to a file you get weak_references so that the file can be deleted when there are instances from the file alive outside of the file, which then can no longer be used. So maybe a custom pointer class is the solution. Also for equality. We use the pointer address currently in the C++ code base for testing equality, which essentially means we cannot recycle them (we already have the identity() actually that we could use instead). For a custom pointer class we could also define our own equality operator. It just feels like last resort to me, to implement your own smart pointer.

@aothms
Copy link
Copy Markdown
Member Author

aothms commented Sep 11, 2025

🙈

@aothms aothms merged commit 79672f7 into v0.8.0 Sep 11, 2025
2 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants