Skip to content

Blog post about processing larger than memory queries #17089

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

As part of trying to help people understand how to tune memory related settings in #17069, @2010YOUY01 pointed out #17069 (comment)

Let's make it concise now, I think adding a few more sentences of explanation might actually confuse those without the background knowledge. A tutorial-style doc is still needed to describe the full picture.

I agree and I think a blog post explaining how DataFusion processes larger than memory data sets would be really helpful to give people the broader context

Describe the solution you'd like

I suggest a blog on https://datafusion.apache.org/blog/ (make a PR to https://github.com/alamb/datafusion-site)

Describe alternatives you've considered

Some ideas

  1. Background on memory usage in DataFusion (the memory manager model and what takes large amounts of memory) -- can probably copy/paste from https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/trait.MemoryPool.html#memory-management-overview
  2. Memory Consumers Describe at a high level what consumes most memory in a plan (group by hash table, sorting, etc)
  3. Optimizations that DataFusion does to try and avoid needing memory (e.g. take advantage of pre-xisting sort orders, topk, etc)
  4. Spilling Sort: Provide an overview of the main spilling sort algorithm (sort in memory, spill to disk, merge pre-sorted runs, etc)
  5. Using Spilling Sort in Group By: Explain that the grouping operation uses the same underlying building block
  6. Call for help: 🎣 for people to figure out how to add spilling for joins and make the spilling hash faster

Additional context

I am very happy to help write this

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions