Skip to content

feat: aggregate nbrs api#792

Merged
SemyonSinchenko merged 17 commits intographframes:mainfrom
SemyonSinchenko:785-aggregate-nbrs
Mar 27, 2026
Merged

feat: aggregate nbrs api#792
SemyonSinchenko merged 17 commits intographframes:mainfrom
SemyonSinchenko:785-aggregate-nbrs

Conversation

@SemyonSinchenko
Copy link
Copy Markdown
Collaborator

What changes were proposed in this pull request?

New API as described in #785

Why are the changes needed?

Close #785

@SemyonSinchenko SemyonSinchenko self-assigned this Feb 3, 2026
@SemyonSinchenko SemyonSinchenko added scala pyspark-classic GraphFrames on PySpark Classic pyspark-connect GraphFrames on PySpark Connect labels Feb 3, 2026
@SemyonSinchenko
Copy link
Copy Markdown
Collaborator Author

Hi @james-willis ! I addressed most of your comments. For some I left my answers. It looks like we are close to a final version. Could be nice if you can take another look, so I can work on bindings and docs.

see scaladoc issue
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Feb 8, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 70.30303% with 49 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.80%. Comparing base (f1db6f4) to head (88001b9).
⚠️ Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
...park/sql/graphframes/GraphFramesConnectUtils.scala 0.00% 36 Missing ⚠️
...scala/org/graphframes/lib/AggregateNeighbors.scala 89.84% 13 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #792      +/-   ##
==========================================
- Coverage   84.94%   80.80%   -4.15%     
==========================================
  Files          68       78      +10     
  Lines        3507     4428     +921     
  Branches      453      527      +74     
==========================================
+ Hits         2979     3578     +599     
- Misses        528      850     +322     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@SemyonSinchenko
Copy link
Copy Markdown
Collaborator Author

I will work on tests, bindings and docs.

@SemyonSinchenko SemyonSinchenko marked this pull request as ready for review February 26, 2026 15:49
@SemyonSinchenko
Copy link
Copy Markdown
Collaborator Author

@james-willis this one is ready for review

@james-willis
Copy link
Copy Markdown
Collaborator

Looking at this PR, here are my recommendations:

Code Quality & Performance

  1. Memory Management Concerns (AggregateNeighbors.scala:267-268):
    The hardcoded warning for maxHops > 10 should be configurable or include memory estimation logic rather than an arbitrary threshold.

  2. Potential Optimization (AggregateNeighbors.scala:303):
    The repartition by SRC could be expensive for smaller datasets. Consider making this optional or using coalesce when appropriate.

API Design

  1. Required Parameter Validation:
    The API requires either stoppingCondition or targetCondition but this isn't enforced until run(). Consider validation during builder setup for better UX.

  2. Default Accumulator:
    Most use cases will want basic path tracking. Consider adding a convenience method for common path accumulation patterns.

Testing & Documentation

  1. Edge Case Testing: Add tests for:

    • Very large graphs (performance characteristics)
    • Disconnected components
    • Graphs with cycles under different stopping conditions
  2. Python API Consistency:
    Ensure Python parameter names follow conventions consistently.

Implementation Robustness

  1. Convergence Logic (AggregateNeighbors.scala:400):
    Currently only checks if frontier is empty, but doesn't handle cases where states exist but no new edges can be traversed.

Overall, this is a solid implementation that follows GraphFrames patterns well. The main areas for improvement are around performance optimization configuration and edge case handling. The comprehensive test suite and documentation are excellent.

Comment by Claude (AI Assistant)

@SemyonSinchenko
Copy link
Copy Markdown
Collaborator Author

@james-willis

  1. It is a warning, I don't understand how the warning can be configurable.
  2. That is false. The join mean repartition by src anyway, so even on a single step it is better to keep the repartition
  3. It is impossible to achieve, because we do not know the order of setter-calls
  4. I do not understand, what does it mean? If it is about adding implementations, I'm going to do it in follow-up PRs. I'm going to add at least allPaths to find all paths between two nodes
  5. I'm not going to add huge graphs (performance testing) to unit tests. All tests are 10+ minutes already, do we want to have 30 minutes tests?
  6. All the Python API follows the snake_case way
  7. In this case the DataFrame will be empty, so I don't fully understand the comment...

@james-willis
Copy link
Copy Markdown
Collaborator

james-willis commented Mar 10, 2026

If it is about adding implementations, I'm going to do it in follow-up PRs. I'm going to add at least allPaths to find all paths between two nodes

yeah that was it. sgtm.

All other answer sgtm on that Claude review.

…andom walks, and PG changes

Resolved merge conflicts and integrated updates from graphframes/main into branch 785-aggregate-nbrs.

Key changes:
- Added sampling and convolution primitives (KMinSampling, SamplingConvolution) and related tests.
- Introduced embeddings and random-walk features: Hash2Vec, RandomWalkEmbeddings, RandomWalkBase, RandomWalkWithRestart, and example runners.
- Added new library components and algorithms: TwoPhase, Updated ConnectedComponents/KCore/Pregel/RandomizedContraction logic.
- Added benchmarks and benchmark infrastructure for various algorithms.
- Python API improvements: new property-graph package (python/graphframes/pg), internal utilities, updated client and protobuf bindings.
- Build, docs, and packaging updates: build.sbt, docs (including graph-ml page), NOTICE, AGENTS.md, and pre-commit config.
- Updated Spark shims and connect utilities to support the new features.
- Added and updated tests across Scala and Python to cover the new functionality.

All conflicts were fixed and changes staged for commit.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new multi-hop traversal API (“AggregateNeighbors”) to GraphFrames, implemented in Scala core with Spark Connect support and exposed via Python (classic + connect), along with documentation and tests to close #785.

Changes:

  • Introduces org.graphframes.lib.AggregateNeighbors (Scala) and wires it into GraphFrame.aggregateNeighbors.
  • Extends Spark Connect protocol + server/client plumbing to invoke AggregateNeighbors remotely.
  • Adds Python API wrappers (plus helper for attribute references), docs, and unit tests.

Reviewed changes

Copilot reviewed 14 out of 16 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
core/src/main/scala/org/graphframes/lib/AggregateNeighbors.scala New Scala implementation + builder-style API and attribute helpers (srcAttr/dstAttr/edgeAttr).
core/src/main/scala/org/graphframes/GraphFrame.scala Exposes aggregateNeighbors entrypoint on GraphFrame.
core/src/test/scala/org/graphframes/lib/AggregateNeighborsSuite.scala Scala unit tests for traversal behavior (paths, filters, stopping, etc.).
connect/src/main/protobuf/graphframes.proto Adds AggregateNeighbors message and oneof method field.
connect/src/main/scala/.../GraphFramesConnectUtils.scala Server-side Connect planner support for AggregateNeighbors.
python/graphframes/connect/graphframes_client.py Python Spark Connect client method building the protobuf plan.
python/graphframes/connect/proto/graphframes_pb2.py / .pyi Regenerated protobuf bindings including AggregateNeighbors message.
python/graphframes/graphframe.py Public Python GraphFrame.aggregate_neighbors + AggregateNeighbors helper class; docstring for API.
python/graphframes/classic/graphframe.py Classic (Py4J) implementation of aggregate_neighbors calling JVM builder.
python/tests/test_graphframes.py Adds Python tests for basic AggregateNeighbors scenarios.
docs/src/04-user-guide/05-traversals.md User guide section documenting Aggregate Neighbors with examples.
project/LaikaCustoms.scala Doc site navigation tweak (page navigation depth).
build.sbt Adjusts protoc version selection for Spark 3 builds.
python/dev/build_jar.py Updates default Spark version used by the dev jar build script.
.gitignore Ignores additional AI-tool related files/directories.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@SemyonSinchenko SemyonSinchenko merged commit 82f600b into graphframes:main Mar 27, 2026
7 checks passed
@SemyonSinchenko SemyonSinchenko deleted the 785-aggregate-nbrs branch March 27, 2026 20:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pyspark-classic GraphFrames on PySpark Classic pyspark-connect GraphFrames on PySpark Connect scala

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: aggregate neighbors API

4 participants