Skip to content

Massive shuffle writes during triangleCount #322

@cronoik

Description

@cronoik

Hi,
I have a graph with 1.6 million vertices and 30 million edges. The vertices of my graph have 58 attributes apart from their 'id' attribute. Doing a triangleCount on this graph produces 1 terabyte of shuffle writes and abortes with running out disk space.

When I generate this graph without the 58 vertice attributes, the triangleCount finishes successfully and causes only 10 gigabyte of shuffle writes.

Therefore I suggest to modify the triangleCount.run() [1] that the local variable g2 gets only the vertice id's and not the vertice attributes.

That is to say, change:
val g2 = GraphFrame(graph.vertices, dedupedE)
to
val g2 = GraphFrame(graph.vertices.select("id"), dedupedE)

This change does not modify the procedure of the triangleCount.run() as the dropped attributes were not used in the following steps.

[1] https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/graphframes/lib/TriangleCount.scala#L48

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions