-
Notifications
You must be signed in to change notification settings - Fork 266
Massive shuffle writes during triangleCount #322
Description
Hi,
I have a graph with 1.6 million vertices and 30 million edges. The vertices of my graph have 58 attributes apart from their 'id' attribute. Doing a triangleCount on this graph produces 1 terabyte of shuffle writes and abortes with running out disk space.
When I generate this graph without the 58 vertice attributes, the triangleCount finishes successfully and causes only 10 gigabyte of shuffle writes.
Therefore I suggest to modify the triangleCount.run() [1] that the local variable g2 gets only the vertice id's and not the vertice attributes.
That is to say, change:
val g2 = GraphFrame(graph.vertices, dedupedE)
to
val g2 = GraphFrame(graph.vertices.select("id"), dedupedE)
This change does not modify the procedure of the triangleCount.run() as the dropped attributes were not used in the following steps.