Skip to content

Training an TFNO with navier-stokes, with flops count#583

Open
ML4SC wants to merge 3 commits intoneuraloperator:mainfrom
ML4SC:tfno-flops-analysis
Open

Training an TFNO with navier-stokes, with flops count#583
ML4SC wants to merge 3 commits intoneuraloperator:mainfrom
ML4SC:tfno-flops-analysis

Conversation

@ML4SC
Copy link
Copy Markdown

@ML4SC ML4SC commented Apr 19, 2025

I've been working with TFNO models and recently developed a script that demonstrates model performance along with FLOPs analysis for both forward and backward passes.

I'd like to contribute to the NeuralOperator project by developing a training example and accompanying documentation that:

  • Demonstrates TFNO performance on the 2D Navier-Stokes equations
  • Includes FLOPs profiling for model introspection and optimization
  • Trains TFNO on multiple GPUs, with ongoing work to optimize communication loops between GPUs
  • Discusses strategies for efficient CPU–GPU communication during training

Please let me know if this would be a valuable addition to the project — I've opened a PR and would greatly appreciate any feedback as I iterate.

@JeanKossaifi
Copy link
Copy Markdown
Member

JeanKossaifi commented Apr 20, 2025 via email

@ML4SC
Copy link
Copy Markdown
Author

ML4SC commented May 2, 2025

Hi Jean,

Thank you for your quick reply. I’ve used torch.profiler to record memory usage and kernel activity on a per-epoch basis. For CPU–GPU transfers, I’ve enabled pin_memory=True and non_blocking=True and set up asynchronous data loading to handle larger batch volumes.

When working with very high-resolution data, I’m exploring a distributed streaming approach, but I haven’t yet found any existing functionality for that in the NeuralOperators codebase. If I’ve overlooked something, could you point me to the relevant module or function? Otherwise, any guidance on where to start implementing distributed data streaming would be greatly appreciated.

Thanks again for your help!

Best,
Natalie

@ML4SC
Copy link
Copy Markdown
Author

ML4SC commented May 2, 2025

Meanwhile, I’d be grateful for any feedback or suggestions you have on my TFNO example using the Navier–Stokes dataset.

@JeanKossaifi
Copy link
Copy Markdown
Member

Thank you @ML4SC - the example looks good, did you get to try building the doc and checking the result?
The training script probably should be in scripts, though I'm not sure if it is needed compared to the existing training script - what do you think @dhpitt ?

@JeanKossaifi
Copy link
Copy Markdown
Member

Just following up - are you still working on this @ML4SC ?

@vduruiss
Copy link
Copy Markdown
Collaborator

@ML4SC Just wanted to check one last time whether you are still interested in finishing this up. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants