- Tony Kimathi
- Ayman Charafeddine
- Animesh Mishra
Barenet, a CUDA-based project developed within the "Big Data and Machine Learning Systems" course in Spring 2024, aimed to introduce GPU programming and illustrate tensor abstraction for facilitating tensor operations implementation. However, it faced challenges in efficiency, error-proneness, and scalability due to manual backward propagation and gradient calculation methods. This project extends Barenet by introducing advanced CUDA kernels for auto-differentiation, enhancing its robustness and efficiency. We implement CUDA-optimized General Matrix Multiply (GEMM) operations and provide CUDA kernel implementations for ResNet-18, self-attention, and multi-head attention modules. These enhancements leverage CUDA's parallel processing capabilities, leading to efficient implementations for advanced architectures like Transformers, significantly improving Barenet's ability to handle sophisticated machine learning models. The use of CUDA streams and memory management techniques further optimizes performance and scalability. Our implementation of autodiff offers a 2x improvement compared to manual propagation. Our implementation of Resnet-18 is roughly on par with the implementation of Resnet18 in Pytorch and Tensorflow. We have also implemented an auto-differentiation training framework for training MultiLayered Perceptron.
Barenet is a barebones CUDA project developed in the course Big Data and Machine Learning Systems, CSCI-GA.3033(077), Spring 2024. The project's goal was to introduce GPU programming and demonstrate the abstraction of tensors to make it easy to implement tensor operations.
Challenges faced by Barenet:
- Inefficient: Manual backward propagation and gradient calculation make it inefficient and cumbersome.
- Error-prone: Manual methods increase the risk of errors in calculations due to human oversight or computational inaccuracies.
- Not scalable: Manual updates make it unsuitable for advanced modeling as models and datasets grow larger and more complex.
The motivation for this work is to enhance Barenet by introducing auto-differentiation to streamline and optimize these processes. Automating this process and extending Barenet to include sophisticated models like ResNet is crucial for modern machine learning applications.
- Understand the Dataflow Graph: Ensuring accurate analysis and optimization of model performance.
- Apply Reverse-Mode Auto-Differentiation Correctly: Implementing backpropagation accurately and efficiently.
- Implement Layers Before Complex Models: Updating basic layers and adding essential layers such as flatten and convolution before integrating complex models like ResNet.
- Pytorch
- Autograd
- torch.nn.Transformer
- Torchvision
- Tensorflow
- tf.GradientTape
- tfm.nlp.layers.Transformer
- tf.keras.applications.ResNet50
The ResNet-18 model addresses the vanishing gradient problem through residual blocks. The architecture includes:
- Initial convolutional layer
- Four sets of residual blocks with increasing filter sizes: 64, 128, 256, and 512
- Each block contains two convolutional layers with a skip connection
- Flattening the output and passing through a fully connected layer
- Softmax activation for classification
A computation graph records and breaks down operations into:
- Addition (+)
- Multiplication (*)
- Subtraction (-)
- Division (/)
- Exponential (e)
- Log
- Squared (x^2)
The gradient computation rules are predefined. A gradient graph is constructed after the computation graph, using topological sort to determine the order of gradient computation. Forward and Backward methods are defined for each operation. During the training loop, forward and backward passes are conducted for each batch, and gradients are propagated through the computational graph.
The parameter update step uses a direct gradient descent approach. The detach() function is used to sever a tensor's connection to the computational graph, allowing weight updates without altering the graph structure.
Our implementation was tested against Tensorflow and PyTorch implementations on an HPC using a single RTX8000 GPU. The results show that our implementation outperforms the corresponding implementations of PyTorch and Tensorflow for ResNet-18.
- Tony Kimathi: Pooling layer, Problem Definition & Motivation, State of the Art Systems, Technical Challenges, Related Work, Slides
- Ayman Charafeddine: Flatten layer, Design of Solution, Future Work, Slides
- Animesh Mishra: Conv layer, Our Solution (Implementation), Evaluation, Slides
Collaboratively worked on tensor3D, auto diff, resnet, op_gemm.cuh, and op_conv.cuh using Zoom calls and physical meetings to pair-program.
- O’Shea, K., & Nash, R. (2015, November 26). An Introduction to Convolutional Neural Networks. arXiv.org. Link
- Gholamalinezhad, H., & Khosravi, H. (2020, September 16). Pooling methods in deep neural networks, a review. arXiv.org. Link
- He, K., Zhang, X., Ren, S., & Sun, J. (2015, December 10). Deep residual learning for image recognition. arXiv.org. Link
- Paszke, A., et al. (2019, December 3). PyTorch: An Imperative Style, High-Performance Deep Learning Library. arXiv.org. Link
- Abadi, M., et al. (2016, May 27). TensorFlow: A system for large-scale machine learning. arXiv.org. Link
- A Gentle Introduction to torch.autograd — PyTorch Tutorials 2.3.0+cu121 documentation. Link
- Transformer — PyTorch 2.3 documentation. Link
- torchvision — Torchvision 0.17 documentation. Link
- Introduction to gradients and automatic differentiation. TensorFlow. Link
- TFM.NLP.Layers.Transformer | TensorFlow v2.16.1. TensorFlow. Link
- ResNet50 | TensorFlow. Link


