This repository presents a comparative study of two modern object detection architectures implemented in PyTorch: Faster R-CNN and YOLOv5. The goal of the project is to provide a clear, reproducible framework for students and researchers beginning to explore object detection, while also offering a realistic comparison between two-stage and single-stage detectors on small and medium-scale datasets.
The experiments focus on fine-tuning pre-trained models, evaluating them using COCO-style metrics, and analyzing convergence behavior, accuracy, speed, and practical trade-offs.
Object detection requires a model to simultaneously determine what objects are present in an image and where they are located. Modern deep learning detectors approach this problem in different ways, leading to important trade-offs in performance and efficiency.
In this project, we compare:
- A two-stage detector: Faster R-CNN
- A single-stage detector: YOLOv5
Two-stage detectors first generate region proposals and then classify them. Single-stage detectors perform localization and classification in a single forward pass. These structural differences influence accuracy, convergence speed, computational cost, and robustness — all of which are examined in this work.
The Penn-Fudan Pedestrian Dataset contains images of pedestrians in urban street scenes with bounding box annotations. In this project, it is treated as a single-class detection task.
Because the dataset is relatively small, it provides a useful testbed for studying transfer learning and fine-tuning pre-trained detection architectures.
The Oxford-IIIT Pet Dataset contains images of 37 cat and dog breeds. In these experiments, only 5 breeds were used. The dataset is adapted for object detection by using bounding box annotations and breed labels.
This dataset forms a multi-class detection problem, which is more challenging than Penn-Fudan due to increased class diversity and intra-class variation.
The Faster R-CNN implementation uses a MobileNetV3 backbone with a Feature Pyramid Network (FPN). The model operates in two stages: first generating candidate object regions using a Region Proposal Network (RPN), and then refining and classifying those regions.
This architecture is known for strong localization accuracy and robustness, particularly on smaller datasets where high-quality region proposals can significantly improve detection precision.
Pre-trained weights are pulled from TorchVision and fine-tuned on the new datasets. This enables effective transfer learning even with limited training data.
The YOLOv5 implementation uses the lightweight nano variant (YOLOv5n). YOLO performs detection in a single forward pass, directly predicting bounding boxes and class probabilities.
In this project:
- Pre-trained weights are loaded.
- The detection head is adapted for the dataset's number of classes.
- The backbone is frozen.
- Only the detection head is fine-tuned.
This approach reduces overfitting risk on small datasets but limits full adaptation. To ensure consistency with Faster R-CNN training pipelines, YOLO weights are manually loaded and trained with PyTorch scripts, though some Ultralytics utilities are included via a git submodule.
Training was conducted using PyTorch with GPU acceleration. Metrics were evaluated using COCO-style metrics via torchmetrics, including:
- [email protected]
- [email protected]:0.95
- Precision
- Recall
- Inference speed (images per second)
Faster R-CNN was trained for 50 epochs using Adam with a learning rate of 1e-4. YOLOv5 experiments were run for 100-300 epochs using a learning rate of 5e-4 and batch size of 8.
Faster R-CNN proved easy to train and robust to hyperparameter variance. YOLOv5 required more epochs and careful learning rate adjustment, as well as relatively large batch sizes to stabilize gradient updates.
Metrics were logged per epoch for visualization.
All tests in this report were run on a Ryzen 9 7950X CPU with an RTX 2070 Super GPU.
| Dataset | Model | [email protected] | [email protected]:0.95 | Precision | Recall | FPS |
|---|---|---|---|---|---|---|
| Penn Fudan Pedestrian | Faster R-CNN | 0.94 | 0.82 | 0.94 | 0.85 | 59 |
| Penn Fudan Pedestrian | YOLO | 0.57 | 0.14 | 0.57 | 0.39 | 58 |
| Oxford Pets (5 Breeds) | Faster R-CNN | 0.96 | 0.75 | 0.97 | 0.79 | 69 |
| Oxford Pets (5 Breeds) | YOLO | 0.49 | 0.25 | 0.49 | 0.50 | 72 |
Faster R-CNN converged quickly, achieving high precision and mAP on both datasets. YOLO improved gradually but plateaued around [email protected] ≈ 0.49-0.57. The improved YOLO results on the new machine highlight that hardware can significantly affect training dynamics and convergence, though it still did not reach R-CNN performance.
| Loss Curve | Metrics Curve |
|---|---|
![]() |
![]() |
| Sample Predictions | Speed Curve |
|---|---|
![]() |
![]() |
| Loss Curve | Metrics Curve |
|---|---|
![]() |
![]() |
| Sample Predictions | Speed Curve |
|---|---|
![]() |
![]() |
| Loss Curve | Metrics Curve |
|---|---|
![]() |
![]() |
| Sample Predictions | Speed Curve |
|---|---|
![]() |
![]() |
These visualizations confirm the convergence trends observed in the metrics tables.
The experiments show a clear performance gap between Faster R-CNN and YOLOv5 on small datasets. Faster R-CNN achieved near-perfect precision and high mAP on both datasets, converging rapidly and producing accurate bounding boxes.
YOLOv5, while faster in inference, plateaued in performance. On the pedestrian dataset, [email protected] reached ~0.57, and on the Oxford Pets dataset it stabilized around ~0.49. Improvements beyond ~100 epochs were marginal, indicating that freezing the backbone limited its adaptability. Training on the new machine improved YOLO performance slightly, but the gap to Faster R-CNN remains significant.
Checkpoint sizes also differed substantially. Faster R-CNN's model files were much larger due to the heavier backbone and region proposal network, whereas YOLO remained lightweight, offering a trade-off between storage and accuracy.
Inference speed differences were less dramatic than expected: YOLO was faster, but Faster R-CNN still ran in real-time for these datasets (~40-60 FPS), showing that small-scale experiments can make both models viable for interactive tasks.
Overall, for small academic datasets:
- Two-stage detectors like Faster R-CNN provide more reliable and higher-quality detection.
- Single-stage detectors like YOLOv5 may require full fine-tuning, larger datasets, or additional augmentation to reach comparable performance.
- Model size and training stability are practical considerations alongside accuracy.
Object detection is a complex problem with significant architectural trade-offs. This study shows:
- Faster R-CNN achieves high precision and mAP quickly, even on small datasets.
- YOLOv5 benefits from hardware and careful training but remains less accurate in this context.
- YOLOv5 is lightweight and fast, highlighting the trade-off between inference speed and detection quality.
These results provide a practical framework for students and researchers to understand how architecture, dataset size, and fine-tuning strategy affect object detection performance. The repository supports further experimentation with full YOLO backbone fine-tuning, data augmentation, and evaluation on larger datasets. Here's a clean, beginner-friendly Setup & Run section you can add to the end of your README. I've simplified the commands, removed unnecessary flags, and included instructions for datasets and visualizations.
This section explains how to get the project running from scratch, including installing dependencies, preparing datasets, training models, and visualizing results.
Ensure you have Python 3.8+ and CUDA installed (if using GPU). Then install the required Python packages:
pip install -r requirements.txtThis will install PyTorch, TorchVision, TorchMetrics, and other necessary libraries.
The repository includes scripts to download the datasets and generate JSON splits.
Download default datasets:
python generate_splits.py --dl-onlyThis will download the Penn-Fudan Pedestrian dataset and the Oxford-IIIT Pet dataset (subset of breeds used in this project). When running training, you will use the pre-made splits included in this repository.
Optional: You can create custom splits or limit the number of pet breeds:
python generate_splits.py --pet_breeds 10Models can be trained using train.py. Two options are supported: Faster R-CNN and YOLOv5.
Example: Train Faster R-CNN on Penn-Fudan Pedestrians:
python train.py --model rcnn --dataset penn --splits pennfudan_splits.json --checkpoints checkpoints/r-cnn/PennFudanPedExample: Train YOLOv5 on Oxford Pets:
python train.py --model yolo --dataset pet --splits pet_splits.json --checkpoints checkpoints/yolo/oxford-iiit-petAfter training, metrics and predictions can be visualized using plot_metrics.py. This generates loss curves, mAP curves, sample predictions, and speed plots.
Example: Visualize Faster R-CNN results on Penn-Fudan:
python plot_metrics.py --dataset penn --splits pennfudan_splits.json --model rcnn --checkpoint ./checkpoints/r-cnn/PennFudanPed/epoch_50.pth --output ./vis/r-cnn/PennFudanPed/Example: Visualize YOLOv5 results on Oxford Pets:
python plot_metrics.py --dataset pet --splits pet_splits.json --model yolo --checkpoint ./checkpoints/yolo/oxford-iiit-pet/epoch_100.pth --output ./vis/yolo/oxford-iiit-pet/The output folder will contain:
loss_curve.png- Training loss over epochsmetrics_curve.png- mAP, precision, and recall curvessample_predictions.png- Example detections on validation imagesspeed_curve.png- Inference speed over the dataset











