1212 It would be a good idea to become familiar with dlib's DNN tooling before reading this
1313 example. So you should read dnn_introduction_ex.cpp and dnn_introduction2_ex.cpp
1414 before reading this example program. You should also read the introductory DNN+MMOD
15- example as well before proceeding. So read dnn_mmod_ex.cpp first .
15+ example dnn_mmod_ex.cpp as well before proceeding.
1616
1717
1818 This example is essentially a more complex version of dnn_mmod_ex.cpp. In it we train
@@ -124,18 +124,19 @@ int main(int argc, char** argv) try
124124 //
125125 // To explain this non-max suppression idea further it's important to understand how
126126 // the detector works. Essentially, sliding window detectors scan all image locations
127- // and ask "is there a care here?". If there really is a car in an image then usually
128- // many sliding window locations will produce high detection scores, indicating that
129- // there is a car at those locations. If we just stopped there then each car would
130- // produce multiple detections. But that isn't what we want. We want each car to
131- // produce just one detection. So it's common for detectors to include "non-maximum
132- // suppression" logic which simply takes the strongest detection and then deletes all
133- // detections "close to" the strongest. This is a simple post-processing step that can
134- // eliminate duplicate detections. However, we have to define what "close to" means.
135- // We can do this by looking at your training data and checking how close the closest
136- // target boxes are to each other, and then picking a "close to" measure that doesn't
137- // suppress those target boxes but is otherwise as tight as possible. This is exactly
138- // what the mmod_options object does by default.
127+ // and ask "is there a care here?". If there really is a car in a specific location in
128+ // an image then usually many slightly different sliding window locations will produce
129+ // high detection scores, indicating that there is a car at those locations. If we
130+ // just stopped there then each car would produce multiple detections. But that isn't
131+ // what we want. We want each car to produce just one detection. So it's common for
132+ // detectors to include "non-maximum suppression" logic which simply takes the
133+ // strongest detection and then deletes all detections "close to" the strongest. This
134+ // is a simple post-processing step that can eliminate duplicate detections. However,
135+ // we have to define what "close to" means. We can do this by looking at your training
136+ // data and checking how close the closest target boxes are to each other, and then
137+ // picking a "close to" measure that doesn't suppress those target boxes but is
138+ // otherwise as tight as possible. This is exactly what the mmod_options object does
139+ // by default.
139140 //
140141 // Importantly, this means that if your training dataset contains an image with two
141142 // target boxes that really overlap a whole lot, then the non-maximum suppression
@@ -152,8 +153,8 @@ int main(int argc, char** argv) try
152153 // the image not suppressed. The smaller the non-max suppression region the more the
153154 // CNN has to learn and the more difficult the learning problem will become. This is
154155 // why we remove highly overlapped objects from the training dataset. That is, we do
155- // it so that the non-max suppression logic will be able to be reasonably effective.
156- // Here we are ensuring that any boxes that are entirely contained by another are
156+ // it so the non-max suppression logic will be able to be reasonably effective. Here
157+ // we are ensuring that any boxes that are entirely contained by another are
157158 // suppressed. We also ensure that boxes with an intersection over union of 0.5 or
158159 // greater are suppressed. This will improve the resulting detector since it will be
159160 // able to use more aggressive non-max suppression settings.
@@ -205,9 +206,9 @@ int main(int argc, char** argv) try
205206 }
206207 }
207208
208- // When modifying a dataset like this, it's a really good idea to print out a log of
209- // how many boxes you ignored. It's easy to accidentally ignore a huge block of data,
210- // so you should always look and see that things are doing what you expect.
209+ // When modifying a dataset like this, it's a really good idea to print a log of how
210+ // many boxes you ignored. It's easy to accidentally ignore a huge block of data, so
211+ // you should always look and see that things are doing what you expect.
211212 cout << " num_overlapped_ignored: " << num_overlapped_ignored << endl;
212213 cout << " num_additional_ignored: " << num_additional_ignored << endl;
213214 cout << " num_overlapped_ignored_test: " << num_overlapped_ignored_test << endl;
@@ -221,24 +222,36 @@ int main(int argc, char** argv) try
221222 // boxes, tall and skinny boxes (e.g. semi trucks), and short and wide boxes (e.g.
222223 // sedans). Here we are telling the MMOD algorithm that a vehicle is recognizable as
223224 // long as the longest box side is at least 70 pixels long and the shortest box side is
224- // at least 30 pixels long. It will use these parameters to decide how large each of
225- // the sliding windows needs to be so as to be able to detect all the vehicles. Since
226- // our dataset has basically these 3 different aspect ratios, it will decide to use 3
227- // different sliding windows. This means the final con layer in the network will have
228- // 3 filters, one for each of these aspect ratios.
225+ // at least 30 pixels long. mmod_options will use these parameters to decide how large
226+ // each of the sliding windows needs to be so as to be able to detect all the vehicles.
227+ // Since our dataset has basically these 3 different aspect ratios, it will decide to
228+ // use 3 different sliding windows. This means the final con layer in the network will
229+ // have 3 filters, one for each of these aspect ratios.
230+ //
231+ // Another thing to consider when setting the sliding window size is the "stride" of
232+ // your network. The network we defined above downsamples the image by a factor of 8x
233+ // in the first few layers. So when the sliding windows are scanning the image, they
234+ // are stepping over it with a stride of 8 pixels. If you set the sliding window size
235+ // too small then the stride will become an issue. For instance, if you set the
236+ // sliding window size to 4 pixels, then it means a 4x4 window will be moved by 8
237+ // pixels at a time when scanning. This is obviously a problem since 75% of the image
238+ // won't even be visited by the sliding window. So you need to set the window size to
239+ // be big enough relative to the stride of your network. In our case, the windows are
240+ // at least 30 pixels in length, so being moved by 8 pixel steps is fine.
229241 mmod_options options (boxes_train, 70 , 30 );
230242
243+
231244 // This setting is very important and dataset specific. The vehicle detection dataset
232245 // contains boxes that are marked as "ignore", as we discussed above. Some of them are
233- // ignored because we set ignore to true on them in the above code. However, the xml
234- // files already contained a lot of ignore boxes. Some of them are large boxes that
235- // encompass large parts of an image and the intention is to have everything inside
236- // those boxes be ignored. Therefore, we need to tell the MMOD algorithm to do that,
237- // which we do by setting options.overlaps_ignore appropriately.
246+ // ignored because we set ignore to true in the above code. However, the xml files
247+ // also contained a lot of ignore boxes. Some of them are large boxes that encompass
248+ // large parts of an image and the intention is to have everything inside those boxes
249+ // be ignored. Therefore, we need to tell the MMOD algorithm to do that, which we do
250+ // by setting options.overlaps_ignore appropriately.
238251 //
239252 // But first, we need to understand exactly what this option does. The MMOD loss
240- // is essentially counting the number of false alarms + missed detections, produced by
241- // the detector, for each image. During training, the code is running the detector on
253+ // is essentially counting the number of false alarms + missed detections produced by
254+ // the detector for each image. During training, the code is running the detector on
242255 // each image in a mini-batch and looking at its output and counting the number of
243256 // mistakes. The optimizer tries to find parameters settings that minimize the number
244257 // of detector mistakes.
@@ -261,7 +274,8 @@ int main(int argc, char** argv) try
261274 options.overlaps_ignore = test_box_overlap (0.5 , 0.95 );
262275
263276 net_type net (options);
264- // The final layer of the network must be a con_ layer that contains
277+
278+ // The final layer of the network must be a con layer that contains
265279 // options.detector_windows.size() filters. This is because these final filters are
266280 // what perform the final "sliding window" detection in the network. For the dlib
267281 // vehicle dataset, there will be 3 sliding window detectors, so we will be setting
@@ -273,15 +287,16 @@ int main(int argc, char** argv) try
273287 trainer.set_learning_rate (0.1 );
274288 trainer.be_verbose ();
275289
290+
276291 // While training, we are going to use early stopping. That is, we will be checking
277292 // how good the detector is performing on our test data and when it stops getting
278293 // better on the test data we will drop the learning rate. We will keep doing that
279- // until the learning rate is less than 1e-4. These two settings tell the training to
294+ // until the learning rate is less than 1e-4. These two settings tell the trainer to
280295 // do that. Essentially, we are setting the first argument to infinity, and only the
281296 // test iterations without progress threshold will matter. In particular, it says that
282297 // once we observe 1000 testing mini-batches where the test loss clearly isn't
283298 // decreasing we will lower the learning rate.
284- trainer.set_iterations_without_progress_threshold (1000000 );
299+ trainer.set_iterations_without_progress_threshold (50000 );
285300 trainer.set_test_iterations_without_progress_threshold (1000 );
286301
287302 const string sync_filename = " mmod_cars_sync" ;
@@ -351,13 +366,19 @@ int main(int argc, char** argv) try
351366
352367 // It's a really good idea to print the training parameters. This is because you will
353368 // invariably be running multiple rounds of training and should be logging the output
354- // to a log file. This print statement will include many of the training parameters in
369+ // to a file. This print statement will include many of the training parameters in
355370 // your log.
356371 cout << trainer << cropper << endl;
357372
358373 cout << " \n sync_filename: " << sync_filename << endl;
359374 cout << " num training images: " << images_train.size () << endl;
360375 cout << " training results: " << test_object_detection_function (net, images_train, boxes_train, test_box_overlap (), 0 , options.overlaps_ignore );
376+ // Upsampling the data will allow the detector to find smaller cars. Recall that
377+ // we configured it to use a sliding window nominally 70 pixels in size. So upsampling
378+ // here will let it find things nominally 35 pixels in size. Although we include a
379+ // limit of 1800*1800 here which means "don't upsample an image if it's already larger
380+ // than 1800*1800". We do this so we don't run out of RAM, which is a concern because
381+ // some of the images in the dlib vehicle dataset are really high resolution.
361382 upsample_image_dataset<pyramid_down<2 >>(images_train, boxes_train, 1800 *1800 );
362383 cout << " training upsampled results: " << test_object_detection_function (net, images_train, boxes_train, test_box_overlap (), 0 , options.overlaps_ignore );
363384
@@ -369,21 +390,24 @@ int main(int argc, char** argv) try
369390
370391 /*
371392 This program takes many hours to execute on a high end GPU. It took about a day to
372- train on an NVIDIA 1080ti. The resulting model file is available at
373- http://dlib.net/files/mmod_rear_end_vehicle_detector.dat.bz2
393+ train on a NVIDIA 1080ti. The resulting model file is available at
394+ http://dlib.net/files/mmod_rear_end_vehicle_detector.dat.bz2
374395 It should be noted that this file on dlib.net has a dlib::shape_predictor appended
375396 onto the end of it (see dnn_mmod_find_cars_ex.cpp for an example of its use). This
376397 explains why the model file on dlib.net is larger than the
377398 mmod_rear_end_vehicle_detector.dat output by this program.
378399
379- Also, the training and testing accuracies were:
400+ You can see some videos of this vehicle detector running on YouTube:
401+ https://www.youtube.com/watch?v=4B3bzmxMAZU
402+ https://www.youtube.com/watch?v=bP2SUo5vSlc
380403
381- num training images: 2217
382- training results: 0.990738 0.736431 0.736073
383- training upsampled results: 0.986837 0.937694 0.936912
384- num testing images: 135
385- testing results: 0.988827 0.471372 0.470806
386- testing upsampled results: 0.987879 0.651132 0.650399
404+ Also, the training and testing accuracies were:
405+ num training images: 2217
406+ training results: 0.990738 0.736431 0.736073
407+ training upsampled results: 0.986837 0.937694 0.936912
408+ num testing images: 135
409+ testing results: 0.988827 0.471372 0.470806
410+ testing upsampled results: 0.987879 0.651132 0.650399
387411 */
388412
389413 return 0 ;
0 commit comments