Skip to content

Commit 3211da4

Browse files
committed
Yet more comments
1 parent a362305 commit 3211da4

2 files changed

Lines changed: 71 additions & 43 deletions

File tree

examples/dnn_mmod_ex.cpp

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -213,6 +213,10 @@ int main(int argc, char** argv) try
213213
}
214214
return 0;
215215

216+
// Now that you finished this example, you should read dnn_mmod_train_find_cars_ex.cpp,
217+
// which is a more advanced example. It discusses many issues surrounding properly
218+
// setting the MMOD parameters and creating a good training dataset.
219+
216220
}
217221
catch(std::exception& e)
218222
{

examples/dnn_mmod_train_find_cars_ex.cpp

Lines changed: 67 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
It would be a good idea to become familiar with dlib's DNN tooling before reading this
1313
example. So you should read dnn_introduction_ex.cpp and dnn_introduction2_ex.cpp
1414
before reading this example program. You should also read the introductory DNN+MMOD
15-
example as well before proceeding. So read dnn_mmod_ex.cpp first.
15+
example dnn_mmod_ex.cpp as well before proceeding.
1616
1717
1818
This example is essentially a more complex version of dnn_mmod_ex.cpp. In it we train
@@ -124,18 +124,19 @@ int main(int argc, char** argv) try
124124
//
125125
// To explain this non-max suppression idea further it's important to understand how
126126
// the detector works. Essentially, sliding window detectors scan all image locations
127-
// and ask "is there a care here?". If there really is a car in an image then usually
128-
// many sliding window locations will produce high detection scores, indicating that
129-
// there is a car at those locations. If we just stopped there then each car would
130-
// produce multiple detections. But that isn't what we want. We want each car to
131-
// produce just one detection. So it's common for detectors to include "non-maximum
132-
// suppression" logic which simply takes the strongest detection and then deletes all
133-
// detections "close to" the strongest. This is a simple post-processing step that can
134-
// eliminate duplicate detections. However, we have to define what "close to" means.
135-
// We can do this by looking at your training data and checking how close the closest
136-
// target boxes are to each other, and then picking a "close to" measure that doesn't
137-
// suppress those target boxes but is otherwise as tight as possible. This is exactly
138-
// what the mmod_options object does by default.
127+
// and ask "is there a care here?". If there really is a car in a specific location in
128+
// an image then usually many slightly different sliding window locations will produce
129+
// high detection scores, indicating that there is a car at those locations. If we
130+
// just stopped there then each car would produce multiple detections. But that isn't
131+
// what we want. We want each car to produce just one detection. So it's common for
132+
// detectors to include "non-maximum suppression" logic which simply takes the
133+
// strongest detection and then deletes all detections "close to" the strongest. This
134+
// is a simple post-processing step that can eliminate duplicate detections. However,
135+
// we have to define what "close to" means. We can do this by looking at your training
136+
// data and checking how close the closest target boxes are to each other, and then
137+
// picking a "close to" measure that doesn't suppress those target boxes but is
138+
// otherwise as tight as possible. This is exactly what the mmod_options object does
139+
// by default.
139140
//
140141
// Importantly, this means that if your training dataset contains an image with two
141142
// target boxes that really overlap a whole lot, then the non-maximum suppression
@@ -152,8 +153,8 @@ int main(int argc, char** argv) try
152153
// the image not suppressed. The smaller the non-max suppression region the more the
153154
// CNN has to learn and the more difficult the learning problem will become. This is
154155
// why we remove highly overlapped objects from the training dataset. That is, we do
155-
// it so that the non-max suppression logic will be able to be reasonably effective.
156-
// Here we are ensuring that any boxes that are entirely contained by another are
156+
// it so the non-max suppression logic will be able to be reasonably effective. Here
157+
// we are ensuring that any boxes that are entirely contained by another are
157158
// suppressed. We also ensure that boxes with an intersection over union of 0.5 or
158159
// greater are suppressed. This will improve the resulting detector since it will be
159160
// able to use more aggressive non-max suppression settings.
@@ -205,9 +206,9 @@ int main(int argc, char** argv) try
205206
}
206207
}
207208

208-
// When modifying a dataset like this, it's a really good idea to print out a log of
209-
// how many boxes you ignored. It's easy to accidentally ignore a huge block of data,
210-
// so you should always look and see that things are doing what you expect.
209+
// When modifying a dataset like this, it's a really good idea to print a log of how
210+
// many boxes you ignored. It's easy to accidentally ignore a huge block of data, so
211+
// you should always look and see that things are doing what you expect.
211212
cout << "num_overlapped_ignored: "<< num_overlapped_ignored << endl;
212213
cout << "num_additional_ignored: "<< num_additional_ignored << endl;
213214
cout << "num_overlapped_ignored_test: "<< num_overlapped_ignored_test << endl;
@@ -221,24 +222,36 @@ int main(int argc, char** argv) try
221222
// boxes, tall and skinny boxes (e.g. semi trucks), and short and wide boxes (e.g.
222223
// sedans). Here we are telling the MMOD algorithm that a vehicle is recognizable as
223224
// long as the longest box side is at least 70 pixels long and the shortest box side is
224-
// at least 30 pixels long. It will use these parameters to decide how large each of
225-
// the sliding windows needs to be so as to be able to detect all the vehicles. Since
226-
// our dataset has basically these 3 different aspect ratios, it will decide to use 3
227-
// different sliding windows. This means the final con layer in the network will have
228-
// 3 filters, one for each of these aspect ratios.
225+
// at least 30 pixels long. mmod_options will use these parameters to decide how large
226+
// each of the sliding windows needs to be so as to be able to detect all the vehicles.
227+
// Since our dataset has basically these 3 different aspect ratios, it will decide to
228+
// use 3 different sliding windows. This means the final con layer in the network will
229+
// have 3 filters, one for each of these aspect ratios.
230+
//
231+
// Another thing to consider when setting the sliding window size is the "stride" of
232+
// your network. The network we defined above downsamples the image by a factor of 8x
233+
// in the first few layers. So when the sliding windows are scanning the image, they
234+
// are stepping over it with a stride of 8 pixels. If you set the sliding window size
235+
// too small then the stride will become an issue. For instance, if you set the
236+
// sliding window size to 4 pixels, then it means a 4x4 window will be moved by 8
237+
// pixels at a time when scanning. This is obviously a problem since 75% of the image
238+
// won't even be visited by the sliding window. So you need to set the window size to
239+
// be big enough relative to the stride of your network. In our case, the windows are
240+
// at least 30 pixels in length, so being moved by 8 pixel steps is fine.
229241
mmod_options options(boxes_train, 70, 30);
230242

243+
231244
// This setting is very important and dataset specific. The vehicle detection dataset
232245
// contains boxes that are marked as "ignore", as we discussed above. Some of them are
233-
// ignored because we set ignore to true on them in the above code. However, the xml
234-
// files already contained a lot of ignore boxes. Some of them are large boxes that
235-
// encompass large parts of an image and the intention is to have everything inside
236-
// those boxes be ignored. Therefore, we need to tell the MMOD algorithm to do that,
237-
// which we do by setting options.overlaps_ignore appropriately.
246+
// ignored because we set ignore to true in the above code. However, the xml files
247+
// also contained a lot of ignore boxes. Some of them are large boxes that encompass
248+
// large parts of an image and the intention is to have everything inside those boxes
249+
// be ignored. Therefore, we need to tell the MMOD algorithm to do that, which we do
250+
// by setting options.overlaps_ignore appropriately.
238251
//
239252
// But first, we need to understand exactly what this option does. The MMOD loss
240-
// is essentially counting the number of false alarms + missed detections, produced by
241-
// the detector, for each image. During training, the code is running the detector on
253+
// is essentially counting the number of false alarms + missed detections produced by
254+
// the detector for each image. During training, the code is running the detector on
242255
// each image in a mini-batch and looking at its output and counting the number of
243256
// mistakes. The optimizer tries to find parameters settings that minimize the number
244257
// of detector mistakes.
@@ -261,7 +274,8 @@ int main(int argc, char** argv) try
261274
options.overlaps_ignore = test_box_overlap(0.5, 0.95);
262275

263276
net_type net(options);
264-
// The final layer of the network must be a con_ layer that contains
277+
278+
// The final layer of the network must be a con layer that contains
265279
// options.detector_windows.size() filters. This is because these final filters are
266280
// what perform the final "sliding window" detection in the network. For the dlib
267281
// vehicle dataset, there will be 3 sliding window detectors, so we will be setting
@@ -273,15 +287,16 @@ int main(int argc, char** argv) try
273287
trainer.set_learning_rate(0.1);
274288
trainer.be_verbose();
275289

290+
276291
// While training, we are going to use early stopping. That is, we will be checking
277292
// how good the detector is performing on our test data and when it stops getting
278293
// better on the test data we will drop the learning rate. We will keep doing that
279-
// until the learning rate is less than 1e-4. These two settings tell the training to
294+
// until the learning rate is less than 1e-4. These two settings tell the trainer to
280295
// do that. Essentially, we are setting the first argument to infinity, and only the
281296
// test iterations without progress threshold will matter. In particular, it says that
282297
// once we observe 1000 testing mini-batches where the test loss clearly isn't
283298
// decreasing we will lower the learning rate.
284-
trainer.set_iterations_without_progress_threshold(1000000);
299+
trainer.set_iterations_without_progress_threshold(50000);
285300
trainer.set_test_iterations_without_progress_threshold(1000);
286301

287302
const string sync_filename = "mmod_cars_sync";
@@ -351,13 +366,19 @@ int main(int argc, char** argv) try
351366

352367
// It's a really good idea to print the training parameters. This is because you will
353368
// invariably be running multiple rounds of training and should be logging the output
354-
// to a log file. This print statement will include many of the training parameters in
369+
// to a file. This print statement will include many of the training parameters in
355370
// your log.
356371
cout << trainer << cropper << endl;
357372

358373
cout << "\nsync_filename: " << sync_filename << endl;
359374
cout << "num training images: "<< images_train.size() << endl;
360375
cout << "training results: " << test_object_detection_function(net, images_train, boxes_train, test_box_overlap(), 0, options.overlaps_ignore);
376+
// Upsampling the data will allow the detector to find smaller cars. Recall that
377+
// we configured it to use a sliding window nominally 70 pixels in size. So upsampling
378+
// here will let it find things nominally 35 pixels in size. Although we include a
379+
// limit of 1800*1800 here which means "don't upsample an image if it's already larger
380+
// than 1800*1800". We do this so we don't run out of RAM, which is a concern because
381+
// some of the images in the dlib vehicle dataset are really high resolution.
361382
upsample_image_dataset<pyramid_down<2>>(images_train, boxes_train, 1800*1800);
362383
cout << "training upsampled results: " << test_object_detection_function(net, images_train, boxes_train, test_box_overlap(), 0, options.overlaps_ignore);
363384

@@ -369,21 +390,24 @@ int main(int argc, char** argv) try
369390

370391
/*
371392
This program takes many hours to execute on a high end GPU. It took about a day to
372-
train on an NVIDIA 1080ti. The resulting model file is available at
373-
http://dlib.net/files/mmod_rear_end_vehicle_detector.dat.bz2
393+
train on a NVIDIA 1080ti. The resulting model file is available at
394+
http://dlib.net/files/mmod_rear_end_vehicle_detector.dat.bz2
374395
It should be noted that this file on dlib.net has a dlib::shape_predictor appended
375396
onto the end of it (see dnn_mmod_find_cars_ex.cpp for an example of its use). This
376397
explains why the model file on dlib.net is larger than the
377398
mmod_rear_end_vehicle_detector.dat output by this program.
378399
379-
Also, the training and testing accuracies were:
400+
You can see some videos of this vehicle detector running on YouTube:
401+
https://www.youtube.com/watch?v=4B3bzmxMAZU
402+
https://www.youtube.com/watch?v=bP2SUo5vSlc
380403
381-
num training images: 2217
382-
training results: 0.990738 0.736431 0.736073
383-
training upsampled results: 0.986837 0.937694 0.936912
384-
num testing images: 135
385-
testing results: 0.988827 0.471372 0.470806
386-
testing upsampled results: 0.987879 0.651132 0.650399
404+
Also, the training and testing accuracies were:
405+
num training images: 2217
406+
training results: 0.990738 0.736431 0.736073
407+
training upsampled results: 0.986837 0.937694 0.936912
408+
num testing images: 135
409+
testing results: 0.988827 0.471372 0.470806
410+
testing upsampled results: 0.987879 0.651132 0.650399
387411
*/
388412

389413
return 0;

0 commit comments

Comments
 (0)