Thursday, 14 May 2020

Darknet/YOLO optimized for Jetson


This post contains details of some of the memory adaptation of Darknet/YOLO for Jetson Nano, making possible train/inference higher resolutions.

As on the source code link above, the memory optimizations are:
- Use of Jetson Nano Unified memory (inference and training)
- Mixed precision training (Paper, implementation, implementation, NVidia)

The implementation of mixed precision is for memory optimization only. We will not get faster implementation because Jetson Nano does not have TensorCores, and Maxwell architecture does not have faster execution for FP16. See report below.

Currently, this supports YOLOv3 only. No v4, nor v2, nor tinies.
It probably works on Maxwell Jetsons and may work on Xavier. Will not work on dGPUs because of the adaptations for Unified Memory.


DataSet
Since Jetson Nano is extremely slow compared with latest dGPUs, I have tested only with Cat/Dog of COCO datasets.
It takes app. 11 days to train 320x320, app. 14.5 days to train 416x416.

More details on how to compile, etc are here.

Unified memory
Darknet has the same data on both CPU and GPU memories. This is understandable if you use dGPUs.
However, Jetson shares the same DRAM for both CPU and GPU, and we don't have to store twice the same data.
So, I adapted Darknet to alloc memory only once.

Mixed Precision and loss scale
For mix between FP16 and 32 on forward/backward/updates, I referenced the TensorFlow version of this link.
There are some proposals for dynamic loss scale (here and here).
However, due to the characteristics of Darknet/YOLOv3 behavior, I decided to implement my own.
The loss scale reduction by half rule is the same as on the links above (Reduce if NaN or Inf in dW, i.e. weights_updates on convolutional layers). However the links above doubles the loss if there's no NaN nor Inf for 1000 or 2000 iterations.

Some analysis of YOLOv3 behaviour:

The chart below shows the values of max magnitude of weight_updates among all conv layers, as well as the loss for FP32 mode.
This is an example. The values will vary if you change training parameters, or even on other training with same parameters due to the many random selections (Ex: selection of images, random generation of augmentation numbers), but the format is the same.
It starts at high value and the peak is 165,055 which will be infinite for FP16 (Max: 65,504), and we would like to set the loss scale for 0.25 in this case.
Then goes down on the first few hundreds of iterations by a factor of app. 2^10.
So, it's better to double the loss scale quickly instead of on every 1000 or 2000 iterations to avoid loss of precision during the conversion, and I decided to update the scale based on the max weight_updates of the following:
    max magnitude of current iteration / max magnitude of all iterations so far
This rule applies until the burn_in which is set to 1000 by default on the cfg file. After the burn_in, it will be doubled on every 1000 iterations (configurable) as on the links.
It could be based on the loss (less computational cost), but it gets stable before the weight_updates on the example of this chart (see avg_loss and weight_updates/256 values).



Another interesting fact is from where the max weight_updates comes from.
Below are the values of max magnitude on every conv layer. As can be noted the max magnitude varies a lot among layers, and the last conv layer is the highest on initial interactions. During this time, the max magnitude among layers varies more than a factor of 2^10.



Resizing data augmentation
Darknet/YOLOv3 implements data augmentation such as adjustment of color as well as crop, flip and resize of images.
The resize is from 1/1.4 to 1.4 of the width/height (multiple of 32 and other maths).
For 320x320, the resize varies from 256x256 to 480x480.
The resizing is adjusted randomically on every 10 interactions, but for the first 10, to initialize the memory allocation to max, darknet/YOLOv3 uses the maximum resize (480x480 in this case).
For maximum resize, the required memory usage and processing time are the highest.

Memory usage evaluation method

I have investigated the following ways to check Tegra (Jetson) system memory usage:
* tegrastats command (can output memory usage on specified interval to a log file)
* jtop command (same information as tegrastat, but no output to logfile)
* cudaMemGetInfo (c code)
* /usr/bin/time -v  (Maximum resident set size)

I decided to use tegrastats because this can log values at fixed interval (I have set to 5 seconds)
Also, cudaMemGetInfo, /usr/bin/time and tegrastats output different values.

Results for Jetson Nano:

Below are some experimental results.
The memory usage are RAM + SWAP reported by tegrastats, meaning that other processes (default processes set on Jetson Nano) memory usage are also included.
As above, Darknet/YOLOv3 uses maximum memory for maximum resize, and the first 10 iterations are set to maximum resize.
So, during the first 10 iterations, it is expected that the memory usage is app. the maximum.
However, sometimes during the training, the memory usage further increases possibly by other processes.
Due to this, there are two possibilities to Darknet being killed by out of memory:
1) During the initial memory allocation
2) After some time, due to memory usage increase by other processes.

For example, the chart below are the memory usage during trainig on 320x320 for FP32 and Mixed Precision.
FP32 finishes first as below.
Memory usage (RAM+SWAP) for first 10 iterations are app 3800MB (FP32) and 3200MB(mix).
However, during the training time, memory usage increases as on the charts below, and needs further investigations on the reasons.

Also,I have configured my Nano with default SWAP (2GB) as well as to use headless mode.



Training results

configuration Original YOLOv3 This adaptation
yolov3 320x320 OK OK
yolov3 416x416 killed after app 100 iterations(1 attempt only) OK
yolov3 544x544 killed on start killed after app 300 iterations(1 attempt only)
yolov3 tiny Depends on w/h Not implemented
yolov2 Depends on w/h Not implemented

Timing analysis
For 320x320 training, the timing to finish one iteration on 480x480 resize is:
  FP32 : 186 secs
  Mix : 272 secs 
The following is a summary of the profiling by the following command:
  nvprof --print-gpu-trace --log-file log.txt ./darknet detector train ...
I've run the training for 3 iterations, and the data below is from the 2nd iteration.
For FP32, the total GPU time is app. the same as 186 secs above.
However, for Mix, the total GPU time is 232 sec, less than the 272 secs above (40 secs difference).
On the other hand, if my analysis is correct, as bellow, the biggest difference between Mix and FP32 is the processing of the convolutions (FP16 takes 47 secs more).
So, the GPU time difference is on convolutions, while I still need to find 40 secs of difference.

mAP@0.5 results for cat/dog

w x h mode best mAP
320x320 2 classes (FP32) 66.4%
320x320 2 classes (Mixed Precision) 68.0%
416x416 2 classes (Mixed Precision) 73.3%


NOTE: trained for 5000 epochs and selected the best mAP weight (stored every 100 iterations)
Is should be stated that the mAP varies on every training, even if we don't change the settings.
So, the above numbers does not necessary means that Mixed Precision algorithm is always better than FP32.

No comments:

Post a Comment