Skip to content

Out-Of-Memory GPU Error on SSD Mobilenet V2 FPNlite during Validation #35

@robingadola

Description

@robingadola

Hello, I'm trying to use the SSD Mobilenet V2 FPNlite model on images with ~30 ground truth bounding boxes per image. If the validation dataset is larger than ~150 images, I keep running into OOM errors during IoU calculation in the calculate_box_wise_iou() function:

Epoch 7/50
357/357 [==============================] - ETA: 0s - loss: 22.4538Error executing job with overrides: []
Traceback (most recent call last):

...


File "/home/externals/stm32ai_modelzoo_services/object_detection/src/utils/bounding_boxes_utils.py", line 283, in calculate_box_wise_iou
      box2_y2 = box2[:, 3]
Node: 'strided_slice_16'
2 root error(s) found.
  (0) RESOURCE_EXHAUSTED:  OOM when allocating tensor with shape[115648000] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node strided_slice_16}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

	 [[strided_slice_60/_298]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

  (1) RESOURCE_EXHAUSTED:  OOM when allocating tensor with shape[115648000] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node strided_slice_16}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

0 successful operations.
0 derived errors ignored. [Op:__inference_test_function_220137]

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

If I decrease the size of the validation set, the error does not occur (although the validation set is then much too small). Running on CPU only works too, but slower. Error occurs for any setting of GPU memory limit in configuration (from 0: unlimited, to 10 GB).

Model: ssd_mobilenet_v2_fpnlite
Input: 416x416
Batch Size: 4
GPU: RTX 3080
OS: Ubuntu

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions