SGD vs ADAM Optimizer Parameters

[ OK ] 433 — full content available

[ INFO ] category: Coding · Domain Specific difficulty: hard freq: medium first seen: 2026-01-13

[HARD][DOMAIN SPECIFIC][MEDIUM]Machine LearningDeep LearningOptimizationMath

$ cat problem.md

You are given a deep-learning model with P parameters. Implement a function that returns the total GPU memory (in bytes) required to store the optimizer states for two optimizers: SGD with momentum and Adam. Assume all optimizer states are kept in FP32 (4 bytes per scalar). For SGD with momentum you must store one momentum vector (same size as the parameter tensor). For Adam you must store two vectors: the first-moment estimate (m) and the second-moment estimate (v), again each the same size as the parameter tensor. Your function should accept a list of integers representing the number of parameters in each layer and return two integers: the total optimizer-state memory for SGD-momentum and for Adam. You must also handle the mixed-precision (FP16) training case: model weights are stored in FP16, but the optimizer states remain FP32. In this scenario you must additionally account for the FP32 “master” copy of the parameters that the optimizer updates.

user@intervues:~/nvidia$