beautyofdeduction
beautyofdeduction OP t1_j7hkq74 wrote
Reply to comment by BellyDancerUrgot in Why does my Transformer blow GPU memory? by beautyofdeduction
That context of how much memory other models use up is helpful. Thanks for taking the time to respond.
beautyofdeduction OP t1_j7hkb7q wrote
Reply to comment by neuralbeans in Why does my Transformer blow GPU memory? by beautyofdeduction
Yes, that's true. But even adding that in (6250*6250 ~= 40 mil floats), we are still nowhere near 40G.
beautyofdeduction OP t1_j7eqr8c wrote
Reply to comment by BellyDancerUrgot in Why does my Transformer blow GPU memory? by beautyofdeduction
8 Bytes * 22M = 0.176 GB?
beautyofdeduction OP t1_j7epm3u wrote
Reply to comment by BellyDancerUrgot in Why does my Transformer blow GPU memory? by beautyofdeduction
Can you elaborate?
beautyofdeduction OP t1_j7jqohn wrote
Reply to comment by neuralbeans in Why does my Transformer blow GPU memory? by beautyofdeduction
I wish I can send you my Github. But the original Attention is All You Need paper trained on sequences of length 25000 on multiple K80's (stated by the authors), which has only 12GB vram. Yes they used multiple GPUs, but afaik each GPU needs to be able to handle its own batch. Or maybe not? Again I wish I could show you my code.