da_yu t1_is55aez wrote on October 13, 2022 at 11:39 AM

#92,666

We probably need to wait for driver and library updates for Ada specific optimization to get a fair picture (CUDA 12). Tensorflow benchmarks without XLA (in my opinion) should be taken with a grain of salt too.

But if the results stays the same, the improvement (especially fp16) is a disappointment.

ThomasBudd93 t1_is5fwt4 wrote on October 13, 2022 at 1:12 PM

#93,288

We also have to wait for the improvements by using fp8 kicks in. NVIDIA has recently published a paper demonstrating that it is feasible to train with fp8 and the new tensor cores are compaitble with that format. Just the software isn't there yet.

mlaprise t1_is5gl5b wrote on October 13, 2022 at 1:17 PM

#93,328

Replying to da_yu (#92,666)

yeah and benchmarking with TF 1.15 in 2022 is kinda strange

lostmsu t1_is5jeeb wrote on October 13, 2022 at 1:38 PM

#93,476

The transformer is awfully small (2 blocks, 200 embedding, 35 seq length). I would discard that result as useless. They should be testing on GPT2-117M or something similar.

AlmightySnoo t1_is5twx8 wrote on October 13, 2022 at 2:53 PM

#94,068

You're memory-bound on neural network problems as frameworks usually perform multiple load/stores from/in the GPU's global RAM at each activation/layer. Operator fusion as done for example by PyTorch's JIT compiler helps a bit but it cannot fuse operators with a matrix multiplication since the latter is usually done using cuBLAS. NN frameworks need to rethink this "okay efficient matrix multiplication algos aren't trivial so let's delegate this to a blackbox code like cublas" mentality as I think it's a shameful waste of chip power and caps the potential of GPUs.

DanShawn t1_is5w6an wrote on October 13, 2022 at 3:08 PM

#94,166

Replying to lostmsu (#93,476)

Or even default BERT...

MohamedRashad t1_is68bbw wrote on October 13, 2022 at 4:29 PM

#94,809

The RTX 3090 is being sold now for as low as 1000$ ... I think it will be the best option for a lot of researchers here.

ReginaldIII t1_is6e8rx wrote on October 13, 2022 at 5:07 PM

#95,137

Replying to mlaprise (#93,328)

Not really. There's still a lot of models being used in production written for the old TF graph API.

And if you've tested every prior GPU against that standard benchmark model for years you keep doing it so you can see what happens.

Edit: And as is this subs tradition for callous downvoting because your knee jerk reaction wasn't correct... Here's the relevant part of the article for you

> TensorFlow 1.15.5 ResNet50 > This is the NVIDIA maintained version 1 of TensorFlow which typically offers somewhat better performance than version 2. The benchmark is training 100 steps of the ResNet 50 layer convolution neural network (CNN). The result is the highest images-per-second value from the run steps. FP32 and FP16 (tensorcore) jobs were run.

It's a standard benchmark model! And it performs better that those written for TF2. What more do you want?

https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Classification/ConvNets/resnet50v1.5

pan_berbelek t1_is6pz3i wrote on October 13, 2022 at 6:23 PM

#95,674

Disappointing? Those results look great, the improvement is really good. What exactly were you expecting?

Sirisian t1_is70wwy wrote on October 13, 2022 at 7:34 PM

#96,255

I'm hoping Samsung gets their GDDR7 modules out fast into the Ti models. If so the memory bottleneck will be basically gone. It'll go from 1 TB/s to 1.728 TB/s.

lostmsu t1_is7k680 wrote on October 13, 2022 at 9:38 PM

#97,192

Replying to pan_berbelek (#95,674)

2x

nomadiclizard t1_is7svro wrote on October 13, 2022 at 10:38 PM

#97,640

I'd rather get a pair of 3090's cheap!

[deleted] t1_is7v6iw wrote on October 13, 2022 at 10:55 PM

#97,767

[deleted]

programmerChilli t1_is7vgbp wrote on October 13, 2022 at 10:57 PM

#97,782

Replying to AlmightySnoo (#94,068)

I mean... it's hard to write efficient matmuls :)

But... recent developments (i.e. CuBLAS and Triton) do allow NN frameworks to write efficient matmuls, so I think you'll start seeing them being used to fuse other operators with them :)

You can already see some of that being done in projects like AITemplate.

I will note one other thing though - fusing operators with matmuls is not as big of a bottleneck in training, this optimization primarily helps in inference.

danielfm123 t1_is7vrsw wrote on October 13, 2022 at 10:59 PM

#97,794

Still very happy on my 1070ti ... 1080p is a dream for someone that started with voodoo 2.

KelseyFrog t1_is8b2cq wrote on October 14, 2022 at 12:56 AM

#98,590

Replying to pan_berbelek (#95,674)

10x-100x

kajladk t1_is8gtdt wrote on October 14, 2022 at 1:39 AM

#98,899

Replying to ReginaldIII (#95,137)

Can anyone explain how taking the max value from 100 runs is a good benchmark when for most other benchmarks (gaming fps etc) the average fps across multiple runs gives a more realistic performance and eliminates any outliers

ReginaldIII t1_is8j7x8 wrote on October 14, 2022 at 1:57 AM

#99,020

Replying to kajladk (#98,899)

I would say there's precedent for lucky run benchmark scores. Consider 3dmark as an example.

https://benchmarks.ul.com/hall-of-fame-2/timespy+3dmark+score+performance+preset/version+1.0

All of those runs with different system configurations are peoples luckiest runs.

kajladk t1_is8lqr0 wrote on October 14, 2022 at 2:16 AM

#99,116

Replying to ReginaldIII (#99,020)

But isn't this different? We are comparing raw metric (fps, images/sec) with an aggregate score which might already have ways to eliminate or regularize some outlier metrics in-built

yashdes t1_is8r2sr wrote on October 14, 2022 at 2:58 AM

#99,379

Replying to Sirisian (#96,255)

Those must be cooking with how hot the gddr6 modules get

Prinzessid t1_is9spkm wrote on October 14, 2022 at 10:32 AM

#100,982

Replying to danielfm123 (#97,794)

1080p? This is about machine learning performance, not graphics

ReginaldIII t1_isal2mt wrote on October 14, 2022 at 2:36 PM

#102,359

Replying to kajladk (#99,116)

Nvidia and Puget want to report lucky run. Lots of people do this. They're being fully transparent that they are reporting lucky runs. And it makes sense from their perspective to report their best theoretical performance.

It honestly just doesn't bother me to see them doing it because it's very normal and lots of people report this way. Even if we think an average with an error bar would be fairer.

afireohno t1_isblrse wrote on October 14, 2022 at 6:42 PM

#104,163

Replying to kajladk (#98,899)

>average fps across multiple runs gives a more realistic performance and eliminates any outliers

Thanks for the laugh. I'll just leave this here so you can read about why the mean (average) is not a robust measure of central tendency because it is easily skewed by outliers.