Comments

You must log in or register to comment.

da_yu t1_is55aez wrote

We probably need to wait for driver and library updates for Ada specific optimization to get a fair picture (CUDA 12). Tensorflow benchmarks without XLA (in my opinion) should be taken with a grain of salt too.

But if the results stays the same, the improvement (especially fp16) is a disappointment.

52

mlaprise t1_is5gl5b wrote

yeah and benchmarking with TF 1.15 in 2022 is kinda strange

30

ReginaldIII t1_is6e8rx wrote

Not really. There's still a lot of models being used in production written for the old TF graph API.

And if you've tested every prior GPU against that standard benchmark model for years you keep doing it so you can see what happens.

Edit: And as is this subs tradition for callous downvoting because your knee jerk reaction wasn't correct... Here's the relevant part of the article for you

> TensorFlow 1.15.5 ResNet50 > This is the NVIDIA maintained version 1 of TensorFlow which typically offers somewhat better performance than version 2. The benchmark is training 100 steps of the ResNet 50 layer convolution neural network (CNN). The result is the highest images-per-second value from the run steps. FP32 and FP16 (tensorcore) jobs were run.

It's a standard benchmark model! And it performs better that those written for TF2. What more do you want?

https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Classification/ConvNets/resnet50v1.5

11

kajladk t1_is8gtdt wrote

Can anyone explain how taking the max value from 100 runs is a good benchmark when for most other benchmarks (gaming fps etc) the average fps across multiple runs gives a more realistic performance and eliminates any outliers

0

afireohno t1_isblrse wrote

>average fps across multiple runs gives a more realistic performance and eliminates any outliers

Thanks for the laugh. I'll just leave this here so you can read about why the mean (average) is not a robust measure of central tendency because it is easily skewed by outliers.

1

kajladk t1_isd009f wrote

Umm, I know mean is more skewed by outliers than median, but it's still "better" than taking the best value

1

ReginaldIII t1_is8j7x8 wrote

I would say there's precedent for lucky run benchmark scores. Consider 3dmark as an example.

https://benchmarks.ul.com/hall-of-fame-2/timespy+3dmark+score+performance+preset/version+1.0

All of those runs with different system configurations are peoples luckiest runs.

0

kajladk t1_is8lqr0 wrote

But isn't this different? We are comparing raw metric (fps, images/sec) with an aggregate score which might already have ways to eliminate or regularize some outlier metrics in-built

1

ReginaldIII t1_isal2mt wrote

Nvidia and Puget want to report lucky run. Lots of people do this. They're being fully transparent that they are reporting lucky runs. And it makes sense from their perspective to report their best theoretical performance.

It honestly just doesn't bother me to see them doing it because it's very normal and lots of people report this way. Even if we think an average with an error bar would be fairer.

0

lostmsu t1_is5jeeb wrote

The transformer is awfully small (2 blocks, 200 embedding, 35 seq length). I would discard that result as useless. They should be testing on GPT2-117M or something similar.

28

pan_berbelek t1_is6pz3i wrote

Disappointing? Those results look great, the improvement is really good. What exactly were you expecting?

23

AlmightySnoo t1_is5twx8 wrote

You're memory-bound on neural network problems as frameworks usually perform multiple load/stores from/in the GPU's global RAM at each activation/layer. Operator fusion as done for example by PyTorch's JIT compiler helps a bit but it cannot fuse operators with a matrix multiplication since the latter is usually done using cuBLAS. NN frameworks need to rethink this "okay efficient matrix multiplication algos aren't trivial so let's delegate this to a blackbox code like cublas" mentality as I think it's a shameful waste of chip power and caps the potential of GPUs.

17

programmerChilli t1_is7vgbp wrote

I mean... it's hard to write efficient matmuls :)

But... recent developments (i.e. CuBLAS and Triton) do allow NN frameworks to write efficient matmuls, so I think you'll start seeing them being used to fuse other operators with them :)

You can already see some of that being done in projects like AITemplate.

I will note one other thing though - fusing operators with matmuls is not as big of a bottleneck in training, this optimization primarily helps in inference.

4

MohamedRashad t1_is68bbw wrote

The RTX 3090 is being sold now for as low as 1000$ ... I think it will be the best option for a lot of researchers here.

9

computing_professor t1_itnwsxg wrote

What about 2x3090 vs 1x4090? Cost vs. performance?

1

MohamedRashad t1_itnxalb wrote

The bigger VRAM (2x3090) is a better deal in my opinion and you get to distribute your training and make more experiments.

1

computing_professor t1_itnzkvv wrote

I guess it's sharable via nvlink. Usually a pair of GeForce cards can't combine vram.

2

ThomasBudd93 t1_is5fwt4 wrote

We also have to wait for the improvements by using fp8 kicks in. NVIDIA has recently published a paper demonstrating that it is feasible to train with fp8 and the new tensor cores are compaitble with that format. Just the software isn't there yet.

8

Sirisian t1_is70wwy wrote

I'm hoping Samsung gets their GDDR7 modules out fast into the Ti models. If so the memory bottleneck will be basically gone. It'll go from 1 TB/s to 1.728 TB/s.

5

yashdes t1_is8r2sr wrote

Those must be cooking with how hot the gddr6 modules get

1

labloke11 t1_iskly8g wrote

Any benchmark on Arc GPUs?

1

nomadiclizard t1_is7svro wrote

I'd rather get a pair of 3090's cheap!

0

danielfm123 t1_is7vrsw wrote

Still very happy on my 1070ti ... 1080p is a dream for someone that started with voodoo 2.

−3

Prinzessid t1_is9spkm wrote

1080p? This is about machine learning performance, not graphics

3