Comments

You must log in or register to comment.

da_yu t1_is55aez wrote

We probably need to wait for driver and library updates for Ada specific optimization to get a fair picture (CUDA 12). Tensorflow benchmarks without XLA (in my opinion) should be taken with a grain of salt too.

But if the results stays the same, the improvement (especially fp16) is a disappointment.

52

ThomasBudd93 t1_is5fwt4 wrote

We also have to wait for the improvements by using fp8 kicks in. NVIDIA has recently published a paper demonstrating that it is feasible to train with fp8 and the new tensor cores are compaitble with that format. Just the software isn't there yet.

8

lostmsu t1_is5jeeb wrote

The transformer is awfully small (2 blocks, 200 embedding, 35 seq length). I would discard that result as useless. They should be testing on GPT2-117M or something similar.

28

AlmightySnoo t1_is5twx8 wrote

You're memory-bound on neural network problems as frameworks usually perform multiple load/stores from/in the GPU's global RAM at each activation/layer. Operator fusion as done for example by PyTorch's JIT compiler helps a bit but it cannot fuse operators with a matrix multiplication since the latter is usually done using cuBLAS. NN frameworks need to rethink this "okay efficient matrix multiplication algos aren't trivial so let's delegate this to a blackbox code like cublas" mentality as I think it's a shameful waste of chip power and caps the potential of GPUs.

17

MohamedRashad t1_is68bbw wrote

The RTX 3090 is being sold now for as low as 1000$ ... I think it will be the best option for a lot of researchers here.

9

ReginaldIII t1_is6e8rx wrote

Not really. There's still a lot of models being used in production written for the old TF graph API.

And if you've tested every prior GPU against that standard benchmark model for years you keep doing it so you can see what happens.

Edit: And as is this subs tradition for callous downvoting because your knee jerk reaction wasn't correct... Here's the relevant part of the article for you

> TensorFlow 1.15.5 ResNet50 > This is the NVIDIA maintained version 1 of TensorFlow which typically offers somewhat better performance than version 2. The benchmark is training 100 steps of the ResNet 50 layer convolution neural network (CNN). The result is the highest images-per-second value from the run steps. FP32 and FP16 (tensorcore) jobs were run.

It's a standard benchmark model! And it performs better that those written for TF2. What more do you want?

https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Classification/ConvNets/resnet50v1.5

11

pan_berbelek t1_is6pz3i wrote

Disappointing? Those results look great, the improvement is really good. What exactly were you expecting?

23

Sirisian t1_is70wwy wrote

I'm hoping Samsung gets their GDDR7 modules out fast into the Ti models. If so the memory bottleneck will be basically gone. It'll go from 1 TB/s to 1.728 TB/s.

5

nomadiclizard t1_is7svro wrote

I'd rather get a pair of 3090's cheap!

0

programmerChilli t1_is7vgbp wrote

I mean... it's hard to write efficient matmuls :)

But... recent developments (i.e. CuBLAS and Triton) do allow NN frameworks to write efficient matmuls, so I think you'll start seeing them being used to fuse other operators with them :)

You can already see some of that being done in projects like AITemplate.

I will note one other thing though - fusing operators with matmuls is not as big of a bottleneck in training, this optimization primarily helps in inference.

4

danielfm123 t1_is7vrsw wrote

Still very happy on my 1070ti ... 1080p is a dream for someone that started with voodoo 2.

−3

kajladk t1_is8gtdt wrote

Can anyone explain how taking the max value from 100 runs is a good benchmark when for most other benchmarks (gaming fps etc) the average fps across multiple runs gives a more realistic performance and eliminates any outliers

0

kajladk t1_is8lqr0 wrote

But isn't this different? We are comparing raw metric (fps, images/sec) with an aggregate score which might already have ways to eliminate or regularize some outlier metrics in-built

1

ReginaldIII t1_isal2mt wrote

Nvidia and Puget want to report lucky run. Lots of people do this. They're being fully transparent that they are reporting lucky runs. And it makes sense from their perspective to report their best theoretical performance.

It honestly just doesn't bother me to see them doing it because it's very normal and lots of people report this way. Even if we think an average with an error bar would be fairer.

0

afireohno t1_isblrse wrote

>average fps across multiple runs gives a more realistic performance and eliminates any outliers

Thanks for the laugh. I'll just leave this here so you can read about why the mean (average) is not a robust measure of central tendency because it is easily skewed by outliers.

1

labloke11 t1_iskly8g wrote

Any benchmark on Arc GPUs?

1