Ulfgardleo

Ulfgardleo t1_irm2bsq wrote

yes. I think at this point it is important to realize that in the exact moment you got hired by a company, your role changed.

You were the guy with a PhD straight from university who did top-notch research. Now, you are the guy hired to make this project work.

If your job description does not include "active research" or "follow the most recent advances in ML research" then it is not your job to know what is up - especially if it is an advancement in a subfield of ML your project is not actively interested in.

2

Ulfgardleo t1_ir9xy3t wrote

You seem to be confused.

  1. Experiment 1 uses small 5x5 matrices. Not block-matrices. There they only count the number of mults. These are not faster than SIMD implementations of 5x5 matrix mults, otherwise they would have shown it off proudly.

  2. Experiment 2 was about 4x4 block-matrices. But here the 10-20% faster than the COMMONLY used algorithms is actually an overstatement of the results. For GPUs, their implementation is only 5% faster than their default jax implementation of Strassen. The difference to TPU could just mean that their Jax compiler sucks for TPUs. (//Edit: by now i low-key assume that the 10-20% refers to standard cBLAS because i do not get 20% compared to strassen for any result in Figure 5 (and how could they, because they never even get more than 20% improvement over cBLAS.))

  3. They do not cite any of the papers that are concerned with efficient implementation of strassen. Especially the efficient memory scheme, from 1994. https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.39.6887 it is unclear whether a GPU implementation of that would be faster, since they are not even discussing the GPU implementation of their strassen variant. They do not claim that their algorithm is faster in complexity, so we are completely reliant on that their implementation of strassen makes sense.

4

Ulfgardleo t1_ir997hv wrote

The worst thing is however that they do not even cite the practically relevant memory efficient implementation of strassen (https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.39.6887 ). One can argue that all matmul algorithms with better complexity than Strassen are irrelevant due to their constants, but not even comparing to the best memory implementation is odd-especially as they don't show improvement in asymptotic complexity.

5

Ulfgardleo t1_ir7508n wrote

no, because these algorithms are terribly inefficient to implement as SIMD. They have nasty data access patterns and need many more FLOPS when also taking additions into account (just the last steps of adding the elements to the result matrix are more than twice the additions of a standard matmul in the case of the results shown here)

21

Ulfgardleo t1_ir72pix wrote

Why is this a nature paper?

  1. Strassen is already known not to be the fastest known algorithms in terms of Floating point multiplications https://en.wikipedia.org/wiki/Computational_complexity_of_matrix_multiplication

  2. already strassen is barely used because its implementation is inefficient except in the largest of matrices. Indeed, strassen is often implemented using a standard MatMul as smallest blocks and only used for very large matrices.

  3. Measuring the implementation complexity in floating mul is kinda meaningless if you pay for it with a multiple of floating additions. It is a meaningless metric (see 2.)

54