boadie

boadie t1_j8hhtwi wrote

In the opposite direction from your question is a very interesting project, TinyNN all implemented as close to the metal as possible and very fast: https://github.com/NVlabs/tiny-cuda-nn

Also in the vague neighbourhood of your question is the Triton compiler, while on the surface being a Python jit compiler is language coverage is much smaller than Python and you can view it as a small dsl, all the interesting bits are way below that level: https://openai.com/blog/triton/

1