Viewing a single comment thread. View all comments

Logon1028 OP t1_j0ifail wrote

Not really. The strided result is a 5 dimensional array. [depth][output_x][output_y][stride_x][stride_y]. I am basically applying an argmax (to the last two axes) then an unravel to get the 2d array of indices for every element in [depth][x][y]. You can see the unravel method (for 2D) on this page...https://numpy.org/doc/stable/reference/generated/numpy.argmax.html

My problem is I basically have a 3d array of strides. And each stride itself is also a 2d array of values. I need to apply the unravel and argmax to every stride (hence the triple nested for loop to iterate over the first 3 dimensions). I don't see how reshaping would allow me to apply a function to the first 3 dimensions more efficiently.

I have been reading to see if I can somehow vectorize the unravel and argmax then apply it to the first 3 dimensions.

1

elbiot t1_j0ivmop wrote

Yeah, I was just thinking in 1D. Im not at a computer so I can't try anything but roughly what I'm thinking is you have a (H, W, D) array and use stride tricks to get a (H, W, D, wx, wy). If you could get that to be (H, W, D, wx*wy) then argmax could give you a (H, W, D) array of indices. I dunno if you can reshape a strided array or use strides to get the shape in question

1

Logon1028 OP t1_j0jmq8c wrote

Well I actually did end up making a performance improvement after thinking over your suggestion a little bit. Essentially inside my triple for loop I was doing...

np.unravel_index(np.argmax(strided_result[depth][x][y], axis=None), strided_result[depth][x][y].shape)

But if I take part of your suggestion and squash the last two dimensions of the strided array I can perform argmax on axis 3 all at once OUTSIDE the for loop (which numpy probably parallelizes). This resulted in a roughly 30% improvement in performance.

layers = [
    Convolutional((1, 28, 28), 5, 2),
    Relu(),
    MaxPooling2D((2, 24, 24), 2, stride=(2,2), padding=(0,0)),
    Convolutional((2, 12, 12), 5, 2),
    Relu(),
    Flatten((2, 8, 8)),
    Dense(2 * 8 * 8, 10),
    Softmax()
]

The above model in my library takes about 10 minutes to train on the entire mnist dataset (for 5 epochs with batch size 1). Which in my opinion is acceptable as this is just an educational library.

@elbiot However, I still need the triple nested for loop. Unfortunately np.unravel_index returns a python tuple for some strange reason instead of a numpy array. Which makes it extremely awkward to work with. Not sure if you have any suggestions on a np.unravel_index function that returns a numpy result for better parallelization?

1

elbiot t1_j0k3evv wrote

I'm away from a computer for a while but you could cast the tuple to an array I assume. And since creating an array is expensive and you'll keep needing an array of the same shape every step, you could just hold onto it and assign values into it instead of re-creating it every time

1

Logon1028 OP t1_j0k4avf wrote

I had thought about that already, but I decided not to cast the tuple just because I can't see it being faster that what I already have. The problem is how restrictive numpy.unravel_index is. It only operates on 1D arrays and you can't pick an axis. So I have to use for loops to account for that. And I am already saving the unravelled coordinates into two numpy arrays. One for the x axis and one for the y axis of the coordinates (arrays created on layer initialization only for efficiency). I see no way to improve beyond what I currently have without having an alternative to the numpy.unravel_index function. Do you know any better alternatives to unravel_index? Or am I just screwed unless I write my own function to unravel the indices.

I don't really have anyone in my life that knows deep learning implementation. Which is why I am asking random people on reddit lol. It is a sad world for me sometimes.

1

elbiot t1_j0k91nm wrote

I think unravel is a tuple so you can just star unpack it to use it as indices without having to do anything else with it

1

Logon1028 OP t1_j0lxkae wrote

That's what I am doing currently. But I have to unpack it in a triple nested for loop because numpy doesn't accept tuples. So I don't gain the benefits of numpy's parallelization. Which is why I was searching for a possible alternative. I am not trying to like super optimize this function, but I want all the low hanging fruit I can get. I want people to be able to use the library to train small models for learning purposes.

1

elbiot t1_j0m3a67 wrote

Huh?

idx = unravel_indices(indices, shape) Values=arr[*idx]

No loop required. If you're referring to the same loop you were using to get the argmax, you can just adjust your indices first so they apply to the unstrided array

1

Logon1028 OP t1_j0mbuxc wrote

Yes, but that unravel_indices has to be applied to EVERY SINGLE ELEMENT of the last axis independently. i.e.

        for depth in range(strided_result.shape[0]):
            for x in range(strided_result.shape[1]):
                for y in range(strided_result.shape[2]):
                    local_stride_index = np.unravel_index(argmax_arr[depth][x][y], strided_result[depth][x][y].shape)

unravel_indices only takes a 1d array as input. In order to apply it to only the last axis of the 4D array you have to use a 4 loop. unravel_indices has no axis parameter.

1

Logon1028 OP t1_j0mcyle wrote

What I ended up doing is using np.indices (multiplied by the stride) to apply a mask to the x and y argmax arrays using an elementwise multiplication. Then I used elementwise division and modulus to calculate the input indexes myself. The only for loop I have in the forward pass now is a simple one for the depth of the input. The backward pass still uses a triple for loop, but I can live with that.

The model I showed in the previous comment now trains in just under 4 minutes. So now I have a roughly 3x performance increase from my original implementation. And I think that is where I am going to leave it.

Thank you for your help. Even though I didn't use all your suggestions directly, it definitely guided me in the right direction. My current implementation is FAR more efficient than any examples I could find online unfortunately.

1