Submitted by t3_ysah21 in MachineLearning

Hey ,

In Timm's implementation of stochastic depth ( the tensor is scaled by the probability of keeping the actual block. I didn't understand why he does so specially that this is not mentioned in the paper.

Can anyone explain this to me please ?

Thanks !

The code :

def drop_path(x, drop_prob: float = 0., training: bool = False, scale_by_keep: bool = True):

keep_prob = 1 - drop_prob shape = (x.shape[0],) + (1,) * (x.ndim - 1)

random_tensor = x.new_empty(shape).bernoulli_(keep_prob)

if keep_prob > 0.0 and scale_by_keep:


return x * random_tensor



You must log in or register to comment.