squidward2022 t1_j9au3bg wrote on February 20, 2023 at 4:01 PM

Reply to comment by mrwafflezzz in [D] Relu + sigmoid output activation by mrwafflezzz

Yup! If you look at the graph of tanh you will see relu(tanh) will smush the left half of the graph to 0. The right half of the graph on (0,infty) ranges in value from 0 and 1 but you can see saturation towards 1 starts to occur around 2-2.5. Since relu leaves this half unchanged you’ll be able to approach 1 very effectively with reasonable finite values.

squidward2022 t1_j97veu5 wrote on February 19, 2023 at 10:46 PM

Reply to [D] Relu + sigmoid output activation by mrwafflezzz

Shifting the domain of sigmoid S from (-infty,infty) to (0,infty) is going to be kind of weird. In the first (original) case we would have S(-infty) = 0, S(0) = 1/2, S(infty) = 1, and thus the finite logit values w your network may output will be between -infty and infty and S(w) will give something meaningful. Now if you mentally shift S to be defined between (0, infty) you get S(0) = 0 S(infty) = 1. What value w would be needed to achieve S(w) = 1/2 ? infty / 2 ? It seems important that Sigmoid is defined on the open interval (-infty, infty) not just because we wish logits to be arbitrary valued, but also because we want S to be "expressive" around the logit values we see in practice, which must be finite.

Here is something you could do that doesn't require a shifted sigmoid: You have network f(x) = w which maps an input x to a score w. Take tanh(f(x)) and you get something with range (-1,1). Any negative w is mapped to a negative value in the range(-1,0) Now just take the ReLU of this, relu(tanh(f(x)) and all negative values from the tanh, which come from negative w's, go to 0 and all the positive values from the tanh, which come from positive w's, are unnafected.

In this way we have, negative w --> (-1,0) --> 0 and positive w --> (0,1) --> (0,1).

squidward2022 t1_j5vmb95 wrote on January 25, 2023 at 9:41 PM

Reply to [D] Self-Supervised Contrastive Approaches that don’t use large batch size. by shingekichan1996

(https://arxiv.org/pdf/2106.04156.pdf ) This was a cool paper from NeurIPS 2020 which aimed to theoretically explain the success of CL by relating it spectral clustering. They present a loss with a very similar form to InfoNCE, which they use for their theory. One of the plus sides found was it worked well with small batch sizes.

(https://arxiv.org/abs/2110.06848) I skimmed this work a while back, one of their main claims is that this approach works with small batch sizes.