vwings

vwings t1_j5gowwn wrote

For such retrieval systems, you would usually use Top-1, Top-5 or Top-something accuracy. Concretely, you have a list of product types (embeddings) in your database (let's say 100 or 100,000 whatever). Then you get your product description, you embed it with your ANN and then you compare it with all product type embs. then you check on which rank the correct product type ends up. From that you can calculate mean rank or top-k accuracy ..

8

vwings t1_j1nmx5z wrote

I assume they mean that each paper gets a list of reviewers it cannot be assigned to, because of conflicts of interests. But the original question could also be about a general blacklist of reviewers that showed misconduct before (e.g. not reacting to the author rebuttal, etc). I don't think a general blacklist exists... But it might be that the conference organizers indeed have this.

14

vwings t1_j1nmbol wrote

On OpenReview at least, you can enter persons who you are somehow connected with (this is a kind of blacklist). Furthermore, Openreview automatically avoids that your coauthors or persons with the same email domain (e.g. *@google.com) are assigned as reviewers. Openreview is actually pretty smart, but I don't know if authors actively try to trick it out...

7

vwings t1_j0osaa7 wrote

Now you are completely deviating from the original scope of the discussion. We discussed what is more general, but - since you changed scope - you agree with me on that.

About "guarantees": also for CP, it is easy to construct examples where it fails. If the distribution of the new data is different from the calibration set, it's not exchangeable anymore, and the guarantee is gone.

1

vwings t1_j0ljgfr wrote

That's what the CP guys say. :)

I would even say that Platt generalizes CP. Whereas CP focuses on the empirical distribution of prediction score only around a particular location at the tails, e.g. at confidence level 5%, Platt scaling tries to mold the whole empirical distribution into calibrated probabilities -- thus Platt considers the whole range of the distr of scores.

4

vwings t1_j0l5778 wrote

A hard problem indeed. The methods in your list have use different settings. Deep Ensembles, MC Dropout don't require a calibration set. The prior networks (i love this paper) assume that during training OOD samples are available. Conformal prediction assumes the availability of a calibration set that follows the distribution of future data... For the other methods,I would have to check ...

2

vwings t1_j0l4gq9 wrote

Great question and comment! In think the first statement here is that usually the CNNs are overconfident.

One thing that the original post is looking for is calibration of the classifier on a calibration set. On the calibration set, the softmax values can be re-adjusted to be close to probabilities. This is essentially what Conformal Prediction and Platt Scaling do.

I strongly recommend this year's talk on Conformal Prediction which provides insights into these problems. Will try to find the link...

1

vwings t1_iw768hl wrote

I think it's valuable, but not huge. There have been several recent works that use this concept that a sample is described by similar samples to enrich representations:

  • the cross-attention mechanism in Transformers does this to some extent
  • AlphaFold: a protein is enriched with similar (by multiple sequence alignment) proteins
  • CLOOB: a sample is enriched with similar samples from the current batch
  • MHNfs: a sample is enriched with similar samples from a large context.

This paper uses this concept, but does it differently: it uses the vector of cosine similarities, which in other works is softmaxed and then used a weights for averaging, directly as representation. That this works and that you can backprop over this is remarkable, but not huge... Just my two cents... [Edits: typos, grammar]

3

vwings t1_iud9fb8 wrote

The best way is probably to use a feature encoding and plugging this into a Transformer. First sample: 200 features A and 5 features B. You encode this as set {[A feats, encoding for A]W_A, [B feats (possibly repeated), encoding for B]W_B]} Second sample with B and C features: {[C feats, encoding for C]W_C, [B feats (possibly repeated), encoding for B]W_B]}. The linear mappings W_A, W_B, and W_C must map to the same dimensions. The order of the feature groups does not play a role (permutation invariance of the transformer). Note that this also learns a feature or feature group embedding.

−2

vwings t1_is99t4r wrote

Check contrastive pre-training of graph neural networks. The clustering methods should work well for those pretrained representations. Furthermore: do the graphs come from a particular domain? You might find some useful pre-training objectives then..

1