Submitted by mostlyhydrogen t3_10rvkru in MachineLearning

Are there tools or techniques that permit you to joint query using more than one query vector?

Use case: iterative ANN search refinement, where I start with a seed vector, select matches, and re-query with more examples to improve the search results.

I tried doing this with FAISS, but it performs a "batch query" that returns a separate set of results for each query vector (not a joint query).

5

Comments

You must log in or register to comment.

RingoCatKeeper t1_j70tokb wrote

Maybe you can take a look at ScanNN

2

mostlyhydrogen OP t1_j724ctr wrote

That was an interesting read, but I don't think it solves my problem. Their examples don't show joint vector searches: https://github.com/google-research/google-research/blob/master/scann/docs/example.ipynb

1

RingoCatKeeper t1_j79vmlw wrote

See the section "ScaNN interface features", you will find that you could search queries with batch, may this similar with your problem?

1

mostlyhydrogen OP t1_j7fxwyx wrote

>ScaNN interface features

Nope. Notice that the results have shape (10000, 20) instead of (20,). That is just doing a batched query i.e. "for each of these 10k input vectors, find me 20 neighbors". What I need is a joint query, i.e. "given these 10k positive examples, give me an additional 20 candidate samples".

2

nobody202342 t1_j70bcfk wrote

Would taking a mean of the vectors work?

1

mostlyhydrogen OP t1_j70koyk wrote

No, because the embeddings are on a unit hypersphere. But taking the average vector on the surface of the hypersphere might work.

2

nobody202342 t1_j70loma wrote

Yup average in the metric space of your embeddings should work as far as I can tell.

1

linverlan t1_j70oz53 wrote

You want to query with multiple vectors but don’t want to query with the vectors separately and don’t want to query with the mean of the vectors? You are going to need to give more details about what you want to do then.

1

mostlyhydrogen OP t1_j7238p8 wrote

As you probably know, ANN search often returns irrelevant data. How might I iteratively refine the search with human feedback: marking samples as "relevant" or "irrelevant" and repeating the search.

I've done a lit search and haven't found anything, maybe because I am using the wrong keywords.

1

BiryaniSenpai t1_j70x9mq wrote

Maybe have your vectors attend to each other and learnably output your final query vector?

1

mostlyhydrogen OP t1_j723us3 wrote

What does it mean for a vector to attend to another vector?

1

BiryaniSenpai t1_j7249ok wrote

I mean pass your queries through a self attention layer and then some fcns and have it output your final query vector

1

YOLOBOT666 t1_j72cncj wrote

Iterative as in continuing until there’s no more neighbours left as you continuously add neighbours to your index and query?

1

mostlyhydrogen OP t1_j73k4xe wrote

Not exactly. I have millions of points, most of which are not related to my query vectors. I want to iteratively refine my search: search, mark results as "relevant" or "irrelevant", repeat search with updated query.

1

YOLOBOT666 t1_j75i86x wrote

Out of curiosity, what are you trying to achieve as in when is the iterative process going to stop, what would be the heuristics? Would appreciate if you could share some papers for this!

1

mostlyhydrogen OP t1_j7fydvb wrote

The goal is to harvest training data for ML. If there is a difficult edge case the model is struggling with, the best way to improve model performance is to harvest additional training data for that edge case. You stop when the model performance meets your requirements.

1

YOLOBOT666 t1_j7iov1k wrote

Nice! I guess the heuristic part is how you use the queries at every iteration and make it “usable” in your iterative approach. What’s the size and dimension of your dataset? These graph-based ANNs are memory intensive, wondering what can you do for your dimensions?

If it’s a public repo/planning to release it on GitHub, I’d be happy to join!

1

mostlyhydrogen OP t1_j7km5j2 wrote

Thanks for the offer! This is a work project, though. I'm working with images. I can't give too many details due to confidentiality, but we're sub-billion images scale.

Usability is determined by trained annotators. If they find an object of interest and want to harvest more training data, they do a reverse image search across the whole training data and tag true matches.

1