mostlyhydrogen
mostlyhydrogen OP t1_j7fydvb wrote
Reply to comment by YOLOBOT666 in [D] Querying with multiple vectors during embedding nearest neighbor search? by mostlyhydrogen
The goal is to harvest training data for ML. If there is a difficult edge case the model is struggling with, the best way to improve model performance is to harvest additional training data for that edge case. You stop when the model performance meets your requirements.
mostlyhydrogen OP t1_j7fxwyx wrote
Reply to comment by RingoCatKeeper in [D] Querying with multiple vectors during embedding nearest neighbor search? by mostlyhydrogen
>ScaNN interface features
Nope. Notice that the results have shape (10000, 20) instead of (20,). That is just doing a batched query i.e. "for each of these 10k input vectors, find me 20 neighbors". What I need is a joint query, i.e. "given these 10k positive examples, give me an additional 20 candidate samples".
mostlyhydrogen OP t1_j73k4xe wrote
Reply to comment by YOLOBOT666 in [D] Querying with multiple vectors during embedding nearest neighbor search? by mostlyhydrogen
Not exactly. I have millions of points, most of which are not related to my query vectors. I want to iteratively refine my search: search, mark results as "relevant" or "irrelevant", repeat search with updated query.
mostlyhydrogen OP t1_j727t8z wrote
Reply to comment by Kacper-Lukawski in [D] Querying with multiple vectors during embedding nearest neighbor search? by mostlyhydrogen
Thanks for the link!
mostlyhydrogen OP t1_j724ctr wrote
Reply to comment by RingoCatKeeper in [D] Querying with multiple vectors during embedding nearest neighbor search? by mostlyhydrogen
That was an interesting read, but I don't think it solves my problem. Their examples don't show joint vector searches: https://github.com/google-research/google-research/blob/master/scann/docs/example.ipynb
mostlyhydrogen OP t1_j723ya3 wrote
Reply to comment by nobody202342 in [D] Querying with multiple vectors during embedding nearest neighbor search? by mostlyhydrogen
What about marking samples as "irrelevant"?
mostlyhydrogen OP t1_j723us3 wrote
Reply to comment by BiryaniSenpai in [D] Querying with multiple vectors during embedding nearest neighbor search? by mostlyhydrogen
What does it mean for a vector to attend to another vector?
mostlyhydrogen OP t1_j7238p8 wrote
Reply to comment by linverlan in [D] Querying with multiple vectors during embedding nearest neighbor search? by mostlyhydrogen
As you probably know, ANN search often returns irrelevant data. How might I iteratively refine the search with human feedback: marking samples as "relevant" or "irrelevant" and repeating the search.
I've done a lit search and haven't found anything, maybe because I am using the wrong keywords.
mostlyhydrogen OP t1_j70koyk wrote
Reply to comment by nobody202342 in [D] Querying with multiple vectors during embedding nearest neighbor search? by mostlyhydrogen
No, because the embeddings are on a unit hypersphere. But taking the average vector on the surface of the hypersphere might work.
Submitted by mostlyhydrogen t3_10rvkru in MachineLearning
mostlyhydrogen OP t1_j7km5j2 wrote
Reply to comment by YOLOBOT666 in [D] Querying with multiple vectors during embedding nearest neighbor search? by mostlyhydrogen
Thanks for the offer! This is a work project, though. I'm working with images. I can't give too many details due to confidentiality, but we're sub-billion images scale.
Usability is determined by trained annotators. If they find an object of interest and want to harvest more training data, they do a reverse image search across the whole training data and tag true matches.