Submitted by hasiemasie t3_10qv7r7 in MachineLearning
sad_potato00 t1_j6w92uy wrote
so we had a similar problem, where buidling names were written in diffrent ways (some abbreviation, full name, full name + what type of it). something that worked for me was using sentence BERT and doing a cosine similarity. deciding a cut off value was easier than deciding how many cluster to use. sadly, manuall labeling and checking is still needed
Viewing a single comment thread. View all comments