I’m looking to some aggregation on academic research and news articles to see what insights I get from it. I’m using textrazor to do named entity recognition on the documents, but getting a lot of dirty labels that have slightly different wording. For example, Tesla, Tesla ltd, Tesla Ltd. As a result, my aggregations have a lot of duplicate results.

The dataset consists of about 4M labels so the solution has to be efficient to be viable. I was thinking of putting the labels through word2vec and then clustering them based on the word embedding distances? But then the problem arises of how many clusters to use?

I’ve also tried simple regex preprocessing to get rid of the company abbreviations but there are other examples that cannot be solved that easily.

Comments

You must log in or register to comment.

wind_dude t1_j6sj0ix wrote on February 1, 2023 at 4:18 PM

I solved a similar issue by building a knowledge graph. It took some manual curation and starting with a good base, but suggestions for misspelling and alternates were suggested by comparing vectors. The suggester runs as a batch with new entities after my ETL batch is done.

hasiemasie OP t1_j6u2u97 wrote on February 1, 2023 at 10:01 PM

Interesting, will try it. Thanks!

Aggravating_Group251 t1_j6tcogf wrote on February 1, 2023 at 7:20 PM

For the clustering approach, would HAC be viable?

hasiemasie OP t1_j6u2pk0 wrote on February 1, 2023 at 10:00 PM

Yes, tried that but with little success :(

Blutorangensaft t1_j6uiygz wrote on February 1, 2023 at 11:50 PM

Disclaimer: no help, more a request

Once you're done with this project, would you mind sharing your speed and accuracy? I'm kind of on the lookout for a good English NER model. Problem is, spacy has some issues with casing.

sad_potato00 t1_j6w92uy wrote on February 2, 2023 at 9:31 AM

so we had a similar problem, where buidling names were written in diffrent ways (some abbreviation, full name, full name + what type of it). something that worked for me was using sentence BERT and doing a cosine similarity. deciding a cut off value was easier than deciding how many cluster to use. sadly, manuall labeling and checking is still needed