suflaj t1_j4l1i5l wrote on January 16, 2023 at 1:49 PM

Likely not enough, at least not for what is considered good. But I fail to see why you'd want to trian it yourself, there are plenty of readily available w2v weights or vocabularies.

bhargavkartik t1_j4mqdsf wrote on January 16, 2023 at 8:28 PM

This.

BellyDancerUrgot t1_j4nw2ku wrote on January 17, 2023 at 1:06 AM

Gensim documentation itself has them highlighted along with the necessary arguments to use to download and use them.

elf7979 OP t1_j4o933j wrote on January 17, 2023 at 2:38 AM

I will check Gensim documentation. Thank you

elf7979 OP t1_j4o90u3 wrote on January 17, 2023 at 2:37 AM

I think trascript from company's conference call includes some certain characterstics since business professionals may use some particular verbs or expressions. I haven't checked out w2v datasets you mentioned yet. Is there existing corpus that's business-oriented?

What if dataset size increases to 1 giga bytes. Is it big enough?

suflaj t1_j4pdd6o wrote on January 17, 2023 at 9:11 AM

You're closer but not yet quite there - the smaller Google News Dataset W2V is trained on is 10 GB. The full one used is around 300GB IIRC