Hello.

I'd like to ask for opinions.

I collected NASDAQ transcript text and aggregated to one single txt file in the size of 100 mega bytes.

What I am planning to do with the text file is to train word2vec model with this corpus. Then, I'd like to make the trained model suggest verb replancement from other people's written input.

I copied some python codes on the web using word2vec model but the result wasn't satisfying.

This the code. Do you think I should enlarge dataset size? If I should, could you give me a color on what's the baseline of the dataset size? Or are there any workarounds to implement my plan?

import nltk
from gensim.models.word2vec import Word2Vec
import string
## stopword list
nltk.download('stopwords')
stop_words = set(nltk.corpus.stopwords.words('english'))
with open('content.txt', 'rt') as f:
text = f.read()
## remove punctuation
trans_table = text.maketrans('', '', string.punctuation)
clean = [[x.lower() for x in each.translate(trans_table).split() if x.lower() not in stop_words] for each in text.split('.\n')]
#print(clean)
## train word2vec model
model = Word2Vec(sentences=clean, vector_size=100, window=5, min_count=1, workers=4)
model.save("word2vec.model")
print(model.wv.most_similar('innovation', topn=5))

suflaj t1_j4l1i5l wrote on January 16, 2023 at 1:49 PM

Likely not enough, at least not for what is considered good. But I fail to see why you'd want to trian it yourself, there are plenty of readily available w2v weights or vocabularies.

bhargavkartik t1_j4mqdsf wrote on January 16, 2023 at 8:28 PM

This.

BellyDancerUrgot t1_j4nw2ku wrote on January 17, 2023 at 1:06 AM

Gensim documentation itself has them highlighted along with the necessary arguments to use to download and use them.

elf7979 OP t1_j4o933j wrote on January 17, 2023 at 2:38 AM

I will check Gensim documentation. Thank you

elf7979 OP t1_j4o90u3 wrote on January 17, 2023 at 2:37 AM

I think trascript from company's conference call includes some certain characterstics since business professionals may use some particular verbs or expressions. I haven't checked out w2v datasets you mentioned yet. Is there existing corpus that's business-oriented?

What if dataset size increases to 1 giga bytes. Is it big enough?

suflaj t1_j4pdd6o wrote on January 17, 2023 at 9:11 AM

You're closer but not yet quite there - the smaller Google News Dataset W2V is trained on is 10 GB. The full one used is around 300GB IIRC

Is 100 mega byte text corpus big enought to train?

Comments