Submitted by elf7979 t3_10de78o in deeplearning

Hello.

I'd like to ask for opinions.

​

I collected NASDAQ transcript text and aggregated to one single txt file in the size of 100 mega bytes.

What I am planning to do with the text file is to train word2vec model with this corpus. Then, I'd like to make the trained model suggest verb replancement from other people's written input.

I copied some python codes on the web using word2vec model but the result wasn't satisfying.

This the code. Do you think I should enlarge dataset size? If I should, could you give me a color on what's the baseline of the dataset size? Or are there any workarounds to implement my plan?

import nltk
from gensim.models.word2vec import Word2Vec
import string
## stopword list
nltk.download('stopwords')
stop_words = set(nltk.corpus.stopwords.words('english'))
with open('content.txt', 'rt') as f:
text = f.read()
## remove punctuation
trans_table = text.maketrans('', '', string.punctuation)
clean = [[x.lower() for x in each.translate(trans_table).split() if x.lower() not in stop_words] for each in text.split('.\n')]
#print(clean)
## train word2vec model
model = Word2Vec(sentences=clean, vector_size=100, window=5, min_count=1, workers=4)
model.save("word2vec.model")
print(model.wv.most_similar('innovation', topn=5))

1

Comments

You must log in or register to comment.

suflaj t1_j4l1i5l wrote

Likely not enough, at least not for what is considered good. But I fail to see why you'd want to trian it yourself, there are plenty of readily available w2v weights or vocabularies.

3

BellyDancerUrgot t1_j4nw2ku wrote

Gensim documentation itself has them highlighted along with the necessary arguments to use to download and use them.

1

elf7979 OP t1_j4o933j wrote

I will check Gensim documentation. Thank you

1

elf7979 OP t1_j4o90u3 wrote

I think trascript from company's conference call includes some certain characterstics since business professionals may use some particular verbs or expressions. I haven't checked out w2v datasets you mentioned yet. Is there existing corpus that's business-oriented?

​

What if dataset size increases to 1 giga bytes. Is it big enough?

1

suflaj t1_j4pdd6o wrote

You're closer but not yet quite there - the smaller Google News Dataset W2V is trained on is 10 GB. The full one used is around 300GB IIRC

1