Submitted by elf7979 t3_10de78o in deeplearning
Hello.
I'd like to ask for opinions.
​
I collected NASDAQ transcript text and aggregated to one single txt file in the size of 100 mega bytes.
What I am planning to do with the text file is to train word2vec model with this corpus. Then, I'd like to make the trained model suggest verb replancement from other people's written input.
I copied some python codes on the web using word2vec model but the result wasn't satisfying.
This the code. Do you think I should enlarge dataset size? If I should, could you give me a color on what's the baseline of the dataset size? Or are there any workarounds to implement my plan?
import nltk
from gensim.models.word2vec import Word2Vec
import string
## stopword list
nltk.download('stopwords')
stop_words = set(nltk.corpus.stopwords.words('english'))
with open('content.txt', 'rt') as f:
text = f.read()
## remove punctuation
trans_table = text.maketrans('', '', string.punctuation)
clean = [[x.lower() for x in each.translate(trans_table).split() if x.lower() not in stop_words] for each in text.split('.\n')]
#print(clean)
## train word2vec model
model = Word2Vec(sentences=clean, vector_size=100, window=5, min_count=1, workers=4)
model.save("word2vec.model")
print(model.wv.most_similar('innovation', topn=5))
suflaj t1_j4l1i5l wrote
Likely not enough, at least not for what is considered good. But I fail to see why you'd want to trian it yourself, there are plenty of readily available w2v weights or vocabularies.