Submitted by AutoModerator t3_z07o4c in MachineLearning
jon-chin t1_ix93k8o wrote
please bear with my since I'm pretty new:
I'm doing topic modeling on a set of tweets using GSDMM. to do that, I need to tokenize and stem them. I can get the clusters, their document sizes, and their stem counts.
however, I'd like to pull in metadata, namely the timestamps of the tweets. is there a way to do this easily? right now, I'm doing a second pass after the modeling is done and guessing which cluster each of the original tweets belongs to. is there a better way to have GSDMM aggregate this metadata while it does the modeling?
trnka t1_ixew7z9 wrote
It's hacky, but you could transform the timestamps into words. I've used that trick a few times successfully.
Something like TweetTimestampRangeA, TweetTimestampRangeB, ... One downside is that you'd need to commit to a strategy for time ranges (either chop the data into N time ranges, or else tokens for month, year, etc)
Viewing a single comment thread. View all comments