jon-chin t1_ix93k8o wrote on November 21, 2022 at 6:13 PM

please bear with my since I'm pretty new:

I'm doing topic modeling on a set of tweets using GSDMM. to do that, I need to tokenize and stem them. I can get the clusters, their document sizes, and their stem counts.

however, I'd like to pull in metadata, namely the timestamps of the tweets. is there a way to do this easily? right now, I'm doing a second pass after the modeling is done and guessing which cluster each of the original tweets belongs to. is there a better way to have GSDMM aggregate this metadata while it does the modeling?

trnka t1_ixew7z9 wrote on November 22, 2022 at 10:33 PM

It's hacky, but you could transform the timestamps into words. I've used that trick a few times successfully.

Something like TweetTimestampRangeA, TweetTimestampRangeB, ... One downside is that you'd need to commit to a strategy for time ranges (either chop the data into N time ranges, or else tokens for month, year, etc)