Hollo, I am tying to do personality trait prediction using Facebook posts and I'm currently facing an issue, as I have multiple users each with a different number of posts and each post has different lenght as well.

I am using BERT to get embeddings for each word in the post and using multiple other feature extraction methods to get additional features per post (sentiment analysis, TF-IDF, etc). So the prediction would be made for each user, the input having the size of the number of posts and each of those would be comprised of N embeddings (N = number of words in a post) as well as the additional features.

The issue I'm facing is that I don't know how to design my prediction model in order to deal with these 2 varying inputs sizes. If I had variable number of inputs of the same size I could simply use an RNN, but here that doesn't work since the number of words per post varies. What architecture could I use for this?

I considered using an RNN to process the word embeddings and TF-IDF scores (the features which size varies) to process into a fixed size output which would be combined with the other features and inserted into a second RNN to predict the personality scores.

Another option that I consideres is simply padding the input but I don't know if this will decrease the accuracy in a significant manner.

--dany-- t1_jedo4gy wrote on March 31, 2023 at 7:05 AM

How about using the embeddings of the whole post? Then you just have to train a model to predict trait from one post. A person’s overall trait can be the average of all traits predicted by all of his posts. I don’t see a point in using RNN over posts.

-pkomlytyrg t1_jeeeo36 wrote on March 31, 2023 at 12:36 PM

I would embed the whole post (BigBird or OpenAI embeddings have really long context lengths), and just feed that vector into an RNN. As long as the post is between one and 9000 tokens, the embedding shape will remain the same

danilo62 OP t1_jeextek wrote on March 31, 2023 at 2:56 PM

Oh, so those models are able to produce fixed size embeddings of texts? I wasn't aware of that

-pkomlytyrg t1_jef4weq wrote on March 31, 2023 at 3:42 PM

Generally, yes. If you use a model with a long context length (BigBird or OpenAI’s ada02), you’ll likely be fine unless the articles you’re embedding are greater than the token limit. If your using BERT or another, smaller model, you have to chunk/average; that can produce fixed sized vectors but you gotta put the work in haha

danilo62 OP t1_jefnror wrote on March 31, 2023 at 5:45 PM

Yeah I'm gonna try both options (with BERT and the bigger models) but since I'm working with a big dataset I'm not sure I'll be able to use the larger models due to the token and request limits. Thanks for the help

danilo62 OP t1_jeeygpe wrote on March 31, 2023 at 3:00 PM

But even then, with the other features (sentiment analysis, tf-idf) how would I feed a vector containing a varying number of tokens and other types of features? I can't see how you would this using an RNN. That is for each post

--dany-- t1_jefcm9m wrote on March 31, 2023 at 4:32 PM

The embedding contains all information like sentiment or tf-idf. You just need to train a model to predict trait from post embedding then average over all posts by a person. I didn’t suggest using RNN. Are you sure you were replying my comment?

danilo62 OP t1_jefo88t wrote on March 31, 2023 at 5:48 PM

Oh, I hadn't realized that you meant that the embedding would contain information about other features, I get it now. I was referencing an RNN since I thought it was the only option due to the variable input size. Thanks

[D] Best deal with varying number of inputs each with variable size using and RNN? (for an NLP task)

Comments