farmingvillein t1_jbk1pv7 wrote on March 9, 2023 at 4:38 PM

> What is the best way to build a custom text classifier leveraging your own data?

"Best" is subjective, but if you are truly new, check out huggingfaces--it will probably be "easiest" (and still high quality), which is what you need as a beginner.

> Also what is the best starting LLM for this purpose- smaller model like Roberta or larger ones like GPT?

Really depends on how much training hardware you have, and how important it is to be "the best".

Roberta is probably going to be the best starting point, from an effort:return perspective.

The above all said--

The other thing I'd encourage you to do is to start by just exploring text classification without doing any custom training. Simply take a couple open source LLMs off the shelf (gpt-turbo and FLAN-T5-XXL being obvious ones), experiment with how to prompt them well, and evaluate results from there.

This will probably be even faster than training something custom, and will give you a good baseline--even if the cost is higher than you want to pay in production, it will help you understand what behavior can look like, and the inference dollars you pay will likely be a fraction of any production training/inference costs.

If, e.g., you get 60% F1 with a "raw" LLM, then you can/should expect Roberta (assuming you have decent training data) to probably be somewhere (and this is an extremely BOE estimate; reality can be quite different, of course) around that. If you then go and train a Roberta model and get, say, 30%, then you probably did something wrong--or the classification process requires a ton of nuance that is actually really hard, and you really should consider baselining on LLMs.

Good luck!

The biggest takeaway you should have, as a beginner:

Figure out what lets you get every step of results fastest, and prioritize that. Experimentation is still very much key in this field.

2muchnet42day t1_jcjsy5s wrote on March 17, 2023 at 10:39 AM

Why do you suggest Roberta and not something like LLAMA or Standford Alpaca?

farmingvillein t1_jckjsyr wrote on March 17, 2023 at 2:38 PM

Much more off-the-shelf right now (although that is changing rapidly)
No/minimal IP issues/concerns (although maybe OP doesn't care about that)

2muchnet42day t1_jckjy9i wrote on March 17, 2023 at 2:39 PM

Thank you

farmingvillein t1_jckm5r2 wrote on March 17, 2023 at 2:54 PM

Although note that OP does say that his data isn't labeled...and you of course need to label it for Roberta. So you're going to need to bootstrap that process via manual labeling or--ideally, if able--via an LLM labeling process.

If you go through the effort to set up an LLM labeling pipeline, you might just find that it is easier to use the LLM as a classifier, instead of fine-tuning yet another model (depending on cost, quality, etc. concerns).