Viewing a single comment thread. View all comments

FastestLearner OP t1_j4uilg8 wrote

Yes. I initially thought of having a neural net trained on the audio track of a particular YT video, but I think the transcripts would provide just enough information, and fine tuning existing language models would work quite well especially with the recent tremendous growth of NLP. Collecting the audio would also require far more storage space than text, and would probably require more RAM, VRAM and compute.

If you are leaning towards crowd-sourcing the inference, I think it would be possible to do that using JS libs (such as TensorFlow.js), although I have no experience of these. The good thing is, once you do an inference on a video, you just upload them to the central server and everyone can get it for free (not requiring further inference costs).

1

Philpax t1_j4uk4ws wrote

Honestly, I'm not convinced it needs a hugely complex language model, as (to me) it seems like a primarily classification task, and not one that would need a deep level of understanding. It'd be a level or two above standard spam filters, maybe?

The two primary NN-in-web solutions I'm aware of are tf.js and ONNX Runtime Web, both of which do CPU inference, but the latter is developing some GPU inference. As you say, it only needs to be done once, so having a button that scans through the transcript and classifies sentence probabilities as sponsor-read or not, and then automatically selects the boundaries of the probabilities seems readily doable. Even if it takes some noticeable amount of time for the user, it's pretty quickly amortised across the entire viewing population.

The only real concern I'd have at that point is... is it worth it for the average user over just hitting the right arrow two times and/or manually submitting the timestamps themselves? I suspect that's why it hasn't been done yet

2

FastestLearner OP t1_j4z74l7 wrote

Yes. I too agree that a large model in not required for detecting simple words like "Please subscribe to our channel" or "Here is the sponsor of our video". I also have another idea which I think should help in getting better accuracies. Use the channel's unique identifier (UID) or the channel's name as input ( and generate conditional probabilities conditioned on the channel's UID). This should help because any particular YouTube channel almost always use the same phrase to introduce their sponsors in almost all of their videos. Think of LinusTechTips, you always here the same thing, "here's the segue to our sponsor yada yada." So this should definitely allow the model to do more accurate inference. Alternatively, you can just reduce the model complexity to save client's resources.

The other thing you mentioned about the average user not hitting the right arrow two times, I think (and this is my hypothesis), the graph of users using adblocking softwares is just increasing monotonically, because once a user gets to savour the internet without ads, they don't go back. Only the old aged folks and the absolutely-not-computer-savvy people don't use adblockers, and IMO that population is decreasing and in the (near) future, that population would simply vanish. This is similar to what Steve Jobs said when he was asked whether people would ever use the mouse. Look at now, everyone uses the mouse. Coming to sponsor blocking, not hitting the right arrow is just more convenient than hitting the right arrow two times. Sometimes hitting it x number of times does not get the job done and you need to hit it further. Also, you might miss the beginning of the non-sponsored segment, so you need to hit the left arrow once too. All of this is made convenient by the current SOTA SponsorBlock extension. It has just begun its journey and I have no doubt that just like the adblocking extensions, sponsorblocking is going to take off and see an exponential growth.

2