Submitted by FastestLearner t3_10f2joc in MachineLearning
[removed]
Submitted by FastestLearner t3_10f2joc in MachineLearning
[removed]
I am not well acquainted with NLP tasks. So I have no idea of how much resource it would need to get a transformer trained on it (or finetune an existing model like BERT on the dataset). If resources are a concern, one could do a crowd sourced training, like LeelaChessZero. I think it's a matter of time someone comes along and does this, because blocking ads is the inevitable future of the internet. Also, some company/startup can do it on a subscription model like the already existing paid adblocking softwares. It's a potential startup idea IMO.
Training isnโt the main issue wrt cost. Inference is.
For this, I'd infer on the client (especially if you train on the YouTube transcript, so that you don't need to run Whisper over the audio track). Of course, it's much harder to make it a paid product then ๐
Yes. I initially thought of having a neural net trained on the audio track of a particular YT video, but I think the transcripts would provide just enough information, and fine tuning existing language models would work quite well especially with the recent tremendous growth of NLP. Collecting the audio would also require far more storage space than text, and would probably require more RAM, VRAM and compute.
If you are leaning towards crowd-sourcing the inference, I think it would be possible to do that using JS libs (such as TensorFlow.js), although I have no experience of these. The good thing is, once you do an inference on a video, you just upload them to the central server and everyone can get it for free (not requiring further inference costs).
Honestly, I'm not convinced it needs a hugely complex language model, as (to me) it seems like a primarily classification task, and not one that would need a deep level of understanding. It'd be a level or two above standard spam filters, maybe?
The two primary NN-in-web solutions I'm aware of are tf.js and ONNX Runtime Web, both of which do CPU inference, but the latter is developing some GPU inference. As you say, it only needs to be done once, so having a button that scans through the transcript and classifies sentence probabilities as sponsor-read or not, and then automatically selects the boundaries of the probabilities seems readily doable. Even if it takes some noticeable amount of time for the user, it's pretty quickly amortised across the entire viewing population.
The only real concern I'd have at that point is... is it worth it for the average user over just hitting the right arrow two times and/or manually submitting the timestamps themselves? I suspect that's why it hasn't been done yet
Yes. I too agree that a large model in not required for detecting simple words like "Please subscribe to our channel" or "Here is the sponsor of our video". I also have another idea which I think should help in getting better accuracies. Use the channel's unique identifier (UID) or the channel's name as input ( and generate conditional probabilities conditioned on the channel's UID). This should help because any particular YouTube channel almost always use the same phrase to introduce their sponsors in almost all of their videos. Think of LinusTechTips, you always here the same thing, "here's the segue to our sponsor yada yada." So this should definitely allow the model to do more accurate inference. Alternatively, you can just reduce the model complexity to save client's resources.
The other thing you mentioned about the average user not hitting the right arrow two times, I think (and this is my hypothesis), the graph of users using adblocking softwares is just increasing monotonically, because once a user gets to savour the internet without ads, they don't go back. Only the old aged folks and the absolutely-not-computer-savvy people don't use adblockers, and IMO that population is decreasing and in the (near) future, that population would simply vanish. This is similar to what Steve Jobs said when he was asked whether people would ever use the mouse. Look at now, everyone uses the mouse. Coming to sponsor blocking, not hitting the right arrow is just more convenient than hitting the right arrow two times. Sometimes hitting it x number of times does not get the job done and you need to hit it further. Also, you might miss the beginning of the non-sponsored segment, so you need to hit the left arrow once too. All of this is made convenient by the current SOTA SponsorBlock extension. It has just begun its journey and I have no doubt that just like the adblocking extensions, sponsorblocking is going to take off and see an exponential growth.
Yes. I did think about that and potential solutions could be:
(1) A startup offering services in exchange of a small fee - The good thing about it is that once you do an inference on a video, you can serve it to thousands of customers with no additional cost (except for server maintenance and bandwidth, but no extra GPU cost other than the first time you ran it on a particular video).
(2) Crowd sourced inference - The current state of the sponsor-blocking extension is that it requires manual user input which it sources from the crowd and collects at a central server. So it's basically crowd-sourced (or peer-sourced) manual labour. I'm sure if someone could come up with an automated version like an executable which runs in the background with very small resource usage, then inference can be done via crowd-sourcing too, the timestamps can then be collected to a central server and distributed across the planet. The good thing about this is that as more and more people join in to participate in the peer-sourced inference, the lower would be the cost of keeping any one peer's GPU busy.
2 seems doable. Not everybody has to have a GPU, but I bet lots of people, including me, would rather spin up the GPU in their personal computer for a few seconds than manually specify where skippable segments are.
The one central server thing bugs me. I'd prefer something like "query your nearest neighbors and choose the one with the most recent data." No idea how to do that though; not a systems person.
Yes. Your first point is something that I would happily engage in as well. I have no problems contributing to the community. Moreover, the extension can have several additional options like:
(i) Do not perform any kind of inference on the client, i.e. always use query existing timestamps from an the online database. This will be helpful for users with low power devices like laptops.
or
(ii) Perform inference (only) for the video that the client wants. This is, of course, necessary if the video does not have any timestamps on the server. It does the inference and uploads the results on the central server.
or
(iii) Keep performing inference for new videos (even ones that are not watched by the particular user) - Some folks who runs a powerful enough hardware and are eager to donate their computation time can choose this option. I am pretty sure some folks will emerge who are willing to do this. The LeelaChessZero project banked entirely on this particular idea. For this option, there could be slider to let the user control how much of the resources to keep actively engaged (maybe by limiting thread count).
The second point that you mentioned could be a implemented with a peer-to-peer communication protocol, but if the neural network's weights don't change, then there would be nothing different with most recent vs. stale timestamps. Also, in P2P you'd still need trackers to keep track of peers, which could be a central server or be decentralized and serverless depending on the implementation. One potential problem could be latency though.
I've thought about this and it seems doable (especially with the availability of both YouTube transcripts and Whisper), but the cost of training would be quite tedious for a hobbyist. Am excited to see if anyone tackles it, though.
I don't have much experience of the cost of training NLP models (I work mostly in Vision). But I think if you can get a product out with just enough accuracy to get the heads turning in your favour, you could always scale up the model later down the road. Alternatively, you could have donate button on the extension's settings page (which many extensions do), if you do get some donations you could use it to update the model later on. It could be crowd-sourced and crowd-funded simultaneously.
I may try making this
Godspeed to you. I think the first person to get it to the chrome/firefox extension store would get the most downloads and pave the future for all other adblocking/sponsorblocking extensions (coz no other extension currently does that, AFAIK).
...or...you could just get the youtube adblock/sponsorblock skip extension (dunno exactly what it's called SkipAdTrigger or something? I cannot check my home machine at the moment...but it's available for Firefox and I'm pretty sure something similar must exist for other browsers as well).
Works well in my experience. It automatically skips sponsorblocks and marks them as green on the time bar (so you can manually watch them if you're into that kinda thing. Hey, there's all kinds of kinks out there. Don't judge.)
You don't really need a separate extension, do you? Your bot can just be another user submitting the timestamps.
Though it would help if the extension developer provided a list of videos that are being watched by their users but has no timestamps yet, so your bot isn't spending time scraping though unpopular videos.
Moderators, why did you delete the post? We were having such a good discussion.
CallFromMargin t1_j4uehtz wrote
How well would it work, and how much would it cost? GPU instances are not cheap, and each minute thousands of hours of YouTube videos are uploaded.