Viewing a single comment thread. View all comments

FastestLearner OP t1_j4ufxgc wrote

I am not well acquainted with NLP tasks. So I have no idea of how much resource it would need to get a transformer trained on it (or finetune an existing model like BERT on the dataset). If resources are a concern, one could do a crowd sourced training, like LeelaChessZero. I think it's a matter of time someone comes along and does this, because blocking ads is the inevitable future of the internet. Also, some company/startup can do it on a subscription model like the already existing paid adblocking softwares. It's a potential startup idea IMO.

1

C0hentheBarbarian t1_j4ug85t wrote

Training isnโ€™t the main issue wrt cost. Inference is.

2

Philpax t1_j4uhmlc wrote

For this, I'd infer on the client (especially if you train on the YouTube transcript, so that you don't need to run Whisper over the audio track). Of course, it's much harder to make it a paid product then ๐Ÿ˜…

2

FastestLearner OP t1_j4uilg8 wrote

Yes. I initially thought of having a neural net trained on the audio track of a particular YT video, but I think the transcripts would provide just enough information, and fine tuning existing language models would work quite well especially with the recent tremendous growth of NLP. Collecting the audio would also require far more storage space than text, and would probably require more RAM, VRAM and compute.

If you are leaning towards crowd-sourcing the inference, I think it would be possible to do that using JS libs (such as TensorFlow.js), although I have no experience of these. The good thing is, once you do an inference on a video, you just upload them to the central server and everyone can get it for free (not requiring further inference costs).

1

Philpax t1_j4uk4ws wrote

Honestly, I'm not convinced it needs a hugely complex language model, as (to me) it seems like a primarily classification task, and not one that would need a deep level of understanding. It'd be a level or two above standard spam filters, maybe?

The two primary NN-in-web solutions I'm aware of are tf.js and ONNX Runtime Web, both of which do CPU inference, but the latter is developing some GPU inference. As you say, it only needs to be done once, so having a button that scans through the transcript and classifies sentence probabilities as sponsor-read or not, and then automatically selects the boundaries of the probabilities seems readily doable. Even if it takes some noticeable amount of time for the user, it's pretty quickly amortised across the entire viewing population.

The only real concern I'd have at that point is... is it worth it for the average user over just hitting the right arrow two times and/or manually submitting the timestamps themselves? I suspect that's why it hasn't been done yet

2

FastestLearner OP t1_j4z74l7 wrote

Yes. I too agree that a large model in not required for detecting simple words like "Please subscribe to our channel" or "Here is the sponsor of our video". I also have another idea which I think should help in getting better accuracies. Use the channel's unique identifier (UID) or the channel's name as input ( and generate conditional probabilities conditioned on the channel's UID). This should help because any particular YouTube channel almost always use the same phrase to introduce their sponsors in almost all of their videos. Think of LinusTechTips, you always here the same thing, "here's the segue to our sponsor yada yada." So this should definitely allow the model to do more accurate inference. Alternatively, you can just reduce the model complexity to save client's resources.

The other thing you mentioned about the average user not hitting the right arrow two times, I think (and this is my hypothesis), the graph of users using adblocking softwares is just increasing monotonically, because once a user gets to savour the internet without ads, they don't go back. Only the old aged folks and the absolutely-not-computer-savvy people don't use adblockers, and IMO that population is decreasing and in the (near) future, that population would simply vanish. This is similar to what Steve Jobs said when he was asked whether people would ever use the mouse. Look at now, everyone uses the mouse. Coming to sponsor blocking, not hitting the right arrow is just more convenient than hitting the right arrow two times. Sometimes hitting it x number of times does not get the job done and you need to hit it further. Also, you might miss the beginning of the non-sponsored segment, so you need to hit the left arrow once too. All of this is made convenient by the current SOTA SponsorBlock extension. It has just begun its journey and I have no doubt that just like the adblocking extensions, sponsorblocking is going to take off and see an exponential growth.

2

FastestLearner OP t1_j4uhkbm wrote

Yes. I did think about that and potential solutions could be:

(1) A startup offering services in exchange of a small fee - The good thing about it is that once you do an inference on a video, you can serve it to thousands of customers with no additional cost (except for server maintenance and bandwidth, but no extra GPU cost other than the first time you ran it on a particular video).

(2) Crowd sourced inference - The current state of the sponsor-blocking extension is that it requires manual user input which it sources from the crowd and collects at a central server. So it's basically crowd-sourced (or peer-sourced) manual labour. I'm sure if someone could come up with an automated version like an executable which runs in the background with very small resource usage, then inference can be done via crowd-sourcing too, the timestamps can then be collected to a central server and distributed across the planet. The good thing about this is that as more and more people join in to participate in the peer-sourced inference, the lower would be the cost of keeping any one peer's GPU busy.

1

float16 t1_j4uqlls wrote

2 seems doable. Not everybody has to have a GPU, but I bet lots of people, including me, would rather spin up the GPU in their personal computer for a few seconds than manually specify where skippable segments are.

The one central server thing bugs me. I'd prefer something like "query your nearest neighbors and choose the one with the most recent data." No idea how to do that though; not a systems person.

1

FastestLearner OP t1_j4x7rvp wrote

Yes. Your first point is something that I would happily engage in as well. I have no problems contributing to the community. Moreover, the extension can have several additional options like:

(i) Do not perform any kind of inference on the client, i.e. always use query existing timestamps from an the online database. This will be helpful for users with low power devices like laptops.

or

(ii) Perform inference (only) for the video that the client wants. This is, of course, necessary if the video does not have any timestamps on the server. It does the inference and uploads the results on the central server.

or

(iii) Keep performing inference for new videos (even ones that are not watched by the particular user) - Some folks who runs a powerful enough hardware and are eager to donate their computation time can choose this option. I am pretty sure some folks will emerge who are willing to do this. The LeelaChessZero project banked entirely on this particular idea. For this option, there could be slider to let the user control how much of the resources to keep actively engaged (maybe by limiting thread count).

The second point that you mentioned could be a implemented with a peer-to-peer communication protocol, but if the neural network's weights don't change, then there would be nothing different with most recent vs. stale timestamps. Also, in P2P you'd still need trackers to keep track of peers, which could be a central server or be decentralized and serverless depending on the implementation. One potential problem could be latency though.

1