reallyfuckingay
reallyfuckingay t1_jd2lhmv wrote
Reply to comment by [deleted] in The Internet Archive is defending its digital library in court today by OutlandishnessOk2452
Despite the recent developments in AI suggesting otherwise, OCR tools, at least ones available to the general public without the need to pay for licenses, are still imperfect enough that some amount of manual cleanup is required afterwards, and in larger bodies of text, this is often an unmanageable for a single person to do in a small timeframe. There's a reason people are actually paid for this.
reallyfuckingay t1_ityzhkh wrote
Reply to comment by ked_man in TSMC says efforts to rebuild US semiconductor industry are doomed to fail by 0wed12
wouldn't it be the other way around? wouldn't Taiwan stand to lose if Americans were confident they didn't have to rely on them for chip production?
reallyfuckingay t1_jd7m3b1 wrote
Reply to comment by [deleted] in The Internet Archive is defending its digital library in court today by OutlandishnessOk2452
Late reply. I think you're overestimating the reliability of these tools based on a anecdote. Google Lens can achieve such accuracy on smaller pieces of text because it has been trained to guess what the next word will be based on what words precede them, the OCR itself doesn't have to perfect so long as the text follows a predictable pattern, which most real life prose does.
When dealing with fictional settings however, with names and terms that were made up by the author, or otherwise are literary in nature and uncommon in colloquial English, this accuracy can drop quite significantly. It might mistake an obscure word for a much more common one with a completely different meaning, or parse speech which has been intentionally given an unorthographic affection on purpose as random gibberish.
I've used tesseract to extract text from garbled PDFs in the past, it still took a painstaking number of reviews to catch all the errors that seemed to fit a sentence at a glance, but were actually different from the original. It definitely can cut down on the amount of work needed, but this still isn't feasible to instantly and accurately transcribe bodies of text as large as entire books, otherwise you'd see it being used much more often.