Thinking about that time Berkeley delisted thousands of recordings of course content as a result of a lawsuit complaining that they could not be utilized by deaf individuals. Can this be resolved with current technology? Google's auto captioning has been abysmal up to this point, I've often wondered what the cost would be for google to run modern tech over the entire backlog of youtube. At least then they might have a new source of training data.
What a silly requirement? Since 1% cannot benefit, let's remove it for the 99%
kleiba 84 days ago [-]
Note that Berkeley is in theory not required to remove the video archive. It's just that by law, they are required to add captions. So, if they want to keep it up, that's what they could do. Except that it's not really a choice - the costs for doing so would be prohibitive. So, really, Berkeley is left with no choice: making the recording accessible or don't offer them at all means - in practice - "don't offer them at all".
Clearly the result of a regulation that meant well. But the road to hell is paved with good intentions.
It's a bit reminiscent of a law that prevents institutions from continually offering employees non-permanent work contracts. As in, after two fixed-term contracts, the third one must be permanent. The idea is to guarantee workers more stable and long-term perspectives. The result, however, is that the employee's contract won't get renewed at all after the second one, and instead someone else will be hired on a non-permanent contract.
freedomben 84 days ago [-]
> the road to hell is paved with good intentions
The longer I live the more the truth of this gets reinforced. We humans really are kind of bad at designing systems and/or solving problems (especially problems of our own making). Most of us are like Ralph Wiggum with a crayon sticking out of our noises saying, "I'm helping!"
Thorrez 84 days ago [-]
In the past, my university was publishing and mailing me a print magazine, and making it available in pdf form online. Then they stopped making the pdf available. I emailed them and asked why. They said it's because the pdf wasn't accessible.
But the print form was even less accessible, and they kept publishing that...
giancarlostoro 84 days ago [-]
ADA compliance will cost you.
3abiton 85 days ago [-]
It's one of those "to motivate the horse to run 1% faster, you add shit ton of weight on top of it" strategy.
IanCal 85 days ago [-]
The problem is that having that rule results in those 1%s always being excluded. It's probably worth just going back and looking at the arguments for laws around accessibility.
mst 84 days ago [-]
Yeah, every time I try and figure out an approach that could've avoided this being covered by the rules without making it easy for everybody to screw over deaf people entirely I end up coming to the conclusion that there probably isn't one.
I'm somewhat tempted to think that whoever sued berkeley and had the whole thing taken down in this specific case was just being a knob, but OTOH there's issues even with that POV in terms of letting precedents be set that will de facto still become "screw over deaf people entirely" even when everybody involved is doing their best to act in good faith.
Hopefully speech-to-text and text-to-speech will make the question moot in the medium term.
freedomben 84 days ago [-]
> Hopefully speech-to-text and text-to-speech will make the question moot in the medium term.
I really think this and other tech advances are going to be our saviors. It's still early days and it sometimes gets things wrong, but it's going to get good and it will basically allow us to have our cake and eat it too (as long as we can prevent having automated solutions banned).
mst 84 days ago [-]
Yeah, my hopes have the caveat of "this requires regulations to catch up to where technology is at rather than making everything worse" and in addition to my generally low opinion of politicians (the ones I've voted for absolutely included) there's a serious risk of a "boomers versus technology" incident spannering it even if everything else goes right ... but I can still *hope* even if I can see a number of possible futures where said hopes will turn out to be in vain.
andai 85 days ago [-]
Didn't YouTube have auto-captions at the time this was discussed? Yeah they're a bit dodgy but I often watch videos in public with sound muted and 90% of the time you can guess what word it was meant to be from context. (And indeed more recent models do way, way, way better on accuracy.)
zehaeva 85 days ago [-]
I have a few Deaf/Hard of Hearing friends who find the auto-captions to be basically useless.
Anything that's even remotely domain specific becomes a garbled mess. Even watching documentaries about light engineering/archeology/history subjects are hilariously bad. Names of historical places and people are randomly correct and almost always never consistent.
The second anyone has a bit of an accent then it's completely useless.
I keep them on partially because I'm of the "everything needs to have subtitles else I can't hear the words they're saying" cohort. So I can figure out what they really mean, but if you couldn't hear anything I can see it being hugely distracting/distressing/confusing/frustrating.
hunter2_ 85 days ago [-]
With this context, it seems as though correction-by-LLM might be a net win among your Deaf/HoH friends even if it would be a net loss for you, since you're able to correct on the fly better than an LLM probably would, while the opposite is more often true for them, due to differences in experience with phonetics?
Soundex [0] is a prevailing method of codifying phonetic similarity, but unfortunately it's focused on names exclusively. Any correction-by-LLM really ought to generate substitution probabilities weighted heavily on something like that, I would think.
You can also download the audio only with yt-dlp and then remake subs with whisper or whatever other model you want. GPU compute wise it will probably be less than asking an llm to try to correct a garbled transcript.
ldenoue 85 days ago [-]
The current Flash-8B model I use costs $1 per 500 hours of transcript.
andai 84 days ago [-]
If I read OpenAI's pricing right, then Google's thing is 200 times cheaper?
HPsquared 85 days ago [-]
I suppose the gold standard would be a multimodal model that also looks at the screen (maybe only if the captions aren't making much sense).
schrodinger 85 days ago [-]
I'd assume Soundex is too basic and English-centric to be a practical solution for an international company like Google. I was taught it and implemented it in a freshman level CS course in 2004, it can't be nearly state of the art!
shakna 85 days ago [-]
Soundex is fast, but inaccurate. It only prevails, because of the computational cost of things like levenshtein distance.
creato 85 days ago [-]
I use youtube closed captions all the time when I don't want to have audio. The captions are almost always fine. I definitely am not watching videos that would have had professional/human edited captions either.
There may be mistakes like the ones you mentioned (getting names wrong/inconsistent), but if I know what was intended, it's pretty easy to ignore that. I think expecting "textual" correctness is unreasonable. Usually when there are mistakes, they are "phonetic", i.e. if you spoke the caption out loud, it would sound pretty similar to what was spoken in the video.
dqv 85 days ago [-]
> I think expecting "textual" correctness is unreasonable.
Of course you think that, you don't have to rely solely on closed captions! It's usually not even posed as an expectation, but as a request to correct captions that don't make sense. Especially now that we have auto-captioning and tools that auto-correct the captions, running through and tweaking them to near-perfect accuracy is not an undue burden.
> if you spoke the caption out loud, it would sound pretty similar to what was spoken in the video.
Yes, but most deaf people can't do that. Even if they can, they shouldn't have to.
beeboobaa6 84 days ago [-]
There's helping people and there's infantilizing them. Being deaf doesn't mean you're stupid. They can figure it out.
Deleting thousands of hours of course material because you're worried they're not able to understand autogenerated captions just ensures everyone loses. Don't be so ridiculous.
mst 84 days ago [-]
They continue to be the worst automated transcripts I encounter and personally I find them sufficiently terribad that every time I try them I end up filing them under "nope, still more trouble than it's worth, gonna find a different source for this information and give them another go in six months."
Even mentally sounding them out (which is fine for me since I have no relevant disabilities, I just despise trying to take in any meaningful quantity of information from a video) when they look weird doesn't make them tolerable *for me*.
It's still a good thing overall that they're tolerable for you, though, and I hope other people are on average finding the experience closer to how you find it than how I find it ... but I definitely don't, yet.
Hopefully in a year or so I'll be in the same camp as you are, though, overall progress in the relevant class of tech seems to've hit a pretty decent velocity these days.
GaggiX 84 days ago [-]
Youtube captions have improved massively in recent years, they are flawless in most cases, sometimes a few errors (almost entirely in reporting numbers).
I think that the biggest problem is that the subtitles do not distinguish between the speakers.
ldenoue 85 days ago [-]
Definitely: and just giving the LLM context before correcting (in this case the title and description of the video, often written by a person) creates much better transcripts.
jonas21 85 days ago [-]
Yes, but the DOJ determined that the auto-generated captions were "inaccurate and incomplete, making the content inaccessible to individuals with hearing disabilities." [1]
If the automatically-generated captions are now of a similar quality as human-generated ones, then that changes things.
Probably quite expensive over the whole catalog but the Berkley content would be cheap to do.
If it's, say, 5000 hours then through the best model at assembly.ai with no discounts it's cost less than $2000. I know someone could do whisper for cheaper, and there likely would be discounts at this rate but worst case it seems very doable even for an individual.
ldenoue 85 days ago [-]
My repo doesn't re process the audio track: instead it makes the raw ASR text transcript better by feeding it additional info (title and description) and asking the LLM to fix errors.
It is not perfect, it'd sometimes replace words with a synonym, but it is much faster and cheaper.
The low cost of Gemini 1.5 Flash-8B costs $1 per 500 hours of transcript.
ei23 85 days ago [-]
With a RTX4090 and insanly-fast-whisper on whisper-large-v3-turbo (see Whisper-WebUI for easy testing) you can transscribe 5000h on consumer hardware in about 50h with timestamps.
So, yeah. I also know someone.
IanCal 84 days ago [-]
I can also run this all locally, my point was more that at the worst right now the most advanced model (afaik, I'm not personally benchmarking) paid for at the headline rates, for a huge content library, costs such a reasonable amount that an individual can do it. I've donated more to single charities than this would cost, while it's not an insignificant sum it's a "find one person who cares enough" level problem.
Grabbing the audio from thousands of hours of video, or even just managing getting the content from wherever it's stored, is probably more of an issue than actually creating the transcripts.
If anyone reading this has access to the original recordings, this is a pretty great time to get transcriptions.
delusional 85 days ago [-]
That's a legal issue. If humans wanted that content to be up, we just could have agreed to keep it up. Legal issues don't get solved by technology.
jazzyjackson 85 days ago [-]
Well. The legal complaint was that transcripts don't exist. The issue was that it was prohibitively expensive to resolve the complaint. Now that transcription is 0.1% of the cost it was 8 years ago, maybe the complaint could have been resolved.
Is building a ramp to meet ADA requirements not using technology to solve a legal issue?
delusional 85 days ago [-]
Nowhere on the linked page at least does it say that it was due to cost. It would seem more likely to me that it was a question of nobody wanting to bother standing up for the videos. If nobody wants to take the fight, the default judgement becomes to take it down.
Building a ramp solves a problem. Pointing at a ramp 5 blocks away 7 years later and asking "doesn't this solve this issue" doesn't.
pests 85 days ago [-]
Yet this feels very harrison bergeron to me. To handicap those with ability so we all can be at the same level.
fuzzy_biscuit 84 days ago [-]
Right. The judgment doesn't help people with disabilities at all. It only punishes the rest of the population.
yard2010 85 days ago [-]
Yet. Legal issues don't get solved by tech yet!
wood_spirit 85 days ago [-]
As an aside, has anyone else had some big hallucinations with the Gemini meet summaries? Have been using it a week or so and loving the quality of the grammar of the summary etc, but noticed two recurring problems: omitting what was actually the most important point raised, and hallucinating things like “person x suggested y do z” when, really, that is absolutely the last thing x would really suggest!
leetharris 85 days ago [-]
The Google ASR is one of the worst on the internet. We run benchmarks of the entire industry regularly and the only hyperscaler with a good ASR is Azure. They acquired Nuance for $20b a while ago and they have a solid lead in the cloud space.
And to run it on a "free" product they probably use a very tiny, heavily quantized version of their already weak ASR.
There's lots and lots of better meeting bots if you don't mind paying or have low usage that works for a free tier. At Rev we give away something like 300 minutes a month.
jll29 85 days ago [-]
Interesting. Do you have any peer reviewed scientific publications or technical reports regarding this work?
We also compared Amazon, Google, Microsoft Azure as well as a bunch of smaller players (from Edinburgh and Cambridge) and - consistent with what you reported - we also found Google ranked worst - but that was a one-off study from 2019 (unpublished) on financial news.
Word Error Rate (WER), the standard metric for the tast, is not everything. For some applications, the ability to upload custom lexicons is paramount (ASR systems that are word-based (almost all) as opposted to phoneme based require each word to be defined ahead of being able to recognize said word).
aftbit 85 days ago [-]
Are there any self-hosted options that are even remotely competitive? I have tried Whisper2 a fair bit, and it seems to work okay in very clean situations, like adding subtitles to movie dialog, but not so well when dealing with multiple speakers or poor audio quality.
albertzeyer 85 days ago [-]
K2/Kaldi is using more traditional ASR technology. It's probably more difficult to set up but you will more reliable outputs (no hallucinations or so).
baxtr 85 days ago [-]
Very interesting. Thanks for sharing.
Since you have experience in this, I’d like to hear your thoughts on a common assumption.
It goes like this: don’t build anything that would be feature for a Hyperscalar because ultimately they win.
I guess a lot of it is a question of timing?
leetharris 85 days ago [-]
I think it really depends on whether or not you can offer a competitive solution and what your end goals are. Do you want an indie hacker business, do you want a lifestyle business, do you want a big exit, do you want to go public, etc?
It is hard to compete with these hyperscalers because they use pseudo anti-competitive tactics that honestly should be illegal.
For example, I know some ASR providers have lost deals to GCP or AWS because those providers will basically throw in ASR for free if you sign up for X amount of EC2 or Y amount of S3, services that have absurd margins for the cloud providers.
Still, stuff like Supabase, Twilio, etc show there is a market. But it's likely shrinking as consolidation continues, exits slow, and the DOJ turns a blind eye to all of this.
hackernewds 85 days ago [-]
Counter argument: Zoom, DocuSign
But you do have to be next to amazing at execution
mst 84 days ago [-]
I think those are cases of successfully becoming *the* company for the thing in the minds of decision makers before the hyperscalers decide to try and turn your product into a bundleable feature.
Which is not to disagree with you, only to "yes, and" to emphasise that it's a fairly narrow path and 'amazing at execution' is necessary but not sufficient.
85 days ago [-]
depr 84 days ago [-]
Have you tested their new Chirp v2 model? Curious if there's any improvement there.
>the only hyperscaler with a good ASR is Azure
How would you say the non-hyperscalers compare? Speechmatics for example?
hunter2_ 85 days ago [-]
It can simultaneously be [the last thing x would suggest] and [a conclusion that an uninvolved person tasked with summarizing might mistakenly draw, with slightly higher probability of making this mistake than not making it] and theoretically an LLM attempts to output the latter. The same exact principle applies to missing the most important point.
tombh 85 days ago [-]
ASR: Automatic Speech Recognition
thaumasiotes 85 days ago [-]
Is that different from "speech-to-text"?
simsla 82 days ago [-]
Same thing, but ASR is the 'official' term for it.
joshdavham 85 days ago [-]
I was too afraid to ask!
throwaway106382 85 days ago [-]
Not to be confused with "Autonomous Sensory Meridian Response" (ASMR) - a popular category of video on Youtube.
hackernewds 85 days ago [-]
How would they be confused?
throwaway106382 84 days ago [-]
I think more people actually know what ASMR is as opposed to ASR. Lots of ASMR videos are people speaking/whispering at extremely low volume.
I don't think it's quite out of the realm of the possibility to have interpreted as "Gemini LLM corrects ASMR YouTube transcripts". Because you know..they're whispering so might be hard to understand or transcribe.
wodenokoto 85 days ago [-]
I can't explain the how, but I thought it was the ASMR thing the title referred to.
xanth 85 days ago [-]
This was a clever jape; a good example of a ironic anti-humor. But I don't think you were confused by that ether ;)
djmips 85 days ago [-]
clever japes are not desired on HN - there's Reddit for that my friend.
sidcool 85 days ago [-]
This is pretty cool. But at the risk of a digression, I can't imagine sharing my API keys with a random website on HN. There has to be a safe approach to this. Like limited use API keys, rate limited API keys or unsafe API keys etc.
mst 84 days ago [-]
I'm aware this isn't a *proper* solution, but "throw your current API key at it, then as soon as you're done playing around, execute a test of your API key rotation scripting" isn't a terrible workaround, especially if you're the sort of person who really *meant* to have tested said scripting recently but kept not getting around to it ("hi").
thomasahle 85 days ago [-]
Can't you just create a new API key with a limited budget?
sidcool 85 days ago [-]
The risk of leakage is very high. If Anthropic, Google, OpenAI can provide dispensible keys, it will be great.
thomasahle 84 days ago [-]
Both OpenAI and Anthropic let you disable and delete keys. I'd be surprised if Google doesn't.
ldenoue 85 days ago [-]
I should do that, let me try.
alsetmusic 85 days ago [-]
Seems like one of the places where LLMs make a lot of sense. I see some boneheaded transcriptions in videos pretty regularly. Comparing them against "more-likely" words or phrases seems like an ideal use case.
leetharris 85 days ago [-]
A few problems with this approach:
1. It brings everything back to the "average." Any outliers get discarded. For example, someone who is a circus performer plays fetch with their frog. An LLM would think this is an obvious error and correct it to "dog."
2. LLMs want to format everything as internet text which does not align well to natural human speech.
3. Hallucinations still happen at scale, regardless of model quality.
We've done a lot of experiments on this at Rev and it's still useful for the right scenario, but not as reliable as you may think.
ldenoue 85 days ago [-]
Do you have something to read about your study, experiments? Genuinely interested. Perhaps the prompts can be made to tell the LLM it's specifically handling human speech, not written text?
falcor84 85 days ago [-]
Regarding the frog, I would assume that the way to address this would be to feed the LLM screenshots from the video, if the budget allows.
leetharris 85 days ago [-]
Generally yes. That being said, sometimes multimodal LLMs show decreased performance with extra modalities.
The extra dimensions of analysis cause increased hallucination at times. So maybe it solves the frog problem, but now it's hallucinating in another section because it got confused by another frame's tokens.
One thing we've wanted to explore lately has been video based diarization. If I have a video to accompany some audio, can I help with cross talk and sound separation by matching lips with audio and assign the correct speaker more accurately? There's likely something there.
Those transcriptions are already done by LLMs in the first place - in fact, audio transcription was one of the very first large scale commercial uses of the technology in its current iteration.
This is just like playing a game of markov telephone where the step in OP's solution is likely higher compute cost than the step YT uses, because YT is interested in minimizing costs.
albertzeyer 85 days ago [-]
Probably just "regular" LMs, not large LMs, I assume. I assume some LM with 10-100M params or so, which is cheap to use (and very standard for ASR).
devmor 84 days ago [-]
Could be. I ran through some offline LMs for voice assisted home automation a couple years ago and they were subpar compared to even the pathetic offering that Youtube provides - but Google of course has much more focused resources to fine tune a small dataset model.
dylan604 85 days ago [-]
What about the cases where the human speaking is actually using nonsense words during a meandering off topic bit of "weaving"? Replacing those nonsense words would be a disservice as it would totally change the tone of the speech.
petesergeant 85 days ago [-]
Also useful I think for checking human-entered transcriptions, which even on expensively produced shows, can often be garbage or just wrong. One human + two separate LLMs, and something to tie-break, and we could possibly finally get decent subtitles for stuff.
icelancer 85 days ago [-]
Nice use of an LLM - we use Groq 70b models for this in our pipelines at work. (After using WhisperX ASR on meeting files and such)
One of the better reasons to use Cerebras/Groq that I've found so you can return huge amounts of clean text back fast for processing in other ways.
ldenoue 85 days ago [-]
Although Gemini accepts very long input context, I found that sending more than 512 or so words at a time to the LLM for "cleaning up the text" yields hallucinations. That's why I chunk the raw transcript into 512-word chunks.
Are you saying it works with 70B models on Groq? Mixtral, Llama? Other?
bob_theslob646 79 days ago [-]
When you did this, I am assuming you cut the audio off around 5 mins?
Yeah, I've had no issues sending tokens up to the context limit. I cut it off with a 10% buffer but that's just to ensure I don't run into tokenization miscounting between tiktoken and whatever tokenizer my actual LLM uses.
I have had little success with Gemini and long videos. My pipeline is video -> ffmpeg strip audio -> whisperX ASR -> groq (L3-70b-specdec) -> gpt-4o/sonnet-3.5 for summarization. Works great.
sorenjan 85 days ago [-]
Using an LLM to correct text is a good idea, but the text transcript doesn't have information about how confident the speech to text conversion is. Whisper can output confidence for each word, this would probably make for a better pipeline. It would surprise me if Google doesn't do something like this soon, although maybe a good speech to text model is too computationally expensive for Youtube at the moment.
dylan604 85 days ago [-]
Depends on your purpose of the transcript. If you are expecting the exact form of the words spoken in written form, then any deviation from that is no longer a transcription. At that point it is text loosely based on the spoken content.
Once you accept it okay for the LLM to just replace words in a transcript, you might as well just let it make up a story based on character names you've provided.
falcor84 85 days ago [-]
> any deviation from that is no longer a transcription
That's a wild exaggeration. Professional transcripts often have small (and not so small) mistakes, caused by typos, mishearing or lack of familiarity with the subject matter. Depending on the case, these are then manually proofread, but even after proofreading, some mistakes often remain, and occasionally even introduced.
dylan604 85 days ago [-]
maybe, but typos are not even the same thing as an LLM thinking of better next choice in words than actually just transcribing what was heard.
Timwi 85 days ago [-]
Can I use this to generate subtitles for my own videos? I would love to have subtitles on them but I can't be bothered to do all the timing synchronization by hand. Surely there must be a way to automate that?
geor9e 85 days ago [-]
That's called Youtube Automatic Speech Recognition (captioning), and is what this tool uses as input. You can turn those on in youtube studio.
leetharris 85 days ago [-]
The main challenge with using LLMs pretrained on internet text for transcript correction is that you reduce verbatimicity due to the nature of an LLM wanting to format every transcript as internet text.
Talking has a lot of nuances to it. Just try to read a Donald Trump transcript. A professional author would never write a book's dialogue like that.
Using a generic LLM on transcripts almost always reduces accuracy as a whole. We have endless benchmark data to demonstrate this at RevAI. It does, however, help with custom vocabulary, rare words, proper nouns, and some people prefer the "readability" of an LLM-formatted transcript. It will read more like a wikipedia page or a book as opposed to the true nature of a transcript, which can be ugly, messy, and hard to parse at times.
phrotoma 84 days ago [-]
I googled "verbatimicity" and all I could find was stuff published by rev.ai which didn't (at a quick glance) define the term. Can you clarify what this means?
depr 84 days ago [-]
Most likely they mean the degree of being verbatim or exact in reproduction.
dylan604 85 days ago [-]
> A professional author would never write a book's dialogue like that.
That's a bit too far. Ever read Huck Finn?
kelvinjps 85 days ago [-]
Google should have the needed tech for good AI transcription, why the don't integrate them in their auto-captioning? and instead the offer those crappy auto subtitles
summerlight 85 days ago [-]
YT is using USM, which is supposed to be their SOTA ASR model. Gemini have much better linguistic knowledge, but it's likely prohibitively expensive to be used on all YT videos uploaded everyday. But this "correction" approach seems to be a nice cost-effective methodology to apply LLM indeed.
briga 85 days ago [-]
Are they crappy though? Most of the time it gets things right, even if they aren't as accurate as a human. And sure, they probably have better techniques for this, but are they cost-effective to run at YouTube-scale? I think their current solution is good enough for most purposes, even if it isn't perfect
InsideOutSanta 85 days ago [-]
I'm watching YouTube videos with subtitles for my wife, who doesn't speak English. For videos on basic topics where people speak clear, unaccented English, they work fine (i.e. you usually get what people are saying). If the topic is in any way unusual, the recording quality is poor, or people have accents, the results very quickly turn into a garbled mess that is incomprehensible at best, and misleading (i.e. the subtitles seem coherent, but are wrong) at worst.
wahnfrieden 85 days ago [-]
Japanese auto captions suck
pachico 84 days ago [-]
Hmm, so this is expecting me to upload a personal API Key...
ldenoue 83 days ago [-]
It’s not uploaded anywhere: the client fetches Gemini servers directly from your browser.
But I understand it can be difficult to trust: that’s why the project is on GitHub so you can run it on your own machine and look at how the key is used.
I will try to offer a version that doesn’t require any key.
dr_dshiv 85 days ago [-]
The first time I used Gemini, I gave it a youtube link and asked for a transcript. It told me how I could transcribe it myself. Honestly, I haven't used it since. Was that unfair of me?
robrenaud 85 days ago [-]
Gemini is much worse as a product than 4o or Claude. I recommend using it from Google AI studio rather than the official consumer facing interface. But for tasks with large audio/visual input, it's better than 4o or Claude.
Whether you want to deal with it being annoying is your call.
replwoacause 84 days ago [-]
No it’s a terrible product that is embarrassingly bad compared to the competition. I ditched it after paying for a month of Gemini Advanced because it was so much worse other offerings.
andai 85 days ago [-]
GPT told me the same thing when I asked it to make an API call, or do an image search, or download a transcript of a YouTube video, or...
Spooky23 85 days ago [-]
The consumer Gemini is very prudish and optimized against risk to Google.
replwoacause 84 days ago [-]
In my experience Gemini Advanced is still so far behind ChatGPT and Claude. Recently it flat out refused to answer my fairly straightforward question by saying “I am just a large language model and cannot help you with that”. The conversation was totally benign but it flat out shit the bed so I canceled my subscription right then and there.
ldenoue 83 days ago [-]
Did you see this using the API or the online product gemini?
replwoacause 83 days ago [-]
The online product, I haven’t tried the API.
Rendered at 22:36:21 GMT+0000 (UTC) with Wasmer Edge.
https://news.berkeley.edu/2017/02/24/faq-on-legacy-public-co...
Discussed at the time (2017) https://news.ycombinator.com/item?id=13768856
Clearly the result of a regulation that meant well. But the road to hell is paved with good intentions.
It's a bit reminiscent of a law that prevents institutions from continually offering employees non-permanent work contracts. As in, after two fixed-term contracts, the third one must be permanent. The idea is to guarantee workers more stable and long-term perspectives. The result, however, is that the employee's contract won't get renewed at all after the second one, and instead someone else will be hired on a non-permanent contract.
The longer I live the more the truth of this gets reinforced. We humans really are kind of bad at designing systems and/or solving problems (especially problems of our own making). Most of us are like Ralph Wiggum with a crayon sticking out of our noises saying, "I'm helping!"
But the print form was even less accessible, and they kept publishing that...
I'm somewhat tempted to think that whoever sued berkeley and had the whole thing taken down in this specific case was just being a knob, but OTOH there's issues even with that POV in terms of letting precedents be set that will de facto still become "screw over deaf people entirely" even when everybody involved is doing their best to act in good faith.
Hopefully speech-to-text and text-to-speech will make the question moot in the medium term.
I really think this and other tech advances are going to be our saviors. It's still early days and it sometimes gets things wrong, but it's going to get good and it will basically allow us to have our cake and eat it too (as long as we can prevent having automated solutions banned).
Anything that's even remotely domain specific becomes a garbled mess. Even watching documentaries about light engineering/archeology/history subjects are hilariously bad. Names of historical places and people are randomly correct and almost always never consistent.
The second anyone has a bit of an accent then it's completely useless.
I keep them on partially because I'm of the "everything needs to have subtitles else I can't hear the words they're saying" cohort. So I can figure out what they really mean, but if you couldn't hear anything I can see it being hugely distracting/distressing/confusing/frustrating.
Soundex [0] is a prevailing method of codifying phonetic similarity, but unfortunately it's focused on names exclusively. Any correction-by-LLM really ought to generate substitution probabilities weighted heavily on something like that, I would think.
[0] https://en.wikipedia.org/wiki/Soundex
There may be mistakes like the ones you mentioned (getting names wrong/inconsistent), but if I know what was intended, it's pretty easy to ignore that. I think expecting "textual" correctness is unreasonable. Usually when there are mistakes, they are "phonetic", i.e. if you spoke the caption out loud, it would sound pretty similar to what was spoken in the video.
Of course you think that, you don't have to rely solely on closed captions! It's usually not even posed as an expectation, but as a request to correct captions that don't make sense. Especially now that we have auto-captioning and tools that auto-correct the captions, running through and tweaking them to near-perfect accuracy is not an undue burden.
> if you spoke the caption out loud, it would sound pretty similar to what was spoken in the video.
Yes, but most deaf people can't do that. Even if they can, they shouldn't have to.
Deleting thousands of hours of course material because you're worried they're not able to understand autogenerated captions just ensures everyone loses. Don't be so ridiculous.
Even mentally sounding them out (which is fine for me since I have no relevant disabilities, I just despise trying to take in any meaningful quantity of information from a video) when they look weird doesn't make them tolerable *for me*.
It's still a good thing overall that they're tolerable for you, though, and I hope other people are on average finding the experience closer to how you find it than how I find it ... but I definitely don't, yet.
Hopefully in a year or so I'll be in the same camp as you are, though, overall progress in the relevant class of tech seems to've hit a pretty decent velocity these days.
I think that the biggest problem is that the subtitles do not distinguish between the speakers.
If the automatically-generated captions are now of a similar quality as human-generated ones, then that changes things.
[1] https://news.berkeley.edu/wp-content/uploads/2016/09/2016-08...
IME youtube transcripts are completely devoid of meaningful information, especially when domain-specific vocabulary is used.
It would be great if they were annotated and served in a more user-friendly fashion.
As a bonus link, one of my favorite courses from the time: https://archive.org/details/ucberkeley_webcast_itunesu_35482...
If it's, say, 5000 hours then through the best model at assembly.ai with no discounts it's cost less than $2000. I know someone could do whisper for cheaper, and there likely would be discounts at this rate but worst case it seems very doable even for an individual.
It is not perfect, it'd sometimes replace words with a synonym, but it is much faster and cheaper.
The low cost of Gemini 1.5 Flash-8B costs $1 per 500 hours of transcript.
Grabbing the audio from thousands of hours of video, or even just managing getting the content from wherever it's stored, is probably more of an issue than actually creating the transcripts.
If anyone reading this has access to the original recordings, this is a pretty great time to get transcriptions.
Is building a ramp to meet ADA requirements not using technology to solve a legal issue?
Building a ramp solves a problem. Pointing at a ramp 5 blocks away 7 years later and asking "doesn't this solve this issue" doesn't.
And to run it on a "free" product they probably use a very tiny, heavily quantized version of their already weak ASR.
There's lots and lots of better meeting bots if you don't mind paying or have low usage that works for a free tier. At Rev we give away something like 300 minutes a month.
We also compared Amazon, Google, Microsoft Azure as well as a bunch of smaller players (from Edinburgh and Cambridge) and - consistent with what you reported - we also found Google ranked worst - but that was a one-off study from 2019 (unpublished) on financial news.
Word Error Rate (WER), the standard metric for the tast, is not everything. For some applications, the ability to upload custom lexicons is paramount (ASR systems that are word-based (almost all) as opposted to phoneme based require each word to be defined ahead of being able to recognize said word).
Since you have experience in this, I’d like to hear your thoughts on a common assumption.
It goes like this: don’t build anything that would be feature for a Hyperscalar because ultimately they win.
I guess a lot of it is a question of timing?
It is hard to compete with these hyperscalers because they use pseudo anti-competitive tactics that honestly should be illegal.
For example, I know some ASR providers have lost deals to GCP or AWS because those providers will basically throw in ASR for free if you sign up for X amount of EC2 or Y amount of S3, services that have absurd margins for the cloud providers.
Still, stuff like Supabase, Twilio, etc show there is a market. But it's likely shrinking as consolidation continues, exits slow, and the DOJ turns a blind eye to all of this.
But you do have to be next to amazing at execution
Which is not to disagree with you, only to "yes, and" to emphasise that it's a fairly narrow path and 'amazing at execution' is necessary but not sufficient.
>the only hyperscaler with a good ASR is Azure
How would you say the non-hyperscalers compare? Speechmatics for example?
I don't think it's quite out of the realm of the possibility to have interpreted as "Gemini LLM corrects ASMR YouTube transcripts". Because you know..they're whispering so might be hard to understand or transcribe.
1. It brings everything back to the "average." Any outliers get discarded. For example, someone who is a circus performer plays fetch with their frog. An LLM would think this is an obvious error and correct it to "dog."
2. LLMs want to format everything as internet text which does not align well to natural human speech.
3. Hallucinations still happen at scale, regardless of model quality.
We've done a lot of experiments on this at Rev and it's still useful for the right scenario, but not as reliable as you may think.
The extra dimensions of analysis cause increased hallucination at times. So maybe it solves the frog problem, but now it's hallucinating in another section because it got confused by another frame's tokens.
One thing we've wanted to explore lately has been video based diarization. If I have a video to accompany some audio, can I help with cross talk and sound separation by matching lips with audio and assign the correct speaker more accurately? There's likely something there.
https://research.google/blog/looking-to-listen-audio-visual-...
This is just like playing a game of markov telephone where the step in OP's solution is likely higher compute cost than the step YT uses, because YT is interested in minimizing costs.
One of the better reasons to use Cerebras/Groq that I've found so you can return huge amounts of clean text back fast for processing in other ways.
Are you saying it works with 70B models on Groq? Mixtral, Llama? Other?
https://github.com/google-gemini/generative-ai-js/issues/269...
I have had little success with Gemini and long videos. My pipeline is video -> ffmpeg strip audio -> whisperX ASR -> groq (L3-70b-specdec) -> gpt-4o/sonnet-3.5 for summarization. Works great.
Once you accept it okay for the LLM to just replace words in a transcript, you might as well just let it make up a story based on character names you've provided.
That's a wild exaggeration. Professional transcripts often have small (and not so small) mistakes, caused by typos, mishearing or lack of familiarity with the subject matter. Depending on the case, these are then manually proofread, but even after proofreading, some mistakes often remain, and occasionally even introduced.
Talking has a lot of nuances to it. Just try to read a Donald Trump transcript. A professional author would never write a book's dialogue like that.
Using a generic LLM on transcripts almost always reduces accuracy as a whole. We have endless benchmark data to demonstrate this at RevAI. It does, however, help with custom vocabulary, rare words, proper nouns, and some people prefer the "readability" of an LLM-formatted transcript. It will read more like a wikipedia page or a book as opposed to the true nature of a transcript, which can be ugly, messy, and hard to parse at times.
That's a bit too far. Ever read Huck Finn?
But I understand it can be difficult to trust: that’s why the project is on GitHub so you can run it on your own machine and look at how the key is used.
I will try to offer a version that doesn’t require any key.
Whether you want to deal with it being annoying is your call.