Current SOTA in local FOSS speech to text?

solrize@lemmy.ml · 3 days ago

Current SOTA in local FOSS speech to text?

TehPers@beehaw.org · 3 days ago

The README lists the VRAM requirements for their different models if you plan to run with a GPU. Without a GPU, you can translate those roughly to system RAM.

Note that ML models pretty much always runs faster on a GPU due to the kinds of operations needed to execute them. If you have the option to run on a GPU, you probably should just do that. Even their largest model only requires ~10GB VRAM based on their table, and if you only need English, you can use a smaller one specialized for English (like medium.en).

solrize@lemmy.ml · 3 days ago

Thanks, I don’t have a GPU so if the program required one, I’d rent an hourly one (vast.ai has lots of affordable rentals). But it’s easier if I can run cpu-only. If I were doing this a lot (I don’t expect to), I could see getting a Ryzen APU server, if the GPU in those is supported.

TehPers@beehaw.org · 3 days ago

If you have an integrated GPU in your processor (I’m assuming you do unless you have no graphical output at all), you can also try to run one of their really small models on it. Otherwise, their smaller models are also faster, so I’d recommend trying those on your CPU to start with.

solrize@lemmy.ml · 3 days ago

Thanks, I’m more interested in quality than speed, but will try the smaller models to see if the results are usable. My CPU has Intel Integrated Graphics 4000 (this is in an i7-3770 which is way old by now) and I kind of doubt Pytorch supports that, though it’s possible. I can imagine upgrading to a newer processor but probably not a real GPU. Actually it looks like everything is way more expensive than it used to be, no big surprise.