Are there currently usable FOSS tools for speech to text conversion (transcription) available under GNU/Linux? Purpose is transcribing stuff like downloaded podcasts. I don’t need or want any kind of GUI tool. Just a CLI program that takes an audio file and converts it to text. I know there are various proprietary systems that do this, such as youtube transcription. One of my questions is whether the free stuff that’s out there is anywhere near as good. I’m not too concerned about the input format (I can convert with ffmpeg), or about CPU time within reason (I don’t mind letting my server spend all night crunching a 1 hour audio). I’d prefer to not require a GPU but if that helps a lot, I can get hold one of one as needed.
Question is about speech to text (STT). I’m not asking about the opposite, text-to-speech (TTS). For some reason people often confuse the two of these.
Thanks!


If you have an integrated GPU in your processor (I’m assuming you do unless you have no graphical output at all), you can also try to run one of their really small models on it. Otherwise, their smaller models are also faster, so I’d recommend trying those on your CPU to start with.
Thanks, I’m more interested in quality than speed, but will try the smaller models to see if the results are usable. My CPU has Intel Integrated Graphics 4000 (this is in an i7-3770 which is way old by now) and I kind of doubt Pytorch supports that, though it’s possible. I can imagine upgrading to a newer processor but probably not a real GPU. Actually it looks like everything is way more expensive than it used to be, no big surprise.