Voice assistants are one of the hottest tech right now. Siri, Alexa, Google Assistant, all aim to help you talk to computers and not just touch and type. Automated Speech Recognition (ASR) and Natural Language Understanding (NLU/NLP) are the key technologies enabling it. If you are just-a-programmer like me, you might be itching to get a piece of action and hack something. You are at the right place; read on.
Though these technologies are hard and the learning curve is steep, but are becoming increasingly accessible. Last month, Mozilla released DeepSpeech 0.6 along with models for US English. It has smaller and faster models than ever before, and even has a TensorFlow Lite model that runs faster than real time on a single core of a Raspberry Pi 4. There are several interesting aspects, but right now I am going to focus on its refreshingly simple batch and stream APIs in C, .NET, Java, JavaScript, and Python for converting speech to text. By the end of this blog post, you will build a voice transcriber. No kidding 🙂
Let’s get started
You need a computer with Python 3.6.5+ installed, good internet connection, and elementary Python programming skills. Even if you do not know Python, read along, it is not so hard. If you don’t want to install anything, you can try out DeepSpeech APIs in the browser using this code lab.
Let’s do needed setup: