Audio Classification with TensorFlow Artificial Intelligence
YAMNet is a pretrained deep net that predicts 521 audio event classes based on the AudioSet-YouTube corpus, and employing the Mobilenet_v1 depthwise-separable convolution architecture.
What this means is that audio event classification is no longer limited to just a few types of sounds with very little, or no, allowed background noise. We can now analyze, in real-time, local or remote audio streams that contain multiple simultaneous sounds such as music, speech, and safety hazards.
The seven10 second long video clips below were generated from the AudioSet from the Sound and Video Understanding teams pursing Machine Perception research at Google. A PCM wav file was then extracted from each video clip and then classified with YAMNet running on a Raspberry Pi 3 Model B+.
After loading the software modules and YAMNet models, the time required to classify each 10 second audio segment from the video clips below was less than 4 seconds, which is less than half of the length of each clip.
This means that a Raspberry Pi 3 Model B+ is capable of predicting 521 audio event classes in real-time, with the classification results being available in seconds depending upon the desired length of audio sample.
OTTStreamingVideo is expert in AI based audio classification, speech recognition, and optical recognition for embedded Linux and server based applications. Please contact us for more information.