Audio Classification with TensorFlow Artificial Intelligence

YAMNet is a pretrained deep net that predicts 521 audio event classes based on the AudioSet-YouTube corpus, and employing the Mobilenet_v1 depthwise-separable convolution architecture.

What this means is that audio event classification is no longer limited to just a few types of sounds with very little, or no, allowed background noise. We can now analyze, in real-time, local or remote audio streams that contain multiple simultaneous sounds such as music, speech, and safety hazards.

The seven10 second long video clips below were generated from the AudioSet from the Sound and Video Understanding teams pursing Machine Perception research at Google. A PCM wav file was then extracted from each video clip and then classified with YAMNet running on a Raspberry Pi 3 Model B+.

After loading the software modules and YAMNet models, the time required to classify each 10 second audio segment from the video clips below was less than 4 seconds, which is less than half of the length of each clip.

This means that a Raspberry Pi 3 Model B+ is capable of predicting 521 audio event classes in real-time, with the classification results being available in seconds depending upon the desired length of audio sample.

OUTPKEhOxE-Ovs.wav :
Music : 0.551
Explosion : 0.240
Gunshot, gunfire: 0.238
Machine gun : 0.157
Fusillade : 0.100
Sound effect: 0.020
Burst, pop : 0.019
Artillery fire: 0.015
Helicopter : 0.013
Boom : 0.012
Electronic music: 0.011
Rhythm and blues: 0.010
Pop music : 0.010
Exciting music: 0.006
Aircraft : 0.006
Vehicle : 0.006
Singing : 0.006
Music of Asia: 0.006
Outside, urban or manmade: 0.005
Dubstep : 0.005

OUT6slrju_ar9U.wav :
Outside, rural or natural: 0.136
Music : 0.094
Explosion : 0.075
Wood : 0.070
Animal : 0.066
Horse : 0.062
Clip-clop : 0.059
Chop : 0.057
Walk, footsteps: 0.052
Percussion : 0.052
Wood block : 0.046
Inside, small room: 0.038
Gunshot, gunfire: 0.037
Outside, urban or manmade: 0.037
Musical instrument: 0.036
Fireworks : 0.036
Skateboard : 0.033
Livestock, farm animals, working animals: 0.032
Mechanisms : 0.032
Whack, thwack: 0.026

OUTc9030y4sJo0.wav :
Explosion : 0.512
Gunshot, gunfire: 0.487
Fusillade : 0.410
Machine gun : 0.161
Boom : 0.107
Music : 0.078
Sound effect: 0.063
Inside, large room or hall: 0.024
Thump, thud : 0.023
Arrow : 0.021
Crackle : 0.019
Speech : 0.017
Artillery fire: 0.016
Door : 0.013
Outside, urban or manmade: 0.013
Burst, pop : 0.012
Crack : 0.011
Fire : 0.010
Whack, thwack: 0.009
Tap : 0.009

OUT-Ho5tDtuah0.wav :
Gunshot, gunfire: 0.573
Explosion : 0.562
Fusillade : 0.337
Machine gun : 0.296
Speech : 0.069
Inside, large room or hall: 0.043
Artillery fire: 0.030
Arrow : 0.029
Thump, thud : 0.027
Outside, urban or manmade: 0.024
Boom : 0.023
Music : 0.023
Burst, pop : 0.020
Outside, rural or natural: 0.020
Inside, small room: 0.017
Sound effect: 0.015
Fire : 0.012
Fireworks : 0.011
Vehicle : 0.011
Crackle : 0.010

OUTK1cnDXbkPu0.wav :
Explosion : 0.367
Gunshot, gunfire: 0.300
Music : 0.208
Fusillade : 0.131
Artillery fire: 0.113
Boom : 0.089
Burst, pop : 0.058
Eruption : 0.056
Vehicle : 0.052
Machine gun : 0.038
Car : 0.033
Motor vehicle (road): 0.032
Thump, thud : 0.026
Thunderstorm: 0.025
Thunder : 0.023
Inside, large room or hall: 0.021
Sound effect: 0.021
Outside, rural or natural: 0.018
Car passing by: 0.018
Rain : 0.016

OUTMa65O2T_hN0.wav :
Speech : 0.274
Explosion : 0.138
Gunshot, gunfire: 0.105
Vehicle : 0.061
Artillery fire: 0.056
Burst, pop : 0.050
Animal : 0.039
Skateboard : 0.031
Outside, rural or natural: 0.025
Mechanisms : 0.023
Machine gun : 0.023
Motor vehicle (road): 0.022
Fusillade : 0.021
Outside, urban or manmade: 0.021
Fireworks : 0.021
Bicycle : 0.019
Sound effect: 0.019
Domestic animals, pets: 0.019
Wild animals: 0.018
Dog : 0.017

OUT–PG66A3lo4.wav :
Explosion : 0.364
Gunshot, gunfire: 0.358
Machine gun : 0.294
Music : 0.150
Timpani : 0.073
Percussion : 0.050
Drum : 0.047
Vehicle : 0.042
Outside, rural or natural: 0.032
Motorcycle : 0.026
Musical instrument: 0.024
Fireworks : 0.024
Fusillade : 0.023
Firecracker : 0.017
Outside, urban or manmade: 0.016
Motor vehicle (road): 0.014
Sound effect: 0.013
Engine : 0.011
Chainsaw : 0.010
Aircraft : 0.009

OTTStreamingVideo is expert in AI based audio classification, speech recognition, and optical recognition for embedded Linux and server based applications. Please contact us for more information.