Hello Edge: Keyword spotting on Microcontrollers论文源码阅读及实验结论

高通的大哥想做语音唤醒实验,我自己对这个也很感兴趣,所以阅读了一下相关论文,并做了一次实验。本篇主要对源码进行解读。

论文结论

  • LSTM使用的内存最少
  • 整体DS-CNN > CRNN > GRU > LSTM > Basic_LSTM > CNN > DNN

KWS_for_FIXLEN代码阅读

参数

音频相关

  • words_list:_silence_,_unknown_,yes,no,up,down,left,right,on,off,stop,go
    • label_count(How many classes are to be recognized):12
    • wanted_words: yes,no,up,down,left,right,on,off,stop,go
  • sample_rate(需要和提供的wav文件的采样率匹配):16000
  • clip_duration_ms(录音文件的时长):1000
    • desired_samples (sample_rate * clip_duration_ms / 1000 语音需要的样本点个数):16000
  • window_size_ms(帧长):40.0
    • window_size_samples(sample_rate * window_size_ms / 1000):640
  • window_stride_ms(帧移):40.0 帧移
    • window_stride_samples(sample_rate * window_stride_ms / 1000):640
    • spectrogram_length(0:1 + int((desired_samples - window_size_samples) / window_stride_samples) ? desired_samples - window_size_samples < 0 声音有多少帧): 25
  • dct_coefficient_count(对MFCC来说每一帧有多少系数):10
    • fingerprint_size (dct_coefficient_count * spectrogram_length): 250
  • background_volume(背景噪声的音量,默认0.1。这是一种Data Augmentation的技术,通过给语音增加噪声来提高模型的泛化能力)
  • background_frequency(多少比例的训练数据会增加噪声):0.8
  • silence_percentage(How much of the training data should be silence):10.0
  • unknown_percentage(How much of the training data should be unknown words):10.0
  • validation_percentage(What percentage of wavs to use as a validation set):10
  • testing_percentage(What percentage of wavs to use as a test set):10
  • time_shift_ms(录音都是长度1秒的文件,但是在实际预测的时候用户开始的实际是不固定的,为了模拟这种情况,我们这里会随机的把录音文件往前或者往后平移一段时间,这个参数就是指定平移的范围。默认100(ms),说明会随机的在[-100,100]之间平移数据):100.0
    • time_shift_samples(time_shift_ms * sample_rate/ 1000):16

模型相关

  • 网络参数
    • model_architecture:single_fc, conv, low_latency_conv, low_latency_svdf, dnn, cnn, basic_lstm, lstm, gru, crnn, ds_cnn
    • model_size_info:[144, 144, 144]
  • 训练参数
    • learning_rate: [0.0005, 0.0001, 0.00002]
    • how_many_training_steps: [10000,10000,10000]
    • summaries_dir
    • train_dir
    • eval_step_interval: 400
    • save_step_interval:100

数据准备

  • python train.py —mode prepare —dct_coefficient_count 10 —window_size_ms 40 —window_stride_ms 40 —data_dir data/speech_commands

    • 如果需要下载原始数据:python train.py —mode prepare —dct_coefficient_count 10 —window_size_ms 40 —window_stride_ms 40 —data_url http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz —data_dir data/speech_commands
    • prepare_data_index:根据音频文件名哈希到不同集合中(尽量让同一个发声人在一个集合中),将静音根据比例添加到不同集合中,把一些unknown的wav根据比例添加到不同集合中,最后进行集合内shuffle (先忽略background文件夹下的wav)。

      • 这里涉及到划分训练集、验证集和测试集的一个小技巧。通常我们的训练数据是不断增加的,如果按照随机的按比例划分训练集、验证集和测试集,那幺增加一个新的数据重新划分后有可能把原来的训练集中的数据划分到测试数据里。因为我们的模型可能要求incremental的训练,因此这就相对于把测试数据也拿来训练了。因此我们需要一种“稳定”的划分方法——原来在训练集中的数据仍然在训练数据中。这里我们使用的技巧就是对于文件名进行hash,然后根据hash的结果对总量取模来划分到不同的集合里。这样就能保证同一个数据第一次如果是在训练集合里,那幺它永远都会划分到训练集合里。不过它只能大致保证三个集合的比例而不能绝对的保证比例

      • 另一点是每个集合里都要加入一定比例(silence_percentage)的silence和unknown词

      • input:silence_percentage, unknown_percentage, validation_percentage, testing_percentage, wanted_words

      • output:data_index: {‘validation’: [{‘label’: word, ‘file’: wav_path}], ‘testing’: […], ‘training’: […]}

        • words_list:[_silence_, _unknown_, yes, no, up, down, left, right, on, off, stop, go]

        • word_to_index(不需要识别的词都映射成1):{‘marvin’: 1, ‘tree’: 1, ‘learn’: 1, ‘dog’: 1, ‘sheila’: 1, ‘bird’: 1, ‘right’: 7, ‘off’: 9, ‘backward’: 1, ‘six’: 1, ‘two’: 1, ‘no’: 3, ‘yes’: 2, ‘one’: 1, ‘follow’: 1, ‘up’: 4, ‘three’: 1, ‘forward’: 1, ‘happy’: 1, ‘nine’: 1, ‘bed’: 1, ‘zero’: 1, ‘house’: 1, ‘visual’: 1, ‘five’: 1, ‘seven’: 1, ‘cat’: 1, ‘left’: 6, ‘stop’: 10, ‘go’: 11, ‘four’: 1, ‘on’: 8, ‘wow’: 1, ‘down’: 5, ‘eight’: 1}

    • prepare_background_data:
      • 单独处理background_noise文件夹下的音频
      • input:BACKGROUND_NOISE_DIR_NAME
      • output:
        • background_data: (16000, 0)的float tensor
        • 重要函数:wav_decoder = contrib_audio.decode_wav(wav_loader, desired_channels=1) :将16-bit PCM Wave 文件解码成一维 float tensor
    • prepare_processing_graph:先将音频解码的tensorflow graph创建好,等session.run的时候执行
      • 加载一个WAVE文件 —> 解码((16000,0)的float tensor) —> 缩放音量(element-wise乘,得到还是(16000,0)的float tensor) —> 语音平移 —> 加入背景噪声((desired_samples,)且在-1.0到1.0的float tensor) —> 计算频谱 ((1, 25, 513)这里第一个维度是通道数,单通道是1)—> 建立一个MFCC指纹((1, 25, 10)
      • 语音平移:如果是右移,那幺需要在左边补零;如果是左移,则要右边补零。这个padding的量也是在生成batch数据的时候动态产生的,所以也定义为一个placeholder。因为语音的tensor是(16000,1)的,所有padding是一个[2,2]的tensor,不过通常只在第一个维度(时间)padding。比如右移100个点,那幺传入的tensor是[[100,0],[0,0]]。如果是左移,我们除了要padding,还要把左边的部分“切掉“,因此还会传入一个time_shift_offset_placeholder_,如果是右移,那幺这个值是零。比如我们要实现左移100个点,那幺传入的time_shift_padding_placeholder_应该是[[0,100],[0,0]],而time_shift_offset_placeholder_应该是[100]。_
      • 混入噪声:placeholder background_data_placeholder_表示噪声,而background_volume_placeholder_表示混入的音量(比例),如果background_volume_placeholder_是零就表示没有噪声。把它们乘起来就得到background_mul,然后把它加到sliced_foreground就得到background_add,因为加起来音量可能超过1,所有需要把大于1的变成1,这可以使用clip_by_value函数把音量限制在[-1,1]的区间里

模型声明

  • 输入、输出
    • input: fingerprint_input(batch_size, fingerprint_size), model_settings, model_size_info, is_training
    • output: logits, dropout_prob(if training, 是一个placeholder)
  • dnn: create_dnn_model
    • 参数:
      • W1:(fingerprint_size, model_size_info[0])
      • b1: (model_size_info[0])
      • W2: (model_size_info[0], model_size_info[1])
      • b2: (model_size_info[1])
      • W3:(model_size_info[1], model_size_info[2])
      • b3: (model_size_info[2])
      • weights: (model_size_info[2], label_count)
      • bias: (label_count)
      • 总参数个数:250 144 + 144 + 144 144 + 144 + 144 144 + 144+ 144 12 + 12 = 79644
      • 操作个数:
    • 前向计算过程:fingerprint_input用flow代替, * 表示矩阵乘
      • flow —>(batch_size, fingerprint_size)
      • flow1 = dropout(relu(flow * W1 + b1 )) —>(batch_size, model_size_info[0])
      • flow2 = dropout(relu(flow1 * W2 + b2 )) —> (batch_size, model_size_info[1])
      • flow3 = dropout(relu(flow2 * W2 + b2 )) —> (batch_size, model_size_info[2]
      • logits = flow3 * weights + bias —> (batch_size, label_count)
  • conv: create_conv_model 参考【2】
    • 参数:
      • first_weights:(first_filter_height=20, first_filter_width=8, 1, first_filter_count=64)注:卷积核
      • first_bias:(first_filter_count,)
      • second_weights: (second_filter_height=10, second_filter_width=4, first_filter_count=64, second_filter_count=64)
      • second_bias: (second_filter_count,)
      • final_fc_weights:(13 5 64=4160, label_count)
      • final_fc_bias:(label_count,)
    • 前向计算过程:
      • flow —>(batch_size, fingerprint_size)
      • fingerprint_4d = reshape(flow) —> (batch_size, spectrogram_length=25, dct_coefficient_count=10, 1) 注:类似图像的[batch_size, in_height, in_width, in_channels]
      • first_conv = dropout(relu(conv2d(fingerprint_4d, first_weights, [1, 1, 1, 1], ‘SAME’) + first_bias)) —> (batch_size, spectrogram_length=25, dct_coefficient_count=10, first_filter_count=64)卷积
      • max_pool = max_pool(first_conv, [1, 2, 2, 1], [1, 2, 2, 1], ‘SAME’) —> (batch_size, (25-2 + 2 1)/2 + 1=13, (10 - 2 + 21)/2 = 5, first_filter_count=64)
      • second_conv = dropout(relu(conv2d(max_pool, second_weights, [1,1,1,1], ‘SAME’) + second_bias)) —> (batch_size, 13, 5, 64)
      • flattened_second_conv = reshape(second_conv) —> (batch_size, 13 5 64=4160)
      • final_fc = flattened_second_conv * final_fc_weights + final_fc_bias —> (batch_size, label_count)
      • 总参数个数:224140 = 20 8 64 + 64 + 10 4 64 64 + 64 + 4160 12 + 12
      • 总操作个数
  • low_latency_conv:create_low_latency_conv_model 参考【2】相比于conv有更少的参数,但是有准确度的损失
    • 参数
      • first_weights:(spectrogram_length=25, first_filter_width=8, 1, first_filter_count=186)
      • first_bias:(first_filter_count,)
      • first_fc_weights:(first_conv_element_count=558, first_fc_output_channels = 128)
      • first_fc_bias:(first_fc_output_channels, )
      • second_fc_weights:(first_fc_output_channels=128, second_fc_output_channels=128)
      • second_fc_bias:(second_fc_output_channels=128, )
      • final_fc_weights:(second_fc_output_channels=128, label_count=12)
      • final_fc_bias:(label_count,)
      • 总参数个数:126998
      • 总操作个数
    • 前向计算过程
      • flow —>(batch_size, fingerprint_size)
      • fingerprint_4d = reshape(flow) —> (batch_size, spectrogram_length=25, dct_coefficient_count=10, 1)
      • first_conv = dropout(relu(conv2d(fingerprint_4d, first_weights, [1, 1, 1, 1], ‘VALID’ )+ first_bias)) —> (batch_size, 1, 3, 186)
      • flattened_first_conv = reshape(first_conv) —> (batch_size, 3 * 186=558)
      • first_fc = dropout(flattened_first_conv * first_fc_weights + first_fc_bias) —> (batch_size, first_fc_output_channels = 128)
      • second_fc = dropout(first_fc * second_fc_weights + second_fc_bias) —> (batch_size, second_fc_output_channels = 128)
      • final_fc = second_fc * final_fc_weights + final_fc_bias —> (batch_size, label_count)
  • basic-lstm: 和lstm差了use_peepholes和num_proj,参数个数为43916 = 10 98 4 + 98 98 4 + 98 4 + 98 12 + 12
  • lstm: model_size_info [0]对应projection size,model_size_info[1]对应memory cells in LSTM
    • 参数 ?加peephole之后参数也不太对
      • lstm_cell: (隐藏层使用了全链接,peephole connections)
        • W_f、W_i、W_o、W_c:(dct_coefficient_count, model_size_info[1])
        • U_f、U_i、U_o、U_c: (model_size_info [0], model_size_info[1])
        • bias_f、bias_i、bias_o、bias_c: (model_size_info[1])
        • W_h: (model_size_info[1], model_size_info[0])
        • bias_h: (model_size_info[0])
      • W_o:(model_size_info[0]=98, label_count=12)
      • b_o: (label_count=12,)
      • 参数总个数:? 78516 = 5760 + 56448 + 576 + 14112 + 98 + 1176 + 12 = 78182
      • 总操作个数
    • 前向计算过程 * 是矩阵乘法,∘是元素乘
      • flow —>(batch_size, fingerprint_size)
      • fingerprint_4d = reshape(flow) —> (batch_size, spectrogram_length=25, dct_coefficient_count=10)
        • 一个时间步的输入
          • 隐状态h_t-1:(batch_size,model_size_info[0])
          • x_t:(batch_size, dct_coefficient_count=10)
          • c_t-1: (batch_size, model_size_info[1])
        • f_t = sigmoid(x_t W_f + h_t-1 U_f + bias_f) —>(batch_size,model_size_info[1])
        • i_t = sigmoid(x_t W_i + h_t-1 U_i + bias_i) —> (batch_size,model_size_info[1])
        • o_t = sigmoid(x_t W_o + h_t-1 U_o + bias_o ) —> (batch_size,model_size_info[1])
        • c’ = tanh(x_t W_c + h_t-1 U_c + bias_c) —> (batch_size,model_size_info[1])
        • c_t = f_t ∘ c_t-1 + i_t ∘ c’ —> (batch_size, model_size_info[1])
        • h_t = tanh(o_t ∘ c_t) —> (batch_size, model_size_info[1])
        • h_t = h_t * W_h + bias_h —> (batch_size, model_size_info[0])
      • flow = dynamic_rnn(lstm_cell, fingerprint_4d) 的最后一个输出 —> (batch_size, model_size_info[0]=98)
      • logits = flow * W_o + b_o —> (batch_size, label_count)
  • gru:

    • 参数 model_size_info[0]对应num_layers,model_size_info[1]对应gru_units
      • 使用layer_normalize
      • 不使用layer_normalize
        • W_rx、W_zx、W_hx: (dct_coefficient_count=10, gru_units=154)
        • W_rh、W_zh、W_hh: (gru_units, gru_units)
        • bias_r、bias_z、bias_h: (gru_units)
        • W_o:(gru_units, label_count)
        • b_o:(label_count)
        • 总参数个数:10 154 3 + 154 154 3 + 154 3 + 154 12 + 12 = 78080 ? 91950
    • 前向计算过程

      • flow —>(batch_size, fingerprint_size)
      • fingerprint_4d = reshape(flow) —> (batch_size, spectrogram_length=25, dct_coefficient_count=10)

        • 一个时间步的输入
          • 隐状态h_t-1:(batch_size,gru_units)
          • x_t:(batch_size, dct_coefficient_count=10)
        • 使用layer_normalize

          • 单向
          • 双向:前向和后向logits的拼接
        • 不使用layer_normalize

          • 单向
            • r_t = sigmoid(x_t W_rx + h_t-1 W_rh + bias_r) —> (batch_size, gru_units)
            • z_t = sigmoid(x_t W_zx + h_t-1 W_zh + bias_z) —> (batch_size, gru_units)
            • h’ = tanh(x_t W_hx + (r_t ∘ h_t-1) W_hh + bias_h) —> (batch_size, gru_units)
            • h_t = (1 - z_t) ∘ h_t-1 + z_t ∘ h’ —> (batch_size, gru_units)
            • flow = dynamic_rnn(gru_cell, fingerprint_4d) 的最后一个输出 —> (batch_size, gru_units)
            • logits = flow * W_o + b_o —> (batch_size, label_count)
          • 双向
            • 前向和后向logits的拼接
  • crnn
  • ds_cnn

模型训练

  • 定义loss:softmax cross entropy
    • 定义optimize:Adam
    • 重点是audio_processor.get_data 随机生成一个batch的训练数据的过程
      • input:
        • how_many :batch大小,如果是-1则返回所有
        • offset: 如果是非随机的生成数据,这个参数指定开始的offset
        • background_frequency: 0.0-1.0之间的值,表示需要混入噪音的数据的比例
        • background_volume_range :背景噪音的音量
        • time_shift :平移的范围,为[-time_shift, time_shift]
        • sess :用于执行前面用于产生数据的Operation,参考prepare_processing_graph函数
      • output:
        • data:
        • labels
        • 过程看代码中注释
    • 实验结果:
      • Final test accuracy = 84.68% (N=3081):python train.py —model_architecture dnn —model_size_info 144 144 144 —dct_coefficient_count 10 —window_size_ms 40 —window_stride_ms 40 —learning_rate 0.0005,0.0001,0.00002 —how_many_training_steps 10000,10000,10000 —data_dir ./data/speech_commands —summaries_dir work/DNN/DNN1/retrain_logs —train_dir work/DNN/DNN1/training
      • python train.py —model_architecture conv —model_size_info 144 144 144 —dct_coefficient_count 10 —window_size_ms 40 —window_stride_ms 40 —learning_rate 0.0005,0.0001,0.00002 —how_many_training_steps 10000,10000,10000 —data_dir ./data/speech_commands —summaries_dir work/CNN/CNN1/retrain_logs —train_dir work/CNN/CNN1/training
      • python train.py —model_architecture low_latency_conv —model_size_info 144 144 144 —dct_coefficient_count 10 —window_size_ms 40 —window_stride_ms 40 —learning_rate 0.0005,0.0001,0.00002 —how_many_training_steps 10000,10000,10000 —data_dir ./data/speech_commands —summaries_dir work/CNN/CNN2/retrain_logs —train_dir work/CNN/CNN2/training
      • python train.py —model_architecture basic_lstm —model_size_info 98 144 —dct_coefficient_count 10 —window_size_ms 40 —window_stride_ms 40 —learning_rate 0.0005,0.0001,0.00002 —how_many_training_steps 10000,10000,10000 —data_dir ./data/speech_commands —summaries_dir work/LSTM/LSTM1/retrain_logs —train_dir work/LSTM/LSTM1/training
      • python train.py —model_architecture lstm —model_size_info 98 —dct_coefficient_count 10 —window_size_ms 40 —window_stride_ms 40 —learning_rate 0.0005,0.0001,0.00002 —how_many_training_steps 10000,10000,10000 —data_dir ./data/speech_commands —summaries_dir work/LSTM/LSTM2/retrain_logs —train_dir work/LSTM/LSTM2/training
      • python train.py —model_architecture gru —model_size_info 1 154 —window_size_ms 40 —window_stride_ms 40 —learning_rate 0.0005,0.0001,0.00002 —how_many_training_steps 10000,10000,10000 —data_dir ./data/speech_commands —summaries_dir work/GRU/retrain_logs —train_dir work/GRU/training

模型测试

  • 使用checkpoint
    • python test.py —model_architecture dnn —model_size_info 144 144 144 —dct_coefficient_count 10 —window_size_ms 40 —window_stride_ms 40 —checkpoint work/DNN/DNN1/training/best/dnn_8503.ckpt-24000
    • 使用freeze后的图: 可生成在安卓和IOS上可执行的图
      • python freeze.py —model_architecture dnn —model_size_info 144 144 144 —dct_coefficient_count 10 —window_size_ms 40 —window_stride_ms 40 —data_dir ./data/speech_commands —checkpoint work/DNN/DNN1/training/best/dnn_8503.ckpt-24000 —output_file work/DNN/DNN1/training/best/dnn.pb
        • 创建图:create_inference_graph,将wave_data的filepath placeholder经过一系列计算转为reshaped_input,用create_model创建图得到logits, 再用softmax得到输出
        • load weights:models.load_variables_from_checkpoint
        • 将变量替换为inline constant:graph_util.convert_variables_to_constants
        • 写入目标图文件中:tf.train.write_graph
      • python test_pb.py —model_architecture dnn —model_size_info 144 144 144 —dct_coefficient_count 10 —window_size_ms 40 —window_stride_ms 40 —data_dir ./data/speech_commands —graph work/DNN/DNN1/training/best/dnn.pb
  • 模型单个音频预测(使用pb文件预测)
  • python predict_pb.py —model_architecture dnn —model_size_info 144 144 144 —dct_coefficient_count 10 —window_size_ms 40 —window_stride_ms 40 —data_dir ./data/speech_commands —graph work/DNN/DNN1/training/best/dnn.pb —test_wave data/speech_commands/yes/377e916b_nohash_0.wav

日常思考

  • KWS是否可以做成两阶段的,小模型先粗检(得到一个probability),再进入语音系统里用稍大一点的模型进行细检(根据上一个probability和声学特征)

  • KWS的安全性问题,是谁都可以唤醒吗?会不会根据说话人的几句话自己进行调整

  • KWS在联网时可以将误检、漏检、正常唤醒的音频数据传到服务端,服务端进行标注并训练,但是模型能动态更新到硬件上吗?如果不行,这就是个一锤子买卖,是不是可以考虑先放到少儿词典这样的app中去收集一些数据

  • 网络的输出:是否要标明开始和结束的位置

参考文献