- LSTM使用的内存最少
- 整体DS-CNN > CRNN > GRU > LSTM > Basic_LSTM > CNN > DNN
- words_list:_silence_,_unknown_,yes,no,up,down,left,right,on,off,stop,go
- label_count(How many classes are to be recognized):12
- wanted_words: yes,no,up,down,left,right,on,off,stop,go
- sample_rate(需要和提供的wav文件的采样率匹配):16000
- clip_duration_ms(录音文件的时长):1000
- desired_samples (sample_rate * clip_duration_ms / 1000 语音需要的样本点个数):16000
- window_size_ms(帧长):40.0
- window_size_samples(sample_rate * window_size_ms / 1000):640
- window_stride_ms(帧移):40.0 帧移
- window_stride_samples(sample_rate * window_stride_ms / 1000):640
- spectrogram_length(0:1 + int((desired_samples - window_size_samples) / window_stride_samples) ? desired_samples - window_size_samples < 0 声音有多少帧): 25
- dct_coefficient_count(对MFCC来说每一帧有多少系数):10
- fingerprint_size (dct_coefficient_count * spectrogram_length): 250
- background_volume(背景噪声的音量,默认0.1。这是一种Data Augmentation的技术,通过给语音增加噪声来提高模型的泛化能力)
- background_frequency(多少比例的训练数据会增加噪声):0.8
- silence_percentage(How much of the training data should be silence):10.0
- unknown_percentage(How much of the training data should be unknown words):10.0
- validation_percentage(What percentage of wavs to use as a validation set):10
- testing_percentage(What percentage of wavs to use as a test set):10
- time_shift_ms(录音都是长度1秒的文件,但是在实际预测的时候用户开始的实际是不固定的,为了模拟这种情况,我们这里会随机的把录音文件往前或者往后平移一段时间,这个参数就是指定平移的范围。默认100(ms),说明会随机的在[-100,100]之间平移数据):100.0
- time_shift_samples(time_shift_ms * sample_rate/ 1000):16
- 网络参数
- model_architecture:single_fc, conv, low_latency_conv, low_latency_svdf, dnn, cnn, basic_lstm, lstm, gru, crnn, ds_cnn
- model_size_info:[144, 144, 144]
- 训练参数
- learning_rate: [0.0005, 0.0001, 0.00002]
- how_many_training_steps: [10000,10000,10000]
- summaries_dir
- train_dir
- eval_step_interval: 400
- save_step_interval:100
python train.py —mode prepare —dct_coefficient_count 10 —window_size_ms 40 —window_stride_ms 40 —data_dir data/speech_commands
- 如果需要下载原始数据:python train.py —mode prepare —dct_coefficient_count 10 —window_size_ms 40 —window_stride_ms 40 —data_url http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz —data_dir data/speech_commands
prepare_data_index:根据音频文件名哈希到不同集合中(尽量让同一个发声人在一个集合中),将静音根据比例添加到不同集合中,把一些unknown的wav根据比例添加到不同集合中,最后进行集合内shuffle (先忽略background文件夹下的wav)。
input:silence_percentage, unknown_percentage, validation_percentage, testing_percentage, wanted_words
output:data_index: {‘validation’: [{‘label’: word, ‘file’: wav_path}], ‘testing’: […], ‘training’: […]}
words_list:[_silence_, _unknown_, yes, no, up, down, left, right, on, off, stop, go]
word_to_index(不需要识别的词都映射成1):{‘marvin’: 1, ‘tree’: 1, ‘learn’: 1, ‘dog’: 1, ‘sheila’: 1, ‘bird’: 1, ‘right’: 7, ‘off’: 9, ‘backward’: 1, ‘six’: 1, ‘two’: 1, ‘no’: 3, ‘yes’: 2, ‘one’: 1, ‘follow’: 1, ‘up’: 4, ‘three’: 1, ‘forward’: 1, ‘happy’: 1, ‘nine’: 1, ‘bed’: 1, ‘zero’: 1, ‘house’: 1, ‘visual’: 1, ‘five’: 1, ‘seven’: 1, ‘cat’: 1, ‘left’: 6, ‘stop’: 10, ‘go’: 11, ‘four’: 1, ‘on’: 8, ‘wow’: 1, ‘down’: 5, ‘eight’: 1}
- prepare_background_data:
- 单独处理background_noise文件夹下的音频
- output:
- background_data: (16000, 0)的float tensor
- 重要函数:wav_decoder = contrib_audio.decode_wav(wav_loader, desired_channels=1) :将16-bit PCM Wave 文件解码成一维 float tensor
- prepare_processing_graph:先将音频解码的tensorflow graph创建好,等session.run的时候执行
- 加载一个WAVE文件 —> 解码((16000,0)的float tensor) —> 缩放音量(element-wise乘,得到还是(16000,0)的float tensor) —> 语音平移 —> 加入背景噪声((desired_samples,)且在-1.0到1.0的float tensor) —> 计算频谱 ((1, 25, 513)这里第一个维度是通道数,单通道是1)—> 建立一个MFCC指纹((1, 25, 10))
- 语音平移:如果是右移,那幺需要在左边补零;如果是左移,则要右边补零。这个padding的量也是在生成batch数据的时候动态产生的,所以也定义为一个placeholder。因为语音的tensor是(16000,1)的,所有padding是一个[2,2]的tensor,不过通常只在第一个维度(时间)padding。比如右移100个点,那幺传入的tensor是[[100,0],[0,0]]。如果是左移,我们除了要padding,还要把左边的部分“切掉“,因此还会传入一个time_shift_offset_placeholder_,如果是右移,那幺这个值是零。比如我们要实现左移100个点,那幺传入的time_shift_padding_placeholder_应该是[[0,100],[0,0]],而time_shift_offset_placeholder_应该是[100]。_
- 混入噪声:placeholder background_data_placeholder_表示噪声,而background_volume_placeholder_表示混入的音量(比例),如果background_volume_placeholder_是零就表示没有噪声。把它们乘起来就得到background_mul,然后把它加到sliced_foreground就得到background_add,因为加起来音量可能超过1,所有需要把大于1的变成1,这可以使用clip_by_value函数把音量限制在[-1,1]的区间里
- 输入、输出
- input: fingerprint_input(batch_size, fingerprint_size), model_settings, model_size_info, is_training
- output: logits, dropout_prob(if training, 是一个placeholder)
- dnn: create_dnn_model
- 参数:
- W1:(fingerprint_size, model_size_info[0])
- b1: (model_size_info[0])
- W2: (model_size_info[0], model_size_info[1])
- b2: (model_size_info[1])
- W3:(model_size_info[1], model_size_info[2])
- b3: (model_size_info[2])
- weights: (model_size_info[2], label_count)
- bias: (label_count)
- 总参数个数:250 144 + 144 + 144 144 + 144 + 144 144 + 144+ 144 12 + 12 = 79644
- 操作个数:
- 前向计算过程:fingerprint_input用flow代替, * 表示矩阵乘
- flow —>(batch_size, fingerprint_size)
- flow1 = dropout(relu(flow * W1 + b1 )) —>(batch_size, model_size_info[0])
- flow2 = dropout(relu(flow1 * W2 + b2 )) —> (batch_size, model_size_info[1])
- flow3 = dropout(relu(flow2 * W2 + b2 )) —> (batch_size, model_size_info[2]
- logits = flow3 * weights + bias —> (batch_size, label_count)
- 参数:
- conv: create_conv_model 参考【2】
- 参数:
- first_weights:(first_filter_height=20, first_filter_width=8, 1, first_filter_count=64)注:卷积核
- first_bias:(first_filter_count,)
- second_weights: (second_filter_height=10, second_filter_width=4, first_filter_count=64, second_filter_count=64)
- second_bias: (second_filter_count,)
- final_fc_weights:(13 5 64=4160, label_count)
- final_fc_bias:(label_count,)
- 前向计算过程:
- flow —>(batch_size, fingerprint_size)
- fingerprint_4d = reshape(flow) —> (batch_size, spectrogram_length=25, dct_coefficient_count=10, 1) 注:类似图像的[batch_size, in_height, in_width, in_channels]
- first_conv = dropout(relu(conv2d(fingerprint_4d, first_weights, [1, 1, 1, 1], ‘SAME’) + first_bias)) —> (batch_size, spectrogram_length=25, dct_coefficient_count=10, first_filter_count=64)卷积
- max_pool = max_pool(first_conv, [1, 2, 2, 1], [1, 2, 2, 1], ‘SAME’) —> (batch_size, (25-2 + 2 1)/2 + 1=13, (10 - 2 + 21)/2 = 5, first_filter_count=64)
- second_conv = dropout(relu(conv2d(max_pool, second_weights, [1,1,1,1], ‘SAME’) + second_bias)) —> (batch_size, 13, 5, 64)
- flattened_second_conv = reshape(second_conv) —> (batch_size, 13 5 64=4160)
- final_fc = flattened_second_conv * final_fc_weights + final_fc_bias —> (batch_size, label_count)
- 总参数个数:224140 = 20 8 64 + 64 + 10 4 64 64 + 64 + 4160 12 + 12
- 总操作个数
- 参数:
- low_latency_conv:create_low_latency_conv_model 参考【2】相比于conv有更少的参数,但是有准确度的损失
- 参数
- first_weights:(spectrogram_length=25, first_filter_width=8, 1, first_filter_count=186)
- first_bias:(first_filter_count,)
- first_fc_weights:(first_conv_element_count=558, first_fc_output_channels = 128)
- first_fc_bias:(first_fc_output_channels, )
- second_fc_weights:(first_fc_output_channels=128, second_fc_output_channels=128)
- second_fc_bias:(second_fc_output_channels=128, )
- final_fc_weights:(second_fc_output_channels=128, label_count=12)
- final_fc_bias:(label_count,)
- 总参数个数:126998
- 总操作个数
- 前向计算过程
- flow —>(batch_size, fingerprint_size)
- fingerprint_4d = reshape(flow) —> (batch_size, spectrogram_length=25, dct_coefficient_count=10, 1)
- first_conv = dropout(relu(conv2d(fingerprint_4d, first_weights, [1, 1, 1, 1], ‘VALID’ )+ first_bias)) —> (batch_size, 1, 3, 186)
- flattened_first_conv = reshape(first_conv) —> (batch_size, 3 * 186=558)
- first_fc = dropout(flattened_first_conv * first_fc_weights + first_fc_bias) —> (batch_size, first_fc_output_channels = 128)
- second_fc = dropout(first_fc * second_fc_weights + second_fc_bias) —> (batch_size, second_fc_output_channels = 128)
- final_fc = second_fc * final_fc_weights + final_fc_bias —> (batch_size, label_count)
- 参数
- basic-lstm: 和lstm差了use_peepholes和num_proj,参数个数为43916 = 10 98 4 + 98 98 4 + 98 4 + 98 12 + 12
- lstm: model_size_info [0]对应projection size,model_size_info[1]对应memory cells in LSTM
- 参数 ?加peephole之后参数也不太对
- lstm_cell: (隐藏层使用了全链接,peephole connections)
- W_f、W_i、W_o、W_c:(dct_coefficient_count, model_size_info[1])
- U_f、U_i、U_o、U_c: (model_size_info [0], model_size_info[1])
- bias_f、bias_i、bias_o、bias_c: (model_size_info[1])
- W_h: (model_size_info[1], model_size_info[0])
- bias_h: (model_size_info[0])
- W_o:(model_size_info[0]=98, label_count=12)
- b_o: (label_count=12,)
- 参数总个数:? 78516 = 5760 + 56448 + 576 + 14112 + 98 + 1176 + 12 = 78182
- 总操作个数
- lstm_cell: (隐藏层使用了全链接,peephole connections)
- 前向计算过程 * 是矩阵乘法,∘是元素乘
- flow —>(batch_size, fingerprint_size)
- fingerprint_4d = reshape(flow) —> (batch_size, spectrogram_length=25, dct_coefficient_count=10)
- 一个时间步的输入
- 隐状态h_t-1:(batch_size,model_size_info[0])
- x_t:(batch_size, dct_coefficient_count=10)
- c_t-1: (batch_size, model_size_info[1])
- f_t = sigmoid(x_t W_f + h_t-1 U_f + bias_f) —>(batch_size,model_size_info[1])
- i_t = sigmoid(x_t W_i + h_t-1 U_i + bias_i) —> (batch_size,model_size_info[1])
- o_t = sigmoid(x_t W_o + h_t-1 U_o + bias_o ) —> (batch_size,model_size_info[1])
- c’ = tanh(x_t W_c + h_t-1 U_c + bias_c) —> (batch_size,model_size_info[1])
- c_t = f_t ∘ c_t-1 + i_t ∘ c’ —> (batch_size, model_size_info[1])
- h_t = tanh(o_t ∘ c_t) —> (batch_size, model_size_info[1])
- h_t = h_t * W_h + bias_h —> (batch_size, model_size_info[0])
- 一个时间步的输入
- flow = dynamic_rnn(lstm_cell, fingerprint_4d) 的最后一个输出 —> (batch_size, model_size_info[0]=98)
- logits = flow * W_o + b_o —> (batch_size, label_count)
- 参数 ?加peephole之后参数也不太对
- 参数 model_size_info[0]对应num_layers,model_size_info[1]对应gru_units
- 使用layer_normalize
- 不使用layer_normalize
- W_rx、W_zx、W_hx: (dct_coefficient_count=10, gru_units=154)
- W_rh、W_zh、W_hh: (gru_units, gru_units)
- bias_r、bias_z、bias_h: (gru_units)
- W_o:(gru_units, label_count)
- b_o:(label_count)
- 总参数个数:10 154 3 + 154 154 3 + 154 3 + 154 12 + 12 = 78080 ? 91950
- flow —>(batch_size, fingerprint_size)
fingerprint_4d = reshape(flow) —> (batch_size, spectrogram_length=25, dct_coefficient_count=10)
- 一个时间步的输入
- 隐状态h_t-1:(batch_size,gru_units)
- x_t:(batch_size, dct_coefficient_count=10)
- 单向
- 双向:前向和后向logits的拼接
- 单向
- r_t = sigmoid(x_t W_rx + h_t-1 W_rh + bias_r) —> (batch_size, gru_units)
- z_t = sigmoid(x_t W_zx + h_t-1 W_zh + bias_z) —> (batch_size, gru_units)
- h’ = tanh(x_t W_hx + (r_t ∘ h_t-1) W_hh + bias_h) —> (batch_size, gru_units)
- h_t = (1 - z_t) ∘ h_t-1 + z_t ∘ h’ —> (batch_size, gru_units)
- flow = dynamic_rnn(gru_cell, fingerprint_4d) 的最后一个输出 —> (batch_size, gru_units)
- logits = flow * W_o + b_o —> (batch_size, label_count)
- 双向
- 前向和后向logits的拼接
- 单向
- 一个时间步的输入
- 参数 model_size_info[0]对应num_layers,model_size_info[1]对应gru_units
- crnn
- ds_cnn
- 定义loss:softmax cross entropy
- 定义optimize:Adam
- 重点是audio_processor.get_data 随机生成一个batch的训练数据的过程
- input:
- how_many :batch大小,如果是-1则返回所有
- offset: 如果是非随机的生成数据,这个参数指定开始的offset
- background_frequency: 0.0-1.0之间的值,表示需要混入噪音的数据的比例
- background_volume_range :背景噪音的音量
- time_shift :平移的范围,为[-time_shift, time_shift]
- sess :用于执行前面用于产生数据的Operation,参考prepare_processing_graph函数
- output:
- data:
- labels
- 过程看代码中注释
- input:
- 实验结果:
- Final test accuracy = 84.68% (N=3081):python train.py —model_architecture dnn —model_size_info 144 144 144 —dct_coefficient_count 10 —window_size_ms 40 —window_stride_ms 40 —learning_rate 0.0005,0.0001,0.00002 —how_many_training_steps 10000,10000,10000 —data_dir ./data/speech_commands —summaries_dir work/DNN/DNN1/retrain_logs —train_dir work/DNN/DNN1/training
- python train.py —model_architecture conv —model_size_info 144 144 144 —dct_coefficient_count 10 —window_size_ms 40 —window_stride_ms 40 —learning_rate 0.0005,0.0001,0.00002 —how_many_training_steps 10000,10000,10000 —data_dir ./data/speech_commands —summaries_dir work/CNN/CNN1/retrain_logs —train_dir work/CNN/CNN1/training
- python train.py —model_architecture low_latency_conv —model_size_info 144 144 144 —dct_coefficient_count 10 —window_size_ms 40 —window_stride_ms 40 —learning_rate 0.0005,0.0001,0.00002 —how_many_training_steps 10000,10000,10000 —data_dir ./data/speech_commands —summaries_dir work/CNN/CNN2/retrain_logs —train_dir work/CNN/CNN2/training
- python train.py —model_architecture basic_lstm —model_size_info 98 144 —dct_coefficient_count 10 —window_size_ms 40 —window_stride_ms 40 —learning_rate 0.0005,0.0001,0.00002 —how_many_training_steps 10000,10000,10000 —data_dir ./data/speech_commands —summaries_dir work/LSTM/LSTM1/retrain_logs —train_dir work/LSTM/LSTM1/training
- python train.py —model_architecture lstm —model_size_info 98 —dct_coefficient_count 10 —window_size_ms 40 —window_stride_ms 40 —learning_rate 0.0005,0.0001,0.00002 —how_many_training_steps 10000,10000,10000 —data_dir ./data/speech_commands —summaries_dir work/LSTM/LSTM2/retrain_logs —train_dir work/LSTM/LSTM2/training
- python train.py —model_architecture gru —model_size_info 1 154 —window_size_ms 40 —window_stride_ms 40 —learning_rate 0.0005,0.0001,0.00002 —how_many_training_steps 10000,10000,10000 —data_dir ./data/speech_commands —summaries_dir work/GRU/retrain_logs —train_dir work/GRU/training
- 使用checkpoint
- python test.py —model_architecture dnn —model_size_info 144 144 144 —dct_coefficient_count 10 —window_size_ms 40 —window_stride_ms 40 —checkpoint work/DNN/DNN1/training/best/dnn_8503.ckpt-24000
- 使用freeze后的图: 可生成在安卓和IOS上可执行的图
- python freeze.py —model_architecture dnn —model_size_info 144 144 144 —dct_coefficient_count 10 —window_size_ms 40 —window_stride_ms 40 —data_dir ./data/speech_commands —checkpoint work/DNN/DNN1/training/best/dnn_8503.ckpt-24000 —output_file work/DNN/DNN1/training/best/dnn.pb
- 创建图:create_inference_graph,将wave_data的filepath placeholder经过一系列计算转为reshaped_input,用create_model创建图得到logits, 再用softmax得到输出
- load weights:models.load_variables_from_checkpoint
- 将变量替换为inline constant:graph_util.convert_variables_to_constants
- 写入目标图文件中:tf.train.write_graph
- python test_pb.py —model_architecture dnn —model_size_info 144 144 144 —dct_coefficient_count 10 —window_size_ms 40 —window_stride_ms 40 —data_dir ./data/speech_commands —graph work/DNN/DNN1/training/best/dnn.pb
- python freeze.py —model_architecture dnn —model_size_info 144 144 144 —dct_coefficient_count 10 —window_size_ms 40 —window_stride_ms 40 —data_dir ./data/speech_commands —checkpoint work/DNN/DNN1/training/best/dnn_8503.ckpt-24000 —output_file work/DNN/DNN1/training/best/dnn.pb
- 模型单个音频预测(使用pb文件预测)
- python predict_pb.py —model_architecture dnn —model_size_info 144 144 144 —dct_coefficient_count 10 —window_size_ms 40 —window_stride_ms 40 —data_dir ./data/speech_commands —graph work/DNN/DNN1/training/best/dnn.pb —test_wave data/speech_commands/yes/377e916b_nohash_0.wav
