TensorFlow 2.10.0 RNN - LSTM による、Speech Recognition

TensorFlow 2.10.0 RNN - LSTM による、Speech Recognition

RNN - LSTM による、Speech Recognition 例が有ったので、Windows11 TensorFlow-GPU 2.10.0 で試してみた。
Introduction to speech recognition with TensorFlow

GPU (GTX-1070) が入っているのが、Windows11 だったので、TensorFlow2 をバージョンアップして、 TensoFlow2-GPU 2.10.0 で試してみました。
当初、TensoFlow 2.12.0 の GPU 版を使うとしていましたが、Windows11 TensorFlow2 GPU 版は、2.10.0 が最後みたいな記述があったので、
こちらにしました。

環境:
Windows11
Python 3.10.6
tensorflow-gpu 2.10.0
GTX-1070
cuda toolkit 11.2
cuDNN SDK 8.1.0

Windows11で、最新の tensorflow gpu版は、どうやら仮想環境(wsl)下で、ubuntu 等を使って、gpu版を使うのが前提のようです。
最初から、ubuntu 等にすれば良いみたいだ。

train.py で、21 epoch 程学習させて、inferencModel.py で、テストしてみました。
下記が、inferencModel.py を、少しいじって、入力文章(speach) と、それの、判定結果を出してみました。

>text:mas ginastics compulsory after work meeting usually political information meeting >>>>>:mass gymnastics compulsory afterwork meeting usually political information meeting >text:the poor sol than joined the dor ind prayer and never did eywitness more contrition at any condemned sermone than he then evinsed >>>>>:the poor soul then joined the doctor in prayer and never did i witness more contrition at any condemned sermon than he then evinced >text:but apparently was not able to spendas much time with them as he would have liked because of the ahe gaps of five and seven years >>>>>:but apparently was not able to spend as much time with them as he would have liked because of the age gaps of five and seven years >text:from which he rose to be assistant registrar with the special duties of transfering shares >>>>>:from which he rose to be assistant registrar with the special duties of transferring shares >text:but he escated through a back door on to the river and road off in aboat to a hiding place in the wods >>>>>:but he escaped through a back door on to the river and rowed off in a boat to a hidingplace in the woods >text:there were nine wards in all on the female side one of them in the attic >>>>>:there were nine wards in all on the female side one of them in the attic >text:she boarded the marsalis bus at ste pal and elm streets to return home she testified further quote >>>>>:she boarded the marsalis bus at st paul and elm streets to return home she testified further quote

>text: が、入力音の文章
>>>>>: が、それに対する、判定結果

結構、すごい。
でも、これは、日本語には、対応していないだろうね。

ここまで来たか、
Web会議にNottaを起動しておくことで、リアルタイムで会議内容を文字に起します。

細かくノートを取る必要がないので、会議内容や議論に思う存分集中することができるそうな。
議事録作成の時間短縮が見込まれるAI自動文字起こしサービスを一度試してみては、どうぞね！
議事録作成の手間を大幅に軽減【Notta】 (アフェリエイト広告)

1. 学習。
オリジナルは、途中で止めると、再開が出来なかったので、再開できるように修正しました。
train.py
import tensorflow as tf try: [tf.config.experimental.set_memory_growth(gpu, True) for gpu in tf.config.experimental.list_physical_devices("GPU")] except: pass import os import sys import tarfile import pandas as pd from tqdm import tqdm from urllib.request import urlopen from io import BytesIO from keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau, TensorBoard from mltu.preprocessors import WavReader from mltu.tensorflow.dataProvider import DataProvider from mltu.transformers import LabelIndexer, LabelPadding, SpectrogramPadding from mltu.tensorflow.losses import CTCloss from mltu.tensorflow.callbacks import Model2onnx, TrainLogger from mltu.tensorflow.metrics import CERMetric, WERMetric from model import train_model from configs import ModelConfigs from keras.models import load_model import matplotlib.pyplot as plt import numpy as np def plot_spectrogram(spectrogram: np.ndarray, title:str = "", transpose: bool = True, invert: bool = True) -> None: """Plot the spectrogram of a WAV file Args: spectrogram (np.ndarray): Spectrogram of the WAV file. title (str, optional): Title of the plot. Defaults to None. transpose (bool, optional): Transpose the spectrogram. Defaults to True. invert (bool, optional): Invert the spectrogram. Defaults to True. """ if transpose: spectrogram = spectrogram.T if invert: spectrogram = spectrogram[::-1] plt.figure(figsize=(15, 5)) plt.imshow(spectrogram, aspect="auto", origin="lower") plt.title(f"Spectrogram: {title}") plt.xlabel("Time") plt.ylabel("Frequency") #plt.colorbar() plt.tight_layout() plt.show() def download_and_unzip(url, extract_to="Datasets", chunk_size=1024*1024): http_response = urlopen(url) data = b"" iterations = http_response.length // chunk_size + 1 for _ in tqdm(range(iterations)): data += http_response.read(chunk_size) tarFile = tarfile.open(fileobj=BytesIO(data), mode="r|bz2") tarFile.extractall(path=extract_to) tarFile.close() if __name__ == "__main__": from mltu.configs import BaseModelConfigs CONT_F=False test_date="202306281257" initial_epoch=0 # start 0 checkpoint_latest_dir = os.path.join("training_2") if CONT_F==False: dataset_path = os.path.join("Datasets", "LJSpeech-1.1") if not os.path.exists(dataset_path): download_and_unzip("https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2", extract_to="Datasets") dataset_path = "Datasets/LJSpeech-1.1" metadata_path = dataset_path + "/metadata.csv" wavs_path = dataset_path + "/wavs/" # Read metadata file and parse it metadata_df = pd.read_csv(metadata_path, sep="|", header=None, quoting=3) metadata_df.columns = ["file_name", "transcription", "normalized_transcription"] metadata_df = metadata_df[["file_name", "normalized_transcription"]] # structure the dataset where each row is a list of [wav_file_path, sound transcription] dataset = [[f"Datasets/LJSpeech-1.1/wavs/{file}.wav", label.lower()] for file, label in metadata_df.values.tolist()] # Create a ModelConfigs object to store model configurations configs = ModelConfigs() max_text_length, max_spectrogram_length = 0, 0 for file_path, label in tqdm(dataset): spectrogram = WavReader.get_spectrogram(file_path, frame_length=configs.frame_length, frame_step=configs.frame_step, fft_length=configs.fft_length) valid_label = [c for c in label if c in configs.vocab] max_text_length = max(max_text_length, len(valid_label)) max_spectrogram_length = max(max_spectrogram_length, spectrogram.shape[0]) configs.input_shape = [max_spectrogram_length, spectrogram.shape[1]] configs.max_spectrogram_length = max_spectrogram_length configs.max_text_length = max_text_length configs.save() # Create a data provider for the dataset data_provider = DataProvider( dataset=dataset, skip_validation=True, batch_size=configs.batch_size, data_preprocessors=[ WavReader(frame_length=configs.frame_length, frame_step=configs.frame_step, fft_length=configs.fft_length), ], transformers=[ SpectrogramPadding(max_spectrogram_length=configs.max_spectrogram_length, padding_value=0), LabelIndexer(configs.vocab), LabelPadding(max_word_length=configs.max_text_length, padding_value=len(configs.vocab)), ], ) train_csv_file=os.path.join(configs.model_path, "train.csv") val_csv_file=os.path.join(configs.model_path, "val.csv") #if os.path.isfile(train_csv_file): # i=0 # Split the dataset into training and validation sets train_data_provider, val_data_provider = data_provider.split(split = 0.9) else: configs = BaseModelConfigs.load("Models/05_sound_to_text/"+test_date+"/configs.yaml") dataset_train = pd.read_csv("Models/05_sound_to_text/"+test_date+"/train.csv").values.tolist() dataset_val = pd.read_csv("Models/05_sound_to_text/"+test_date+"/val.csv").values.tolist() train_data_provider = DataProvider( dataset=dataset_train, skip_validation=True, batch_size=configs.batch_size, data_preprocessors=[ WavReader(frame_length=configs.frame_length, frame_step=configs.frame_step, fft_length=configs.fft_length), ], transformers=[ SpectrogramPadding(max_spectrogram_length=configs.max_spectrogram_length, padding_value=0), LabelIndexer(configs.vocab), LabelPadding(max_word_length=configs.max_text_length, padding_value=len(configs.vocab)), ], ) val_data_provider = DataProvider( dataset=dataset_val, skip_validation=True, batch_size=configs.batch_size, data_preprocessors=[ WavReader(frame_length=configs.frame_length, frame_step=configs.frame_step, fft_length=configs.fft_length), ], transformers=[ SpectrogramPadding(max_spectrogram_length=configs.max_spectrogram_length, padding_value=0), LabelIndexer(configs.vocab), LabelPadding(max_word_length=configs.max_text_length, padding_value=len(configs.vocab)), ], ) # Creating TensorFlow model architecture model = train_model( input_dim = configs.input_shape, output_dim = len(configs.vocab), dropout=0.5 ) # Compile the model and print summary model.compile( optimizer=tf.keras.optimizers.Adam(learning_rate=configs.learning_rate), loss=CTCloss(), metrics=[ CERMetric(vocabulary=configs.vocab), WERMetric(vocabulary=configs.vocab) ], run_eagerly=False ) # Include the epoch in the file name (uses `str.format`) checkpoint_path = "training_2/cp-{epoch:04d}.ckpt" #checkpoint_path = os.path.join("Models/05_sound_to_text", datetime.strftime(datetime.now(), "%Y%m%d%H%M")) checkpoint_dir = os.path.dirname(checkpoint_path) if CONT_F: latest = tf.train.latest_checkpoint(checkpoint_latest_dir) #Model_dir=os.path.join('Models','05_sound_to_text','202306180353','model.h5') print(latest) # training_2\cp-0002.ckpt model.load_weights(latest) #f_fname = os.path.basename(latest) basename_without_ext = os.path.splitext(os.path.basename(latest))[0] initial_epoch=int(basename_without_ext.split('-')[1]) print('initial_epoch:',initial_epoch) model.summary(line_length=110) if False: ddt,lv=train_data_provider.__getitem__(1) print(ddt.shape) # (8, 1392, 193) #print(ddt[0,:]) print('--') # (8, 186) print(lv.shape) # (20000,) #print(lv[0,:]) from IPython.display import Audio wave_audio = np.sin(np.linspace(0, 3000, 20000)) print(wave_audio.shape) #Audio(wave_audio, rate=20000) #av_dd=np.ravel(ddt[1]) #print(av_dd.shape) #print(av_dd) #Audio(av_dd, rate=44100) plot_spectrogram(ddt[0]) sys.exit() # Define callbacks #earlystopper = EarlyStopping(monitor="val_CER", patience=20, verbose=1, mode="min") earlystopper = EarlyStopping(monitor="val_CER", patience=3, verbose=1, mode="min") checkpoint = ModelCheckpoint(f"{configs.model_path}/model.h5", monitor="val_CER", verbose=1, save_best_only=True, mode="min") trainLogger = TrainLogger(configs.model_path) tb_callback = TensorBoard(f"{configs.model_path}/logs", update_freq=1) reduceLROnPlat = ReduceLROnPlateau(monitor="val_CER", factor=0.8, min_delta=1e-10, patience=5, verbose=1, mode="auto") model2onnx = Model2onnx(f"{configs.model_path}/model.h5") #batch_size = 32 batch_size = configs.batch_size # Create a callback that saves the model's weights every 5 epochs cp_callback = ModelCheckpoint( filepath=checkpoint_path, #monitor="loss", #monitor="CER", monitor="val_CER", verbose=1, save_best_only=True, save_weights_only=True, #save_freq=20*batch_size, mode="min") # Save training and validation datasets as csv files train_data_provider.to_csv(os.path.join(configs.model_path, "train.csv")) val_data_provider.to_csv(os.path.join(configs.model_path, "val.csv")) # Train the model model.fit( train_data_provider, validation_data=val_data_provider, epochs=configs.train_epochs, initial_epoch=initial_epoch, callbacks=[earlystopper, checkpoint, trainLogger, reduceLROnPlat, tb_callback, model2onnx, cp_callback], workers=configs.train_workers ) train_csv_file=os.path.join(configs.model_path, "train.csv") val_csv_file=os.path.join(configs.model_path, "val.csv")

誰か、日本語対応の学習データを公開してくれないものか?

2. 学習データセット。
学習に使っているデータセットは、
The LJ Speech Dataset

これを、日本語のデータセットで、学習させれば、日本語対応になるのか?
kaggle に、下記があるけど。
Japanese Single Speaker Speech Dataset

２段構えにすれば、日本文(漢字) になりそうだが?
meian/meian_0000.wav|この前探った時は、途中に瘢痕の隆起があったので、ついそこが行きどまりだとばかり思って、ああ云ったんですが、|kono mae sagut ta toki wa 、 tochu- ni hankon no ryu-ki ga at ta node 、 tsui soko ga yukidomari da to bakari omot te 、 a- yut ta n desu ga 、|8.77 meian/meian_0001.wav|今日疎通を好くするために、そいつをがりがり掻き落して見ると、まだ奥があるんです」|kyo- sotsu- wo yoku suru tame ni 、 soitsu wo garigari kaki otoshi te miru to 、 mada oku ga aru n desu|7.48

3. 参照。
使われているモデルの理解には、下記書籍が、役立ちました。
『Python と Keras によるディープラーニング』マイナビ出版
6.3.8 双方向のリカレントニューラルネットワーク
7.1 Sequentialモデルを超えて: Keras Functional API

4. 検証。
試しに、テストデータを、スピーカーに出して、それをマイクで取り込んて、モデルを、Predict させみました。
やはり、結果は、とても使えるものでは、ありませんでした。

原因を、少し調べてみました。
テストデータを、スピーカーに出す時の、スペクトグラムと、それをマイクから取り込んで、model 入力用に変換した、
スペクトグラムが余り一致していないのが問題の様です。

この両者が一致しない事には、正しい結果は期待できません。
マイク入力の音声データを如何に正確に取り込むかが問題の様です。
現状のマイク入力の部分は、かなり出来が良くないのだろう。

しかし、これは不可能に近いか。不完全な入力データにも対応するモデルが必要なのかも。
そもそも、学習データが、1人の女性の声のみの様なのがいけない。色んな人が読んだデータにすべきじゃ。

実際のシチュエーションで使うには、まだまだか。
かなり、スピーカの音量を大きくすると、少し良くなるみたい。
Windows11 Realtek Audio Console で、イコライザー: 声にしたら少し良くなった。
スピーカ側と、マイク側のそれぞれの音を、スペクトグラムで比較する時は、音のデータ長を同じにしないといけない。

4.1 学習データセットの、spectrogram が、後ろ詰になっているのが発覚。
mltu/transformer.py Class SpectrogramPadding にバグがあるみたい。
短いspectrogram データが、前詰めでなく、後ろ詰めになっているみたい。
試しに、前詰めになる様に、Class SpectrogramPadding を修正しました。
これで、再度試さないといかんぞね。

4.2 上記バグを修正して、さらに、spectrogram を、mel spectrogram にして、再度試してみます。

メルスペクトログラムにする利点は、下記に、記載があります。
畳込みニューラルネットワークの基本技術を比較する　ー音でもやってみたー

後、音声データのAugmentationで、ノイズを入れてみたいぞね。
Data Augmentation に関しては、ディープラーニングで音声分類に記載があります。

どちらも、以前試した、sound classify が、役立ちました。

5. おんちゃんのメモ。
1) Orange Pi5のNPUを使用してyolo（高速？）を動かしてみる(rknn-toolkit2)

2) LSTM-GRU入門。
直感で理解するLSTM・GRU入門 - 機械学習の基礎をマスターしよう！
上記ページのリメイク版の動画を観ると、RNN、BasicLSTM、LSTM、GRU の全体像が判りやすい。

3) TensorFlow の transformer を使った音声認識(ASR) と言うのがあるらしい。
TensorFlow の transformer を使った音声認識(ASR)のプログラムを改修して日本語学習させてみました。
大元は、下記らしい。
Automatic Speech Recognition with Transformer

4) OpenSeq2Seq

大分、ロボット開発とずれてしまった。
また、ボット開発に戻らないといかんぞね。

TensorFlow 2.10.0 RNN - LSTM による、Speech Recognition

カテゴリ:

検索

このブログ記事について

カテゴリ

月別アーカイブ

ウェブページ

サイトナビ

TensorFlow 2.10.0 RNN - LSTM による、Speech Recognition

カテゴリ:

検索

このブログ記事について

カテゴリ

月別 アーカイブ

ウェブページ

サイトナビ

月別アーカイブ