Artificial Intelligence With Python 简明教程
AI with Python – Speech Recognition
在本章中,我们将学习使用 Python 中的 AI 进行语音识别。
In this chapter, we will learn about speech recognition using AI with Python.
语音是成人人类交流最基本的手段。语音处理的基本目标是提供人与机器之间的交互。
Speech is the most basic means of adult human communication. The basic goal of speech processing is to provide an interaction between a human and a machine.
语音处理系统主要有三个任务 −
Speech processing system has mainly three tasks −
-
First, speech recognition that allows the machine to catch the words, phrases and sentences we speak
-
Second, natural language processing to allow the machine to understand what we speak, and
-
Third, speech synthesis to allow the machine to speak.
本章重点介绍 speech recognition ,即理解人类说出的单词的过程。请记住,语音信号是借助麦克风捕获的,然后系统必须理解它。
This chapter focuses on speech recognition, the process of understanding the words that are spoken by human beings. Remember that the speech signals are captured with the help of a microphone and then it has to be understood by the system.
Building a Speech Recognizer
语音识别或自动语音识别 (ASR) 是机器人技术等人工智能项目的关注中心。没有 ASR,就无法想象认知机器人与人类之间的互动。然而,构建语音识别器并非易事。
Speech Recognition or Automatic Speech Recognition (ASR) is the center of attention for AI projects like robotics. Without ASR, it is not possible to imagine a cognitive robot interacting with a human. However, it is not quite easy to build a speech recognizer.
Difficulties in developing a speech recognition system
开发一个高质量的语音识别系统是一个艰巨的任务。语音识别技术的困难性可以沿着以下讨论的多个层面进行概括。
Developing a high quality speech recognition system is really a difficult problem. The difficulty of speech recognition technology can be broadly characterized along a number of dimensions as discussed below −
-
Size of the vocabulary − Size of the vocabulary impacts the ease of developing an ASR. Consider the following sizes of vocabulary for a better understanding.
-
Channel characteristics − Channel quality is also an important dimension. For example, human speech contains high bandwidth with full frequency range, while a telephone speech consists of low bandwidth with limited frequency range. Note that it is harder in the latter.
-
Speaking mode − Ease of developing an ASR also depends on the speaking mode, that is whether the speech is in isolated word mode, or connected word mode, or in a continuous speech mode. Note that a continuous speech is harder to recognize.
-
Speaking style − A read speech may be in a formal style, or spontaneous and conversational with casual style. The latter is harder to recognize.
-
Speaker dependency − Speech can be speaker dependent, speaker adaptive, or speaker independent. A speaker independent is the hardest to build.
-
Type of noise − Noise is another factor to consider while developing an ASR. Signal to noise ratio may be in various ranges, depending on the acoustic environment that observes less versus more background noise −
-
Microphone characteristics − The quality of microphone may be good, average, or below average. Also, the distance between mouth and micro-phone can vary. These factors also should be considered for recognition systems.
尽管有这些困难,研究人员在语音理解语音信号、说话者和识别口音等各个方面的开展了大量工作。
Despite these difficulties, researchers worked a lot on various aspects of speech such as understanding the speech signal, the speaker, and identifying the accents.
若要构建语音识别器,您需要遵循以下步骤。
You will have to follow the steps given below to build a speech recognizer −
Visualizing Audio Signals - Reading from a File and Working on it
这是构建语音识别系统的第一步,因为它可以理解音频信号的结构。可以用来处理音频信号的一些常见步骤如下——
This is the first step in building speech recognition system as it gives an understanding of how an audio signal is structured. Some common steps that can be followed to work with audio signals are as follows −
Recording
当您需要从文件读取音频信号时,首先使用麦克风录制它。
When you have to read the audio signal from a file, then record it using a microphone, at first.
Sampling
使用麦克风录制时,信号将以数字化形式存储。但要处理它,机器需要它们采用离散数字形式。因此,我们应该以特定频率对信号进行采样,并将信号转换为离散数字形式。选择较高的采样频率意味着当人类听到信号时,他们会感觉到这是一个连续的音频信号。
When recording with microphone, the signals are stored in a digitized form. But to work upon it, the machine needs them in the discrete numeric form. Hence, we should perform sampling at a certain frequency and convert the signal into the discrete numerical form. Choosing the high frequency for sampling implies that when humans listen to the signal, they feel it as a continuous audio signal.
Example
以下示例展示了使用 Python 分析存储在文件中的音频信号的分步方法。此音频信号的频率为 44,100 HZ。
The following example shows a stepwise approach to analyze an audio signal, using Python, which is stored in a file. The frequency of this audio signal is 44,100 HZ.
导入必要的程序包,如下所示——
Import the necessary packages as shown here −
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import wavfile
现在,读取存储的音频文件。它将返回两个值:采样频率和音频信号。提供音频文件存储的路径,如下所示——
Now, read the stored audio file. It will return two values: the sampling frequency and the audio signal. Provide the path of the audio file where it is stored, as shown here −
frequency_sampling, audio_signal = wavfile.read("/Users/admin/audio_file.wav")
使用所示的命令显示音频信号的采样频率、信号的数据类型及其持续时间等参数——
Display the parameters like sampling frequency of the audio signal, data type of signal and its duration, using the commands shown −
print('\nSignal shape:', audio_signal.shape)
print('Signal Datatype:', audio_signal.dtype)
print('Signal duration:', round(audio_signal.shape[0] /
float(frequency_sampling), 2), 'seconds')
此步骤涉及标准化信号,如下所示——
This step involves normalizing the signal as shown below −
audio_signal = audio_signal / np.power(2, 15)
在此步骤中,我们从这个信号中提取前 100 个值进行可视化。为此,请使用以下命令 -
In this step, we are extracting the first 100 values from this signal to visualize. Use the following commands for this purpose −
audio_signal = audio_signal [:100]
time_axis = 1000 * np.arange(0, len(signal), 1) / float(frequency_sampling)
现在,使用下面给出的命令可视化信号 -
Now, visualize the signal using the commands given below −
plt.plot(time_axis, signal, color='blue')
plt.xlabel('Time (milliseconds)')
plt.ylabel('Amplitude')
plt.title('Input audio signal')
plt.show()
您会看到上面音频信号的输出图表和提取的数据,如下图所示
You would be able to see an output graph and data extracted for the above audio signal as shown in the image here
Signal shape: (132300,)
Signal Datatype: int16
Signal duration: 3.0 seconds
Characterizing the Audio Signal: Transforming to Frequency Domain
音频信号的表征涉及将时域信号转换为频域,并通过以下方式理解其频率分量。这是一个重要的步骤,因为它提供了关于信号的大量信息。您可以使用诸如傅里叶变换之类的数学工具来执行此转换。
Characterizing an audio signal involves converting the time domain signal into frequency domain, and understanding its frequency components, by. This is an important step because it gives a lot of information about the signal. You can use a mathematical tool like Fourier Transform to perform this transformation.
Example
以下示例逐步展示了如何使用 Python 表征存储在文件中的信号。请注意,此处我们使用傅立叶变换数学工具将其转换为频域。
The following example shows, step-by-step, how to characterize the signal, using Python, which is stored in a file. Note that here we are using Fourier Transform mathematical tool to convert it into frequency domain.
导入必要的程序包,如下所示 -
Import the necessary packages, as shown here −
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import wavfile
现在,读取存储的音频文件。它将返回两个值:采样频率和音频信号。在命令中提供音频文件存储的路径,如下所示 -
Now, read the stored audio file. It will return two values: the sampling frequency and the the audio signal. Provide the path of the audio file where it is stored as shown in the command here −
frequency_sampling, audio_signal = wavfile.read("/Users/admin/sample.wav")
在此步骤中,我们将使用以下命令显示参数,如音频信号的采样频率、信号数据类型及其持续时间 -
In this step, we will display the parameters like sampling frequency of the audio signal, data type of signal and its duration, using the commands given below −
print('\nSignal shape:', audio_signal.shape)
print('Signal Datatype:', audio_signal.dtype)
print('Signal duration:', round(audio_signal.shape[0] /
float(frequency_sampling), 2), 'seconds')
在此步骤中,我们需要对信号进行归一化,如下所示 -
In this step, we need to normalize the signal, as shown in the following command −
audio_signal = audio_signal / np.power(2, 15)
此步骤涉及提取信号的长度和半长度。为此,请使用以下命令 -
This step involves extracting the length and half length of the signal. Use the following commands for this purpose −
length_signal = len(audio_signal)
half_length = np.ceil((length_signal + 1) / 2.0).astype(np.int)
现在,我们需要应用数学工具进行频域转换。这里我们使用傅立叶变换。
Now, we need to apply mathematics tools for transforming into frequency domain. Here we are using the Fourier Transform.
signal_frequency = np.fft.fft(audio_signal)
现在,对频域信号进行归一化并进行平方 -
Now, do the normalization of frequency domain signal and square it −
signal_frequency = abs(signal_frequency[0:half_length]) / length_signal
signal_frequency **= 2
接下来,提取频率转换信号的长度和半长度 -
Next, extract the length and half length of the frequency transformed signal −
len_fts = len(signal_frequency)
请注意,必须对傅里叶转换的信号进行调整,以适用于偶数和奇数情况。
Note that the Fourier transformed signal must be adjusted for even as well as odd case.
if length_signal % 2:
signal_frequency[1:len_fts] *= 2
else:
signal_frequency[1:len_fts-1] *= 2
现在,提取分贝 (dB) 中的功率 -
Now, extract the power in decibal(dB) −
signal_power = 10 * np.log10(signal_frequency)
为 X 轴调整 kHz 中的频率 -
Adjust the frequency in kHz for X-axis −
x_axis = np.arange(0, len_half, 1) * (frequency_sampling / length_signal) / 1000.0
现在,按以下方式可视化信号的表征 -
Now, visualize the characterization of signal as follows −
plt.figure()
plt.plot(x_axis, signal_power, color='black')
plt.xlabel('Frequency (kHz)')
plt.ylabel('Signal power (dB)')
plt.show()
您可以观察到上面代码的输出图表,如下图所示 -
You can observe the output graph of the above code as shown in the image below −
Generating Monotone Audio Signal
您迄今为止看到的两个步骤对于学习信号很重要。现在,如果您想要使用一些预定义的参数生成音频信号,此步骤将会很有用。请注意,此步骤将音频信号保存在输出文件中。
The two steps that you have seen till now are important to learn about signals. Now, this step will be useful if you want to generate the audio signal with some predefined parameters. Note that this step will save the audio signal in an output file.
Example
在以下示例中,我们将使用 Python 生成一个单调信号,该信号将存储在一个文件中。为此,您需要执行以下步骤 −
In the following example, we are going to generate a monotone signal, using Python, which will be stored in a file. For this, you will have to take the following steps −
导入必要的包,如下所示
Import the necessary packages as shown −
import numpy as np
import matplotlib.pyplot as plt
from scipy.io.wavfile import write
提供输出文件应保存的文件
Provide the file where the output file should be saved
output_file = 'audio_signal_generated.wav'
现在,指定您选择的参数,如下所示
Now, specify the parameters of your choice, as shown −
duration = 4 # in seconds
frequency_sampling = 44100 # in Hz
frequency_tone = 784
min_val = -4 * np.pi
max_val = 4 * np.pi
在此步骤中,我们可以生成音频信号,如下所示
In this step, we can generate the audio signal, as shown −
t = np.linspace(min_val, max_val, duration * frequency_sampling)
audio_signal = np.sin(2 * np.pi * tone_freq * t)
现在,将音频文件保存在输出文件中−
Now, save the audio file in the output file −
write(output_file, frequency_sampling, signal_scaled)
为我们的图表提取前 100 个值,如下所示 −
Extract the first 100 values for our graph, as shown −
audio_signal = audio_signal[:100]
time_axis = 1000 * np.arange(0, len(signal), 1) / float(sampling_freq)
现在,可视化生成的音频信号如下 −
Now, visualize the generated audio signal as follows −
plt.plot(time_axis, signal, color='blue')
plt.xlabel('Time in milliseconds')
plt.ylabel('Amplitude')
plt.title('Generated audio signal')
plt.show()
你可以观察到在此处给出的图像中的绘图 −
You can observe the plot as shown in the figure given here −
Feature Extraction from Speech
这是构建一个语音识别器中最重要的步骤,因为在将语音信号转换成频率域后,我们必须将其转换成可用的特征向量形式。我们可以使用不同的特征提取技术,如 MFCC、PLP、PLP-RASTA 等来实现此目的。
This is the most important step in building a speech recognizer because after converting the speech signal into the frequency domain, we must convert it into the usable form of feature vector. We can use different feature extraction techniques like MFCC, PLP, PLP-RASTA etc. for this purpose.
Example
在以下示例中,我们将逐步使用 Python,通过使用 MFCC 技术从信号中提取特征。
In the following example, we are going to extract the features from signal, step-by-step, using Python, by using MFCC technique.
导入必要的程序包,如下所示 -
Import the necessary packages, as shown here −
import numpy as np
import matplotlib.pyplot as plt
from scipy.io import wavfile
from python_speech_features import mfcc, logfbank
现在,读取存储的音频文件。它将返回两个值−采样频率和音频信号。提供存储音频文件的位置。
Now, read the stored audio file. It will return two values − the sampling frequency and the audio signal. Provide the path of the audio file where it is stored.
frequency_sampling, audio_signal = wavfile.read("/Users/admin/audio_file.wav")
请注意,这里我们正在获取前 15000 个样本进行分析。
Note that here we are taking first 15000 samples for analysis.
audio_signal = audio_signal[:15000]
使用 MFCC 技术并执行以下命令以提取 MFCC 特征 −
Use the MFCC techniques and execute the following command to extract the MFCC features −
features_mfcc = mfcc(audio_signal, frequency_sampling)
现在,打印 MFCC 参数,如下所示 −
Now, print the MFCC parameters, as shown −
print('\nMFCC:\nNumber of windows =', features_mfcc.shape[0])
print('Length of each feature =', features_mfcc.shape[1])
现在,使用以下给出的命令绘制并可视化 MFCC 特征 −
Now, plot and visualize the MFCC features using the commands given below −
features_mfcc = features_mfcc.T
plt.matshow(features_mfcc)
plt.title('MFCC')
在此步骤中,我们使用滤波器组特征,如下所示
In this step, we work with the filter bank features as shown −
提取滤波器组特征 −
Extract the filter bank features −
filterbank_features = logfbank(audio_signal, frequency_sampling)
现在,打印滤波器组参数。
Now, print the filterbank parameters.
print('\nFilter bank:\nNumber of windows =', filterbank_features.shape[0])
print('Length of each feature =', filterbank_features.shape[1])
现在,绘制和可视化滤波器组功能。
Now, plot and visualize the filterbank features.
filterbank_features = filterbank_features.T
plt.matshow(filterbank_features)
plt.title('Filter bank')
plt.show()
由于上述步骤,您可以观察到以下输出:MFCC 的图 1 和滤波器组的图 2
As a result of the steps above, you can observe the following outputs: Figure1 for MFCC and Figure2 for Filter Bank
Recognition of Spoken Words
语音识别意味着当人类说话时,机器会理解它。在这里,我们使用 Python 中的 Google 语音 API 来实现它。我们需要为此安装以下包 -
Speech recognition means that when humans are speaking, a machine understands it. Here we are using Google Speech API in Python to make it happen. We need to install the following packages for this −
-
Pyaudio − It can be installed by using pip install Pyaudio command.
-
SpeechRecognition − This package can be installed by using pip install SpeechRecognition.
-
Google-Speech-API − It can be installed by using the command pip install google-api-python-client.
Example
观察以下示例以了解如何识别口语 -
Observe the following example to understand about recognition of spoken words −
导入必要的包,如下所示
Import the necessary packages as shown −
import speech_recognition as sr
创建一个对象,如下所示 -
Create an object as shown below −
recording = sr.Recognizer()
现在, Microphone() 模块将接受语音作为输入 -
Now, the Microphone() module will take the voice as input −
with sr.Microphone() as source: recording.adjust_for_ambient_noise(source)
print("Please Say something:")
audio = recording.listen(source)
现在,Google API 将识别语音并给出输出。
Now google API would recognize the voice and gives the output.
try:
print("You said: \n" + recording.recognize_google(audio))
except Exception as e:
print(e)
您可以看到以下输出 -
You can see the following output −
Please Say Something:
You said:
例如,如果您说 tutorialspoint.com ,则系统会正确识别,如下所示 -
For example, if you said tutorialspoint.com, then the system recognizes it correctly as follows −
tutorialspoint.com