利用者:Lapischera/メル周波数ケプストラム

メル周波数ケプストラム(mel-frequency cepstrum:MFC) は、音響処理における特徴表現である。対数パワースペクトルを非線形メル尺度で線形コサイン変換することで音の短期パワースペクトルを表現している。

Mel-frequency cepstral coefficients (MFCCs) は、集合的に MFC を構成する係数である。^[1]ケプストラムオーディオクリップの表現（非線形の「スペクトル-オブ-ア-スペクトル」）。ケプストラムとメル周波数ケプストラムは、MFCでは、周波数帯域がメルスケール上で等間隔に配置されているため、通常のスペクトラムで使用される直線的な間隔の周波数帯域よりも、人間の聴覚系の反応に近いということです。この周波数ワーピングは、例えば音声信号の伝送帯域幅やストレージ要件を削減する音声圧縮において、より良い音の表現を可能にする。

MFCCs は一般的に次のように導出される:^[2]^[3]

（窓関数を適用した）音響信号をフーリエ変換する。
上記で得られたスペクトルの累乗(パワー)を窓関数をオーバーラップさせてメル尺度にマッピングする。
各メル周波数におけるパワーの対数をとる。
対数メルスペクトログラムを信号とみなして離散コサイン変換する。
結果として得られるスペクトルの振幅がMFCCsである。

以上の基本手順のほか様々な導出手順が存在する。例えば、スケールをマッピングするために使用される窓関数の形状や間隔の違い、^[4] または、"delta "や "delta-delta"（1次および2次のフレーム間差分）係数などのダイナミクス特徴の追加が挙げられる。^[5]

アプリケーション

MFCCは一般的に音声認識システム^[6]、例えば電話に向かって話された番号を自動的に認識できるシステムなどの特徴量として用いられる。

MFCC はまた、音楽情報検索やジャンル分類、音声類似度測定などにも用いられる。^[7]

話者認識のためのMFCC

MFCCでは、メル周波数帯域が均等に分布しており、人間の音声システムによく似ているため、MFCCを話者の特徴付けに効率的に使用することができる。例えば、話者の携帯電話モデルの詳細を認識し、さらに話者の詳細を認識するために使用することができます。携帯電話を識別するための音声認識について言えば、携帯電話の電子部品の製造には公差があり、異なる電子回路実現は、全く同じ伝達関数ではない。タスク実行回路が異なるメーカーのものである場合、ある実現形態から別の実現形態への伝達関数の非類似性はより顕著になる。したがって、各携帯電話は、畳み込み入力音声に歪みが生じ、携帯電話の録音に独特の影響を残す。従って、元の周波数スペクトルさらに、各携帯電話に固有の伝達関数を乗算した後、信号処理技術を使用します。従って、MFCCを使用することで、携帯電話のブランドとモデルを識別するために、携帯電話の録音を特徴付けることができる。携帯電話の録音部を線形時変として考える（LTI) filter：インパルス応答をh(n)、録音された音声信号y(n)は入力x(n)に対するフィルタの出力である。

したがって, $y(n)=x(n)*h(n)$ (畳み込み) 音声は定常信号ではないため、信号が定常であると仮定して、信号をオーバーラップしたフレームに分割する。そこで、録音された入力音声の短期セグメント（フレーム） $p^{th}$ は次で表せる。

y_{p}w(n)=[x(n)w(pW-n)]*h(n)

,

ここで、w(n): 長さWの窓付き関数。

したがって、録音された音声の携帯電話の足跡は、録音された携帯電話を識別するのに役立つ畳み込み歪みであることが明記されている。携帯電話に埋め込まれたIDは、より識別しやすい形に変換する必要があるため、短時間フーリエ変換(STFT)を行う：

Y_{p}w(f)=X_{p}w(f)H(f)

$H(f)$ can be considered as a concatenated transfer function that produced input speech, and the recorded speech $Y_{p}w(f)$ can be perceived as original speech from cell phone.そのため、声道と携帯電話レコーダーの等価伝達関数は、録音された音声の元のソースと見なされる。したがって、

X_{p}w(f)=Xe_{p}w(f)X_{v}(f),H'(f)=H(f)X_{v}(f),

ここで Xew(f) は励起関数、

X_{v}(f)

はフレーム

p^{th}

における声道伝達関数、

H'(f)

は携帯電話を特徴づける等価伝達関数である。

Y_{p}w(f)=Xe_{p}w(f)H'(f)

デバイスの識別と話者の識別は非常に密接に関係しているため、このアプローチは話者認識に有用である。フィルターバンク(メルスケールフィルターバンクで適切なケプストラム)で乗算されたスペクトルのエンベロープを重要視し、伝達関数U(f)でフィルターバンクを平滑化した後、出力エネルギーの対数演算を行う：

log[|Y_{p}w(f)|]=\log[|U(f)||Xe_{p}w(f)||H'(f)|]

Representing $H_{w}(f)=U(f)H'(f)$

\log[|Y_{p}w(f)|]=\log[|Xe_{p}w(f)|]+\log[|H_{w}(f)|]

MFCCが成功したのは、この加算特性を持つ非線形変換のおかげである。

時間領域に戻すと:

c_{y}(j)=c_{e}(j)+c_{w}(j)

ここで、cy(j)、ce(j)、cw(j)はそれぞれ、携帯電話を特徴付ける携帯電話レコーダーの録音音声ケプストラムと重み付け等価インパルス応答であり、jはフィルターバンクのフィルター数である。より正確には、機器固有の情報は、識別に適した加算形式に変換された録音音声の中にある。cy(j)は、録音電話の識別のためにさらに処理される。

よく使われるフレーム長-20msまたは20ms。

よく使われる窓関数-ハミング窓とハニング窓

従って、メルスケールは一般的に使用される周波数スケールで、1000Hzまでは直線、それ以上は対数である。

メルスケールのフィルターの中心周波数の計算：

f_{mel}=1000\log(1+f/1000)/\log 2

, 基底10.

MFCC計算の基本手順：

対数フィルタバンク出力が生成され、20倍されることで、デシベル単位のスペクトル包絡線が得られる。
MFCCは、スペクトル包絡線の離散コサイン変換（DCT）を行うことで得られる。
ケプストラム係数は次のように求められる：

$ci=\sum _{n=1}^{Nf}{Sn}\ cos[i(n-0.5)\left({\frac {\pi }{Nf}}\right)]$ , i= 1,2,....,L ,

c_y(i) = i番目の MFCC係数、N_f はフィルタバンク内の三角フィルタの数である、Sn はn番目のフィルター係数の対数エネルギー出力、Lは計算したいMFCC係数の数。

逆変換

MFCCは、4つのステップで音声にほぼ逆変換できる。

(a1) DCTを逆変換して対数メルデシベルスペクトログラムを得る

(a2) パワーにマッピングしてメルパワースペクトログラムを得る

(b1) 再スケーリングして短時間フーリエ変換

(b2)Griffin-Limを使った位相再構成と音声合成

各ステップは MFCC 計算の1ステップに対応する。^[8]

ノイズ感度

MFCC値は加算ノイズの存在にあまり強くないため、音声認識システムではノイズの影響を軽減するためにその値を正規化するのが一般的である。一部の研究者は、たとえば、離散コサイン変換（DCT）を行う前に対数メル振幅スペクトログラムを(2乗あるいは3乗することで)パワースペクトログラムに変換して、低エネルギー成分の影響を軽減するなどの基本的なMFCCアルゴリズムの頑健性を向上させるための最適化を提案している。^[9]

歴史

一般的に、MFCの発明者はPaul Mermelstein^[10]^[11]とされている。Mermelsteinは当原稿にBridleとBrownの名を挙げている^[12]：

BridleとBrownは、一組の不均一な間隔のバンドパスフィルターの出力の余弦変換によって与えられる、19個の重み付けされたスペクトル形状係数のセットを使用した。フィルター間隔は1kHz以上で対数になるように選択され、フィルター帯域幅もそこで増加する。したがって、これらをmel-basedケプストラルパラメータと呼ぶことにする。^[10]

時には、両方を初期の創始者として引用されることもある。^[13]

DavisとMermelsteinを含む多くの著者は、^[11]MFCのコサイン変換のスペクトル基底関数は、Polsらによってずっと以前に音声表現と認識に応用された対数スペクトルの主成分に非常によく似ているとと述べている。 ^[14]^[15]

参考文献

^ Min Xu (2004). “HMM-based audio keyword generation”. In Kiyoharu Aizawa; Yuichi Nakamura; Shin'ichi Satoh. Advances in Multimedia Information Processing – PCM 2004: 5th Pacific Rim Conference on Multimedia. Springer. ISBN 978-3-540-23985-7. オリジナルの2007-05-10時点におけるアーカイブ。
^ Sahidullah, Md.; Saha, Goutam (May 2012). “Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition”. Speech Communication 54 (4): 543–565. doi:10.1016/j.specom.2011.11.004.
^ Abdulsatar, Assim Ara; Davydov, V V; Yushkova, V V; Glinushkin, A P; Rud, V Yu (2019-12-01). “Age and gender recognition from speech signals”. Journal of Physics: Conference Series 1410 (1): 012073. Bibcode: 2019JPhCS1410a2073A. doi:10.1088/1742-6596/1410/1/012073. ISSN 1742-6588.
^ Fang Zheng, Guoliang Zhang and Zhanjiang Song (2001), "Comparison of Different Implementations of MFCC," J. Computer Science & Technology, 16(6): 582–589.
^ S. Furui (1986), "Speaker-independent isolated word recognition based on emphasized spectral dynamics"
^ T. Ganchev, N. Fakotakis, and G. Kokkinakis (2005), "Comparative evaluation of various MFCC implementations on the speaker verification task Archived 2011-07-17 at the Wayback Machine.," in 10th International Conference on Speech and Computer (SPECOM 2005), Vol. 1, pp. 191–194.
^ Meinard Müller (2007). Information Retrieval for Music and Motion. Springer. p. 65. ISBN 978-3-540-74047-6
^ “librosa.feature.inverse.mfcc_to_audio — librosa 0.10.0 documentation”. librosa.org. 2023年9月29日閲覧。
^ V. Tyagi and C. Wellekens (2005), On desensitizing the Mel-Cepstrum to spurious spectral components for Robust Speech Recognition, in Acoustics, Speech, and Signal Processing, 2005. Proceedings. (ICASSP ’05). IEEE International Conference on, vol. 1, pp. 529–532.
^ ^a ^b P. Mermelstein (1976), "Distance measures for speech recognition, psychological and instrumental," in Pattern Recognition and Artificial Intelligence, C. H. Chen, Ed., pp. 374–388. Academic, New York.
^ ^a ^b S.B. Davis, and P. Mermelstein (1980), "Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences," in IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), pp. 357–366.
^ J. S. Bridle and M. D. Brown (1974), "An Experimental Automatic Word-Recognition System", JSRU Report No. 1003, Joint Speech Research Unit, Ruislip, England.
^ Nelson Morgan; Hervé Bourlard & Hynek Hermansky (2004). “Automatic Speech Recognition: An Auditory Perspective”. In Steven Greenberg & William A. Ainsworth. Speech Processing in the Auditory System. Springer. p. 315. ISBN 978-0-387-00590-4
^ L. C. W. Pols (1966), "Spectral Analysis and Identification of Dutch Vowels in Monosyllabic Words," Doctoral dissertation, Free University, Amsterdam, the Netherlands
^ R. Plomp, L. C. W. Pols, and J. P. van de Geer (1967). "Dimensional analysis of vowel spectra." J. Acoustical Society of America, 41(3):707–712.

外部リンク

[[Category:信号処理]]

[1] Min Xu (2004). “HMM-based audio keyword generation”. In Kiyoharu Aizawa; Yuichi Nakamura; Shin'ichi Satoh. Advances in Multimedia Information Processing – PCM 2004: 5th Pacific Rim Conference on Multimedia. Springer. ISBN 978-3-540-23985-7. オリジナルの2007-05-10時点におけるアーカイブ。

[2] Sahidullah, Md.; Saha, Goutam (May 2012). “Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition”. Speech Communication 54 (4): 543–565. doi:10.1016/j.specom.2011.11.004.

[3] Abdulsatar, Assim Ara; Davydov, V V; Yushkova, V V; Glinushkin, A P; Rud, V Yu (2019-12-01). “Age and gender recognition from speech signals”. Journal of Physics: Conference Series 1410 (1): 012073. Bibcode: 2019JPhCS1410a2073A. doi:10.1088/1742-6596/1410/1/012073. ISSN 1742-6588.

[:0-4] Fang Zheng, Guoliang Zhang and Zhanjiang Song (2001), "Comparison of Different Implementations of MFCC," J. Computer Science & Technology, 16(6): 582–589.

[:1-5] S. Furui (1986), "Speaker-independent isolated word recognition based on emphasized spectral dynamics"

[6] T. Ganchev, N. Fakotakis, and G. Kokkinakis (2005), "Comparative evaluation of various MFCC implementations on the speaker verification task Archived 2011-07-17 at the Wayback Machine.," in 10th International Conference on Speech and Computer (SPECOM 2005), Vol. 1, pp. 191–194.

[7] Meinard Müller (2007). Information Retrieval for Music and Motion. Springer. p. 65. ISBN 978-3-540-74047-6

[8] “librosa.feature.inverse.mfcc_to_audio — librosa 0.10.0 documentation”. librosa.org. 2023年9月29日閲覧。

[9] V. Tyagi and C. Wellekens (2005), On desensitizing the Mel-Cepstrum to spurious spectral components for Robust Speech Recognition, in Acoustics, Speech, and Signal Processing, 2005. Proceedings. (ICASSP ’05). IEEE International Conference on, vol. 1, pp. 529–532.

[merm763-10] P. Mermelstein (1976), "Distance measures for speech recognition, psychological and instrumental," in Pattern Recognition and Artificial Intelligence, C. H. Chen, Ed., pp. 374–388. Academic, New York.

[merm804-11] S.B. Davis, and P. Mermelstein (1980), "Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences," in IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), pp. 357–366.

[12] J. S. Bridle and M. D. Brown (1974), "An Experimental Automatic Word-Recognition System", JSRU Report No. 1003, Joint Speech Research Unit, Ruislip, England.

[13] Nelson Morgan; Hervé Bourlard & Hynek Hermansky (2004). “Automatic Speech Recognition: An Auditory Perspective”. In Steven Greenberg & William A. Ainsworth. Speech Processing in the Auditory System. Springer. p. 315. ISBN 978-0-387-00590-4

[14] L. C. W. Pols (1966), "Spectral Analysis and Identification of Dutch Vowels in Monosyllabic Words," Doctoral dissertation, Free University, Amsterdam, the Netherlands

[15] R. Plomp, L. C. W. Pols, and J. P. van de Geer (1967). "Dimensional analysis of vowel spectra." J. Acoustical Society of America, 41(3):707–712.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]