Erlang for Speech Recognition

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Erlang for Speech Recognition

Ivan Uemlianin
Dear All

I am developing a suite of software tools for use in language technology
(primarily speech recognition).  A lot of the procedures in speech
technology have a very "map-reduce" feel to them and I think Erlang
would be a good fit.

Below I list and briefly describe the tools I'm developing.  Does anyone
know if there are similar current Erlang projects? (I have looked).

As the whole lot is needed for a speech rec system, I don't quite know
how I should proceed: should I write the easiest component first
(probably the language model builder/server), the hardest (probably the
audio preprocessor), the most useful outside of speech recognition
(probably the hidden Markov model builder/server), ...?


** Audio Preprocessor

Automatic Speech Recognition (ASR) is essentially mapping a sequence of
integers (i.e., acoustic signals) onto a sequence of linguistic symbols
(i.e., phonemes (units of linguistic sound) or words).  The raw audio
data (e.g. from a wav file or a microphone) is not terribly useful for
this and the first step is to convert this data into a more useful
abstract representation.  Each 100ms of sound is transformed into a
feature vector of 39 features, known as Mel-Frequency Cepstral
Co-efficients (MFCCs).

The first step in recognition, or in training a recogniser, is always to
convert the audio like this.

A while ago I wrote a little script to read and write wav files [1].  I
also have a 'dummy' make_feats.erl which sets out the imagined tasks for
converting audio data into MFCCs.  However, there are two possible ways
ahead:

1. Write the whole lot from scratch in Erlang.

2. There is a mature, respected open-source speech recognition toolkit,
written in C called Sphinx [2].  Sphinx has a make_feats function which
could be called as a port or a NIF.  The Sphinx make_feats doesn't work
quite how I'd like, so some changes would have to be made (Sphinx is
released under a BSD-style license).

(2) would avoid working out how to implement the maths, but (1) would
avoid fiddling around inside someone else's C.  Any advice?


** Hidden Markov Model Builder/Server

A hidden markov Model (HMM) is essentially a finite state machine with
probabilities given to the edges connecting states.  We train up a HMM
for each phoneme of a language (HMMs are trained using MFCC sequences).

The foundational recognition task is recognising a single phoneme (this
is then conditioned by probabilities of different phoneme sequences).
So, we take an MFCC sequence, match it up against each HMM in turn and
ask, "What is the probability that this HMM could have produced this
MFCC sequence?"

Although used mainly in ASR, HMMs can be used in speech synthesis (aka
text-to-speech), and they are used outside speech tech of course (e.g.
in finance [3]).

I have written a toy HMM trainer and recogniser, for simple symbol
sequences.  I think the sensible next step would be to tone this up and
test it again simple real world data (I could compare performance and
results with the R HMM package [4]).  Once that seems stable, enhance
the code to work with sequences of real number vectors, build a phoneme
recogniser and compare with Sphinx.

The set of HMMs is referred to as the Acoustic Model (AM) of the
language.  Other mathematical models can be used but, since at least the
mid-80s, HMMs dominate.  There is some interesting recent work using
dynamic Bayesian networks, and using conditional random fields.


** Language Model Builder/Server

The Language Model (LM) furnishes probabilities of various linguistic
structures, and sits on top off, or collaborates with, the AM.

Sequences of phonemes are dealt with by a simple pronunciation
dictionary, which is just a list mapping sequences of phonemes to words.

Just as HMMs dominate AMs, the dominant model for syntactic structures
is the ngram grammar, which assigns probabilities to sequences of words
(the most common 'n' is 3, often called a trigram).

As well as their use in ASR, LMs are an essential component in
statistical machine translation.

I have written a toy LM builder, which assigns probabilities to trigrams
based on a given corpus.  I think the sensible next step would be to
tone this up to work with large corpora, and compare performance and
results against a standard open-source LM builder [5].


** references

[1]
http://llaisdy.wordpress.com/2010/06/01/wave-erl-an-erlang-script-to-read-and-write-wav-files/

[2] http://cmusphinx.sourceforge.net/

[3]
http://www.optirisk-systems.com/events/application-of-hidden-markov-models-and-
filters-to-financial-time-series-data.asp

[4] http://r-forge.r-project.org/projects/rhmm/

[5] i.e. irstlm (http://sourceforge.net/projects/irstlm/).  There are
several semi-open-source LM builders available for research use only,
but afaik irstlm is the only one that is bona fide open-source.


--
============================================================
Ivan A. Uemlianin
Speech Technology Research and Development

                     [hidden email]
                      www.llaisdy.com
                          llaisdy.wordpress.com
                      www.linkedin.com/in/ivanuemlianin

     "Froh, froh! Wie seine Sonnen, seine Sonnen fliegen"
                      (Schiller, Beethoven)
============================================================

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Erlang for Speech Recognition

Bob Paddock
> As the whole lot is needed for a speech rec system, I don't quite know how I
> should proceed: should I write the easiest component first (probably the
> language model builder/server), the hardest (probably the audio
> preprocessor), the most useful outside of speech recognition (probably the
> hidden Markov model builder/server), ...?

You start with the audio processor, and the rest of the front end.
 If that doesn't work then any down stream work you do is wasted time.


> ** Audio Preprocessor
>
> The raw audio data
> (e.g. from a wav file or a microphone) is not terribly useful for this and
> the first step is to convert this data into a more useful abstract
> representation.

Consider a different approach such as  Extrema  Processing.
EP removes the identity of the speaker from the intelligence of what
the speaker said, which makes down stream matching easier.
If you are trying to do a voice identification security application,
where you need the identity of the speaker this approach would be
useless.

EP takes a signal input and converts it to the time domain with a
differentiator.  For example a pure sine wave input would give a
transition when at the peak, and at the valley, of the sine where the
direction of the slope changes direction.  Now your template matcher
matches these time domain signals.

Adding out of band 'noise' may also help, similarly to the concept of
dithering in a A/D converter.

I've been meaning to build such a system Real Soon Now for far to long...

> The foundational recognition task is recognizing a single phoneme

If I remember my theory correctly, Phoneme's can be further broken
down into Allophones.
Important in the Identity vs Intelligence of speech.


--
http://blog.softwaresafety.net/
http://www.designer-iii.com/
http://www.wearablesmartsensors.com/
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Erlang for Speech Recognition

Banibrata Dutta
In reply to this post by Ivan Uemlianin
On Sun, Jun 19, 2011 at 2:36 PM, Ivan Uemlianin <[hidden email]> wrote:

<snip>
 
** Audio Preprocessor

Automatic Speech Recognition (ASR) is essentially mapping a sequence of integers (i.e., acoustic signals) onto a sequence of linguistic symbols (i.e., phonemes (units of linguistic sound) or words).  The raw audio data (e.g. from a wav file or a microphone) is not terribly useful for this and the first step is to convert this data into a more useful abstract representation.  Each 100ms of sound is transformed into a feature vector of 39 features, known as Mel-Frequency Cepstral Co-efficients (MFCCs).

<snip>

Doesn't that remember where I'd read this, but the topic was something like "what Erlang as a language is probably not well suited for...", and significant amount of mathematical calculation, was probably one of those. Has that stand changed ? Could it be a factor to consider ?

--
regards,
BDutta

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Erlang for Speech Recognition

Ivan Uemlianin
In reply to this post by Ivan Uemlianin
Dear Kenji

Thanks for the tip-off.  Husky looks interesting.

My Japanese is not what it should be, but it looks like Husky uses HMMs produced by ATT and HTK, is that right?

    http://www.furui.cs.titech.ac.jp/~shinot/husky/runhusky.html

Where it says:

    認識ネットワークやHMM状態ファ イルのパラメタは日本語話し言葉コーパスという音声データベースからHTKやATTツールキットを用いて作成しています.

But whatever part Husky does, it'll be very interesting to see how it's done in Haskell.  I'll check out
Takahiro Shinozaki's papers and see what he's written about Husky.

Best wishes

Ivan


On 19/06/11 13:22, Kenji Rikitake wrote:
Takahiro Shinozaki has published a Haskell speech recognition program
called Husky.
http://www.furui.cs.titech.ac.jp/~shinot/husky/husky.hs

FYI.
Kenji Rikitake
Note: I know very very little about speech recognition.


-- 
============================================================
Ivan A. Uemlianin
Speech Technology Research and Development

                    [hidden email]
                     www.llaisdy.com
                         llaisdy.wordpress.com
                     www.linkedin.com/in/ivanuemlianin

    "Froh, froh! Wie seine Sonnen, seine Sonnen fliegen"
                     (Schiller, Beethoven)
============================================================

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Erlang for Speech Recognition

Ivan Uemlianin
In reply to this post by Bob Paddock
Dear Bob

Thanks for your comments.

On 19/06/11 15:07, Bob Paddock wrote:
>
> You start with the audio processor, and the rest of the front end.
>   If that doesn't work then any down stream work you do is wasted time.

Sensible.  The audio processor is in the lead so far.

> Consider a different approach such as  Extrema  Processing...

I'm afraid I'm not familiar with Extrema Processing.  Can you give me
some pointers?  Do you knoe if it's used in speech recognition?

Using Mel-Frequency Cepstral Coefficients removes many speaker-dependent
properties of the signal (like voice pitch).  I don't know much about
voice identification, but I imagine you'd abstract out a different set
of features (e.g., you'd probably want to keep voice pitch).

Best wishes

Ivan


--
============================================================
Ivan A. Uemlianin
Speech Technology Research and Development

                     [hidden email]
                      www.llaisdy.com
                          llaisdy.wordpress.com
                      www.linkedin.com/in/ivanuemlianin

     "Froh, froh! Wie seine Sonnen, seine Sonnen fliegen"
                      (Schiller, Beethoven)
============================================================
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Erlang for Speech Recognition

Ivan Uemlianin
In reply to this post by Banibrata Dutta
Dear Banibrata

Thanks for your comment.

On 19/06/11 15:45, Banibrata Dutta wrote:
>
> Doesn't that remember where I'd read this, but the topic was something
> like "what Erlang as a language is probably not well suited for...", and
> significant amount of mathematical calculation, was probably one of
> those. Has that stand changed ? Could it be a factor to consider ?

Yes, that's another factor leaning me towards using the Sphinx code
(written in C) as a basis for a make_feats function and calling it as a
port or a NIF.

Best wishes

Ivan


--
============================================================
Ivan A. Uemlianin
Speech Technology Research and Development

                     [hidden email]
                      www.llaisdy.com
                          llaisdy.wordpress.com
                      www.linkedin.com/in/ivanuemlianin

     "Froh, froh! Wie seine Sonnen, seine Sonnen fliegen"
                      (Schiller, Beethoven)
============================================================
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Thanks! Re: Erlang for Speech Recognition

Ivan Uemlianin
In reply to this post by Ivan Uemlianin
Dear All

Thank you for your comments.

I'm going to tackle the audio preprocessor first, converting pcm audio
into MFCCs.  I'll use Sphinx's make_feats as a basis.  This will mean,
on the C side, writing a function to call what I need from Sphinx, and
on the Erlang side, writing the necessary Port/LinkedIn Driver/NIF.

With thanks and best wishes

Ivan


On 19/06/2011 10:06, Ivan Uemlianin wrote:

> Dear All
>
> I am developing a suite of software tools for use in language technology
> (primarily speech recognition). A lot of the procedures in speech
> technology have a very "map-reduce" feel to them and I think Erlang
> would be a good fit.
>
> Below I list and briefly describe the tools I'm developing. Does anyone
> know if there are similar current Erlang projects? (I have looked).
>
> As the whole lot is needed for a speech rec system, I don't quite know
> how I should proceed: should I write the easiest component first
> (probably the language model builder/server), the hardest (probably the
> audio preprocessor), the most useful outside of speech recognition
> (probably the hidden Markov model builder/server), ...?
>
>
> ** Audio Preprocessor
>
> Automatic Speech Recognition (ASR) is essentially mapping a sequence of
> integers (i.e., acoustic signals) onto a sequence of linguistic symbols
> (i.e., phonemes (units of linguistic sound) or words). The raw audio
> data (e.g. from a wav file or a microphone) is not terribly useful for
> this and the first step is to convert this data into a more useful
> abstract representation. Each 100ms of sound is transformed into a
> feature vector of 39 features, known as Mel-Frequency Cepstral
> Co-efficients (MFCCs).
>
> The first step in recognition, or in training a recogniser, is always to
> convert the audio like this.
>
> A while ago I wrote a little script to read and write wav files [1]. I
> also have a 'dummy' make_feats.erl which sets out the imagined tasks for
> converting audio data into MFCCs. However, there are two possible ways
> ahead:
>
> 1. Write the whole lot from scratch in Erlang.
>
> 2. There is a mature, respected open-source speech recognition toolkit,
> written in C called Sphinx [2]. Sphinx has a make_feats function which
> could be called as a port or a NIF. The Sphinx make_feats doesn't work
> quite how I'd like, so some changes would have to be made (Sphinx is
> released under a BSD-style license).
>
> (2) would avoid working out how to implement the maths, but (1) would
> avoid fiddling around inside someone else's C. Any advice?
>
>
> ** Hidden Markov Model Builder/Server
>
> A hidden markov Model (HMM) is essentially a finite state machine with
> probabilities given to the edges connecting states. We train up a HMM
> for each phoneme of a language (HMMs are trained using MFCC sequences).
>
> The foundational recognition task is recognising a single phoneme (this
> is then conditioned by probabilities of different phoneme sequences).
> So, we take an MFCC sequence, match it up against each HMM in turn and
> ask, "What is the probability that this HMM could have produced this
> MFCC sequence?"
>
> Although used mainly in ASR, HMMs can be used in speech synthesis (aka
> text-to-speech), and they are used outside speech tech of course (e.g.
> in finance [3]).
>
> I have written a toy HMM trainer and recogniser, for simple symbol
> sequences. I think the sensible next step would be to tone this up and
> test it again simple real world data (I could compare performance and
> results with the R HMM package [4]). Once that seems stable, enhance the
> code to work with sequences of real number vectors, build a phoneme
> recogniser and compare with Sphinx.
>
> The set of HMMs is referred to as the Acoustic Model (AM) of the
> language. Other mathematical models can be used but, since at least the
> mid-80s, HMMs dominate. There is some interesting recent work using
> dynamic Bayesian networks, and using conditional random fields.
>
>
> ** Language Model Builder/Server
>
> The Language Model (LM) furnishes probabilities of various linguistic
> structures, and sits on top off, or collaborates with, the AM.
>
> Sequences of phonemes are dealt with by a simple pronunciation
> dictionary, which is just a list mapping sequences of phonemes to words.
>
> Just as HMMs dominate AMs, the dominant model for syntactic structures
> is the ngram grammar, which assigns probabilities to sequences of words
> (the most common 'n' is 3, often called a trigram).
>
> As well as their use in ASR, LMs are an essential component in
> statistical machine translation.
>
> I have written a toy LM builder, which assigns probabilities to trigrams
> based on a given corpus. I think the sensible next step would be to tone
> this up to work with large corpora, and compare performance and results
> against a standard open-source LM builder [5].
>
>
> ** references
>
> [1]
> http://llaisdy.wordpress.com/2010/06/01/wave-erl-an-erlang-script-to-read-and-write-wav-files/
>
>
> [2] http://cmusphinx.sourceforge.net/
>
> [3]
> http://www.optirisk-systems.com/events/application-of-hidden-markov-models-and-
>
> filters-to-financial-time-series-data.asp
>
> [4] http://r-forge.r-project.org/projects/rhmm/
>
> [5] i.e. irstlm (http://sourceforge.net/projects/irstlm/). There are
> several semi-open-source LM builders available for research use only,
> but afaik irstlm is the only one that is bona fide open-source.
>
>


--
============================================================
Ivan A. Uemlianin
Speech Technology Research and Development

                     [hidden email]
                      www.llaisdy.com
                          llaisdy.wordpress.com
                      www.linkedin.com/in/ivanuemlianin

     "Froh, froh! Wie seine Sonnen, seine Sonnen fliegen"
                      (Schiller, Beethoven)
============================================================
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions