Extract phoneme form an audio file

Aug 18, 2013 at 2:39pm
I have finished my speech recognition program in C++, and now I am building the dictionary. I have a problem about this and I need some helps. I am very tired from training speech. Step by step, I recorded word by word and trained it, so I want to train automatically. In other words, I want to extract phoneme from an audio file. Has anyone done this yet? Please guide me. ( I want to implement it without any software ). I think I can't process immediately on the value which was read by sndfile.h library...
Aug 18, 2013 at 11:37pm
Which library are you using for the speech recognition?
Aug 19, 2013 at 4:54am
I used sndfile.h and fftw3.h
Last edited on Aug 19, 2013 at 5:28am
Aug 19, 2013 at 8:36pm
When you say extract the "phoneme" what exactly do you mean?please make sure your using the correct term... If you are sorry it must be my misunderstanding.
Last edited on Aug 19, 2013 at 8:37pm
Aug 20, 2013 at 10:18am
I mean I want to extract every words in a audio file. On the other hand, I want to filter out the noise and my audio file will be divided into some audio files, each of them will have only one word from a sentence which I spoke
P/s: Sorry for my poor English :(
Aug 20, 2013 at 4:09pm
¿word or phoneme?
for words you could simply detect silence

for phoneme, I've used the second LSP coefficient to approximate f0. So detect the vowel and take a little of time before it starts
(had a consonant-vowel scheme, it was having issues with `m' and `v')
Aug 20, 2013 at 4:49pm
Hey ne555! I should have known you would know what a phoneme is! Boy does that take me back!
Aug 20, 2013 at 5:50pm
I did this exact exercise a couple of years ago. I wish I still had the source that I wrote. Here's what I did:

1. A script containing a list of words. Each word was on a newline.
2. A WAVE file where my wife read each word (an seperated each one with some space)
3. A squelch level that would define the "noise" level. The noise level varied from day-to-day depending on the traffic outside.
4. An estimate of the minimum time between words.

1. Read in each word from the script and store it in a queue.
2. Load the wave file into memory.
3. Create a new wave file and use the first name in the queue as it's name before the .wav.
4. Start going through the wave file.
5. When the sound level exceeds the "squelch" limit, mark the start of the word (you may want to mark the start of the word a few samples earlier).
6. When the sound level is below the "squelch" limit for the pre-defined minimum time between words, mark the end of the word at the current time minus the time between words.
7. Copy all of the samples between your two marks into your new wave file and save it.
8. Repeat at step 3.

If there were 50 words in your original script, you will (hopefully) have 50 .wav files in your output directory. You may need to play with the squelch, and the delay between words to get a good match. You also need to ensure that each word is not cut off and so you may need to play with offsets on the start/end words.

At one point I found it was useful to make all of the bookmarks first, check if the number of words found in the script matched the number of words found in the wave, and THEN do the copying.

The best test I had for doing this was to count from one to ten. A number like "Ten" will have a very good start and end detection, however "Six" has a very slow start and so the "S" often got cut off, hence you should check this. I think "Four" was a very quiet one, so sometimes it didn't get recognized at all of the squelch was too high.
Aug 20, 2013 at 8:16pm
@kooth: ¿ah?
Aug 21, 2013 at 3:06pm
@ne555: but now i don't have digital-microphone ( i used mic which was built-in laptop ) so it's hard to estimate the silence value.
@Stewbond: how can you detect the "squelch" limit? I read the wave file by sndfile.h, then I wrote down all vaules to a txt file to find the squelch limit. But I saw nothing :(. I have no idea what I am doing :(, can you explain to me?
Aug 21, 2013 at 5:32pm
¿which SNR do you expect to work with?
Aug 22, 2013 at 3:06pm
What do you mean? I think I have to find which SNR I have to work with instead of expecting :O? I am newbie in this problem, please tell me more clear
Aug 25, 2013 at 7:59am
thank you, guys. I've finally finished :D. But I can't estimate the silence value successfully, each time I speak, I have to change that value :(. Is there any way to fix this?
Aug 25, 2013 at 4:46pm
Audacity is a great tool you can download to analyze waveforms. Record your wave file, then open it with Audacity. This will show you what your voice looks like in a waveform and gives you an idea of the noise level of your setup. When you are not talking, the noise level should not really change.

Visualizing the waveform I think will really help you.
Topic archived. No new replies allowed.