Forced Alignment with InproTK (and Sphinx)

Forced alignment is the task of determining start and end times for words within an audio file, given a reference text. For example, if I record myself on a microphone to a wav file saying the words “hello world” I can then pass that wav file and the text “hello world” to a program that can do forced alignment and it will be able to tell me at what point in the file each of the two words started and ended.

In this tutorial, we will use InproTK (which in turn uses CMU’s Sphinx4 speech recognizer) to perform the forced alignment (henceforth, FA). We will see how it is done with InproTK’s SimpleReco, then we will see how one can do FA in other languages. Below, two common problems are addressed.

It was explained in a previous post how to “install” InproTK in eclipse. Please refer to that explanation on downloading and getting InproTK to run (it might not hurt to checkout the develop branch). Once you have it working as a project in eclipse, then please continue with the steps below.

FA in InproTK is quite simple. In eclipse, navigate to src, then inpro.apps.SimpleReco and open it. The java source file should appear in the editor. Click on the little arrow next to the green play button on the top, then Run As -> Java Application. You should see a bit of output in the console explaining what the command line parameters should be.

The command line parameters that are necessary for FA are:
-F <URI to audio file>
-fa “<reference text>”
-Lp <path to inc_reco output file>

In order to test FA on your system, please download this: audio. Put it in a folder that you know the path to.

In eclipse, click on the arrow next to the Run icon, then click on “Run Configurations”. Find SimpleReco under Java Applications and click on it. Then click on the tab “Arguments”. In the Program arguments box, copy in the following, then change the two paths within the brackets to reflect your own setup:

-F file://<path to audio>/audio.wav  -fa “der rote stricht oben rechts richtig” -L -Lp <path for output file>

Click “Apply” then “Run”.  You should see output in the command line. You should also see an inc_reco file appear in the <path for output file>. Open that file and you will see lines that look like this:

Time: 6.80
0.000    2.930    <sil>
2.930    3.450    der
3.450    3.990    rote
3.990    4.200    <sil>

What if I want to perform FA on many files? You can call InproTK from the command line. Make sure all the necessary jars are in the lib are on the classpath, and it can be called from, e.g.,  a shell or python script. E.g.:

java -classpath inprotk/lib/oaa2/*:inprotk/lib/common-for-convenience/*:inprotk/lib/*:inprotk/lib/sphinx4/* inpro.apps.SimpleReco -c file:inprotk/config.xml -F file:<path to audio file> -fa “<reference text>” -L -Lp <path to inc_reco output file>

FA with InproTK in English

The above example was audio from a German speaker. What if I want to use this in another language? No problem. You will of course, need three things for that language: an acoustic model, a language model, and a dictionary. There are all three kinds of models on for various languages. We will not look at how to create models here. Rather, we will see an example of how an already existing model for English can be used in InproTK.

In the inpro.apps folder, you will see a config.xml. In that file, there are references to other xml configuration files. Note the sphinx-de.xml. Replacing the information in that file with the information for the English model will tell Sphinx/InproTK where to find it. To do so, please change sphinx-de.xml to sphinx-en.xml and save the config.xml file.

Fortunately, InproTK already has the WSJ acoustic models and dictionary. You will also need a language model file which you can download here. Download the wsj5kc.Z.DMP file and put it into InproTK’s res folder. Next, create a file in called sphinx-en.xml and paste the following contents into it:

<property name=”wordInsertionProbability” value=”0.7″/>
<property name=”languageWeight” value=”11.5″/>

<!– ******************************************************** –>
<!– The Grammar  configuration                               –>
<!– ******************************************************** –>
<component name=”jsgfGrammar” type=”edu.cmu.sphinx.jsgf.JSGFGrammar”>
<property name=”dictionary” value=”dictionary”/>
<property name=”grammarLocation” value=”file:src/inpro/domains/pentomino/resources/”/>
<property name=”grammarName” value=”pento”/>
<property name=”logMath” value=”logMath”/>

<component name=”ngram” type=”edu.cmu.sphinx.linguist.language.ngram.SimpleNGramModel”>
<property name=”dictionary” value=”dictionary”/>
<property name=”logMath” value=”logMath”/>
<!– dies lässt sich mit dem -lm switch in SimpleReco verbiegen –>
<property name=”location” value=”file://res/wsj5kc.Z.DMP”/>
<property name=”maxDepth” value=”3″/>
<property name=”unigramWeight” value=”.7″/>

<!– this second ngram model can be used to blend multiple language models using interpolatedLM defined in sphinx.xml –>
<component name=”ngram2″ type=”edu.cmu.sphinx.linguist.language.ngram.SimpleNGramModel”>
<property name=”dictionary” value=”dictionary”/>
<property name=”logMath” value=”logMath”/>
<property name=”location” value=”file://res/wsj5kc.Z.DMP”/>
<property name=”maxDepth” value=”3″/>
<property name=”unigramWeight” value=”.7″/>

<!– ******************************************************** –>
<!– The Dictionary configuration                            –>
<!– ******************************************************** –>
<component name=”dictionary” type=”edu.cmu.sphinx.linguist.dictionary.FullDictionary”>
<property name=”dictionaryPath” value=”resource:/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/dict/cmudict.0.6d”/>
<property name=”fillerPath” value=”resource:/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/dict/fillerdict”/>
<!–        <property name=”addSilEndingPronunciation” value=”false”/>
<property name=”allowMissingWords” value=”true”/>
<property name=”createMissingWords” value=”true”/>
<property name=”g2pModelPath” value=”file:///home/timo/uni/projekte/inpro-git/res/Cocolab_DE-g2p-4.fst.ser”/>
<property name=”g2pMaxPron” value=”2″/>
–>        <property name=”unitManager” value=”unitManager”/>

<!– ******************************************************** –>
<!– The acoustic model and unit manager configuration        –>
<!– ******************************************************** –>
<component name=”sphinx3Loader” type=”edu.cmu.sphinx.linguist.acoustic.tiedstate.Sphinx3Loader”>
<property name=”logMath” value=”logMath”/>
<property name=”unitManager” value=”unitManager”/>
<property name=”location” value=”resource:/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz”/>
<property name=”modelDefinition” value=”etc/WSJ_clean_13dCep_16k_40mel_130Hz_6800Hz.4000.mdef”/>
<property name=”dataLocation” value=”cd_continuous_8gau/”/>

Now InproTK can perform FA on English audio. Go back to the SimpleReco configuration (arrow next to Play button, configurations, SimpleReco) and change the command line arguments to refer to an English audio file and change the reference text to reflect that audio. You can change back to German by opening up the config.xml file and replacing sphinx-en.xml with sphinx-de.xml.

Problem! I don’t see the inc_reco output file anywhere!! It could be the case that your LabelWriter isn’t working properly. In inpro.apps you will find iu-config.xml. Open that and find labelWriter2. Make sure the component definition for that looks like this:

<component name=”labelWriter2″ type=”inpro.incremental.sink.LabelWriter”>
<property name=”writeToStdOut” value=”false”/>
<property name=”writeToFile” value=”true”/>

If not, then copy the above text over the original labelWriter2 definition. Save the iu-config.xml file.

Problem! My output has some “Can’t find pronunciation for <word>”.  That means one of the words in your reference text does not exist in the dictionary. The dictionary for the German model is in the lib/Cocolab_DE_8gau_13dCep_16k_40mel_130Hz_6800Hz jar in the Cocolab_DE_8gau_13dCep_16k_40mel_130Hz_6800Hz/dict folder. You can view its contents in eclipse by going to the InproTK project, then Referenced Libraries, then Cocolab_DE_ …. then dict, then Cocolab_DE.lex. Here is an example of the contents:

zog    t s o: k
Sekretär    z e k r e t E: 6
tanzen    t a n t s n
tanzen(1)    t a n t s @ n
Ladentür    l a: d n t y: 6
kochten    k O x t n

where each line has a word, a tab, then the phonetic transcription. To add words to the dictionary (ones that appear in the above-mentioned warning in the console), add the word to the bottom, press tab, then add a phonetic transcription (we will not see how a phonetic transcription is made here; for now just find other words that look similar to the one you are adding and form the transcription based on those). You may need to open the jar with an archive tool, add the missing word+transcription, save the dictionary file, then re-make the jar and replace the old one. Alternatively, you can make your own dictionary file and set the dictionary path to that file in sphinx-de.xml.

2 Interspeech 2015 Papers accepted!

We had two papers accepted to the Interspeech Conference in Dresden in August:

Title: Micro-Structure of Disfluencies: Basics for Conversational Speech Synthesis

Authors: Simon Betz, Petra Wagner and David Schlangen

Abstract: Incremental dialogue systems can produce fast responses and can interact in a human-like fashion. However, these systems occasionally produce erroneous material or run out of things to say. Humans in such situations use disfluencies to remedy their ongoing production and signal this to the listener. We devised a new model for inserting disfluencies into synthesis and evaluated this approach in a perception test. It showed that lengthenings and silent pauses can be built for speech synthesis with low effort and high output quality. Synthesized word fragments and filled pauses, while potentially useful in incremental dialogue systems, appear more difficult to handle for listeners. While we were able to get consistently high ratings for certain types of disfluencies, the need for more basic research on their micro structure became apparent in order to be able to synthesize the fine phonetic detail of disfluencies. For this, we analysed corpus data with regard to distributional and durational aspects of lengthenings, word fragments and pauses. Based on these natural speaking strategies, we explored further to what extent speech can be delayed using disfluency strategies, and how to handle difficult disfluency elements by determining the appropriate amount of durational variation applicable.


Title: Recurrent Neural Networks for Incremental Disfluency Detection

Authors: Julian Hough and David Schlangen

Abstract: For dialogue systems to become robust, they must be able to detect disfluencies accurately and with minimal latency. To meet this challenge, here we frame incremental disfluency detection as a word-by-word tagging task and, following their recent success in Spoken Language Understanding tasks, we test the performance of Recurrent Neural Networks (RNNs). We experiment with different inputs for RNNs to explore the effect of context on their ability to detect edit terms and repair disfluencies effectively, and also experiment with different tagging schemes. Although not eclipsing the state of the art in terms of utterance-final performance, RNNs achieve good detection results, requiring no feature engineering and using simple input vectors representing the incoming utterance as their training input. Furthermore, RNNs show very good incremental properties with low latency and very good output stability, surpassing previously reported results in these measures.