Free Wisdom Online: Audio Transcription on Linux

Audio Transcription on Linux

A large part of my work involves working with recorded interviews: transcribing them, verifying transcriptions done by others, analysing passages, looking for quotes to illustrate specific points. I am doing this on Linux and have tried a few different combinations of tools in the past, for a while settling on emacs or OpenOffice in combination with curses version of mplayer, switching between the two windows with Alt-Tab. This worked, but not so well. So, having recently upgraded to Ubuntu 7.10 (Gutsy), I decided to look for what new options are available. It turned out that Gutsy has pretty much all the pieces for a good transcription setup. The rest of this document tells how to do the following on Gutsy:

Control audio (play/pause and skip +/– 5 or 30 seconds) with the keyboard from any application, i.e., without having to switch between windows. (I mapped F1 through F4 for those functions.)
Insert time stamps into any document. (In my setup I now need to press F6 to put the timestamp into the clipboard and then Ctrl-V to paste the timestamp.)
Play audio at a timestamp. (In my setup, I just need to select a piece of text like “[00:43:10]” in any document and press F6 – the audio will then jump to 43 minutes 10 seconds into the current audio file.)

Controlling Audio

First of all we need a way to play audio. We'll use XMMS2, which has a client-server architecture. There is a daemon (“xmms2d”) that actually plays the audio and maintains a playlist and there are several front-ends that you can choose from. The seemingly standard one is GXMMS2, but I didn’t like it too much. The playlist interface is a bit too complicated, and the minimized version is too tall. Instead, I settled on Esperanza. What I liked most about Esperanza is that it minimizes to a very slim bar, just like a toolbar. I adjusted the size and position of my OpenOffice window so that it fits right under the Esperanza bar. This way I can see where my audio is at (and which file was loaded) while typing into OpenOffice:

(“Rafael” is a pseudonym.)

Esperanza and XMMS2 can be installed with apt-get:

sudo apt-get install xmms2 esperanza

Even though the Esperanza bar is visible while OpenOffice is active, it doesn’t have focus and thus will not catch keyboard events. (And if it did, it wouldn’t help, since it doesn’t have a keyboard shortcut for going back or skipping by 5 seconds – a key feature when doing transcription.) However, we don’t need to use it for controlling the audio – just for selecting a file and for showing where we are at. Instead, we can control the XMMS2 daemon using “xmms2” – a command-line client. For instance, typing “xmms2 pause” at the command line would pause the audio and “xmms2 seek -5” will take us 5 seconds back. But you won’t actually have to type out those commands every time. Gnome 2 lets us bind those commands to keys, and those bindings will be active in any application. Open gconf-editor(press Alt-F2 and type “gconf-editor), and navigate to apps/metacity/global_keybindings.

We'll pick keys for “runcommand1” through “runcommand7”, typing “F1,” “<Shift>F1,” etc. After that, navigate to apps/metacity/keybinding_commands and set the commands that correspond to “runcommand1” through “runcommand7”. I.e., if we set “runcommand1” to “F1” in global_keybindings and set command1 to “xmms2 seek -5” in keybinding_commands, pressing F1 will have the same effect as calling xmms2 seek -5 at the command line – it will move the audio back 5 seconds.

In my case I configure them as follows:

runcommand#	key	action	command
runcommand1	F1	go 5 seconds back	xmms2 seek -5
runcommand2	F2	pause	xmms2 pause
runcommand3	F3	play	xmms2 play
runcommand4	F4	forward 5 seconds	xmms2 seek +5
runcommand5	<Shift>F1	go 30 seconds back	xmms2 seek -30
runcommand6	<Shift>F4	forward 30 seconds	xmms2 seek +30

I skipped F5 since I use it often in OpenOffice – to bring up the Navigator. F1 – F4 and F6, on the other hand, didn’t serve any function that I could remember. Of course, if you use those for something else, you should pick different keys.

After making this configuration, I can control audio playback from any application, including OpenOffice.

Getting the timestamp and playing audio at the time stamp.

Another thing that comes in handy in transcription is being able to insert into the document the current audio position and to be able to play the audio at some recorded position. Luckily, one can control xmms2 from python, using python-xmmsclient (“sudo apt-get install python-xmmsclient”). To get the timestamps in and out of documents (OpenOffice or other), we can use Python bindings for GTK (“sudo apt-get install pygtk gtk”) to get text in and of the clipboard.

The following script will check if the current selection contains something that looks like a timestamp, and if so will advance audio to that position. (It can handle timestamps that look like “hh:mm:ss” or “mm:ss”, with or without brackets around them.) Otherwise, it would capture the current position and save it in the clipboard, so that it could then be pasted into any document.

# get the clipboard
import pygtk
pygtk.require('2.0')
import gtk

# get XMMS
import xmmsclient
xmms = xmmsclient.XMMS()
xmms.connect()

class Controller :

    def __init__ (self) :
        self.xmms = xmmsclient.XMMS()
        self.xmms.connect()
        self.clipboard = gtk.clipboard_get()
        self.selection_clipboard = gtk.clipboard_get(selection="PRIMARY")
        self.selection = self.selection_clipboard.wait_for_text()
        if self.selection :
            self.selection = self.selection.strip().replace("[", "").replace("]","")

    def get_time_from_xmms(self) :
        w = self.xmms.playback_playtime()
        w.wait()
        t = w.value() / 1000
        return "[%02d:%02d:%02d]" % (t/3600, (t % 3600) / 60, t % 60)

    def seek_to_timestamp(self) :
        parts = self.selection.split(":")
        h, m, s = 0, 0, 0
        if len(parts) == 2 :
           m, s = parts
           s = parts[1]
        elif len(parts) == 3 :
           h, m, s = parts
        h, m, s = int(h), int(m), int(s)
        ms = (h*3600 + m*60 + s) * 1000
        w = self.xmms.playback_start()
        w.wait()
        w = self.xmms.playback_seek_ms(ms)
        w.wait()

    def push_time_to_clipboard(self) :
        time = self.get_time_from_xmms()
        self.clipboard.set_text(time)
        self.clipboard.store()

    def dispatch(self) :
        if self.selection and self.selection[0] in "[0123456789" :
            self.seek_to_timestamp()
        else :
            self.push_time_to_clipboard()

Controller().dispatch()

Save this script somewhere (e.g., in “~/clipboard2xmms.py”). Now we just need to bind this script to a command, in my case I use F6:

After that, pressing F6 either advances audio to the timestamp corresponding to the current selection (in any text, e.g. you should be able to select this timestamp – [00:10:15] – to advance start playing the audio at 10 minutes 15 seconds) or puts the current audio position into the clipboard, from where you can paste it into any document.

Remaining issues

Ideally, I would also want to have a different audio file associated with each OpenOffice document, and to be able to automatically load the right one. So far I haven’t figured out how to do that, partly because I can’t get Python xmms client to change audio files. For the time being, Esperanza provides a reasonable interface for selecting audio files. I just load all of my interviews into a playlist, which makes it relatively easy to switch between them. (Note that the playlist is preserved even between reboots of the system and I don’t use XMMS2 for playing actual music.)