Paper Cut Audio Editing for Radio Journalism
April 1st 2024 | #English #Open Source
tl;dr: It's no longer necessary to slowly listen to audio for editing interviews, you can do it with simple text files. Work as fast as you can read!
There is now a way to edit (automatically generated) audio transcripts in a text editor like Vim(!) and then tell the computer to apply all the changes in the text to the corresponding audio(s). No more slow cuts in audio tools like Audacity/Tenacity or painful transcriptions by hand. Instead fast, efficient and nerdy. This changes my radio production workflow fundamentally. And the best part is, we don't need expensively licensed software from some startup to do it, it's possible with open source tools that are easily available and adaptable to our specific needs.
Disclaimer: The following text describes a workflow that might seem daunting for people who don't use linux or have never opened a command line interface. I've tried to make it an interesting read anyway, for the less technically inclined, you can stick to part I and III, just skip part II.
Part I: Current Practice and Utopia
As a radio journalist, I spend a lot of time listening to audio over and over, choosing what to cut out and what parts to keep in, and how to rearrange the audios I have. This is a very rewarding process, as well crafted features and reports can engage the senses in a very artful way when done right. But it is also a extremely time-consuming process. Editing audio up till now meant, editing in the same spead as you can listen. This means that with a industry-standard DAWs or Audacity/Tenacity, you'd need at least one hour to edit a one hour interview, depending on how many edits are necessary, the amount of time needed will easily skyrocket.
I admit to usually editing at 1.25x to 1.5x the normal playback speed, and I've gotten quite fast in editing decisions, so I'd usually need 40-50 minutes for that same hour long interview. Now we've got this crazy little thing called "AI", which gives us some fancy tools that appear to halve that editing time. Okay, quick disclaimer: I'm not talking about these fancy programs editing for me, I do actually want to retain creative control, after all. But thanks to large language models (for our purposes: speech recognition), we can now easily let the computer automatically transcribe our audio, so why not just edit the resulting text, and then tell the computer to adjust the audio accordingly? This, in essence, is the application of paper editing to audio production. And let me tell you, it's awesome! At least for me, I work a hell of a lot faster with text, which comes as no surprise as most people read double as fast as we speak.
So, what's the great Idea? Well, editing on paper is the fastest and most easy way I know to work on complex stories. But I'll let Prabhas Pokharel explain:
Paper editing is a process in which you review transcripts, identify the quotes you'll want to use, and lay out the story using the dialogue. The final product (the "paper edit") consists of quotes from the transcripts, arranged in order.
Radio journalists working on a story will often transcribe audio snippets from interviews, print them out and then physically arrange them, adding some more text as they go. A more demanding story, integrating several interviews and other audio snippets, can quickly grow to several meters of taped together paper. Believe me, I've seen it and I've done it, and it helps greatly (though we'll usually be doing in a text editor, I mean, it's the 21st century after all, even if we're still waiting for our hoverboards...).
So paper editing is great, but with audio it's a lot of work. You can't just copy/paste a quote from some official statement in the internet. So instead we need to pick the audio snippet from an interview, which requires listening to the audio until we find the right part, and either transcribing it completely or summarizing it for our paper edit. And once we have our finished script, we still nead to craft all the audios together with our favorite digital audio workstation, once again listening a lot to these audios. A lot of this could be automated, the creative process is mainly the act of editing and the finer tweaks in the final audio production.
So what we want, for paper editing to be easily applicable for audio production is the following workflow:
Autotranscribe Audio → Edit in Textfile → Autocompile Roughcut → Postproduction → Publish.
Thanks to advances in publicly available large language models, transcription can now be automated. And thanks to Scateu we can take a command line linux text editor, Vim, and make it do the actual edit in textform and than automatically create our audio rough cut.
Part II: How does this work?
Well, all in all, it's just a great combination of well established, simple tools, Whisper and some useful scripts to glue it all together. I'll only give instructions that work on linux, as that's all I've tried. The main work on Vim is from Scateau's Github page, and I've gotten some helpful pointers from Zach "earboxer" DeCook. Thanks, y'all!
Here are the programs you'll want to install with your favorite linux package manager or the one your linux distribution of choice happens to come with:
dos2unix sox ffmpeg jq socat vim mpv whisper.cpp
Whisper.cpp is the implementation of a large language model for transcribing audio. Depending on your linux distribution, you also might need to install the relevant model for whisper, either small, medium or large. The larger the model, the better your transcriptions will be, but you'll also need more space on your harddrive. I've had good results with medium so far. You could also use some other large language model implementation, as long as it can give you .srt files (subtitle files with timestamps). To get the rest of the software and scripts we want to use, run the following commands in a terminal:
$ mkdir -p ~/.vim/pack/plugins/start; cd ~/.vim/pack/plugins/start
$ git clone https://github.com/scateu/tsv_edl.vim
$ git clone https://github.com/vim-airline/vim-airline
$ git clone https://github.com/pR0Ps/molokai-dark
$ cd ~/.vim/pack/plugins/start/tsv_edl.vim
$ make install-utils
And we're set. Next: Let's try this!
Open a terminal, navigate to where your audios are. You'll need .wav audio files, with a samplerate of 16khz for whisper to be able to process them (Why? no idea.). In case that's not your default (and it shouldn't be, please always work with 44100khz), this is how you can make any audio fit whispers requirements:
$ ffmpeg -i [NAME_OF_YOUR_AUDIO] -ar 16000 [NAME_OF_YOUR_AUDIO].wav
*Side note on audio quality: If you dump your 44.1khzwavs somewhere safe, and then switch out the lower quality files we needed for transcribing with whisper after you have your transcriptions, then you will be editing the better quality files!
Creating the correct audio files should be quite fast. Then we can watch the magic happen:
$ whisper.cpp-medium [NAME_OF_YOUR_AUDIO].wav -osrt -l german
This should take a while, depending on your hardware. A good old X260 Thinkpad will take roughly as long as the audio is long, go get some coffee. Ah, and while we wait: the "-osrt" tells Whisper to output the transcription in said .srt file format, which we need since we'll need the timestamps, and "-l" is the language modifier. Default is english, just wanted to show that you can also use it for german.
Ah, the whispers have stopped? Great, you'll now have a .srt file of your audio, read through it and compare to the audio to get an impression of how well this works! Before we start editing, we have one last command to perform:
$ srt2tsv [NAME_OF_YOUR_AUDIO].srt
This will create a .tsv file, with which we can now edit:
$ vim [NAME_OF_YOUR_AUDIO].tsv
Yes, Vim is weird and nobody should use it...but it's what the nerds scripted an integrated audio playback for, so lets get to it. If you've never used Vim, like any normal person, it's not your usual text editor. To actually edit text, you'll have to press "i" and enter "insert mode", before you can type. The escape key gets you back to "normal" mode, where you can give Vim commands. To close Vim, type ":quit", although if you haven't changed the opened file you'll need to add an "!" to that. Fun stuff! A tutorial on Vim can be found here.
What is important is: There's a bunch of keys that do things we want to edit our audio. We can move up and down the lines with the arrow keys, and we can cut lines with backspace (the "EDL" in the beginning will become "xxx" to indicate that this line will be cut). And the best part is, we can also listen to the audio to check if we're cutting at the right space, but for that we need to do one small thing: We need to remove the ".wav" from the filename in every line, between the timestamp and the transcription. Find & replace should do the trick, here is the vim command you want:
:.,$s/filename.wav/filename/
The ":" tells Vim that we're issuing a command, the ".,$" tells it to search in the whole document for every occurence, the "s" tells it to search, and then we have the search term followed by the replacement term, seperated by "/".
Now we can also listen to the line currently below our curser with Shift + Tab (as long as the audio file is in the same directory). Now you probably can see how editing this way is so blazingly fast, just skim the text and delete lines you don't need, and listen in at any point as needed... there's obviously a bit more, here is the reference card for audio editing with vim for delving deeper into the magic. You can even copy and past different lines from different transcripts together into one new file, adding some script in between for the narration, thus creating a classic composed programm, which could look like this:
To now create the actual edited audio file, we can select all the lines that we want to process for our audio with the shortcut "V" (add lines by moving the cursor along) and pressing the spacebar. Any lines marked "xxx" in our selection will be cut from the audio. Vim will ask us how to name the new file (yes, it's none-destructive!) and then it will create the audio for you. If you also recorded, transcribed and inserted your own narration you can even create complete roughcuts of a complete composed programm or feature, not just interviews.
Part III: Some Insights and Ideas
So obviously, editing audio like this is huge news. Not just because it only took me ~40 minutes to cut a two hours long interview (it's now one hour and 17 minutes long), simply because reading is just so much faster than listening, but also because it's possible to also combine several audio files, cherry pick lines you want to use and arrange them according to your narrative idea, then even adding newly recorded narration, before the computer edits everything together for you. In this way any format that isn't live, like reports, interviews, composed programs and features can be created in a more focused manner.
I believe the workflow for radio journalism will greatly improve through this method in the long term, and it's awesome that linux users can now do this with Whisper and Vim, while the BBC is still testing it's own approach as an internal beta (as of March, 2024).
It's not just the workflow that could improve, I also see huge advantages for cross-media publishing, as it would be a piece of cake to do one last correction of the complete script and publish your story as text and audio on your respective media website, even adding automated translations for more languages. The potential for multilingual and accessible journalism is definetely worth working towards.
Still, there are some drawbacks currently. Let's address those now.
First, the most obvious problem: As this write-up shows, there's a lot of technical expertise involved (the linux command line and Vim), which requires substantial motivation for people who aren't developers or tech enthusiasts. Although I believe many people could learn the current workflow, I can quarantee you from my experience in the world of community radio that many will refuse. This isn't ill will, these people are very dedicated and nerdy in their own area, creating many wonderful hours of radioshows in their free time. But making radio is something a very diverse group of people do, and to enable them to do that, they need software that just works (and stays the same over the years). If someone would craft the same features described here onto any traditional GUI-based text editor with clickable icons, I'd see no reason why people won't immediately start using it.
Secondly, having a rough cut is great, and as radio journalists, it's no problem to then do one more production round. But it would greatly help if at least some form of crossfading were applied. Crossfades might even make many rough cuts becomefinal cuts by default (apart from a bit of volume work, EQs and dynamic effects). Another approach would be to make sure the spliced audio pieces were clearly seperated in the rough cut. This would make fine tuning the audio for complex productions way easier, as it's clearly visible where I have to work on transitions and stuff. Thinking about this lead to another thought: Would it be possible to also create an Audacity/Tenacity project file, with the audios on alternating tracks, completely prepped for a last round of edits and postproduction? If anyone can do that, I'll gladly test and debug it!
Third, and less in regard to the workflow and Vim itself, but more in regard to the automated transcription. Whisper is neat and all, but obviously it will produce mistakes or miss important parts of your audio. Using LLMs for easier access to recordings and relying on it are two very different things. Going forward, it will be essential for radio journalists like me to also double check that we're not missing important parts of our stories just because we're only working with what the LLM could extract from our recordings. A combination of notes on important aspects we notice while gathering our material in the field and double checking audio transcriptions and the source material should go a long way here in adressing this issue, but it's essential that we take the time needed for this double checking of our work. Never forget, working as a journalist means due diligence, checking your sources and making sure the stories you tell are not just meaningful but also factual. No so-called "AI" can replace this part of journalistic work. Let me illustrate this point in one closing example. I have one whole interview that is clearly audible and understandable for human ears, but the transcript looks like this:
00:27:42,000 00:27:52,000 [background noise]
00:27:52,000 00:28:02,000 [background noise]
00:28:02,000 00:28:12,000 [background noise]
00:28:12,000 00:28:22,000 [background noise]
< Metal in der DDR | Punk in der DDR >