Notes on Loudness Normalization
You know the situation: Dialogue in a movie is barely audible, so you turn the volume all the way up. The next scene has an explosion and your ears explode.
To prevent this, there are algorithms to normalize loudness. I wasn't really interested in reading everything there is to know about loudness normalization though. Instead, I just experimented with a few options. I also looked into how they can be used on a Linux desktop.
PipeWire filters
PipeWire has recently replaced older sound servers like PulseAudio or Jack on Linux desktops. It provides backwards compatibility with the old systems, so you can for example use the Pulse Volume Control GUI. It also provides features similar to Jack (or even PureData) where you can create different audio processing nodes and connect them together.
Creating a filter node was easy enough. However, I had to manually connect it to the audio streams I wanted to process. For stereo audio, theat meant manually making 4 links (2 links from movie to filter and 2 links from filter to speakers). I tried to automatically create these links via the API to no avail. I also tried to fiddle with WirePlumber with similar results.
Finally I found filter chains, apparanetly a completely unrelated feature in PipeWire that creates a virtual sink in front of the filter and automatically connects its output to the default sink. This makes it really easy to use the filter with standard GUIs.
Filter chains are configured using a syntax that looks like JSON without commas. The documentation says they should be saved to ~/.config/pipewire/filter-chain.conf.d/
, but for me they didn't load unless I saved them to ~/.config/pipewire/pipewire.conf.d/
.
If there is any error in the configuration, the filter will just be ignored. I added ExecStart=/usr/bin/pipewire -vvv
to /usr/lib/systemd/user/pipewire.service
to get some debug output, which helped a little but not much.
For the filters themselves you have a couple of options:
- a couple of builtin low-level primitives like multiplication or logarithms
- ladspa/lv2 plugins
- SOFA filters for spatially oriented audio
- EBU R 128 filters (we will get to that)
Out of all of these, ldaspa/lv2 plugins provide the most flexibility. However, I didn't get them to work. So I was mostly stuck with the builtin primitives to build my filters.
This whole experience was a bit bumpy. Once I got this to work it was a joy, but documentation and the debug experience could certainly be improved.
Reshaping curves
My first idea was to apply function directly to the audio signal. I landed on . This function is symmetric around (0, 0), boosts small values, and compresses larger values so the maximum value is still at 1.
It also reshapes the sound waves. A pure sine wave would be distorted when send through this filter. I was curious to hear how that would effect the sound.
This is the PipeWire configuration I came up with:
context.modules = [
{
name = libpipewire-module-filter-chain
args = {
node.description = "compressor"
media.name = "compressor"
filter.graph = {
nodes = [
{
type = builtin
name = copy
label = copy
}
{
type = builtin
name = cube
label = mult
}
{
type = builtin
name = mixer
label = mixer
control {
"Gain 1" = 1.5
"Gain 2" = -0.5
}
}
]
links = [
{ output = "copy:Out" input = "cube:In 1" }
{ output = "copy:Out" input = "cube:In 2" }
{ output = "copy:Out" input = "cube:In 3" }
{ output = "copy:Out" input = "mixer:In 1" }
{ output = "cube:Out" input = "mixer:In 2" }
]
}
audio.channels = 2
capture.props = {
node.name = "effect_input.compressor"
media.class = Audio/Sink
}
playback.props = {
node.name = "effect_output.compressor"
node.passive = true
}
}
}
]
The result sounded ok, but also not quite like what I had in mind: The compression for larger values was barely noticeable because the audio data doesn't really contain many large values. On the plus side, this meant that the wave distortion effect was small. But it didn't really do much beyond increasing the volume.
Fourier Transforms
It is a fun exercise to apply techniques from image processing to sound or the other way around.
I had experimented with optimizing images by spreading each of the red, green, and blue channels so that the minimum value for each is 0% and the maximum value is 100%. That technique turned out useful to remove color casts from old photos.
To apply this technique to sound, my approach was to first do a Fourier transform to get the strength of each frequency, spread these strengths, and then do the inverse Fourier transform.
The minimum turned out to be 0 in most cases. But I thought this might also be a good chance to do some additional noise reduction. So I shifted the minimum anyway.
On the other end, I didn't want to cancel out all differences in loudness. So instead of stretching the maximum to 100% everywhere, I opted to just push it slightly in that direction by applying a square root.
Finally, I didn't want to have abrupt changes in loudness. So I smoothed the minimum and maximum by mixing it with the previous values.
Because I didn't know how to implement this using PipeWire filter chains, I prototyped it in python instead:
import sys
import numpy as np
import soundfile as sf
= 2048
CHUNK_SIZE = 0.9
KEEP = 0.02
CUTOFF = 0.5
BOOST
= sf.read(sys.argv[1])
audio_data, sample_rate
= []
chunks = 0
min_magnitude = 1
max_magnitude
for start in range(0, len(audio_data), CHUNK_SIZE):
= min(start + chunk_size, len(audio_data))
end
= np.fft.fft(audio_data[start:end])
fft_data = np.abs(fft_data)
magnitude = np.min(magnitude) * (1 - KEEP) + min_magnitude * KEEP
min_magnitude = np.max(magnitude) * (1 - KEEP) + max_magnitude * KEEP
max_magnitude
= (
spread_magnitude - min_magnitude - (max_magnitude - min_magnitude) * CUTOFF)
(magnitude / (max_magnitude - min_magnitude - (max_magnitude - min_magnitude) * CUTOFF)
* (max_magnitude ** BOOST)
)= np.clip(spread_magnitude, 0, 1)
spread_magnitude
= spread_magnitude * np.exp(1j * np.angle(fft_data))
new_fft_data = np.fft.ifft(new_fft_data)
processed_chunk
chunks.append(np.real(processed_chunk))
'processed.flac', processed, sample_rate) sf.write(
The result sounded OK (no noticeable distortion) but the quiet parts were still too quiet.
EBU R 128
In the meantime I did some reading on the last kind of filter that PipeWire had to offer. I had never heard of EBU R 128 before. It turns out it has quite an interesting story.
EBU is short for the "European Broadcasting Union". That is the same organization that does the Eurovision Song Contest, so this already starts glamorous.
In the last few decades, there was a thing called the Loudness War: Audio producers who wanted their songs and jingles to be more noticeable used compression to increase the average loudness of the sound, while leaving the peaks at the same level. EBU R 128 provides loudness recommendations for its member organizations, which effectively stopped the loudness war.
We shouldn't give too much credit to EBU though. Much of the specification is in turn based on ITU-R BS.1770-5 by the International Telecommunication Union. This might actually be one of the best standards I have ever read. It first gives a conceptual overview, then provides all normative formulas, and then goes deep into the rationale and methodology. It was a very interesting and at the same time approachable read.
The only downside is of course the name. I can understand why EBU R 128 is more commonly used.
Loudness is typically measured as logarithm of power, which in turn is calculated as the integral over the squared audio signal. In the case of ITU-R BS.1770-S:
The unit for loudness is LKFS (Loudness, K-weighted, relative to full scale). EBU uses the same unit, but calls it LUFS (Loudness units relative to full scale).
Before all that is calculated, frequencies are weighted to account for human hearing. The industry standard is a curve simply called A-weighting. ITU-R BS.1770-S however refers to a study by Soulodre that found that no weighting actually performs better, and a new curve called RLB performs even better than that.
In addition to the frequency weighting curve, ITU-R BS.1770-S also specifies an algorithm to calculate "gated" loudness. In this version, power is calculated as the average over many small chunks. Chunks that are too quite are ignored.
On top of this, EBU Tech 3341 defines three profiles:
- "Momentary Loudness" is measured over a 400ms window without gating
- "Short-term Loudness" is measured over a 3s window without gating
- "Integrated Loudness" is measured over the complete audio with gating
If you want to use this system with PipeWire, the repo contains an example of how to use its ebur128 filter. Fair warning though: The current version has a typo so that "Shortterm" must be written as "Shorttem" instead.
I have used this filter with some success. This really does normalize loudness. However, there are still some issues. With the "Short-term" profile there is a noticeable ramp when going from a quiet section to a loud section or the other way around. So when there is a sudden bang after a quiet section, it gets amplified even further.
Conclusion
I want to be able to hear all dialogue, but I don't want loud explosions or background noise to be amplified. It is tricky to make that distinction with these simple techniques. I feel like I could get lost in trying to tweak all the parameters to perfection, so I better stop here.
PipeWire turned out to be extremely flexible in theory, but also very limited in practice. For example, I wish I there was a builtin power filter (it has , but not ) or that it was possible to apply these filters to control values (e.g. the gain factor generated by ebur128). While the documentation is decent, I still had issues finding relevant information.
At this point this is just a collection of notes. I will use the ebur128 filter for a while and then maybe come back to this topic with some new ideas.