Comprehensive Guide to VoIP Voice Quality

From VoIPmonitor.org
Revision as of 16:12, 11 December 2025 by Maintenance script (talk | contribs) (Created comprehensive article on VoIP Impairments based on academic research)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

This article provides a comprehensive technical reference on voice quality impairments in telephony and VoIP (Voice over IP) networks. Understanding these degradation factors is essential for network engineers, VoIP administrators, and anyone involved in voice quality monitoring and optimization.

Introduction

Voice quality in telecommunications is affected by numerous impairment factors that can degrade the user experience. These factors can be broadly categorized into:

  • Traditional telephony impairments - factors inherent to analog and digital telephone networks
  • IP network impairments - factors specific to packet-switched networks

The Mean Opinion Score (MOS) is the standard metric for measuring voice quality, ranging from 1 (bad) to 5 (excellent). The R-factor (from the E-model, ITU-T G.107) provides an objective quality measure ranging from 0 to 100.

The relationship between MOS and R-factor is defined by:

MOS=1+0.035R+R(R60)(100R)×7×106
MOS to R-factor conversion curve

Traditional Telephony Impairments

Loudness

Loudness, or send/receive loudness rating, is one of the most important factors affecting voice quality. The Overall Loudness Rating (OLR) is the sum of:

  • SLR (Send Loudness Rating) - transmitting side attenuation
  • RLR (Receive Loudness Rating) - receiving side attenuation

OLR=SLR+RLR

According to ITU-T recommendations:

  • Optimal OLR range: 5-15 dB
  • Values below 5 dB cause discomfort due to excessive loudness
  • Values above 15 dB make speech difficult to understand
Mean Opinion Score as a function of attenuation (OLR) at different circuit noise levels

Circuit Noise

Circuit noise refers to unwanted electrical signals that interfere with voice transmission. The main types include:

  • White noise (thermal noise) - constant background noise
  • Intermodulation noise - caused by nonlinear mixing of signals
  • Impulse noise - short-duration high-amplitude spikes
  • Crosstalk - interference from adjacent circuits

Circuit noise is measured in dBmp (decibels relative to 1 milliwatt, psophometrically weighted). According to ITU-T P.11:

Circuit Noise Impact on Voice Quality
Noise Level (dBmp) Quality Impact
-80 to -70 Negligible impact
-70 to -65 Minor degradation
-65 to -55 Noticeable degradation
-55 to -40 Significant degradation
> -40 Severe degradation
Mean Opinion Score as a function of circuit noise level

Sidetone

Sidetone is the controlled coupling of the speaker's voice from the microphone to their own earpiece. This feedback helps speakers:

  • Regulate their speaking volume
  • Confirm the connection is active
  • Maintain natural conversation rhythm

The Sidetone Masking Rating (STMR) measures sidetone level:

  • Optimal STMR: 7-15 dB
  • Too low (< 7 dB): Speakers perceive their voice as too loud and speak quietly
  • Too high (> 15 dB): Speakers compensate by speaking louder (Lombard effect)

ITU-T G.121 provides detailed recommendations for sidetone design.

Attenuation Distortion

Attenuation distortion (also called frequency distortion) occurs when different frequencies are attenuated unequally across the transmission path. This is particularly problematic in:

  • Analog subscriber lines
  • Frequency-division multiplexing systems
  • Audio codecs with limited bandwidth

ITU-T G.111 specifies the allowed frequency response deviations:

  • Reference frequency: 1020 Hz
  • Bandwidth: 300 Hz - 3400 Hz
  • Maximum deviation: ±3 dB (within band)

Group Delay Distortion

Group delay (or envelope delay) distortion occurs when different frequency components of a signal travel at different speeds through the transmission medium. This is caused by:

  • Non-linear phase response of filters
  • Transmission line characteristics
  • Codec processing

While less perceptible than other impairments in narrowband telephony, group delay distortion becomes more significant in wideband and video communications.

Absolute Delay

Absolute (one-way) delay is the total time required for a voice signal to travel from the speaker's mouth to the listener's ear. In VoIP, this includes:

  • Codec delay - encoding/decoding processing time
  • Packetization delay - assembling voice samples into packets
  • Network delay - transmission through the IP network
  • Jitter buffer delay - buffering to smooth packet arrival variations
  • Playout delay - final processing and D/A conversion
ITU-T G.114 Delay Recommendations
One-way Delay Quality Assessment
0-150 ms Acceptable for most applications
150-400 ms Acceptable with caution (users notice delay)
> 400 ms Unacceptable for normal conversation

Talker Echo

Talker echo occurs when a speaker hears their own voice reflected back from the far end of the connection. This is primarily caused by impedance mismatch at the hybrid (2-wire to 4-wire converter).

2-wire/4-wire hybrid converter - source of talker echo

The severity of talker echo depends on:

  • Echo path loss (ERL) - attenuation of the reflected signal
  • Round-trip delay - time between speaking and hearing the echo
  • TELR (Talker Echo Loudness Rating) - combined measure of echo perception
TELR tolerance limits as a function of round-trip delay (ITU-T G.131)

According to ITU-T G.131:

  • Echo becomes perceptible when delay exceeds 25 ms
  • Echo cancellation is required when delay exceeds 25-30 ms
  • Modern VoIP systems typically require echo cancellers due to inherent network delay

Listener Echo

Listener echo occurs when the listener hears their own voice reflected back. This is typically caused by:

  • Acoustic coupling in speakerphones
  • Feedback in headsets
  • Microphone pickup of speaker output

The Listener Echo Loudness Rating (LELR) is defined as:

LELR=STMR+Dr

where:

  • STMR = Sidetone Masking Rating
  • Dr = Room noise factor

Nonlinear Distortion

Nonlinear distortion occurs when the output signal is not proportional to the input signal. This includes:

  • Harmonic distortion - generation of harmonic frequencies
  • Intermodulation distortion - mixing products of different frequencies
  • Clipping - signal amplitude exceeding system limits

ITU-T O.41 specifies measurement methods for nonlinear distortion. Total Harmonic Distortion (THD) should be below 1% for acceptable quality.

Quantization Distortion

Quantization distortion (quantization noise) is inherent to analog-to-digital conversion. The qdu (quantization distortion unit) measures this impairment.

For PCM systems:

  • A-law/μ-law companding (G.711): 1 qdu per encoding
  • Tandem connections add quantization noise
E-model reference connection showing all impairment parameters

IP Network Impairments

Delay and Jitter

In IP networks, delay consists of:

  • Propagation delay - signal travel time through physical medium
  • Serialization delay - time to transmit packet bits
  • Processing delay - router/switch processing time
  • Queuing delay - waiting time in network buffers

Jitter (delay variation) is the inconsistency in packet arrival times. Two common metrics are:

IPDV (IP Packet Delay Variation):

IPDV(i)=D(i)D(i1)

PDV (Packet Delay Variation):

PDV(i)=D(i)Dmin

where D(i) is the delay of packet i, and Dmin is the minimum observed delay.

Delay and jitter measurement showing packet delay variation over time

Jitter Buffer

The jitter buffer (de-jitter buffer) compensates for delay variation by:

  1. Buffering incoming packets
  2. Smoothing out arrival time variations
  3. Providing packets to the decoder at regular intervals
De-jitter buffer architecture in VoIP receive path

Jitter buffer types:

  • Static - fixed buffer size (simpler, higher latency)
  • Adaptive - dynamic buffer size based on network conditions (lower latency, more complex)

Trade-offs:

  • Larger buffer → lower packet loss, higher delay
  • Smaller buffer → higher packet loss, lower delay

Packet Loss

Packet loss in IP networks can be:

  • Random loss - uncorrelated packet drops
  • Burst loss - consecutive packet drops

The Packet Loss Percentage (Ppl) is calculated as:

Ppl=NlostNtotal×100

The Burst Ratio (BurstR) characterizes loss patterns:

BurstR=MBLRMBLB

where:

  • MBLR = Mean Burst Length under random loss model
  • MBLB = Mean Burst Length actually observed
Mean Opinion Score degradation as a function of packet loss percentage
Packet Loss Impact on Voice Quality
Loss Rate Quality Impact
0-1% Negligible with good PLC
1-3% Minor degradation noticeable
3-5% Noticeable degradation
5-10% Significant degradation
> 10% Severe degradation, often unacceptable

Packet Loss Concealment (PLC)

PLC algorithms mitigate the impact of lost packets by:

  • Packet repetition - replaying the last received packet
  • Interpolation - generating replacement samples based on surrounding data
  • Codec-specific PLC - built-in concealment in codecs like G.729
MOS comparison of different codecs under varying packet loss conditions

Packet Reordering

Packet reordering occurs when packets arrive in a different order than they were sent. This can be caused by:

  • Load balancing across multiple network paths
  • Router buffering variations
  • Network congestion causing rerouting

Reordered packets may be:

  • Treated as lost (if outside jitter buffer window)
  • Buffered and resequenced (requires additional buffering)
  • Discarded as duplicates

Voice Quality Assessment Methods

Subjective Assessment

Subjective testing involves human listeners rating voice quality. The standard is MOS-LQS (Mean Opinion Score - Listening Quality Subjective) defined in ITU-T P.800.

Objective Assessment

Objective methods use algorithms to predict perceived quality:

Speech quality assessment hierarchy - subjective and objective methods

Intrusive Methods (Double-ended)

Require reference signal:

  • PESQ (Perceptual Evaluation of Speech Quality) - ITU-T P.862
  • POLQA (Perceptual Objective Listening Quality Analysis) - ITU-T P.863
PESQ perceptual model architecture
Correlation between subjective MOS and PESQ scores

Non-intrusive Methods (Single-ended)

Do not require reference signal:

  • P.563 (Single-ended method) - ITU-T P.563
  • E-model (Parametric model) - ITU-T G.107
P.563 speech quality parameters extraction

E-model

The E-model (ITU-T G.107) calculates the R-factor based on transmission parameters:

R=R0IsIdIe+A

where:

  • R0 = Basic signal-to-noise ratio (typically 93.2)
  • Is = Simultaneous impairment factor (loudness, sidetone, noise)
  • Id = Delay impairment factor
  • Ie = Equipment impairment factor (codec degradation, packet loss)
  • A = Advantage factor (user expectation adjustment)
Voice quality measurement methodology overview
PESQ scores for different codecs under packet loss conditions
3D visualization of MOS as a function of packet loss probability and burst ratio

References

  • ITU-T G.107 - The E-model: a computational model for use in transmission planning
  • ITU-T G.111 - Loudness ratings (LRs) in an international connection
  • ITU-T G.114 - One-way transmission time
  • ITU-T G.121 - Loudness ratings (LRs) of national systems
  • ITU-T G.131 - Talker echo and its control
  • ITU-T P.11 - Effect of transmission impairments
  • ITU-T P.800 - Methods for subjective determination of transmission quality
  • ITU-T P.862 - PESQ algorithm
  • ITU-T P.863 - POLQA algorithm
  • ITU-T O.41 - Psophometer for use on telephone-type circuits

See Also