Emergency procedures: Difference between revisions

From VoIPmonitor.org
(Add jitterbuffer_adapt fix for MEMORY IS FULL + HEAP FULL errors)
(Add dmesg -T as critical first diagnostic step for MEMORY IS FULL errors, add NFS network latency diagnostics)
Line 307: Line 307:
|}
|}


=== Step 1: Check for Disk I/O Bottleneck (Important!) ===
=== Step 0: Check Kernel Messages for Storage Errors (Critical First Step!) ===


If the service restarts repeatedly with "MEMORY IS FULL" but increasing <code>max_buffer_mem</code> does not help, check for disk I/O problems on the spool directory (typically <code>/var/spool/voipmonitor</code>).
Before investigating performance issues, check the kernel message buffer for storage hardware or filesystem errors. This is the '''first diagnostic step''' to distinguish between hardware/failure problems and performance bottlenecks.
 
;1. Check kernel messages for storage errors:
<syntaxhighlight lang="bash">
# Check the kernel message buffer for storage-related errors
dmesg -T | grep -i -E "i/o error|disk|storage|filesystem|ext4|xfs|nfs|scsi"
</syntaxhighlight>
 
* '''What to look for:'''
  * I/O errors (e.g., "Buffer I/O error", "critical medium error")
  * Filesystem errors (e.g., "EXT4-fs error", "XFS error")
  * NFS-specific errors (e.g., "NFS: server not responding", "NFS: device not ready")
  * SCSI/SATA errors (e.g., "Task abort", "Device failed")
  * ATA SMART errors indicating disk degradation
 
;2. If kernel errors are present:
** This indicates a hardware or filesystem issue, not a performance bottleneck
** Solutions depend on the specific error:
  * Replace failing disk hardware
  * Repair filesystem (fsck)
  * Resolve NFS connectivity issues (network, server availability)
  * Check RAID controller for failures
  * Fix underlying kernel/storage configuration issues
 
;3. If kernel messages are clean (no errors):
** Proceed to '''Step 1''' below to investigate disk I/O performance bottlenecks
 
For more detailed kernel event investigation, use:
<syntaxhighlight lang="bash">
# View all recent kernel messages with timestamps
dmesg -T | tail -100
 
# Filter for time range (example: last 1 hour)
journalctl -k --since "1 hour ago"
</syntaxhighlight>
 
=== Step 1: Check for Disk I/O Bottleneck (Performance Issue) ===
 
If <code>dmesg -T</code> shows no storage errors (Step 0), the issue is likely a performance bottleneck in the storage subsystem. Check for disk I/O problems on the spool directory (typically <code>/var/spool/voipmonitor</code>).


;1. Monitor disk utilization with iostat:
;1. Monitor disk utilization with iostat:
Line 346: Line 384:
If disk I/O tests confirm the issue:
If disk I/O tests confirm the issue:


* '''Option 1: Upgrade storage hardware**
* '''Option 1: Upgrade storage hardware'''
   ** Move <code>/var/spool/voipmonitor</code> to a faster local SSD or NVMe drive
   ** Move <code>/var/spool/voipmonitor</code> to a faster local SSD or NVMe drive
   ** Consider RAID 10 for better performance and redundancy
   ** Consider RAID 10 for better performance and redundancy
  ** If using NFS, move spool to local storage instead of network-mounted filesystem


* '''Option 2: Tune storage configuration'''
* '''Option 2: Tune storage configuration'''
Line 354: Line 393:
   ** Verify the storage controller firmware is up to date
   ** Verify the storage controller firmware is up to date
   ** Disable unnecessary monitoring or indexing (e.g., updatedb, antivirus scanning) on the spool directory
   ** Disable unnecessary monitoring or indexing (e.g., updatedb, antivirus scanning) on the spool directory
* '''Option 2a: NFS Network Storage Performance'''
  If <code>/var/spool/voipmonitor</code> is mounted on NFS:
  ** Check network latency to NFS server:
    <syntaxhighlight lang="bash">
    # Ping test to NFS server
    ping -c 10 <nfs_server_ip>
    # Measure NFS-specific latency/mount stats
    # Requires nfsiostat from nfs-utils package
    nfsiostat 1
    </syntaxhighlight>
  ** Check NFS server response time and network congestion
  ** Consider upgrading network (e.g., 10GbE) for higher NFS throughput
  ** Use TCP mount options for reliability (e.g., <code>mount -t nfs -o tcp</code>)
  ** Verify NFS server has sufficient disk I/O performance
  ** If NFS is the bottleneck, move spool directory to local SSD storage


* '''Option 3: Move spool directory to faster volume'''
* '''Option 3: Move spool directory to faster volume'''
Line 504: Line 560:
== AI Summary for RAG ==
== AI Summary for RAG ==


'''Summary:''' This article provides emergency procedures for recovering VoIPmonitor from critical failures. It covers steps to force-terminate runaway processes consuming excessive CPU (including kill -9 and systemctl commands), root cause analysis for CPU spikes (SIP REGISTER floods, pcapcommand, RTP threads, audio features), OOM memory exhaustion troubleshooting (checking dmesg for killed processes, reducing innodb_buffer_pool_size and max_buffer_mem), preventive measures (monitoring, anti-fraud auto-blocking, network edge protection), recovery procedures for system freezes during updates and binary issues after crashes, out-of-band management scenarios, and CRITICAL troubleshooting for service restart loop with "packetbuffer: MEMORY IS FULL" and "Cannot bind to port [5029]" errors. The MEMORY IS FULL error has multiple root causes: (1) RAM-based memory exhaustion (solution: increase max_buffer_mem, enable packetbuffer_compress) or (2) Disk I/O performance bottleneck (solution: check iostat -x 1 for 100% utilization, test write speed with dd to /var/spool/voipmonitor with oflag=direct, upgrade storage or move spool directory) or (3) Adaptive jitterbuffer overload (solution: check jitterbuffer settings with grep jitterbuffer /etc/voipmonitor.conf, disable jitterbuffer_adapt=no if enabled, which also disables MOS_adaptive scoring but keeps jitterbuffer_f1/f2 available). The "Cannot bind to port [5029]" error requires clearing zombie processes (killall -9 voipmonitor, systemctl stop voipmonitor).
'''Summary:''' This article provides emergency procedures for recovering VoIPmonitor from critical failures. It covers steps to force-terminate runaway processes consuming excessive CPU (including kill -9 and systemctl commands), root cause analysis for CPU spikes (SIP REGISTER floods, pcapcommand, RTP threads, audio features), OOM memory exhaustion troubleshooting (checking dmesg for killed processes, reducing innodb_buffer_pool_size and max_buffer_mem), preventive measures (monitoring, anti-fraud auto-blocking, network edge protection), recovery procedures for system freezes during updates and binary issues after crashes, out-of-band management scenarios, and CRITICAL troubleshooting for service restart loop with "packetbuffer: MEMORY IS FULL" and "Cannot bind to port [5029]" errors. The MEMORY IS FULL error has multiple root causes: (1) Kernel storage errors (Step 0: check dmesg -T for I/O errors, filesystem errors, NFS errors, SCSI/SATA errors, SMART errors before investigating performance) or (2) Disk I/O performance bottleneck (Step 1: check iostat -x 1 for 100% utilization, test write speed with dd to /var/spool/voipmonitor with oflag=direct; Step 2: resolve by upgrading storage, moving spool, or for NFS check network latency with ping and nfsiostat) or (3) RAM-based memory exhaustion (Step 3: increase max_buffer_mem, enable packetbuffer_compress, ringbuffer, callslimit) or (4) Adaptive jitterbuffer overload (Step 4: check jitterbuffer settings with grep jitterbuffer /etc/voipmonitor.conf, disable jitterbuffer_adapt=no if enabled, which also disables MOS_adaptive scoring but keeps jitterbuffer_f1/f2 available). The "Cannot bind to port [5029]" error (Step 5) requires clearing zombie processes (killall -9 voipmonitor, systemctl stop voipmonitor). For NFS storage, use ping and nfsiostat to diagnose network latency.


'''Keywords:''' emergency recovery, high CPU, system unresponsive, runaway process, kill process, kill -9, systemctl, SIP REGISTER flood, pcapcommand, performance optimization, out-of-band management, iDRAC, iLO, IPMI, crash recovery, OOM, out of memory, memory exhaustion, dmesg, innodb_buffer_pool_size, max_buffer_mem, MEMORY IS FULL, HEAP FULL, packetbuffer, disk I/O, I/O bottleneck, iostat, disk utilization, %util, write speed test, dd oflag=direct, spool directory, SSD, NVMe, RAID, Cannot bind to port 5029, zombie process, Manager API port, port 5029, restart loop, storage performance, I/O wait, %wa, jitterbuffer, jitterbuffer_adapt, adaptive jitterbuffer, jitterbuffer_f1, jitterbuffer_f2, MOS_adaptive, CPU intensive, memory exhaustion
'''Keywords:''' emergency recovery, high CPU, system unresponsive, runaway process, kill process, kill -9, systemctl, SIP REGISTER flood, pcapcommand, performance optimization, out-of-band management, iDRAC, iLO, IPMI, crash recovery, OOM, out of memory, memory exhaustion, dmesg -T, dmesg, kernel messages, storage errors, I/O errors, filesystem errors, ext4 errors, xfs errors, NFS errors, SCSI errors, SATA errors, SMART errors, innodb_buffer_pool_size, max_buffer_mem, MEMORY IS FULL, HEAP FULL, packetbuffer, disk I/O, I/O bottleneck, iostat -x 1, iostat, disk utilization, %util, write speed test, dd oflag=direct, spool directory, SSD, NVMe, RAID, Cannot bind to port 5029, zombie process, Manager API port, port 5029, restart loop, storage performance, I/O wait, %wa, jitterbuffer, jitterbuffer_adapt, adaptive jitterbuffer, jitterbuffer_f1, jitterbuffer_f2, MOS_adaptive, CPU intensive, memory exhaustion, NFS, NFS latency, ping, nfsiostat, network storage, 10GbE, packetbuffer_compress, ringbuffer, callslimit, fsck


'''Key Questions:'''
'''Key Questions:'''
Line 523: Line 579:
* What causes "packetbuffer: MEMORY IS FULL" error message?
* What causes "packetbuffer: MEMORY IS FULL" error message?
* How to distinguish between RAM exhaustion and disk I/O bottleneck?
* How to distinguish between RAM exhaustion and disk I/O bottleneck?
* What is the first diagnostic step for "MEMORY IS FULL" errors?
* How to use dmesg -T to check for storage errors?
* What type of errors to look for in dmesg when MEMORY IS FULL occurs?
* How to check for I/O errors, filesystem errors, NFS errors in kernel messages?
* What to if kernel dmesg shows storage errors vs no errors?
* How to check for disk I/O performance issues causing restart loops?
* How to check for disk I/O performance issues causing restart loops?
* How to use iostat to diagnose disk utilization?
* How to use iostat to diagnose disk utilization?
Line 538: Line 599:
* What is the trade-off when disabling jitterbuffer_adapt?
* What is the trade-off when disabling jitterbuffer_adapt?
* Can I still use jitterbuffer_f1 and jitterbuffer_f2 with jitterbuffer_adapt disabled?
* Can I still use jitterbuffer_f1 and jitterbuffer_f2 with jitterbuffer_adapt disabled?
* How to check NFS network latency causing MEMORY IS FULL?
* What tools to use for NFS diagnostics (ping, nfsiostat)?
* How to improve NFS storage performance for VoIPmonitor?

Revision as of 05:27, 6 January 2026


This guide covers emergency procedures for recovering your VoIPmonitor system from critical failures, including runaway processes, high CPU usage, and system unresponsiveness.

Emergency: VoIPmonitor Process Consuming Excessive CPU or System Unresponsive

When a VoIPmonitor process consumes excessive CPU (e.g., ~3000% or more) or causes the entire system to become unresponsive, follow these immediate steps:

Immediate Action: Force-Terminate Runaway Process

If the system is still minimally responsive via SSH or requires out-of-band management (iDRAC, IPMI, console):

1. Identify the Process ID (PID)
# Using htop (if available)
htop

# Or using ps
ps aux | grep voipmonitor

Look for the voipmonitor process consuming the most CPU resources. Note down the PID (process ID number).

2. Forcefully terminate the process
kill -9 <PID>

Replace <PID> with the actual process ID number identified in step 1.

3. Verify system recovery
# Check CPU usage has returned to normal
top

# Check if the process was terminated
ps aux | grep voipmonitor

The system should become responsive again immediately after the process is killed. CPU utilization should drop significantly.

Optional: Stop and Restart the Service (for persistent issues)

If the problem persists or the service needs to be cleanly restarted:

# Stop the voipmonitor service
systemctl stop voipmonitor

# Verify no zombie processes remaining
killall voipmonitor

# Restart the service
systemctl start voipmonitor

# Verify service status
systemctl status voipmonitor

Caution: When using systemd service management, avoid using the deprecated service command as it can cause systemd to lose track of the daemon. Always use systemctl commands or direct process commands like killall.

Root Cause Analysis: Why Did the CPU Spike?

After recovering the system, investigate the root cause to prevent recurrence. Common causes include:

SIP REGISTER Flood / Spaming Attack

Massive volumes of SIP REGISTER messages from malicious IPs can overwhelm the VoIPmonitor process.

  • Detection: Check recent alert triggers in the VoIPmonitor GUI > Alerts > Sent Alerts for SIP REGISTER flood alerts
  • Immediate mitigation: Block attacker IPs at the network edge (SBC, firewall, iptables)
  • Long-term prevention: Configure anti-fraud rules with custom scripts to auto-block, see SIP REGISTER Flood Mitigation
Packet Capture Overload (pcapcommand)

The pcapcommand feature forks a program for every call, which can generate up to 500,000 interrupts per second.

  • Detection: Check /etc/voipmonitor.conf for a pcapcommand line
  • Immediate fix: Comment out or remove the pcapcommand directive and restart the service
  • Alternative: Use the built-in cleaning spool functionality (maxpoolsize, cleanspool) instead
Excessive RTP Processing Threads

High concurrent call volumes can overload RTP processing threads.

  • Detection: Check performance logs for high tRTP_CPU values (sum of all RTP threads)
  • Mitigation:
callslimit = 2000  # Limit max concurrent calls
Audio Feature Overhead

Silence detection and audio conversion are CPU-intensive operations.

  • Detection: Check if silencedetect or saveaudio are enabled
  • Mitigation:
  silencedetect = no
  # saveaudio = wav  # Comment out if not needed
  

See Scaling and Performance Tuning for detailed performance optimization strategies.

Preventive Measures

Once the root cause is identified, implement these preventive configurations:

Monitor CPU Trends

Use collectd or your existing monitoring system to track CPU usage over time and receive alerts before critical thresholds are reached.

Anti-Fraud Auto-Blocking

Configure Anti-Fraud rules with custom scripts to automatically block attacker IPs when a flood is detected. See the Anti-Fraud documentation for PHP script examples using iptables or ipset.

Network Edge Protection

Block SIP REGISTER spam and floods at your network edge (SBC, firewall) before traffic reaches VoIPmonitor. This provides better performance and reduces CPU load on the monitoring system.

Emergency: GUI and CLI Frequently Inaccessible Due to Memory Exhaustion

When the VoIPmonitor GUI and CLI become frequently inaccessible or the server becomes unresponsive due to Out of Memory (OOM) conditions, follow these steps to identify and resolve the issue.

Diagnose OOM Events

The Linux kernel out-of-memory (OOM) killer terminates processes when RAM is exhausted.

Check the kernel ring buffer for OOM events
dmesg -T | grep -i killed

If you see messages like "Out of memory: Kill process" or "invoke-oom-killer", your system is running out of physical RAM.

Immediate Relief: Reduce Memory Allocation

Reduce memory consumption by tuning both MySQL and VoIPmonitor parameters.

1. Reduce MySQL Buffer Pool Size

Edit the MySQL configuration file (typically /etc/my.cnf.d/mysql-server.cnf or /etc/mysql/my.cnf for Debian/Ubuntu):

[mysqld]
# Reduce from 8GB to 6GB (adjust based on available RAM)
innodb_buffer_pool_size = 6G

A good starting point is innodb_buffer_pool_size = RAM * 0.5 - max_buffer_mem * 0.8. For example, on a 16GB server with 8GB allocated to max_buffer_mem, set innodb_buffer_pool_size to approximately 6GB.

2. Reduce VoIPmonitor Buffer Memory

Edit /etc/voipmonitor.conf and decrease the max_buffer_mem value:

[general]
# Reduce from 8000 to 6000 (adjust based on available RAM)
max_buffer_mem = 6000

The max_buffer_mem parameter limits the maximum RAM allocation for the packet buffer. Typical values range from 2000-8000 MB depending on traffic volume and call rates.

3. Restart the affected services
systemctl restart mysqld
systemctl restart voipmonitor

Monitor the system to confirm stability.

Long-term Solution: Increase RAM

For sustained production operation, increase the server's physical RAM:

  • Minimum: Add at least 16 GB of additional RAM to eliminate OOM conditions
  • Performance benefit: After the RAM upgrade, you can safely increase innodb_buffer_pool_size to improve MySQL performance
  • Recommended settings: Set innodb_buffer_pool_size to 50-70% of total RAM and max_buffer_mem based on your traffic requirements

See Sniffer Configuration for details on VoIPmonitor memory settings.

Emergency: Diagnosing System Hangs and Collecting Core Dump Evidence

When the VoIPmonitor system hangs, packet buffer (heap) spikes to 100%, and a single CPU core is pegged at 100%, you need to diagnose the issue and collect evidence for developer analysis before restarting.

      1. Identify the Problematic Thread

Use the Manager API to identify which sniffer thread is consuming excessive CPU resources.

# Query thread statistics from the sensor
echo 'sniffer_threads' | nc <sensor_ip> 5029

Replace <sensor_ip> with the actual IP address of your VoIPmonitor sensor. Look for a thread showing approximately 100% CPU usage. This indicates the specific processing thread that is causing the hang.

      1. Generate Core Dump for Developer Analysis

If a thread is pegged at 100% and the system needs to be analyzed by VoIPmonitor developers, generate a core dump before restarting:

1. Find the VoIPmonitor process ID (PID)
ps aux | grep voipmonitor | grep -v grep
2. Attach to the process with gdb and generate a core dump
gdb -p <PID_of_voipmonitor>
# Within gdb, generate the core dump
gcore <output_file>

# Example:
gdb -p 12345
(gdb) gcore /tmp/voipmonitor_hang.core

The core dump file provides developers with a complete snapshot of the process state at the moment of the hang, including memory, registers, and stack traces.

3. Detach from gdb and quit
 detach
quit
      1. Restore Service and Collect Evidence

After collecting the diagnostic evidence, restart the service to restore operation:

systemctl restart voipmonitor

Provide the following files to VoIPmonitor support for analysis:

  • Core dump file (from gcore command)
  • Thread statistics output (from sniffer_threads command)
  • Performance logs (/var/log/syslog showing the hang period)
  • Configuration file (/etc/voipmonitor.conf)

Important: Core dump files can be very large (several GB depending on max_buffer_mem). Ensure you have sufficient disk space and consider compressing the file before transferring it to support.

Emergency: System Freezes on Every Update Attempt

If the VoIPmonitor sensor becomes unresponsive or hangs each time you attempt to update it through the Web GUI:

1. SSH into the sensor host
2. Execute the following commands to forcefully stop and restart
killall voipmonitor
systemctl stop voipmonitor
systemctl start voipmonitor

This sequence ensures zombie processes are terminated, systemd is fully stopped, and a clean service restart occurs. Verify the sensor status in the GUI to confirm it is responding correctly.

Emergency: Binary Not Found After Crash

If the VoIPmonitor service fails to start after a crash with error "Binary not found" for /usr/local/sbin/voipmonitor:

1. Check for a renamed binary
ls -l /usr/local/sbin/voipmonitor_*

The crash recovery process may have renamed the binary with an underscore suffix.

2. If found, rename it back
mv /usr/local/sbin/voipmonitor_ /usr/local/sbin/voipmonitor
3. Restart the service
systemctl start voipmonitor
systemctl status voipmonitor
Verify the service starts correctly.

Out-of-Band Management Scenarios

When the system is completely unresponsive and cannot be accessed via SSH:

  • Use your server's out-of-band management system:
 * Dell iDRAC
 * HP iLO
 * Supermicro IPMI
 * Other vendor-specific BMC/management tools
  • Actions available via OBM:
 * Access virtual console (KVM-over-IP)
 * Send NMI (Non-Maskable Interrupt) for system dump
 * Force power cycle
 * Monitor hardware health

See Sniffer Troubleshooting for more diagnostic procedures.

Emergency: Service Restart Loop with "packetbuffer: MEMORY IS FULL" and "Cannot bind to port"

If the VoIPmonitor service enters a restart loop, logging packetbuffer: MEMORY IS FULL and displaying Cannot bind to port [5029] errors, the issue can have multiple root causes. The "MEMORY IS FULL" error message is ambiguous and can indicate either RAM exhaustion or disk I/O bottleneck.

Critical: Distinguish Between RAM and Disk I/O Issues

The symptoms appear identical, but the root causes and solutions are different:

RAM-Based Memory Issue Disk I/O Performance Issue
Memory buffer fills due to excessive concurrent calls or traffic floods Memory buffer fills because disk cannot write fast enough to drain it
Solution: Increase max_buffer_mem, enable packetbuffer_compress, or limit concurrent calls Solution: Upgrade storage, move spool to faster disk, or resolve I/O bottleneck

Step 0: Check Kernel Messages for Storage Errors (Critical First Step!)

Before investigating performance issues, check the kernel message buffer for storage hardware or filesystem errors. This is the first diagnostic step to distinguish between hardware/failure problems and performance bottlenecks.

1. Check kernel messages for storage errors
# Check the kernel message buffer for storage-related errors
dmesg -T | grep -i -E "i/o error|disk|storage|filesystem|ext4|xfs|nfs|scsi"
  • What to look for:
 * I/O errors (e.g., "Buffer I/O error", "critical medium error")
 * Filesystem errors (e.g., "EXT4-fs error", "XFS error")
 * NFS-specific errors (e.g., "NFS: server not responding", "NFS: device not ready")
 * SCSI/SATA errors (e.g., "Task abort", "Device failed")
 * ATA SMART errors indicating disk degradation
2. If kernel errors are present
    • This indicates a hardware or filesystem issue, not a performance bottleneck
    • Solutions depend on the specific error:
 * Replace failing disk hardware
 * Repair filesystem (fsck)
 * Resolve NFS connectivity issues (network, server availability)
 * Check RAID controller for failures
 * Fix underlying kernel/storage configuration issues
3. If kernel messages are clean (no errors)
    • Proceed to Step 1 below to investigate disk I/O performance bottlenecks

For more detailed kernel event investigation, use:

# View all recent kernel messages with timestamps
dmesg -T | tail -100

# Filter for time range (example: last 1 hour)
journalctl -k --since "1 hour ago"

Step 1: Check for Disk I/O Bottleneck (Performance Issue)

If dmesg -T shows no storage errors (Step 0), the issue is likely a performance bottleneck in the storage subsystem. Check for disk I/O problems on the spool directory (typically /var/spool/voipmonitor).

1. Monitor disk utilization with iostat
# Monitor disk I/O in real-time (1-second intervals)
iostat -x 1
  • What to look for: A value near 100% in the %util column indicates the disk is operating at maximum capacity
  • Symptoms:** High %util, high await (average wait time), or high queue depth
2. Perform a write speed test to the spool directory
# Test sequential write speed (adjust count based on available disk space)
# Note: dd test uses O_DIRECT to bypass cache for accurate measurement
dd if=/dev/zero of=/var/spool/voipmonitor/testfile bs=1M count=1024 oflag=direct conv=fdatasync

# Clean up test file
rm /var/spool/voipmonitor/testfile
  • Interpretation: A very slow write speed (e.g., less than 50 MB/s on HDDs or significantly lower than expected SSD speed) confirms a storage bottleneck
  • For SSD/NVMe, expect 400+ MB/s sequential writes
  • For HDDs, expect 80-150 MB/s sequential writes (7200 RPM)
3. Check for I/O wait (Linux monitoring)
# Check if the system is spending significant time waiting for I/O
# High 'wa' (wait) percentage indicates disk bottleneck
top
# or
vmstat 1
  • Look for high %wa (I/O wait) in the CPU section

Step 2: Resolve Disk I/O Bottleneck

If disk I/O tests confirm the issue:

  • Option 1: Upgrade storage hardware
 ** Move /var/spool/voipmonitor to a faster local SSD or NVMe drive
 ** Consider RAID 10 for better performance and redundancy
 ** If using NFS, move spool to local storage instead of network-mounted filesystem
  • Option 2: Tune storage configuration
 ** Check if the disk is operating in degraded mode (RAID rebuild in progress)
 ** Verify the storage controller firmware is up to date
 ** Disable unnecessary monitoring or indexing (e.g., updatedb, antivirus scanning) on the spool directory
  • Option 2a: NFS Network Storage Performance
 If /var/spool/voipmonitor is mounted on NFS:
 ** Check network latency to NFS server:
     # Ping test to NFS server
     ping -c 10 <nfs_server_ip>

     # Measure NFS-specific latency/mount stats
     # Requires nfsiostat from nfs-utils package
     nfsiostat 1
 ** Check NFS server response time and network congestion
 ** Consider upgrading network (e.g., 10GbE) for higher NFS throughput
 ** Use TCP mount options for reliability (e.g., mount -t nfs -o tcp)
 ** Verify NFS server has sufficient disk I/O performance
 ** If NFS is the bottleneck, move spool directory to local SSD storage
  • Option 3: Move spool directory to faster volume
  # Stop service
  systemctl stop voipmonitor

  # Mount faster disk to /var/spool/voipmonitor
  # Or create symlink:
  mv /var/spool/voipmonitor /var/spool/voipmonitor.backup
  ln -s /path/to/fast/disk/voipmonitor /var/spool/voipmonitor

  # Restart service
  systemctl start voipmonitor

For detailed disk performance benchmarking, see I/O Performance Measurement for advanced testing with fio and ioping.

Step 3: Check for RAM-Based Memory Issue

If disk I/O is healthy but the error persists, the issue is RAM-based memory exhaustion.

1. Check RAM allocation
# Check current memory usage
free -h
2. Increase memory buffer limits

Edit /etc/voipmonitor.conf:

Recommended Values for "MEMORY IS FULL" Errors
ringbuffer For very high traffic (>200Mbps) or severe packet loss scenarios, increase to 2000 MB (maximum allowed). Default is 50 MB, recommended for >100Mbit traffic is 500 MB.
max_buffer_mem For high concurrent call loads (5000+ calls) or persistent buffer issues, increase to 8000 MB. Default is 2000 MB, typical tuning is 4000 MB for moderate loads.
packetbuffer_compress Enable if RAM is constrained (increases CPU usage to reduce memory footprint).
[general]
# HIGH TRAFFIC CONFIGURATION - Prevent "MEMORY IS FULL" errors
# Max ringbuffer for very high traffic traffic/serious packet loss
ringbuffer = 2000

# Increase buffer memory for high concurrent call loads
max_buffer_mem = 8000

# Enable compression to save RAM at CPU cost
packetbuffer_compress = yes

# Optional: Limit concurrent calls to prevent overload
callslimit = 2000

Alternative: Moderate Traffic Configuration

[general]
# For moderate traffic (100-200 Mbit, 2000-5000 concurrent calls)
ringbuffer = 500
max_buffer_mem = 4000
packetbuffer_compress = yes
3. Restart and monitor
systemctl restart voipmonitor
journalctl -u voipmonitor -f

See Sniffer Configuration for more memory tuning options.

Step 4: Alternative Root Cause - Adaptive Jitterbuffer Overload

If the "packetbuffer: MEMORY IS FULL" and "HEAP FULL" errors occur even after adjusting max_buffer_mem, the issue may be caused by the adaptive jitterbuffer feature consuming excessive memory during processing. The adaptive jitterbuffer (which simulates jitter up to 500ms) is CPU and memory-intensive and can trigger heap exhaustion on high-traffic systems.

1. Check if jitterbuffer_adapt is enabled
# Check voipmonitor.conf for jitter buffer settings
grep jitterbuffer /etc/voipmonitor.conf

If jitterbuffer_adapt = yes is set, this features may be causing the memory exhaustion.

2. Disable adaptive jitterbuffer

Edit /etc/voipmonitor.conf and set:

[general]
# Disable adaptive jitterbuffer to prevent memory/CPU exhaustion
jitterbuffer_adapt = no
3. Restart the service
systemctl restart voipmonitor
4. Verify the error is resolved
# Monitor for MEMORY IS FULL errors
journalctl -u voipmonitor -f

Important Trade-offs:

  • Disabling jitterbuffer_adapt removes the CPU/memory overhead but also disables MOS_adaptive score calculation
  • Fixed jitterbuffer modes (jitterbuffer_f1 for 50ms, jitterbuffer_f2 for 200ms) remain available and consume significantly less resources
  • If MOS quality scoring is required, consider using jitterbuffer_f2 = yes instead

This solution is particularly effective when the system crashes with both "MEMORY IS FULL" and "HEAP FULL" errors simultaneously, indicating the adaptive jitterbuffer heap is overflowing during real-time packet processing.

Step 5: Clear Stale Port 5029 Bindings

The "Cannot bind to port [5029]" error occurs when a zombie process still holds the Manager API port. This prevents clean restarts.

# Force kill all VoIPmonitor processes
killall -9 voipmonitor

# Ensure service is stopped
systemctl stop voipmonitor

# Verify no processes are running
ps aux | grep voipmonitor

# Restart service
systemctl start voipmonitor

After clearing zombie processes and addressing the root cause (I/O or RAM), the service should start successfully without the bind error.

Related Issues

For performance tuning and scaling guidance, see:

Related Documentation

AI Summary for RAG

Summary: This article provides emergency procedures for recovering VoIPmonitor from critical failures. It covers steps to force-terminate runaway processes consuming excessive CPU (including kill -9 and systemctl commands), root cause analysis for CPU spikes (SIP REGISTER floods, pcapcommand, RTP threads, audio features), OOM memory exhaustion troubleshooting (checking dmesg for killed processes, reducing innodb_buffer_pool_size and max_buffer_mem), preventive measures (monitoring, anti-fraud auto-blocking, network edge protection), recovery procedures for system freezes during updates and binary issues after crashes, out-of-band management scenarios, and CRITICAL troubleshooting for service restart loop with "packetbuffer: MEMORY IS FULL" and "Cannot bind to port [5029]" errors. The MEMORY IS FULL error has multiple root causes: (1) Kernel storage errors (Step 0: check dmesg -T for I/O errors, filesystem errors, NFS errors, SCSI/SATA errors, SMART errors before investigating performance) or (2) Disk I/O performance bottleneck (Step 1: check iostat -x 1 for 100% utilization, test write speed with dd to /var/spool/voipmonitor with oflag=direct; Step 2: resolve by upgrading storage, moving spool, or for NFS check network latency with ping and nfsiostat) or (3) RAM-based memory exhaustion (Step 3: increase max_buffer_mem, enable packetbuffer_compress, ringbuffer, callslimit) or (4) Adaptive jitterbuffer overload (Step 4: check jitterbuffer settings with grep jitterbuffer /etc/voipmonitor.conf, disable jitterbuffer_adapt=no if enabled, which also disables MOS_adaptive scoring but keeps jitterbuffer_f1/f2 available). The "Cannot bind to port [5029]" error (Step 5) requires clearing zombie processes (killall -9 voipmonitor, systemctl stop voipmonitor). For NFS storage, use ping and nfsiostat to diagnose network latency.

Keywords: emergency recovery, high CPU, system unresponsive, runaway process, kill process, kill -9, systemctl, SIP REGISTER flood, pcapcommand, performance optimization, out-of-band management, iDRAC, iLO, IPMI, crash recovery, OOM, out of memory, memory exhaustion, dmesg -T, dmesg, kernel messages, storage errors, I/O errors, filesystem errors, ext4 errors, xfs errors, NFS errors, SCSI errors, SATA errors, SMART errors, innodb_buffer_pool_size, max_buffer_mem, MEMORY IS FULL, HEAP FULL, packetbuffer, disk I/O, I/O bottleneck, iostat -x 1, iostat, disk utilization, %util, write speed test, dd oflag=direct, spool directory, SSD, NVMe, RAID, Cannot bind to port 5029, zombie process, Manager API port, port 5029, restart loop, storage performance, I/O wait, %wa, jitterbuffer, jitterbuffer_adapt, adaptive jitterbuffer, jitterbuffer_f1, jitterbuffer_f2, MOS_adaptive, CPU intensive, memory exhaustion, NFS, NFS latency, ping, nfsiostat, network storage, 10GbE, packetbuffer_compress, ringbuffer, callslimit, fsck

Key Questions:

  • What to do when VoIPmonitor consumes 3000% CPU or system becomes unresponsive?
  • How to forcefully terminate a runaway VoIPmonitor process?
  • What are common causes of CPU spikes in VoIPmonitor?
  • How to mitigate SIP REGISTER flood attacks causing high CPU?
  • How to diagnose OOM (Out of Memory) events?
  • How to fix GUI and CLI frequently inaccessible due to memory exhaustion?
  • How to reduce memory usage of MySQL and VoIPmonitor?
  • What is max_buffer_mem and how to configure it?
  • How to restart VoIPmonitor service after a crash?
  • What to do if service binary is not found after crash?
  • How to prevent VoIPmonitor from freezing during GUI updates?
  • What tools can help diagnose VoIPmonitor performance issues?
  • What causes "packetbuffer: MEMORY IS FULL" error message?
  • How to distinguish between RAM exhaustion and disk I/O bottleneck?
  • What is the first diagnostic step for "MEMORY IS FULL" errors?
  • How to use dmesg -T to check for storage errors?
  • What type of errors to look for in dmesg when MEMORY IS FULL occurs?
  • How to check for I/O errors, filesystem errors, NFS errors in kernel messages?
  • What to if kernel dmesg shows storage errors vs no errors?
  • How to check for disk I/O performance issues causing restart loops?
  • How to use iostat to diagnose disk utilization?
  • How to perform write speed test to /var/spool/voipmonitor directory?
  • What does "Cannot bind to port [5029]" error mean?
  • How to clear zombie processes holding port 5029?
  • How to resolve disk I/O bottleneck for VoIPmonitor?
  • How to move spool directory to faster storage?
  • What is the correct dd command to test disk write speed?
  • What causes "HEAP FULL" errors in VoIPmonitor?
  • How is jitterbuffer_adapt related to MEMORY IS FULL errors?
  • What is the solution for MEMORY IS FULL + HEAP FULL crashes caused by jitterbuffer_adapt?
  • Why should I disable jitterbuffer_adapt?
  • What happens when I set jitterbuffer_adapt = no?
  • What is the trade-off when disabling jitterbuffer_adapt?
  • Can I still use jitterbuffer_f1 and jitterbuffer_f2 with jitterbuffer_adapt disabled?
  • How to check NFS network latency causing MEMORY IS FULL?
  • What tools to use for NFS diagnostics (ping, nfsiostat)?
  • How to improve NFS storage performance for VoIPmonitor?