Emergency procedures: Difference between revisions

From VoIPmonitor.org
(Add section on troubleshooting packet loss on sensors due to CPU overload - includes checking VoIPmonitor logs for Load Average and heap usage, Load Average interpretation guide, sensor identification, and CPU upgrade procedures)
(Review: fix wiki syntax, add diagnostic flowchart, optimize AI summary)
Line 1: Line 1:
{{DISPLAYTITLE:Emergency Procedures & System Recovery}}
= Diagnosing Database Bottlenecks Using Sensor RRD Charts =


'''This guide covers emergency procedures for recovering your VoIPmonitor system from critical failures, including runaway processes, high CPU usage, and system unresponsiveness.'''
If your VoIPmonitor GUI becomes unresponsive or PHP processes are being terminated by the OOM (Out of Memory) killer, the root cause may be a '''database performance bottleneck''', not a PHP configuration issue.


== Emergency: "Too High Load" Error in Live Sniffer ==
This guide explains how to use the sensor's RRD (Round-Robin Database) charts to identify whether the database server is the limiting factor.


When the Live Sniffer or sniffer service crashes with a "too high load" error message, this is an intentional **protection feature**, not a crash. The monitor proactively shuts down to maintain data integrity when system load is too high to process packets reliably.
== Symptoms of Database Bottlenecks affecting the GUI ==


=== Understanding the "Too High Load" Message ===
* GUI becomes extremely slow or unresponsive during peak hours
* PHP processes are killed by the OOM killer on the GUI server
* Dashboard and CDR views take a long time to load
* Alerts and reports fail during high traffic periods
* System appears fine during off-peak hours but degrades during peak usage


The "too high load" termination is a **safety feature** designed to:
'''Note:''' These symptoms often occur when the GUI server is waiting for database queries to complete, causing PHP processes to pile up and consume excessive memory.
* Prevent corrupted packet capture under excessive system load
* Maintain data integrity by stopping before data loss occurs
* Protect against system unstability


This is NOT a crash or bug - it is the expected behavior when the system cannot keep up with traffic load.
== Understanding Sensor RRD Charts ==


=== Step 1: Assess True CPU Load with htop ===
The VoIPmonitor sensor generates performance charts (RRD files) that track system metrics over time. These charts are accessible through the GUI and provide visual indicators of where bottlenecks are occurring.


Overall "system CPU usage" can be misleading because it averages across all cores. To understand actual load:
To access sensor RRD charts:
# Navigate to '''Settings > Sensors''' in the GUI
# Click the graph icon next to the sensor
# Select the time range covering the problematic peak hours


;Use htop to view per-thread and per-core CPU usage:
== Diagnostic Flowchart ==
<syntaxhighlight lang="bash">
htop
</syntaxhighlight>


* In htop, view CPU load for individual threads/cores, not just the average
<kroki lang="mermaid">
* Load is displayed per CPU core (each bar represents one core)
flowchart TD
* A single core running at 100% can appear as 25% overall on a 4-core system
    A[GUI Unresponsive / OOM Errors] --> B{Check Sensor RRD Charts}
    B --> C{SQL Cache Growing<br/>During Peak Hours?}
    C -->|No| D[Issue is NOT<br/>database bottleneck]
    C -->|Yes| E[Database Bottleneck<br/>Confirmed]
    E --> F{Identify<br/>Bottleneck Type}
    F --> G{mysqld CPU<br/>near 100%?}
    G -->|Yes| H[CPU Bottleneck]
    H --> I[Add CPU cores<br/>or upgrade CPU]
    G -->|No| J{Buffer pool full?<br/>Swap usage?}
    J -->|Yes| K[Memory Bottleneck]
    K --> L[Add RAM<br/>Tune innodb_buffer_pool_size]
    J -->|No| M{High iowait?<br/>Magnetic disks?}
    M -->|Yes| N[Storage I/O Bottleneck]
    N --> O[Upgrade to SSD/NVMe]


=== Step 2: Compare Load Average Against Core Count ===
    style A fill:#f9f,stroke:#333
    style E fill:#ff9,stroke:#333
    style H fill:#f96,stroke:#333
    style K fill:#f96,stroke:#333
    style N fill:#f96,stroke:#333
</kroki>


Understanding load average (LA) is critical:
== Diagnostic Step 1: Look for Growing SQL Cache ==


* Load average represents the average number of processes waiting for CPU time
The most critical indicator of a database bottleneck is '''growing SQL cache''' or '''SQL cache files''' during peak hours.
* Compare the load average against your total number of CPU cores
* Generally, a load average lower than the core count is acceptable


For example:
{| class="wikitable"
* 8-core system: Load average of 4.0 (50% capacity) is acceptable
* 4-core system: Load average of 8.0 (200% capacity) is overloaded
* 16-core system: Load average of 12.0 (75% capacity) is acceptable
 
Display load average:
<syntaxhighlight lang="bash">
# View current load average
uptime
 
# Or use top/htop (shown in summary line)
# Format: load average: 1 min, 5 min, 15 min
</syntaxhighlight>
 
=== Step 3: Identify the Bottleneck ===
 
If load average exceeds core count consistently:
 
* Check if a single process is pegging one core at 100% (htop will show this clearly)
* Verify if the issue affects multiple cores (system-wide overload)
* Review performance logs for thread-specific CPU usage
 
For system-wide overload solutions, see [[Scaling|Scaling and Performance Tuning]].
 
== Troubleshooting: Packet Loss on Sensors Due to CPU Overload ==
 
When packet loss occurs on specific sensors (31, 32, 33, or any others), the most common cause is CPU overload. The sensor cannot process packets fast enough, causing packets to drop.
 
=== Step 1: Check VoIPmonitor Logs for CPU Overload Indicators ===
 
Check the VoIPmonitor logs for specific signs of CPU overload:
 
;Check for Load Average (LA) and heap usage:
<syntaxhighlight lang="bash">
# Check recent VoIPmonitor logs for CPU/heap indicators
journalctl -u voipmonitor -n 200 | grep -E "Load Average|heap|MEMORY"
</syntaxhighlight>
 
;What to look for in the logs:
* '''High Load Average (LA)''': Values exceeding the number of CPU cores (e.g., LA of 8.0 on a 4-core system)
* '''Heap usage approaching 100%''': Indicates packet buffer is filling up
* '''"MEMORY IS FULL"''': Critical memory exhaustion condition
 
<syntaxhighlight lang="bash">
# Check current system Load Average directly
uptime
# Output example: load average: 3.50, 3.20, 2.90
# Compare this value to your number of CPU cores
 
# Alternative: Real-time monitoring
htop
# Look at the "Load average" line (typically displayed at top)
</syntaxhighlight>
 
=== Step 2: Interpret Load Average Correctly ===
 
Load Average is NOT a percentage. It represents the average number of processes waiting for CPU time.
 
{| class="wikitable" style="background:#e8f4f8; border:1px solid #4A90E2;"
|-
|-
! colspan="2" style="background:#4A90E2; color: white;" | Load Average Interpretation Guide
! Metric !! What to Look For !! Indicates
|-
|-
| ! CPU Cores
| '''SQL Cache''' || Consistently increasing during peak hours, never decreasing || Database cannot keep up with insert rate
| ! Acceptable Load Average Range
|-
|-
| 4 cores || Below 4.0 (ideal: 0.70 - 2.0)
| '''SQL Cache Files''' || Growing over time during peak usage || Database buffer pool too small or storage too slow
|-
|-
| 8 cores || Below 8.0 (ideal: 1.4 - 4.0)
| '''CPU Load (mysqld)''' || Near 100% during peak hours || CPU bottleneck on database server
|-
|-
| 16 cores || Below 16.0 (ideal: 2.8 - 8.0)
| '''Disk I/O (mysql)''' || High or saturated during peak hours || Storage bottleneck (magnetic disks instead of SSDs)
|-
| 32 cores || Below 32.0 (ideal: 5.6 - 16.0)
|}
|}


* '''Load Average < Core Count''': System has idle capacity - acceptable
If you see SQL cache or SQL cache files growing consistently during peak traffic periods, the database server is the bottleneck.
* '''Load Average exceeding Core Count''': System is overloaded - packets may be dropping
* '''Load Average 2x+ Core Count''': Severe overload - immediate action required
 
'''Example:'''
* 8-core system with Load Average of 4.0 = 50% utilized (acceptable)
* 4-core system with Load Average of 8.0 = 200% overloaded (packet loss likely)
* 16-core system with Load Average of 15.0 = 94% utilized (near capacity, upgrade recommended)
 
=== Step 3: Identify Which Sensor is Affected ===
 
;1. Check each sensor individually:
<syntaxhighlight lang="bash">
# On each sensor (31, 32, 33, etc.), check load
uptime
 
# Check VoIPmonitor service status
systemctl status voipmonitor
 
# Check for memory issues
journalctl -u voipmonitor -n 100 | grep -i "memory\|heap"
</syntaxhighlight>
 
;2. Use the Management API to check packet drop counters:
<syntaxhighlight lang="bash">
echo 'sniffer_stat' | nc <sensor_ip> 5029 | jq '.packets_dropped'
</syntaxhighlight>
 
Look for a non-zero value in <code>packets_dropped</code>, which indicates internal packet loss on that sensor.
 
=== Step 4: Resolve CPU Overload by Adding CPU Cores ===
 
The permanent solution for CPU overload is to increase the sensor's CPU capacity.


;On physical servers:
== Diagnostic Step 2: Determine the Bottleneck Type ==
* Add more physical CPU cores
* Upgrade to a higher-spec processor


;On virtual machines (VMware, KVM, Proxmox):
After identifying that the database is the issue, determine which resource is the limiting factor:
<syntaxhighlight lang="bash">
# Check current CPU allocation
lscpu | grep "^CPU(s)"


# Add vCPUs through hypervisor management interface,
=== CPU Bottleneck ===
# then verify additional cores are visible:
* Check database CPU usage during peak hours
lscpu | grep "^CPU(s)"
* If mysqld is at or near 100% CPU, you need more CPU cores or faster CPUs
</syntaxhighlight>


=== Step 5: Monitor After Upgrade ===
=== Memory Bottleneck ===
* Check if SQL cache grows because buffer pool is too small
* Database runs out of RAM for caching, forcing disk reads
* The SQL cache chart shows a pattern of filling up and staying full


After adding CPU cores, monitor the system to confirm the issue is resolved:
=== Storage I/O Bottleneck (Most Common) ===
* High disk I/O wait times for mysqld process
* Disk latency (iowait) increases during peak hours
* Database storage on magnetic disks (e.g., 10K SAS) instead of SSD/NVMe
* SQL cache grows because data cannot be written/read fast enough


;1. Verify Load Average is within acceptable limits:
== Solutions for Database Performance Bottlenecks ==
<syntaxhighlight lang="bash">
# Monitor for several minutes or hours
watch -n 10 "uptime"
</syntaxhighlight>


;2. Verify packet loss has ceased:
=== Solution 1: Add More RAM to the Database Server ===
<syntaxhighlight lang="bash">
# Check for "packets_dropped" via Management API
echo 'sniffer_stat' | nc <sensor_ip> 5029 | jq '.packets_dropped'


# Should show: 0 or stable low value
This is often the most effective fix for memory-related bottlenecks.
</syntaxhighlight>
 
;3. Monitor VoIPmonitor logs for recurrence:
<syntaxhighlight lang="bash">
# Check for new MEMORY IS FULL or heap issues
journalctl -u voipmonitor -f | grep -i "memory\|heap"
</syntaxhighlight>
 
=== Temporary Mitigation (Before Hardware Upgrade) ===
 
If you cannot immediately add CPU cores, apply these temporary tuning options to reduce load:


{| class="wikitable"
{| class="wikitable"
|-
|-
! Configuration !! Description !! Typical Values
! Current RAM !! Recommended Upgrade !! Expected Impact
|-
| '''callslimit''' || Limit max concurrent calls || Reduces load by setting max: <code>callslimit = 2000</code>
|-
|-
| '''ringbuffer''' || Increase packet buffer memory || For high traffic: <code>ringbuffer = 500-2000</code> MB
| 32GB || 64GB or 128GB || Significantly reduces cache growth
|-
|-
| '''silencedetect''' || Disable CPU-intensive audio features || <code>silencedetect = no</code>
| 64GB || 128GB or 256GB || Handles much higher peak loads
|-
|-
| '''saveaudio''' || Skip audio conversion (if not needed) || Comment out <code>saveaudio</code> lines
| 128GB || 256GB || Suitable for large deployments
|}
|}


These are '''workarounds only'''. The proper solution is to add CPU cores to match your traffic volume.
After adding RAM, tune <code>innodb_buffer_pool_size</code> in your MySQL configuration:
 
See [[Sniffer_configuration|Sniffer Configuration]] and [[Scaling|Scaling and Performance Tuning]] for detailed tuning options.
 
== Emergency: VoIPmonitor Process Consuming Excessive CPU or System Unresponsive ==
 
When a VoIPmonitor process consumes excessive CPU (e.g., ~3000% or more) or causes the entire system to become unresponsive, follow these immediate steps:
 
=== Immediate Action: Force-Terminate Runaway Process ===
 
If the system is still minimally responsive via SSH or requires out-of-band management (iDRAC, IPMI, console):
 
;1. Identify the Process ID (PID):
<syntaxhighlight lang="bash">
# Using htop (if available)
htop
 
# Or using ps
ps aux | grep voipmonitor
</syntaxhighlight>
 
Look for the voipmonitor process consuming the most CPU resources. Note down the PID (process ID number).
 
;2. Forcefully terminate the process:
<syntaxhighlight lang="bash">
kill -9 <PID>
</syntaxhighlight>
 
Replace <PID> with the actual process ID number identified in step 1.
 
;3. Verify system recovery:
<syntaxhighlight lang="bash">
# Check CPU usage has returned to normal
top
 
# Check if the process was terminated
ps aux | grep voipmonitor
</syntaxhighlight>
 
The system should become responsive again immediately after the process is killed. CPU utilization should drop significantly.
 
=== Optional: Stop and Restart the Service (for persistent issues) ===
 
If the problem persists or the service needs to be cleanly restarted:
 
<syntaxhighlight lang="bash">
# Stop the voipmonitor service
systemctl stop voipmonitor
 
# Verify no zombie processes remaining
killall voipmonitor
 
# Restart the service
systemctl start voipmonitor
 
# Verify service status
systemctl status voipmonitor
</syntaxhighlight>
 
'''Caution:''' When using <code>systemd</code> service management, avoid using the deprecated <code>service</code> command as it can cause systemd to lose track of the daemon. Always use <code>systemctl</code> commands or direct process commands like <code>killall</code>.
 
=== Root Cause Analysis: Why Did the CPU Spike? ===
 
After recovering the system, investigate the root cause to prevent recurrence. Common causes include:
 
;SIP REGISTER Flood / Spaming Attack
Massive volumes of SIP REGISTER messages from malicious IPs can overwhelm the VoIPmonitor process.
 
* '''Detection:''' Check recent alert triggers in the VoIPmonitor GUI > Alerts > Sent Alerts for SIP REGISTER flood alerts
* '''Immediate mitigation:''' Block attacker IPs at the network edge (SBC, firewall, iptables)
* '''Long-term prevention:''' Configure anti-fraud rules with custom scripts to auto-block, see [[Anti-fraud#SIP REGISTER Flood/Attack|SIP REGISTER Flood Mitigation]]
 
;Packet Capture Overload (pcapcommand)
The <code>pcapcommand</code> feature forks a program for ''every'' call, which can generate up to 500,000 interrupts per second.
 
* '''Detection:''' Check <code>/etc/voipmonitor.conf</code> for a <code>pcapcommand</code> line
* '''Immediate fix:''' Comment out or remove the <code>pcapcommand</code> directive and restart the service
* '''Alternative:''' Use the built-in cleaning spool functionality (<code>maxpoolsize</code>, <code>cleanspool</code>) instead
 
;Excessive RTP Processing Threads
High concurrent call volumes can overload RTP processing threads.
 
* '''Detection:''' Check performance logs for high <code>tRTP_CPU</code> values (sum of all RTP threads)
* '''Mitigation:'''
  <pre>callslimit = 2000  # Limit max concurrent calls</pre>
 
;Audio Feature Overhead
Silence detection and audio conversion are CPU-intensive operations.
 
* '''Detection:''' Check if <code>silencedetect</code> or <code>saveaudio</code> are enabled
* '''Mitigation:'''
  <pre>
  silencedetect = no
  # saveaudio = wav  # Comment out if not needed
  </pre>
 
See [[Scaling|Scaling and Performance Tuning]] for detailed performance optimization strategies.
 
=== Preventive Measures ===
 
Once the root cause is identified, implement these preventive configurations:
 
;Monitor CPU Trends:
Use [[Collectd_installation|collectd]] or your existing monitoring system to track CPU usage over time and receive alerts before critical thresholds are reached.
 
;Anti-Fraud Auto-Blocking:
Configure [[Anti-fraud|Anti-Fraud rules]] with custom scripts to automatically block attacker IPs when a flood is detected. See the [[Anti-fraud|Anti-Fraud documentation]] for PHP script examples using iptables or ipset.
 
;Network Edge Protection:
Block SIP REGISTER spam and floods at your network edge (SBC, firewall) before traffic reaches VoIPmonitor. This provides better performance and reduces CPU load on the monitoring system.
 
== Emergency: GUI and CLI Frequently Inaccessible Due to Memory Exhaustion ==
 
When the VoIPmonitor GUI and CLI become frequently inaccessible or the server becomes unresponsive due to Out of Memory (OOM) conditions, follow these steps to identify and resolve the issue.
 
=== Diagnose OOM Events ===
 
The Linux kernel out-of-memory (OOM) killer terminates processes when RAM is exhausted.
 
;Check the kernel ring buffer for OOM events:
<syntaxhighlight lang="bash">
dmesg -T | grep -i killed
</syntaxhighlight>
 
If you see messages like "Out of memory: Kill process" or "invoke-oom-killer", your system is running out of physical RAM.
 
=== Immediate Relief: Reduce Memory Allocation ===
 
Reduce memory consumption by tuning both MySQL and VoIPmonitor parameters.
 
;1. Reduce MySQL Buffer Pool Size:
 
Edit the MySQL configuration file (typically <code>/etc/my.cnf.d/mysql-server.cnf</code> or <code>/etc/mysql/my.cnf</code> for Debian/Ubuntu):


<syntaxhighlight lang="ini">
<syntaxhighlight lang="ini">
[mysqld]
# /etc/mysql/my.cnf
# Reduce from 8GB to 6GB (adjust based on available RAM)
# Set to 50-70% of total RAM on dedicated database server
innodb_buffer_pool_size = 6G
innodb_buffer_pool_size = 128G
</syntaxhighlight>
</syntaxhighlight>


A good starting point is <code>innodb_buffer_pool_size = RAM * 0.5 - max_buffer_mem * 0.8</code>. For example, on a 16GB server with 8GB allocated to max_buffer_mem, set innodb_buffer_pool_size to approximately 6GB.
For more tuning guidance, see [[Scaling#Optimizing_Database_Performance_.28MySQL.2FMariaDB.29|Scaling - Database Performance]].


;2. Reduce VoIPmonitor Buffer Memory:
'''Warning:''' Do NOT reduce <code>innodb_buffer_pool_size</code> on the GUI server when the database is the bottleneck. This will make the problem worse.


Edit <code>/etc/voipmonitor.conf</code> and decrease the <code>max_buffer_mem</code> value:
=== Solution 2: Upgrade Database Storage to SSD/NVMe ===


<syntaxhighlight lang="ini">
If your database storage is on magnetic disks (e.g., 10K SATA or SAS), upgrading to SSDs is often the single most effective improvement.
[general]
# Reduce from 8000 to 6000 (adjust based on available RAM)
max_buffer_mem = 6000
</syntaxhighlight>


The <code>max_buffer_mem</code> parameter limits the maximum RAM allocation for the packet buffer. Typical values range from 2000-8000 MB depending on traffic volume and call rates.
{| class="wikitable"
 
;3. Restart the affected services:
 
<syntaxhighlight lang="bash">
systemctl restart mysqld
systemctl restart voipmonitor
</syntaxhighlight>
 
Monitor the system to confirm stability.
 
=== Long-term Solution: Increase RAM ===
 
For sustained production operation, increase the server's physical RAM:
 
* '''Minimum''': Add at least 16 GB of additional RAM to eliminate OOM conditions
* '''Performance benefit''': After the RAM upgrade, you can safely increase <code>innodb_buffer_pool_size</code> to improve MySQL performance
* '''Recommended settings''': Set <code>innodb_buffer_pool_size</code> to 50-70% of total RAM and <code>max_buffer_mem</code> based on your traffic requirements
 
See [[Sniffer_configuration#max_buffer_mem|Sniffer Configuration]] for details on VoIPmonitor memory settings.
 
== Emergency: Diagnosing System Hangs and Collecting Core Dump Evidence ==
 
When the VoIPmonitor system hangs, packet buffer (heap) spikes to 100%, and a single CPU core is pegged at 100%, you need to diagnose the issue and collect evidence for developer analysis before restarting.
 
### Identify the Problematic Thread
 
Use the Manager API to identify which sniffer thread is consuming excessive CPU resources.
 
<syntaxhighlight lang="bash">
# Query thread statistics from the sensor
echo 'sniffer_threads' | nc <sensor_ip> 5029
</syntaxhighlight>
 
Replace <sensor_ip> with the actual IP address of your VoIPmonitor sensor. Look for a thread showing approximately 100% CPU usage. This indicates the specific processing thread that is causing the hang.
 
### Generate Core Dump for Developer Analysis
 
If a thread is pegged at 100% and the system needs to be analyzed by VoIPmonitor developers, generate a core dump before restarting:
 
;1. Find the VoIPmonitor process ID (PID):
<syntaxhighlight lang="bash">
ps aux | grep voipmonitor | grep -v grep
</syntaxhighlight>
 
;2. Attach to the process with gdb and generate a core dump:
<syntaxhighlight lang="bash">
gdb -p <PID_of_voipmonitor>
# Within gdb, generate the core dump
gcore <output_file>
 
# Example:
gdb -p 12345
(gdb) gcore /tmp/voipmonitor_hang.core
</syntaxhighlight>
 
The core dump file provides developers with a complete snapshot of the process state at the moment of the hang, including memory, registers, and stack traces.
 
;3. Detach from gdb and quit:
<syntaxhighlight lang="bash"> detach
quit
</syntaxhighlight>
 
### Restore Service and Collect Evidence
 
After collecting the diagnostic evidence, restart the service to restore operation:
 
<syntaxhighlight lang="bash">
systemctl restart voipmonitor
</syntaxhighlight>
 
Provide the following files to VoIPmonitor support for analysis:
 
* Core dump file (from gcore command)
* Thread statistics output (from sniffer_threads command)
* Performance logs (/var/log/syslog showing the hang period)
* Configuration file (/etc/voipmonitor.conf)
 
'''Important:''' Core dump files can be very large (several GB depending on max_buffer_mem). Ensure you have sufficient disk space and consider compressing the file before transferring it to support.
 
== Emergency: System Freezes on Every Update Attempt ==
 
If the VoIPmonitor sensor becomes unresponsive or hangs each time you attempt to update it through the Web GUI:
 
;1. SSH into the sensor host
;2. Execute the following commands to forcefully stop and restart:
<syntaxhighlight lang="bash">
killall voipmonitor
systemctl stop voipmonitor
systemctl start voipmonitor
</syntaxhighlight>
 
This sequence ensures zombie processes are terminated, systemd is fully stopped, and a clean service restart occurs. Verify the sensor status in the GUI to confirm it is responding correctly.
 
== Emergency: Binary Not Found After Crash ==
 
If the VoIPmonitor service fails to start after a crash with error "Binary not found" for <code>/usr/local/sbin/voipmonitor</code>:
 
;1. Check for a renamed binary:
<syntaxhighlight lang="bash">
ls -l /usr/local/sbin/voipmonitor_*
</syntaxhighlight>
 
The crash recovery process may have renamed the binary with an underscore suffix.
 
;2. If found, rename it back:
<syntaxhighlight lang="bash">
mv /usr/local/sbin/voipmonitor_ /usr/local/sbin/voipmonitor
</syntaxhighlight>
 
;3. Restart the service:
<syntaxhighlight lang="bash">
systemctl start voipmonitor
systemctl status voipmonitor
</syntaxhighlight>
 
Verify the service starts correctly.
 
== Out-of-Band Management Scenarios ==
 
When the system is completely unresponsive and cannot be accessed via SSH:
 
* '''Use your server's out-of-band management system:'''
  * Dell iDRAC
  * HP iLO
  * Supermicro IPMI
  * Other vendor-specific BMC/management tools
 
* '''Actions available via OBM:'''
  * Access virtual console (KVM-over-IP)
  * Send NMI (Non-Maskable Interrupt) for system dump
  * Force power cycle
  * Monitor hardware health
 
See [[Sniffer_troubleshooting|Sniffer Troubleshooting]] for more diagnostic procedures.
 
== Emergency: Service Restart Loop with "packetbuffer: MEMORY IS FULL" and "Cannot bind to port" ==
 
If the VoIPmonitor service enters a restart loop, logging <code>packetbuffer: MEMORY IS FULL</code> and displaying <code>Cannot bind to port [5029]</code> errors, the issue can have '''multiple root causes'''. The "MEMORY IS FULL" error message is ambiguous and can indicate either RAM exhaustion or disk I/O bottleneck.
 
=== Critical: Distinguish Between RAM and Disk I/O Issues ===
 
The symptoms appear identical, but the root causes and solutions are different:
 
{|
|-
! style="background:#ffc107;" | RAM-Based Memory Issue
! style="background:#ffc107;" | Disk I/O Performance Issue
! style="background:#ffc107;" | Network Throughput Bottleneck
|-
| Memory buffer fills due to excessive concurrent calls or traffic floods
| Memory buffer fills because disk cannot write fast enough to drain it
| Probe fills packet buffer while sending packets to central server (insufficient network bandwidth)
|-
| Solution: Increase <code>max_buffer_mem</code>, enable <code>packetbuffer_compress</code>, or limit concurrent calls
| Solution: Upgrade storage, move spool to faster disk, or resolve I/O bottleneck
| Solution: Switch to Local Processing mode (<code>packetbuffer_sender=no</code>) or use <code>packetbuffer_compress=yes</code> to reduce network traffic. See Step 6 below.
|}
 
=== Step 0: Check Kernel Messages for Storage Errors (Critical First Step!) ===
 
Before investigating performance issues, check the kernel message buffer for storage hardware or filesystem errors. This is the '''first diagnostic step''' to distinguish between hardware/failure problems and performance bottlenecks.
 
;1. Check kernel messages for storage errors:
<syntaxhighlight lang="bash">
# Check the kernel message buffer for storage-related errors
dmesg -T | grep -i -E "i/o error|disk|storage|filesystem|ext4|xfs|nfs|scsi"
</syntaxhighlight>
 
* '''What to look for:'''
  * I/O errors (e.g., "Buffer I/O error", "critical medium error")
  * Filesystem errors (e.g., "EXT4-fs error", "XFS error")
  * NFS-specific errors (e.g., "NFS: server not responding", "NFS: device not ready")
  * SCSI/SATA errors (e.g., "Task abort", "Device failed")
  * ATA SMART errors indicating disk degradation
 
;2. If kernel errors are present:
** This indicates a hardware or filesystem issue, not a performance bottleneck
** Solutions depend on the specific error:
  * Replace failing disk hardware
  * Repair filesystem (fsck)
  * Resolve NFS connectivity issues (network, server availability)
  * Check RAID controller for failures
  * Fix underlying kernel/storage configuration issues
 
;3. If kernel messages are clean (no errors):
** Proceed to '''Step 1''' below to investigate disk I/O performance bottlenecks
 
For more detailed kernel event investigation, use:
<syntaxhighlight lang="bash">
# View all recent kernel messages with timestamps
dmesg -T | tail -100
 
# Filter for time range (example: last 1 hour)
journalctl -k --since "1 hour ago"
</syntaxhighlight>
 
=== Step 1: Check for Disk I/O Bottleneck (Performance Issue) ===
 
If <code>dmesg -T</code> shows no storage errors (Step 0), the issue is likely a performance bottleneck in the storage subsystem. Check for disk I/O problems on the spool directory (typically <code>/var/spool/voipmonitor</code>).
 
;1. Monitor disk utilization with iostat:
<syntaxhighlight lang="bash">
# Monitor disk I/O in real-time (1-second intervals)
iostat -x 1
</syntaxhighlight>
* '''What to look for:''' A value near 100% in the <code>%util</code> column indicates the disk is operating at maximum capacity
* '''Symptoms:** High %util, high await (average wait time), or high queue depth
 
;2. Perform a write speed test to the spool directory:
<syntaxhighlight lang="bash">
# Test sequential write speed (adjust count based on available disk space)
# Note: dd test uses O_DIRECT to bypass cache for accurate measurement
dd if=/dev/zero of=/var/spool/voipmonitor/testfile bs=1M count=1024 oflag=direct conv=fdatasync
 
# Clean up test file
rm /var/spool/voipmonitor/testfile
</syntaxhighlight>
* '''Interpretation:''' A very slow write speed (e.g., less than 50 MB/s on HDDs or significantly lower than expected SSD speed) confirms a storage bottleneck
* For SSD/NVMe, expect 400+ MB/s sequential writes
* For HDDs, expect 80-150 MB/s sequential writes (7200 RPM)
 
;3. Check for I/O wait (Linux monitoring):
<syntaxhighlight lang="bash">
# Check if the system is spending significant time waiting for I/O
# High 'wa' (wait) percentage indicates disk bottleneck
top
# or
vmstat 1
</syntaxhighlight>
* Look for high <code>%wa</code> (I/O wait) in the CPU section
 
=== Step 2: Resolve Disk I/O Bottleneck ===
 
If disk I/O tests confirm the issue:
 
* '''Option 1: Upgrade storage hardware'''
  ** Move <code>/var/spool/voipmonitor</code> to a faster local SSD or NVMe drive
  ** Consider RAID 10 for better performance and redundancy
  ** If using NFS, move spool to local storage instead of network-mounted filesystem
 
* '''Option 2: Tune storage configuration'''
  ** Check if the disk is operating in degraded mode (RAID rebuild in progress)
  ** Verify the storage controller firmware is up to date
  ** Disable unnecessary monitoring or indexing (e.g., updatedb, antivirus scanning) on the spool directory
 
* '''Option 2a: NFS Network Storage Performance'''
  If <code>/var/spool/voipmonitor</code> is mounted on NFS:
  ** Check network latency to NFS server:
    <syntaxhighlight lang="bash">
    # Ping test to NFS server
    ping -c 10 <nfs_server_ip>
 
    # Measure NFS-specific latency/mount stats
    # Requires nfsiostat from nfs-utils package
    nfsiostat 1
    </syntaxhighlight>
  ** Check NFS server response time and network congestion
  ** Consider upgrading network (e.g., 10GbE) for higher NFS throughput
  ** Use TCP mount options for reliability (e.g., <code>mount -t nfs -o tcp</code>)
  ** Verify NFS server has sufficient disk I/O performance
  ** If NFS is the bottleneck, move spool directory to local SSD storage
 
* '''Option 3: Move spool directory to faster volume'''
  <syntaxhighlight lang="bash">
  # Stop service
  systemctl stop voipmonitor
 
  # Mount faster disk to /var/spool/voipmonitor
  # Or create symlink:
  mv /var/spool/voipmonitor /var/spool/voipmonitor.backup
  ln -s /path/to/fast/disk/voipmonitor /var/spool/voipmonitor
 
  # Restart service
  systemctl start voipmonitor
  </syntaxhighlight>
 
For detailed disk performance benchmarking, see [[IO_Measurement|I/O Performance Measurement]] for advanced testing with <code>fio</code> and <code>ioping</code>.
 
=== Step 3: Check for RAM-Based Memory Issue ===
 
If disk I/O is healthy but the error persists, the issue is RAM-based memory exhaustion.
 
;1. Check RAM allocation:
<syntaxhighlight lang="bash">
# Check current memory usage
free -h
</syntaxhighlight>
 
;2. Increase memory buffer limits:
Edit <code>/etc/voipmonitor.conf</code>:
 
{| class="wikitable" style="background:#fff3cd; border:1px solid #ffc107;"
|-
|-
! colspan="2" style="background:#ffc107;" | Recommended Values for "MEMORY IS FULL" Errors
! Current Storage !! Recommended Upgrade !! Expected Speedup
|-
|-
| '''ringbuffer''' || For very high traffic (>200Mbps) or severe packet loss scenarios, increase to 2000 MB (maximum allowed). Default is 50 MB, recommended for >100Mbit traffic is 500 MB.
| 10K RPM SATA HDD || NVMe SSD array || 10-50x faster I/O
|-
|-
| '''max_buffer_mem''' || For high concurrent call loads (5000+ calls) or persistent buffer issues, increase to 8000 MB. Default is 2000 MB, typical tuning is 4000 MB for moderate loads.
| 10K RPM SAS HDD || Enterprise SSD (SAS/SATA) || 5-20x faster I/O
|-
|-
| '''packetbuffer_compress''' || Enable if RAM is constrained (increases CPU usage to reduce memory footprint).
| Older SSD || Modern NVMe (PCIe 4.0+) || 2-5x faster I/O
|}
|}


<syntaxhighlight lang="ini">
For high-traffic deployments, '''NVMe storage is recommended for the database host'''.
[general]
# HIGH TRAFFIC CONFIGURATION - Prevent "MEMORY IS FULL" errors
# Max ringbuffer for very high traffic traffic/serious packet loss
ringbuffer = 2000
 
# Increase buffer memory for high concurrent call loads
max_buffer_mem = 8000
 
# Enable compression to save RAM at CPU cost
packetbuffer_compress = yes
 
# Optional: Limit concurrent calls to prevent overload
callslimit = 2000
</syntaxhighlight>
 
'''Alternative: Moderate Traffic Configuration'''
<syntaxhighlight lang="ini">
[general]
# For moderate traffic (100-200 Mbit, 2000-5000 concurrent calls)
ringbuffer = 500
max_buffer_mem = 4000
packetbuffer_compress = yes
</syntaxhighlight>
 
;3. Restart and monitor:
<syntaxhighlight lang="bash">
systemctl restart voipmonitor
journalctl -u voipmonitor -f
</syntaxhighlight>
 
'''IMPORTANT: This guidance applies to RAM-based memory issues where the local server cannot process traffic fast enough. For distributed deployments where probes send packets to a central server, see Step 6 below - the solution is typically to switch to Local Processing mode, not to increase max_buffer_mem.'''


See [[Sniffer_configuration#max_buffer_mem|Sniffer Configuration]] for more memory tuning options.
See [[Hardware#Database_Storage|Hardware - Storage Selection]] for detailed recommendations.


=== Step 4: Alternative Root Cause - Adaptive Jitterbuffer Overload ===
=== Solution 3: Temporary Mitigation - Schedule Alerts/Reports Outside Peak Hours ===


If the "packetbuffer: MEMORY IS FULL" and "HEAP FULL" errors occur even after adjusting <code>max_buffer_mem</code>, the issue may be caused by the adaptive jitterbuffer feature consuming excessive memory during processing. The adaptive jitterbuffer (which simulates jitter up to 500ms) is CPU and memory-intensive and can trigger heap exhaustion on high-traffic systems.
If you cannot immediately upgrade the database server hardware, temporarily reduce the load by scheduling intensive tasks during off-peak hours.


;1. Check if jitterbuffer_adapt is enabled:
'''1. Disable or reduce alert frequency''' during peak hours:
<syntaxhighlight lang="bash">
* Navigate to '''GUI > Alerts'''
# Check voipmonitor.conf for jitter buffer settings
* Temporarily disable high-frequency alerts
grep jitterbuffer /etc/voipmonitor.conf
* Set alerts to run during off-peak periods (e.g., 2am-4am)
</syntaxhighlight>


If <code>jitterbuffer_adapt = yes</code> is set, this features may be causing the memory exhaustion.
'''2. Schedule reports outside peak usage:'''
* Navigate to '''GUI > Reports'''
* Configure scheduled reports for off-peak hours
* Avoid generating reports during the busiest part of the day


;2. Disable adaptive jitterbuffer:
'''3. Reduce dashboard complexity''' during peak hours:
Edit <code>/etc/voipmonitor.conf</code> and set:
* Simplify dashboards that query large ranges of data
<syntaxhighlight lang="ini">
* Avoid "All time" statistics during peak loads
[general]
* Use cached dashboards or static displays when possible
# Disable adaptive jitterbuffer to prevent memory/CPU exhaustion
jitterbuffer_adapt = no
</syntaxhighlight>


;3. Restart the service:
=== Solution 4: Consider Component Separation ===
<syntaxhighlight lang="bash">
systemctl restart voipmonitor
</syntaxhighlight>


;4. Verify the error is resolved:
If the database server is a bottleneck and upgrading is not feasible, consider moving to a dedicated database architecture.
<syntaxhighlight lang="bash">
# Monitor for MEMORY IS FULL errors
journalctl -u voipmonitor -f
</syntaxhighlight>


'''Important Trade-offs:'''
In a component separation deployment (see [[Scaling#Scaling_Through_Component_Separation|Scaling - Component Separation]]):
* '''Host 1:''' Dedicated database server with maximum RAM and SSD/NVMe storage
* '''Host 2:''' GUI web server
* '''Host 3:''' Sensor(s)


* Disabling <code>jitterbuffer_adapt</code> removes the CPU/memory overhead but also disables <code>MOS_adaptive</code> score calculation
This allows you to independently scale the database with more powerful hardware without affecting the GUI.
* Fixed jitterbuffer modes (<code>jitterbuffer_f1</code> for 50ms, <code>jitterbuffer_f2</code> for 200ms) remain available and consume significantly less resources
* If MOS quality scoring is required, consider using <code>jitterbuffer_f2 = yes</code> instead


This solution is particularly effective when the system crashes with both "MEMORY IS FULL" and "HEAP FULL" errors simultaneously, indicating the adaptive jitterbuffer heap is overflowing during real-time packet processing.
== Common Pitfalls to Avoid ==
 
=== Step 5: Clear Stale Port 5029 Bindings ===
 
The "Cannot bind to port [5029]" error occurs when a zombie process still holds the Manager API port. This prevents clean restarts.
 
<syntaxhighlight lang="bash">
# Force kill all VoIPmonitor processes
killall -9 voipmonitor
 
# Ensure service is stopped
systemctl stop voipmonitor
 
# Verify no processes are running
ps aux | grep voipmonitor
 
# Restart service
systemctl start voipmonitor
</syntaxhighlight>
 
After clearing zombie processes and addressing the root cause (I/O or RAM), the service should start successfully without the bind error.
 
=== Step 6: Network Throughput Bottleneck in Distributed Deployments ===
 
In distributed client-server mode (remote probes sending data to a central server), a different type of "MEMORY IS FULL" error can occur when the **network throughput between probes and central server** becomes the bottleneck, particularly during peak traffic hours.


{| class="wikitable" style="background:#fff3cd; border:1px solid #ffc107;"
{| class="wikitable" style="background:#fff3cd; border:1px solid #ffc107;"
|-
|-
! colspan="2" style="background:#ffc107;" | Distributed Mode: Network Bottleneck Scenario
! colspan="2" style="background:#ffc107;" | Incorrect Solutions When Database is the Bottleneck
|-
| style="vertical-align: top;" | '''Symptoms:'''
| * "packetbuffer: MEMORY IS FULL" errors on remote probes<br>* Missing CDRs and significant delays in CDR display, especially during peak traffic<br>* VoIP Monitor server shows extremely high memory utilization (99%)<br>* Problems occur during peak hours when network traffic is highest
|-
| style="vertical-align: top;" | '''Root Cause:'''
| Using Packet Mirroring mode (<code>packetbuffer_sender=yes</code>) on probes with insufficient network bandwidth to the central server. The probe's packetbuffer fills because it cannot send raw packets fast enough over the network.
|-
| style="vertical-align: top;" | '''Solution:'''
| Switch to Local Processing mode (<code>packetbuffer_sender=no</code>) if probe hardware has sufficient CPU and disk resources.
|}
 
==== Identify Network Bottleneck vs Local Resource Issues ====
 
To confirm the bottleneck is network-related rather than local resource constraints:
 
;1. Check probe configuration:
<syntaxhighlight lang="bash">
# On remote probe - check current packetbuffer_sender setting
grep packetbuffer_sender /etc/voipmonitor.conf
</syntaxhighlight>
 
If <code>packetbuffer_sender = yes</code>, the probe may be sending raw packets to central server, requiring high network bandwidth.
 
;2. Verify probe local resources are healthy:
<syntaxhighlight lang="bash">
# Check if probe has sufficient free RAM
free -h
 
# Check if probe disk I/O is not a bottleneck
iostat -x 1
 
# Check CPU load
top
</syntaxhighlight>
 
If the probe has sufficient free RAM, healthy disk I/O, and reasonable CPU load, the bottleneck is likely network throughput.
 
;3. Check network throughput during peak hours:
Measure the actual network utilization between probe and central server during peak traffic:
<syntaxhighlight lang="bash">
# Monitor network interface throughput (e.g., eth0)
sar -n DEV 1
 
# or use iftop for real-time per-connection monitoring (if available)
iftop -i eth0
</syntaxhighlight>
 
If network utilization approaches or saturates the link capacity during peak traffic, the network is the bottleneck.
 
==== Solution: Switch to Local Processing Mode ====
 
If probes have sufficient CPU and disk resources, switching from Packet Mirroring to Local Processing mode eliminates network throughput issues.
 
{| class="wikitable" style="background:#e8f4f8; border:1px solid #4A90E2;"
|-
! colspan="2" style="background:#4A90E2; color: white;" | Impact of Switching to Local Processing Mode
|-
| colspan="2" | '''Before (<code>packetbuffer_sender=yes</code>):''' Probe sends raw packets over network to central server. Bandwidth requirement equals VoIP traffic volume. <br>'''After (<code>packetbuffer_sender=no</code>):''' Probe analyzes packets locally and sends only CDRs. Network traffic is minimal.
|-
| style="vertical-align: top;" | '''Pros:'''
| * Drastically reduces network load between probes and central server<br>* Eliminates MEMORY IS FULL errors caused by network bottlenecks<br>* CDRs appear promptly (no network delay)<br>* Scales better with multiple probes
|-
| style="vertical-align: top;" | '''Cons / Trade-offs:'''
| * Increases CPU and RAM usage on probes (they perform full analysis)<br>* PCAP files are stored on probes, not central server<br>* Delay when downloading PCAPs for replay (file must be transferred from remote probe on demand)<br>* Requires sufficient probe resources (CPU, disk, RAM)
|}
 
;Configuration change on remote probes:
 
Edit <code>/etc/voipmonitor.conf</code> on each affected probe:
<syntaxhighlight lang="ini">
[general]
# Switch from Packet Mirroring to Local Processing
packetbuffer_sender = no
 
# Ensure probe has MySQL credentials to write CDRs directly to database
# Or configure central server's server_bind to accept CDRs from probes
</syntaxhighlight>
 
;Restart the probe service:
<syntaxhighlight lang="bash">
systemctl restart voipmonitor
</syntaxhighlight>
 
==== Prerequisites for Local Processing Mode ====
 
Before switching to <code>packetbuffer_sender=no</code>, ensure probes meet these requirements:
 
{| class="wikitable"
|-
! Requirement !! Why It Matters
|-
|-
| '''Sufficient CPU''' || Probes perform full SIP/RTP analysis, which is CPU-intensive
| style="vertical-align: top;" | '''Reducing PHP memory_limit'''
| This does NOT fix the root cause. PHP waits for the database; less memory means processes crash sooner.
|-
|-
| '''Sufficient RAM''' || Probes need <code>max_buffer_mem</code> and <code>ringbuffer</code> resources for their traffic volume
| style="vertical-align: top;" | '''Tuning PHP-FPM worker counts'''
| More workers will pile up waiting for slow database queries, consuming even more memory.
|-
|-
| '''Fast local disk''' || PCAP files are stored on probes. Disk I/O performance affects capture reliability
| style="vertical-align: top;" | '''Reducing innodb_buffer_pool_size'''
| This makes the database slower, not faster. It causes more disk I/O and longer query times.
|-
|-
| '''Database connectivity''' || Probes write CDRs directly to MySQL/MariaDB (via configured database credentials) or send to central server via client-server protocol
| style="vertical-align: top;" | '''Adding RAM to the GUI server'''
| If the bottleneck is the database, adding RAM to the GUI won't help. The database is the limiting factor.
|}
|}


If probes lack sufficient CPU or disk resources, Local Processing mode may not be viable. In this case, consider:
== Verification Checklist ==
* Upgrading probe hardware
* Improving network bandwidth (e.g., 10GbE)
* Reducing traffic volume per probe (add more probes)
 
==== Tuning max_buffer_mem for Network Throughput Bottlenecks ====
 
If you must continue using Packet Mirroring mode (<code>packetbuffer_sender=yes</code>) temporarily before switching to Local Processing, you can tune <code>max_buffer_mem</code> differently than for RAM-based memory issues:
 
{| class="wikitable" style="background:#fff3cd; border:1px solid #ffc107;"
|-
! colspan="2" style="background:#ffc107;" | max_buffer_mem Guidance by Bottleneck Type
|-
| style="vertical-align: top;" | '''RAM-Based Memory Issue (Step 3)'''
| Local server cannot process traffic fast enough.<br>'''Solution:''' INCREASE <code>max_buffer_mem</code> (e.g., 4000-8000 MB) to give more headroom
|-
| style="vertical-align: top;" | '''Network Bottleneck (distributed mode)'''
| Probe cannot send packets to central server fast enough.<br>'''Solution:''' DECREASE <code>max_buffer_mem</code> (e.g., from 8000 to 2000 MB) so the buffer fails faster without exhausting RAM, enabling quicker recovery from network congestion.
|}
 
Edit <code>/etc/voipmonitor.conf</code> on the probe:
<syntaxhighlight lang="ini">
[general]
# REDUCE max_buffer_mem for network throughput bottlenecks
# This causes the buffer to fail sooner (releasing memory) instead of
# continuing to occupy RAM waiting for a network connection that cannot keep up
max_buffer_mem = 2000
 
# Enable compression to reduce network traffic
packetbuffer_compress = yes
</syntaxhighlight>
 
Restart the probe service:
<syntaxhighlight lang="bash">
systemctl restart voipmonitor
</syntaxhighlight>
 
For detailed documentation on distributed architecture configuration, see [[Sniffer_distributed_architecture|Distributed Architecture: Client-Server Mode]].


=== Related Issues ===
After implementing a database upgrade to fix the bottleneck:


For performance tuning and scaling guidance, see:
1. Monitor SQL cache charts during the next peak traffic period
* [[Scaling|Scaling and Performance Tuning Guide]]
2. Check that SQL cache does not grow uncontrollably
* [[IO_Measurement|I/O Performance Measurement]]
3. Verify GUI responsiveness during peak hours
* [[High-Performance_VoIPmonitor_and_MySQL_Setup_Manual|High-Performance Setup]]
4. Confirm no OOM killer events
5. Check database query latency (slow queries should be minimal)


== Related Documentation ==
== Related Documentation ==


* [[Scaling|Scaling and Performance Tuning Guide]] - For performance optimization
* [[Scaling]] - General performance tuning and scaling guide
* [[Anti-fraud|Anti-Fraud Rules]] - For attack detection and mitigation
* [[Scaling#Optimizing_Database_Performance_.28MySQL.2FMariaDB.29|Optimizing Database Performance]] - MySQL tuning parameters
* [[Sniffer_troubleshooting|Sniffer Troubleshooting]] - For systematic diagnostic procedures
* [[Hardware]] - Hardware sizing recommendations for different deployment sizes
* [[High-Performance_VoIPmonitor_and_MySQL_Setup_Manual|High-Performance Setup]] - For optimizing high-traffic deployments
* [[Scaling#Scaling_Through_Component_Separation|Component Separation]] - Dedicated database architecture
* [[Systemd_for_voipmonitor_service_management|Systemd Service Management]] - For service management best practices


== AI Summary for RAG ==
== AI Summary for RAG ==


'''Summary:''' This article provides emergency procedures for recovering VoIPmonitor from critical failures. It covers the "too high load" error in Live Sniffer (an intentional protection feature, not a crash, designed to maintain data integrity by shutting down when system load is too high), steps to assess true CPU load using htop (view per-thread and per-core CPU usage, not just averages), how to compare load average (LA) against total CPU cores (load average lower than core count is generally acceptable), how to view load average with uptime or top/htop, and identifying bottlenecks when load exceeds core count. It also covers steps to force-terminate runaway processes consuming excessive CPU (including kill -9 and systemctl commands), root cause analysis for CPU spikes (SIP REGISTER floods, pcapcommand, RTP threads, audio features), OOM memory exhaustion troubleshooting (checking dmesg for killed processes, reducing innodb_buffer_pool_size and max_buffer_mem), preventive measures (monitoring, anti-fraud auto-blocking, network edge protection), recovery procedures for system freezes during updates and binary issues after crashes, out-of-band management scenarios, and CRITICAL troubleshooting for service restart loop with "packetbuffer: MEMORY IS FULL" and "Cannot bind to port [5029]" errors. The MEMORY IS FULL error has multiple root causes: (1) Kernel storage errors (Step 0: check dmesg -T for I/O errors, filesystem errors, NFS errors, SCSI/SATA errors, SMART errors before investigating performance) or (2) Disk I/O performance bottleneck (Step 1: check iostat -x 1 for 100% utilization, test write speed with dd to /var/spool/voipmonitor with oflag=direct; Step 2: resolve by upgrading storage, moving spool, or for NFS check network latency with ping and nfsiostat) or (3) RAM-based memory exhaustion (Step 3: increase max_buffer_mem, enable packetbuffer_compress, ringbuffer, callslimit) or (4) Adaptive jitterbuffer overload (Step 4: check jitterbuffer settings with grep jitterbuffer /etc/voipmonitor.conf, disable jitterbuffer_adapt=no if enabled, which also disables MOS_adaptive scoring but keeps jitterbuffer_f1/f2 available) or (5) Network throughput bottleneck in distributed deployments (Step 6: occurs when using Packet Mirroring mode packetbuffer_sender=yes on remote probes with insufficient network bandwidth to central server, causing packetbuffer to fill during peak traffic. Symptoms include MEMORY IS FULL errors, missing CDRs, and significant CDR display delays, especially during peak hours. Solution: Switch probes to Local Processing mode packetbuffer_sender=no if they have sufficient CPU and disk resources. This eliminates network bottleneck by having probes analyze packets locally and send only CDRs. Trade-offs include increased probe CPU/RAM usage, PCAPs stored on probes (delay for replay/download), and requires sufficient probe resources. Prerequisites: sufficient probe CPU for full SIP/RTP analysis, sufficient RAM for max_buffer_mem and ringbuffer, fast local disk for PCAP storage, database connectivity. If probes lack resources, consider upgrading probe hardware, improving network bandwidth (10GbE), or adding more probes to reduce traffic per probe). Check probe configuration with grep packetbuffer_sender, verify probe local resources (free -h, iostat -x 1, top), monitor network utilization with sar -n DEV or iftop). The "Cannot bind to port [5029]" error (Step 5) requires clearing zombie processes (killall -9 voipmonitor, systemctl stop voipmonitor). For NFS storage, use ping and nfsiostat to diagnose network latency. It also covers troubleshooting packet loss on specific sensors (31, 32, 33, or any others) due to CPU overload. Check VoIPmonitor logs for CPU overload indicators: journalctl -u voipmonitor -n 200 | grep -E "Load Average|heap|MEMORY" to look for high Load Average exceeding CPU cores (e.g., LA of 8.0 on a 4-core system), heap usage approaching 100%, or "MEMORY IS FULL". Check current load with uptime or htop. Load Average is NOT a percentage - it represents average number of processes waiting for CPU time. Acceptable range is below core count (4-core: below 4.0, 8-core: below 8.0, 16-core: below 16.0, 32-core: below 32.0). Load Average exceeding core count indicates overload with packet loss. Check each sensor: uptime, systemctl status voipmonitor, journalctl for memory/heap issues. Use sniffer_stat Management API: echo 'sniffer_stat' | nc <sensor_ip> 5029 | jq '.packets_dropped'. Non-zero packets_dropped indicates internal packet loss. Resolve CPU overload by adding CPU cores (physical or virtual with lscpu). Monitor after upgrade: watch -n 10 "uptime", verify packets_dropped is 0 or stable, monitor journalctl. Temporary mitigation before hardware upgrade: adjust configurations like callslimit=2000 (max calls), ringbuffer=500-2000 MB (packet buffer), silencedetect=no (disable CPU-intensive audio), and comment out saveaudio lines to skip audio conversion.
'''Summary:''' Guide for diagnosing database bottlenecks affecting VoIPmonitor GUI using sensor RRD charts. Symptoms: GUI unresponsive during peak hours, OOM killer terminating PHP processes, slow dashboard/CDR views. KEY DIAGNOSTIC: Check sensor RRD charts (Settings > Sensors > graph icon) for growing SQL cache during peak hours - primary indicator of database bottleneck. Bottleneck types: CPU (mysqld at 100%), Memory (buffer pool too small), Storage I/O (most common - high iowait, magnetic disks). Solutions: (1) Add RAM to database server and tune innodb_buffer_pool_size to 50-70% of RAM; (2) Upgrade storage from HDD to SSD/NVMe (10-50x speedup); (3) Schedule alerts/reports outside peak hours; (4) Component separation with dedicated database server. INCORRECT solutions: Do NOT reduce PHP memory_limit, do NOT tune PHP-FPM workers, do NOT reduce innodb_buffer_pool_size, do NOT add RAM to GUI server instead of database.


'''Keywords:''' emergency recovery, high CPU, system unresponsive, runaway process, kill process, kill -9, systemctl, SIP REGISTER flood, pcapcommand, performance optimization, out-of-band management, iDRAC, iLO, IPMI, crash recovery, OOM, out of memory, memory exhaustion, dmesg -T, dmesg, kernel messages, storage errors, I/O errors, filesystem errors, ext4 errors, xfs errors, NFS errors, SCSI errors, SATA errors, SMART errors, innodb_buffer_pool_size, max_buffer_mem, MEMORY IS FULL, HEAP FULL, packetbuffer, disk I/O, I/O bottleneck, iostat -x 1, iostat, disk utilization, %util, write speed test, dd oflag=direct, spool directory, SSD, NVMe, RAID, Cannot bind to port 5029, zombie process, Manager API port, port 5029, restart loop, storage performance, I/O wait, %wa, jitterbuffer, jitterbuffer_adapt, adaptive jitterbuffer, jitterbuffer_f1, jitterbuffer_f2, MOS_adaptive, CPU intensive, memory exhaustion, NFS, NFS latency, ping, nfsiostat, network storage, 10GbE, packetbuffer_compress, ringbuffer, callslimit, fsck, too high load, sniffer crash, Live Sniffer, htop, load average, LA, CPU cores, per-thread CPU monitoring, per-core CPU usage, data integrity, protection feature, distributed mode, client-server mode, remote probes, central server, network throughput, network bottleneck, packetbuffer_sender, packetbuffer_sender=yes, packetbuffer_sender=no, Local Processing, Packet Mirroring, peak traffic, missing CDRs, CDR delay, display delay, network utilization, sar -n DEV, iftop, probe resources, central server bottleneck, packet loss sensor, CPU overload sensor, heap usage, heap approaching 100 sniffer_stat packets dropped
'''Keywords:''' database bottleneck, RRD charts, sensor performance, SQL cache, SQL cache files, peak hours, OOM killer, GUI unresponsive, dashboard slow, RAM upgrade, SSD upgrade, NVMe, iowait, innodb_buffer_pool_size, component separation, dedicated database


'''Key Questions:'''
'''Key Questions:'''
* Packet loss occurring on specific sensors (31, 32, 33, or others) - how to troubleshoot?
* How do I diagnose database bottlenecks in VoIPmonitor?
* How to check VoIPmonitor logs for CPU overload indicators on sensors?
* What do growing SQL cache files in RRD charts indicate?
* How to check VoIPmonitor logs for Load Average and heap usage approaching 100%?
* Why is my VoIPmonitor GUI slow during peak hours?
* What is Load Average and how to interpret it for VoIPmonitor sensors?
* How to fix OOM killer terminating PHP processes?
* Is Load Average a percentage or absolute value?
* Should I upgrade RAM on GUI server or database server?
* What is acceptable Load Average range for my CPU cores?
* What storage is recommended for VoIPmonitor database?
* What happens when Load Average exceeds number of CPU cores?
* How to access sensor RRD charts in VoIPmonitor GUI?
* How to check which specific sensor is experiencing packet loss?
* What are incorrect solutions for database bottlenecks?
* How to use sniffer_stat Management API to check for packet drops on sensors?
* How much RAM should innodb_buffer_pool_size be set to?
* How to add CPU cores to resolve packet loss on sensors?
* When should I consider component separation for VoIPmonitor?
* How to monitor sensors after CPU upgrade to verify packet loss is resolved?
* What temporary configurations can reduce CPU load before hardware upgrade (callslimit, ringbuffer, silencedetect, saveaudio)?
* What does "too high load" error mean in Live Sniffer?
* Is "too high load" crash a bug or a feature?
* Why does the sniffer terminate with "too high load" error?
* How to assess true CPU load in VoIPmonitor?
* How to use htop to monitor per-thread and per-core CPU usage?
* Why is overall system CPU usage misleading?
* How to interpret load average (LA) against CPU cores?
* What is a good load average value for my system?
* How to check load average with uptime or top/htop?
* How to identify CPU bottlenecks when load exceeds core count?
* What to do when VoIPmonitor consumes 3000% CPU or system becomes unresponsive?
* How to forcefully terminate a runaway VoIPmonitor process?
* What are common causes of CPU spikes in VoIPmonitor?
* How to mitigate SIP REGISTER flood attacks causing high CPU?
* How to diagnose OOM (Out of Memory) events?
* How to fix GUI and CLI frequently inaccessible due to memory exhaustion?
* How to reduce memory usage of MySQL and VoIPmonitor?
* What is max_buffer_mem and how to configure it?
* How to restart VoIPmonitor service after a crash?
* What to do if service binary is not found after crash?
* How to prevent VoIPmonitor from freezing during GUI updates?
* What tools can help diagnose VoIPmonitor performance issues?
* What causes "packetbuffer: MEMORY IS FULL" error message?
* How to distinguish between RAM exhaustion and disk I/O bottleneck?
* What is the first diagnostic step for "MEMORY IS FULL" errors?
* How to use dmesg -T to check for storage errors?
* What type of errors to look for in dmesg when MEMORY IS FULL occurs?
* How to check for I/O errors, filesystem errors, NFS errors in kernel messages?
* What to if kernel dmesg shows storage errors vs no errors?
* How to check for disk I/O performance issues causing restart loops?
* How to use iostat to diagnose disk utilization?
* How to perform write speed test to /var/spool/voipmonitor directory?
* What does "Cannot bind to port [5029]" error mean?
* How to clear zombie processes holding port 5029?
* How to resolve disk I/O bottleneck for VoIPmonitor?
* How to move spool directory to faster storage?
* What is the correct dd command to test disk write speed?
* What causes "HEAP FULL" errors in VoIPmonitor?
* How is jitterbuffer_adapt related to MEMORY IS FULL errors?
* What is the solution for MEMORY IS FULL + HEAP FULL crashes caused by jitterbuffer_adapt?
* Why should I disable jitterbuffer_adapt?
* What happens when I set jitterbuffer_adapt = no?
* What is the trade-off when disabling jitterbuffer_adapt?
* Can I still use jitterbuffer_f1 and jitterbuffer_f2 with jitterbuffer_adapt disabled?
* How to check NFS network latency causing MEMORY IS FULL?
* What tools to use for NFS diagnostics (ping, nfsiostat)?
* How to improve NFS storage performance for VoIPmonitor?
* Can MEMORY IS FULL errors be caused by network throughput bottlenecks in distributed mode?
* What causes MEMORY IS FULL errors on remote probes in client-server mode?
* How to identify if MEMORY IS FULL is caused by network bottleneck in distributed deployment?
* How does packetbuffer_sender mode affect network traffic between probes and central server?
* What is the difference between Local Processing and Packet Mirroring in distributed mode?
* How to switch from Packet Mirroring to Local Processing mode on remote probes?
* What is the solution for MEMORY IS FULL caused by insufficient network bandwidth between probes and central server?
* What are the trade-offs when switching to Local Processing mode (packetbuffer_sender=no)?
* What are the prerequisites for using Local Processing mode on remote probes?
* How to check if network throughput is the bottleneck in distributed VoIPmonitor deployment?
* How to monitor network utilization between remote probes and central server (sar -n DEV, iftop)?
* How to verify probe resources (CPU, RAM, disk) are sufficient for Local Processing mode?
* What configuration change fixes MEMORY IS FULL errors caused by bandwidth limitations in distributed mode?
* What happens when probes use packetbuffer_sender=yes with insufficient network bandwidth?
* Why do CDRs have significant delays during peak traffic in distributed mode?
* How to fix missing CDRs and MEMORY IS FULL errors on remote probes?

Revision as of 11:25, 6 January 2026

Diagnosing Database Bottlenecks Using Sensor RRD Charts

If your VoIPmonitor GUI becomes unresponsive or PHP processes are being terminated by the OOM (Out of Memory) killer, the root cause may be a database performance bottleneck, not a PHP configuration issue.

This guide explains how to use the sensor's RRD (Round-Robin Database) charts to identify whether the database server is the limiting factor.

Symptoms of Database Bottlenecks affecting the GUI

  • GUI becomes extremely slow or unresponsive during peak hours
  • PHP processes are killed by the OOM killer on the GUI server
  • Dashboard and CDR views take a long time to load
  • Alerts and reports fail during high traffic periods
  • System appears fine during off-peak hours but degrades during peak usage

Note: These symptoms often occur when the GUI server is waiting for database queries to complete, causing PHP processes to pile up and consume excessive memory.

Understanding Sensor RRD Charts

The VoIPmonitor sensor generates performance charts (RRD files) that track system metrics over time. These charts are accessible through the GUI and provide visual indicators of where bottlenecks are occurring.

To access sensor RRD charts:

  1. Navigate to Settings > Sensors in the GUI
  2. Click the graph icon next to the sensor
  3. Select the time range covering the problematic peak hours

Diagnostic Flowchart

Diagnostic Step 1: Look for Growing SQL Cache

The most critical indicator of a database bottleneck is growing SQL cache or SQL cache files during peak hours.

Metric What to Look For Indicates
SQL Cache Consistently increasing during peak hours, never decreasing Database cannot keep up with insert rate
SQL Cache Files Growing over time during peak usage Database buffer pool too small or storage too slow
CPU Load (mysqld) Near 100% during peak hours CPU bottleneck on database server
Disk I/O (mysql) High or saturated during peak hours Storage bottleneck (magnetic disks instead of SSDs)

If you see SQL cache or SQL cache files growing consistently during peak traffic periods, the database server is the bottleneck.

Diagnostic Step 2: Determine the Bottleneck Type

After identifying that the database is the issue, determine which resource is the limiting factor:

CPU Bottleneck

  • Check database CPU usage during peak hours
  • If mysqld is at or near 100% CPU, you need more CPU cores or faster CPUs

Memory Bottleneck

  • Check if SQL cache grows because buffer pool is too small
  • Database runs out of RAM for caching, forcing disk reads
  • The SQL cache chart shows a pattern of filling up and staying full

Storage I/O Bottleneck (Most Common)

  • High disk I/O wait times for mysqld process
  • Disk latency (iowait) increases during peak hours
  • Database storage on magnetic disks (e.g., 10K SAS) instead of SSD/NVMe
  • SQL cache grows because data cannot be written/read fast enough

Solutions for Database Performance Bottlenecks

Solution 1: Add More RAM to the Database Server

This is often the most effective fix for memory-related bottlenecks.

Current RAM Recommended Upgrade Expected Impact
32GB 64GB or 128GB Significantly reduces cache growth
64GB 128GB or 256GB Handles much higher peak loads
128GB 256GB Suitable for large deployments

After adding RAM, tune innodb_buffer_pool_size in your MySQL configuration:

# /etc/mysql/my.cnf
# Set to 50-70% of total RAM on dedicated database server
innodb_buffer_pool_size = 128G

For more tuning guidance, see Scaling - Database Performance.

Warning: Do NOT reduce innodb_buffer_pool_size on the GUI server when the database is the bottleneck. This will make the problem worse.

Solution 2: Upgrade Database Storage to SSD/NVMe

If your database storage is on magnetic disks (e.g., 10K SATA or SAS), upgrading to SSDs is often the single most effective improvement.

Current Storage Recommended Upgrade Expected Speedup
10K RPM SATA HDD NVMe SSD array 10-50x faster I/O
10K RPM SAS HDD Enterprise SSD (SAS/SATA) 5-20x faster I/O
Older SSD Modern NVMe (PCIe 4.0+) 2-5x faster I/O

For high-traffic deployments, NVMe storage is recommended for the database host.

See Hardware - Storage Selection for detailed recommendations.

Solution 3: Temporary Mitigation - Schedule Alerts/Reports Outside Peak Hours

If you cannot immediately upgrade the database server hardware, temporarily reduce the load by scheduling intensive tasks during off-peak hours.

1. Disable or reduce alert frequency during peak hours:

  • Navigate to GUI > Alerts
  • Temporarily disable high-frequency alerts
  • Set alerts to run during off-peak periods (e.g., 2am-4am)

2. Schedule reports outside peak usage:

  • Navigate to GUI > Reports
  • Configure scheduled reports for off-peak hours
  • Avoid generating reports during the busiest part of the day

3. Reduce dashboard complexity during peak hours:

  • Simplify dashboards that query large ranges of data
  • Avoid "All time" statistics during peak loads
  • Use cached dashboards or static displays when possible

Solution 4: Consider Component Separation

If the database server is a bottleneck and upgrading is not feasible, consider moving to a dedicated database architecture.

In a component separation deployment (see Scaling - Component Separation):

  • Host 1: Dedicated database server with maximum RAM and SSD/NVMe storage
  • Host 2: GUI web server
  • Host 3: Sensor(s)

This allows you to independently scale the database with more powerful hardware without affecting the GUI.

Common Pitfalls to Avoid

Incorrect Solutions When Database is the Bottleneck
Reducing PHP memory_limit This does NOT fix the root cause. PHP waits for the database; less memory means processes crash sooner.
Tuning PHP-FPM worker counts More workers will pile up waiting for slow database queries, consuming even more memory.
Reducing innodb_buffer_pool_size This makes the database slower, not faster. It causes more disk I/O and longer query times.
Adding RAM to the GUI server If the bottleneck is the database, adding RAM to the GUI won't help. The database is the limiting factor.

Verification Checklist

After implementing a database upgrade to fix the bottleneck:

1. Monitor SQL cache charts during the next peak traffic period 2. Check that SQL cache does not grow uncontrollably 3. Verify GUI responsiveness during peak hours 4. Confirm no OOM killer events 5. Check database query latency (slow queries should be minimal)

Related Documentation

AI Summary for RAG

Summary: Guide for diagnosing database bottlenecks affecting VoIPmonitor GUI using sensor RRD charts. Symptoms: GUI unresponsive during peak hours, OOM killer terminating PHP processes, slow dashboard/CDR views. KEY DIAGNOSTIC: Check sensor RRD charts (Settings > Sensors > graph icon) for growing SQL cache during peak hours - primary indicator of database bottleneck. Bottleneck types: CPU (mysqld at 100%), Memory (buffer pool too small), Storage I/O (most common - high iowait, magnetic disks). Solutions: (1) Add RAM to database server and tune innodb_buffer_pool_size to 50-70% of RAM; (2) Upgrade storage from HDD to SSD/NVMe (10-50x speedup); (3) Schedule alerts/reports outside peak hours; (4) Component separation with dedicated database server. INCORRECT solutions: Do NOT reduce PHP memory_limit, do NOT tune PHP-FPM workers, do NOT reduce innodb_buffer_pool_size, do NOT add RAM to GUI server instead of database.

Keywords: database bottleneck, RRD charts, sensor performance, SQL cache, SQL cache files, peak hours, OOM killer, GUI unresponsive, dashboard slow, RAM upgrade, SSD upgrade, NVMe, iowait, innodb_buffer_pool_size, component separation, dedicated database

Key Questions:

  • How do I diagnose database bottlenecks in VoIPmonitor?
  • What do growing SQL cache files in RRD charts indicate?
  • Why is my VoIPmonitor GUI slow during peak hours?
  • How to fix OOM killer terminating PHP processes?
  • Should I upgrade RAM on GUI server or database server?
  • What storage is recommended for VoIPmonitor database?
  • How to access sensor RRD charts in VoIPmonitor GUI?
  • What are incorrect solutions for database bottlenecks?
  • How much RAM should innodb_buffer_pool_size be set to?
  • When should I consider component separation for VoIPmonitor?