Emergency procedures: Difference between revisions

From VoIPmonitor.org
(Create new page for emergency recovery procedures)
 
(Rewrite: streamlined structure, improved flowchart, clearer diagnostic steps)
 
(12 intermediate revisions by the same user not shown)
Line 1: Line 1:
{{DISPLAYTITLE:Emergency Procedures & System Recovery}}
{{DISPLAYTITLE:Emergency Procedures: GUI Performance Crisis}}
[[Category:Troubleshooting]]
[[Category:Database]]


'''This guide covers emergency procedures for recovering your VoIPmonitor system from critical failures, including runaway processes, high CPU usage, and system unresponsiveness.'''
= GUI Performance Crisis: Database Bottleneck Diagnosis =


== Emergency: VoIPmonitor Process Consuming Excessive CPU or System Unresponsive ==
When VoIPmonitor GUI becomes unresponsive or PHP processes are killed by OOM, the root cause is often a '''database bottleneck''', not PHP configuration. This guide shows how to diagnose using sensor RRD charts.


When a VoIPmonitor process consumes excessive CPU (e.g., ~3000% or more) or causes the entire system to become unresponsive, follow these immediate steps:
{{Note|For general troubleshooting, see [[Database_troubleshooting]], [[GUI_troubleshooting]], or [[Sniffer_troubleshooting]].}}


=== Immediate Action: Force-Terminate Runaway Process ===
== Symptoms ==


If the system is still minimally responsive via SSH or requires out-of-band management (iDRAC, IPMI, console):
* GUI extremely slow or unresponsive during peak hours
* PHP processes killed by OOM killer
* Dashboard and CDR views take long to load
* Alerts/reports fail during high traffic
* System fine during off-peak, degrades during peak


;1. Identify the Process ID (PID):
== Diagnostic Flowchart ==
<syntaxhighlight lang="bash">
# Using htop (if available)
htop


# Or using ps
<kroki lang="mermaid">
ps aux | grep voipmonitor
%%{init: {'flowchart': {'nodeSpacing': 15, 'rankSpacing': 35}}}%%
</syntaxhighlight>
flowchart TD
    A[GUI Slow / OOM Errors] --> B{Check RRD Charts<br/>Settings → Sensors → 📊}
    B --> C{SQL Cache growing<br/>during peak?}
    C -->|No| D[NOT database bottleneck<br/>Check PHP/Apache config]
    C -->|Yes| E[Database Bottleneck<br/>Confirmed]
    E --> F{mysqld CPU ~100%?}
    F -->|Yes| G[CPU Bottleneck<br/>→ Upgrade CPU]
    F -->|No| H{High iowait?<br/>HDD storage?}
    H -->|Yes| I[I/O Bottleneck<br/>→ Upgrade to SSD/NVMe]
    H -->|No| J[Memory Bottleneck<br/>→ Add RAM + tune buffer_pool]


Look for the voipmonitor process consuming the most CPU resources. Note down the PID (process ID number).
    style A fill:#f9f,stroke:#333
    style E fill:#ff9,stroke:#333
</kroki>


;2. Forcefully terminate the process:
== Step 1: Access RRD Charts ==
<syntaxhighlight lang="bash">
kill -9 <PID>
</syntaxhighlight>


Replace <PID> with the actual process ID number identified in step 1.
# Navigate to '''Settings → Sensors'''
# Click the '''graph icon''' (📊) next to the sensor
# Select time range covering problematic peak hours


;3. Verify system recovery:
== Step 2: Identify Growing SQL Cache ==
<syntaxhighlight lang="bash">
# Check CPU usage has returned to normal
top


# Check if the process was terminated
The key indicator is '''SQL Cache''' or '''SQL Cache Files''' growing during peak hours:
ps aux | grep voipmonitor
</syntaxhighlight>


The system should become responsive again immediately after the process is killed. CPU utilization should drop significantly.
{| class="wikitable"
|-
! Metric !! What to Look For !! Indicates
|-
| '''SQL Cache''' || Consistently increasing, never decreasing || DB cannot keep up with inserts
|-
| '''SQL Cache Files''' || Growing over time || Buffer pool too small or storage too slow
|-
| '''mysqld CPU''' || Near 100% || CPU bottleneck
|-
| '''Disk I/O (mysql)''' || High/saturated || Storage bottleneck (HDD vs SSD)
|}


=== Optional: Stop and Restart the Service (for persistent issues) ===
{{Warning|1=If SQL Cache is NOT growing, the problem is likely NOT the database. Check PHP/Apache configuration instead.}}


If the problem persists or the service needs to be cleanly restarted:
== Step 3: Identify Bottleneck Type ==


<syntaxhighlight lang="bash">
=== CPU Bottleneck ===
# Stop the voipmonitor service
* <code>mysqld</code> at or near 100% CPU
systemctl stop voipmonitor
* '''Solution:''' More CPU cores or faster CPU


# Verify no zombie processes remaining
=== Memory Bottleneck ===
killall voipmonitor
* SQL cache fills up and stays full
* Buffer pool too small for dataset
* '''Solution:''' Add RAM, tune <code>innodb_buffer_pool_size</code>


# Restart the service
=== Storage I/O Bottleneck (Most Common) ===
systemctl start voipmonitor
* High <code>iowait</code> during peak hours
* Database on magnetic disks (HDD)
* '''Solution:''' Upgrade to SSD/NVMe (10-50x improvement)


# Verify service status
== Solutions ==
systemctl status voipmonitor
</syntaxhighlight>


'''Caution:''' When using <code>systemd</code> service management, avoid using the deprecated <code>service</code> command as it can cause systemd to lose track of the daemon. Always use <code>systemctl</code> commands or direct process commands like <code>killall</code>.
=== Add RAM to Database Server ===


=== Root Cause Analysis: Why Did the CPU Spike? ===
<syntaxhighlight lang="ini">
 
# /etc/mysql/my.cnf
After recovering the system, investigate the root cause to prevent recurrence. Common causes include:
# Set to 50-70% of total RAM on dedicated DB server
 
innodb_buffer_pool_size = 64G
;SIP REGISTER Flood / Spaming Attack
Massive volumes of SIP REGISTER messages from malicious IPs can overwhelm the VoIPmonitor process.
 
* '''Detection:''' Check recent alert triggers in the VoIPmonitor GUI > Alerts > Sent Alerts for SIP REGISTER flood alerts
* '''Immediate mitigation:''' Block attacker IPs at the network edge (SBC, firewall, iptables)
* '''Long-term prevention:''' Configure anti-fraud rules with custom scripts to auto-block, see [[Anti-fraud#SIP REGISTER Flood/Attack|SIP REGISTER Flood Mitigation]]
 
;Packet Capture Overload (pcapcommand)
The <code>pcapcommand</code> feature forks a program for ''every'' call, which can generate up to 500,000 interrupts per second.
 
* '''Detection:''' Check <code>/etc/voipmonitor.conf</code> for a <code>pcapcommand</code> line
* '''Immediate fix:''' Comment out or remove the <code>pcapcommand</code> directive and restart the service
* '''Alternative:''' Use the built-in cleaning spool functionality (<code>maxpoolsize</code>, <code>cleanspool</code>) instead
 
;Excessive RTP Processing Threads
High concurrent call volumes can overload RTP processing threads.
 
* '''Detection:''' Check performance logs for high <code>tRTP_CPU</code> values (sum of all RTP threads)
* '''Mitigation:'''
  <pre>callslimit = 2000  # Limit max concurrent calls</pre>
 
;Audio Feature Overhead
Silence detection and audio conversion are CPU-intensive operations.
 
* '''Detection:''' Check if <code>silencedetect</code> or <code>saveaudio</code> are enabled
* '''Mitigation:'''
  <pre>
  silencedetect = no
  # saveaudio = wav  # Comment out if not needed
  </pre>
 
See [[Scaling|Scaling and Performance Tuning]] for detailed performance optimization strategies.
 
=== Preventive Measures ===
 
Once the root cause is identified, implement these preventive configurations:
 
;Monitor CPU Trends:
Use [[Collectd_installation|collectd]] or your existing monitoring system to track CPU usage over time and receive alerts before critical thresholds are reached.
 
;Anti-Fraud Auto-Blocking:
Configure [[Anti-fraud|Anti-Fraud rules]] with custom scripts to automatically block attacker IPs when a flood is detected. See the [[Anti-fraud|Anti-Fraud documentation]] for PHP script examples using iptables or ipset.
 
;Network Edge Protection:
Block SIP REGISTER spam and floods at your network edge (SBC, firewall) before traffic reaches VoIPmonitor. This provides better performance and reduces CPU load on the monitoring system.
 
== Emergency: System Freezes on Every Update Attempt ==
 
If the VoIPmonitor sensor becomes unresponsive or hangs each time you attempt to update it through the Web GUI:
 
;1. SSH into the sensor host
;2. Execute the following commands to forcefully stop and restart:
<syntaxhighlight lang="bash">
killall voipmonitor
systemctl stop voipmonitor
systemctl start voipmonitor
</syntaxhighlight>
</syntaxhighlight>


This sequence ensures zombie processes are terminated, systemd is fully stopped, and a clean service restart occurs. Verify the sensor status in the GUI to confirm it is responding correctly.
{| class="wikitable"
|-
! Current RAM !! Recommended !! <code>innodb_buffer_pool_size</code>
|-
| 32GB || 64-128GB || 32-64G
|-
| 64GB || 128-256GB || 64-128G
|}


== Emergency: Binary Not Found After Crash ==
=== Upgrade to SSD/NVMe ===


If the VoIPmonitor service fails to start after a crash with error "Binary not found" for <code>/usr/local/sbin/voipmonitor</code>:
{| class="wikitable"
|-
! Current !! Upgrade To !! Expected Speedup
|-
| 10K RPM HDD || NVMe SSD || 10-50x
|-
| SAS HDD || Enterprise SSD || 5-20x
|-
| Older SSD || NVMe (PCIe 4.0+) || 2-5x
|}


;1. Check for a renamed binary:
=== Temporary Mitigation ===
<syntaxhighlight lang="bash">
ls -l /usr/local/sbin/voipmonitor_*
</syntaxhighlight>


The crash recovery process may have renamed the binary with an underscore suffix.
If immediate hardware upgrade not possible:
* '''Alerts:''' Reduce frequency or schedule during off-peak (2am-4am)
* '''Reports:''' Schedule for off-peak hours
* '''Dashboards:''' Simplify queries, avoid "All time" ranges


;2. If found, rename it back:
=== Component Separation ===
<syntaxhighlight lang="bash">
mv /usr/local/sbin/voipmonitor_ /usr/local/sbin/voipmonitor
</syntaxhighlight>


;3. Restart the service:
For persistent issues, consider dedicated database server:
<syntaxhighlight lang="bash">
* '''Host 1:''' Database (max RAM + SSD/NVMe)
systemctl start voipmonitor
* '''Host 2:''' GUI web server
systemctl status voipmonitor
* '''Host 3:''' Sensor(s)
</syntaxhighlight>


Verify the service starts correctly.
See [[Scaling#Scaling_Through_Component_Separation|Scaling - Component Separation]].


== Out-of-Band Management Scenarios ==
== Common Mistakes ==


When the system is completely unresponsive and cannot be accessed via SSH:
{{Warning|1=These do NOT fix database bottlenecks:}}


* '''Use your server's out-of-band management system:'''
{| class="wikitable"
  * Dell iDRAC
|-
  * HP iLO
! Wrong Action !! Why It Fails
  * Supermicro IPMI
|-
  * Other vendor-specific BMC/management tools
| Reducing PHP <code>memory_limit</code> || PHP waits for DB; less memory = earlier crashes
|-
| Adding more PHP-FPM workers || More workers pile up waiting for slow DB
|-
| Reducing <code>innodb_buffer_pool_size</code> || Makes DB slower, increases disk I/O
|-
| Adding RAM to GUI server || Bottleneck is DB server, not GUI
|}


* '''Actions available via OBM:'''
== Verification ==
  * Access virtual console (KVM-over-IP)
  * Send NMI (Non-Maskable Interrupt) for system dump
  * Force power cycle
  * Monitor hardware health


See [[Sniffer_troubleshooting|Sniffer Troubleshooting]] for more diagnostic procedures.
After implementing fix:
# Monitor SQL cache during next peak period
# Verify SQL cache does NOT grow uncontrollably
# Confirm GUI responsiveness
# Check for OOM killer events in system logs


== Related Documentation ==
== See Also ==


* [[Scaling|Scaling and Performance Tuning Guide]] - For performance optimization
* [[Database_troubleshooting]] - SQL queue issues, CDR delays
* [[Anti-fraud|Anti-Fraud Rules]] - For attack detection and mitigation
* [[Scaling]] - Performance tuning and database optimization
* [[Sniffer_troubleshooting|Sniffer Troubleshooting]] - For systematic diagnostic procedures
* [[GUI_troubleshooting]] - HTTP 500, login issues, debug mode
* [[High-Performance_VoIPmonitor_and_MySQL_Setup_Manual|High-Performance Setup]] - For optimizing high-traffic deployments
* [[Systemd_for_voipmonitor_service_management|Systemd Service Management]] - For service management best practices


== AI Summary for RAG ==
== AI Summary for RAG ==


'''Summary:''' This article provides emergency procedures for recovering VoIPmonitor from critical failures. It covers steps to force-terminate runaway processes consuming excessive CPU (including kill -9 and systemctl commands), root cause analysis for CPU spikes (SIP REGISTER floods, pcapcommand, RTP threads, audio features), preventive measures (monitoring, anti-fraud auto-blocking, network edge protection), recovery procedures for system freezes during updates and binary issues after crashes, and out-of-band management scenarios.
'''Summary:''' Emergency guide for diagnosing database bottlenecks affecting VoIPmonitor GUI using sensor RRD charts. Symptoms: GUI unresponsive during peak hours, OOM killer terminating PHP, slow dashboards. KEY DIAGNOSTIC: Access RRD charts (Settings → Sensors → graph icon), look for growing SQL cache during peak hours - primary indicator of DB bottleneck. Bottleneck types: CPU (mysqld at 100%), Memory (buffer pool full), Storage I/O (most common - high iowait, HDD storage). Solutions: (1) Add RAM and set innodb_buffer_pool_size to 50-70% of RAM; (2) Upgrade HDD to SSD/NVMe (10-50x speedup); (3) Schedule alerts/reports off-peak; (4) Component separation with dedicated DB server. WRONG solutions: Do NOT reduce PHP memory_limit, do NOT add PHP-FPM workers, do NOT reduce innodb_buffer_pool_size, do NOT add RAM to GUI server.


'''Keywords:''' emergency recovery, high CPU, system unresponsive, runaway process, kill process, kill -9, systemctl, SIP REGISTER flood, pcapcommand, performance optimization, out-of-band management, iDRAC, iLO, IPMI, crash recovery
'''Keywords:''' database bottleneck, RRD charts, SQL cache, peak hours, OOM killer, GUI slow, GUI unresponsive, innodb_buffer_pool_size, SSD upgrade, NVMe, iowait, component separation, emergency procedures, performance crisis


'''Key Questions:'''
'''Key Questions:'''
* What to do when VoIPmonitor consumes 3000% CPU or system becomes unresponsive?
* How to diagnose database bottlenecks in VoIPmonitor?
* How to forcefully terminate a runaway VoIPmonitor process?
* What does growing SQL cache in RRD charts indicate?
* What are common causes of CPU spikes in VoIPmonitor?
* Why is VoIPmonitor GUI slow during peak hours?
* How to mitigate SIP REGISTER flood attacks causing high CPU?
* How to fix OOM killer terminating PHP processes?
* How to restart VoIPmonitor service after a crash?
* Should I upgrade RAM on GUI or database server?
* What to do if service binary is not found after crash?
* What storage is recommended for VoIPmonitor database?
* How to prevent VoIPmonitor from freezing during GUI updates?
* How to access sensor RRD charts?
* What tools can help diagnose VoIPmonitor performance issues?
* What are wrong solutions for database bottlenecks?

Latest revision as of 16:48, 8 January 2026


GUI Performance Crisis: Database Bottleneck Diagnosis

When VoIPmonitor GUI becomes unresponsive or PHP processes are killed by OOM, the root cause is often a database bottleneck, not PHP configuration. This guide shows how to diagnose using sensor RRD charts.

ℹ️ Note: For general troubleshooting, see Database_troubleshooting, GUI_troubleshooting, or Sniffer_troubleshooting.

Symptoms

  • GUI extremely slow or unresponsive during peak hours
  • PHP processes killed by OOM killer
  • Dashboard and CDR views take long to load
  • Alerts/reports fail during high traffic
  • System fine during off-peak, degrades during peak

Diagnostic Flowchart

Step 1: Access RRD Charts

  1. Navigate to Settings → Sensors
  2. Click the graph icon (📊) next to the sensor
  3. Select time range covering problematic peak hours

Step 2: Identify Growing SQL Cache

The key indicator is SQL Cache or SQL Cache Files growing during peak hours:

Metric What to Look For Indicates
SQL Cache Consistently increasing, never decreasing DB cannot keep up with inserts
SQL Cache Files Growing over time Buffer pool too small or storage too slow
mysqld CPU Near 100% CPU bottleneck
Disk I/O (mysql) High/saturated Storage bottleneck (HDD vs SSD)

⚠️ Warning: If SQL Cache is NOT growing, the problem is likely NOT the database. Check PHP/Apache configuration instead.

Step 3: Identify Bottleneck Type

CPU Bottleneck

  • mysqld at or near 100% CPU
  • Solution: More CPU cores or faster CPU

Memory Bottleneck

  • SQL cache fills up and stays full
  • Buffer pool too small for dataset
  • Solution: Add RAM, tune innodb_buffer_pool_size

Storage I/O Bottleneck (Most Common)

  • High iowait during peak hours
  • Database on magnetic disks (HDD)
  • Solution: Upgrade to SSD/NVMe (10-50x improvement)

Solutions

Add RAM to Database Server

# /etc/mysql/my.cnf
# Set to 50-70% of total RAM on dedicated DB server
innodb_buffer_pool_size = 64G
Current RAM Recommended innodb_buffer_pool_size
32GB 64-128GB 32-64G
64GB 128-256GB 64-128G

Upgrade to SSD/NVMe

Current Upgrade To Expected Speedup
10K RPM HDD NVMe SSD 10-50x
SAS HDD Enterprise SSD 5-20x
Older SSD NVMe (PCIe 4.0+) 2-5x

Temporary Mitigation

If immediate hardware upgrade not possible:

  • Alerts: Reduce frequency or schedule during off-peak (2am-4am)
  • Reports: Schedule for off-peak hours
  • Dashboards: Simplify queries, avoid "All time" ranges

Component Separation

For persistent issues, consider dedicated database server:

  • Host 1: Database (max RAM + SSD/NVMe)
  • Host 2: GUI web server
  • Host 3: Sensor(s)

See Scaling - Component Separation.

Common Mistakes

⚠️ Warning: These do NOT fix database bottlenecks:

Wrong Action Why It Fails
Reducing PHP memory_limit PHP waits for DB; less memory = earlier crashes
Adding more PHP-FPM workers More workers pile up waiting for slow DB
Reducing innodb_buffer_pool_size Makes DB slower, increases disk I/O
Adding RAM to GUI server Bottleneck is DB server, not GUI

Verification

After implementing fix:

  1. Monitor SQL cache during next peak period
  2. Verify SQL cache does NOT grow uncontrollably
  3. Confirm GUI responsiveness
  4. Check for OOM killer events in system logs

See Also

AI Summary for RAG

Summary: Emergency guide for diagnosing database bottlenecks affecting VoIPmonitor GUI using sensor RRD charts. Symptoms: GUI unresponsive during peak hours, OOM killer terminating PHP, slow dashboards. KEY DIAGNOSTIC: Access RRD charts (Settings → Sensors → graph icon), look for growing SQL cache during peak hours - primary indicator of DB bottleneck. Bottleneck types: CPU (mysqld at 100%), Memory (buffer pool full), Storage I/O (most common - high iowait, HDD storage). Solutions: (1) Add RAM and set innodb_buffer_pool_size to 50-70% of RAM; (2) Upgrade HDD to SSD/NVMe (10-50x speedup); (3) Schedule alerts/reports off-peak; (4) Component separation with dedicated DB server. WRONG solutions: Do NOT reduce PHP memory_limit, do NOT add PHP-FPM workers, do NOT reduce innodb_buffer_pool_size, do NOT add RAM to GUI server.

Keywords: database bottleneck, RRD charts, SQL cache, peak hours, OOM killer, GUI slow, GUI unresponsive, innodb_buffer_pool_size, SSD upgrade, NVMe, iowait, component separation, emergency procedures, performance crisis

Key Questions:

  • How to diagnose database bottlenecks in VoIPmonitor?
  • What does growing SQL cache in RRD charts indicate?
  • Why is VoIPmonitor GUI slow during peak hours?
  • How to fix OOM killer terminating PHP processes?
  • Should I upgrade RAM on GUI or database server?
  • What storage is recommended for VoIPmonitor database?
  • How to access sensor RRD charts?
  • What are wrong solutions for database bottlenecks?