Sniffer troubleshooting: Difference between revisions

From VoIPmonitor.org
(Add Cause 3: External Source Packet Truncation (Kamailio siptrace, HAProxy tee))
(Patch: replace '=== Solution: I/O Bottleneck =...')
 
(81 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{DISPLAYTITLE:Troubleshooting: No Calls Being Sniffed}}
= Sniffer Troubleshooting =


'''This guide provides a systematic, step-by-step process to diagnose why the VoIPmonitor sensor might not be capturing any calls. Follow these steps in order to quickly identify and resolve the most common issues.'''
This page covers common VoIPmonitor sniffer/sensor problems organized by symptom. For configuration reference, see [[Sniffer_configuration]]. For performance tuning, see [[Scaling]].


== Troubleshooting Flowchart ==
== Critical First Step: Is Traffic Reaching the Interface? ==


<mermaid>
{{Warning|Before any sensor tuning, verify packets are reaching the network interface. If packets aren't there, no amount of sensor configuration will help.}}
flowchart TD
    A[No Calls Being Captured] --> B{Step 1: Service Running?}
    B -->|No| B1[systemctl restart voipmonitor]
    B -->|Yes| C{Step 2: Traffic on Interface?<br/>tshark -i eth0 -Y 'sip'}


    C -->|No packets| D[Step 3: Network Issue]
<syntaxhighlight lang="bash">
    D --> D1{Interface UP?}
# Check for SIP traffic on the capture interface
    D1 -->|No| D2[ip link set dev eth0 up]
tcpdump -i eth0 -nn "host <PROBLEMATIC_IP> and port 5060" -c 10
    D1 -->|Yes| D3{SPAN/RSPAN?}
 
    D3 -->|Yes| D4[Enable promisc mode]
# If no packets: Network/SPAN issue - contact network admin
    D3 -->|ERSPAN/GRE/TZSP| D5[Check tunnel config]
# If packets visible: Proceed with sensor troubleshooting below
</syntaxhighlight>


     C -->|Packets visible| E[Step 4: VoIPmonitor Config]
<kroki lang="mermaid">
    E --> E1{interface correct?}
graph TD
     E1 -->|No| E2[Fix interface in voipmonitor.conf]
     A[No Calls Recorded] --> B{Packets on interface?<br/>tcpdump -i eth0 port 5060}
     E1 -->|Yes| E3{sipport correct?}
     B -->|No packets| C[Network Issue]
     E3 -->|No| E4[Add port: sipport = 5060,5080]
    C --> C1[Check SPAN/mirror config]
     E3 -->|Yes| E5{BPF filter blocking?}
     C --> C2[Verify VLAN tagging]
     E5 -->|Maybe| E6[Comment out filter directive]
     C --> C3[Check cable/port]
     B -->|Packets visible| D[Sensor Issue]
    D --> D1[Check voipmonitor.conf]
    D --> D2[Check GUI Capture Rules]
     D --> D3[Check logs for errors]
</kroki>


    E5 -->|No| F[Step 5: GUI Capture Rules]
== Quick Diagnostic Checklist ==
    F --> F1{Rules with Skip: ON?}
    F1 -->|Yes| F2[Remove/modify rules + reload sniffer]
    F1 -->|No| G[Step 6: Check Logs]


    G --> H{OOM Events?}
{| class="wikitable"
    H -->|Yes| H1[Step 7: Add RAM / tune MySQL]
|-
    H -->|No| I{Large SIP packets?}
! Check !! Command !! Expected Result
    I -->|Yes| I1{External SIP source?<br/>Kamailio/HAProxy mirror}
|-
    I1 -->|No| I2[Increase snaplen in voipmonitor.conf]
| Service running || <code>systemctl status voipmonitor</code> || Active (running)
    I1 -->|Yes| I3[Fix external source: Kamailio siptrace or HAProxy tee]
|-
    I2 --> I4[If snaplen change fails, recheck with tcpdump -s0]
| Traffic on interface || <code>tshark -i eth0 -c 5 -Y "sip"</code> || SIP packets displayed
    I4 --> I1
|-
    I -->|No| J[Contact Support]
| Interface errors || <code>ip -s link show eth0</code> || No RX errors/drops
</mermaid>
|-
| Promiscuous mode || <code>ip link show eth0</code> || PROMISC flag present
|-
| Logs || <code>tail -100 /var/log/syslog \| grep voip</code> || No critical errors
|-
| GUI rules || Settings → Capture Rules || No unexpected "Skip" rules
|}


== Step 1: Is the VoIPmonitor Service Running Correctly? ==
== No Calls Being Recorded ==
First, confirm that the sensor process is active and loaded the correct configuration file.
 
=== Service Not Running ===


;1. Check the service status (for modern systemd systems):
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
# Check status
systemctl status voipmonitor
systemctl status voipmonitor
# View recent logs
journalctl -u voipmonitor --since "10 minutes ago"
# Start/restart
systemctl restart voipmonitor
</syntaxhighlight>
</syntaxhighlight>
Look for a line that says <code>Active: active (running)</code>. If it is inactive or failed, try restarting it with <code>systemctl restart voipmonitor</code> and check the status again.


;2. Verify the running process:
Common startup failures:
* '''Interface not found''': Check <code>interface</code> in voipmonitor.conf matches <code>ip a</code> output
* '''Port already in use''': Another process using the management port
* '''License issue''': Check [[License]] for activation problems
 
=== Wrong Interface or Port Configuration ===
 
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
ps aux | grep voipmonitor
# Check current config
grep -E "^interface|^sipport" /etc/voipmonitor.conf
 
# Example correct config:
# interface = eth0
# sipport = 5060
</syntaxhighlight>
</syntaxhighlight>
This command will show the running process and the exact command line arguments it was started with. Critically, ensure it is using the correct configuration file, for example: <code>--config-file /etc/voipmonitor.conf</code>. If it is not, there may be an issue with your startup script.


== Step 2: Is Network Traffic Reaching the Server? ==
{{Tip|For multiple SIP ports: <code>sipport = 5060,5061,5080</code>}}
If the service is running, the next step is to verify if the VoIP packets (SIP/RTP) are actually arriving at the server's network interface. The best tool for this is <code>tshark</code> (the command-line version of Wireshark).
 
=== GUI Capture Rules Blocking ===
 
Navigate to '''Settings → Capture Rules''' and check for rules with action "Skip" that may be blocking calls. Rules are processed in order - a Skip rule early in the list will block matching calls.
 
See [[Capture_rules]] for detailed configuration.
 
=== SPAN/Mirror Not Configured ===
 
If <code>tcpdump</code> shows no traffic:
# Verify switch SPAN/mirror port configuration
# Check that both directions (ingress + egress) are mirrored
# Confirm VLAN tagging is preserved if needed
# Test physical connectivity (cable, port status)
 
See [[Sniffing_modes]] for SPAN, RSPAN, and ERSPAN configuration.
 
=== Filter Parameter Too Restrictive ===
 
If <code>filter</code> is set in voipmonitor.conf, it may exclude traffic:


;1. Install tshark:
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
# For Debian/Ubuntu
# Check filter
apt-get update && apt-get install tshark
grep "^filter" /etc/voipmonitor.conf


# For CentOS/RHEL/AlmaLinux
# Temporarily disable to test
yum install wireshark
# Comment out the filter line and restart
</syntaxhighlight>
</syntaxhighlight>


;2. Listen for SIP traffic on the correct interface:
 
Replace <code>eth0</code> with the interface name you have configured in <code>voipmonitor.conf</code>.
 
==== Missing id_sensor Parameter ====
 
'''Symptom''': SIP packets visible in Capture/PCAP section but missing from CDR, SIP messages, and Call flow.
 
'''Cause''': The <code>id_sensor</code> parameter is not configured or is missing. This parameter is required to associate captured packets with the CDR database.
 
'''Solution''':
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
tshark -i eth0 -Y "sip || rtp" -n
# Check if id_sensor is set
grep "^id_sensor" /etc/voipmonitor.conf
 
# Add or correct the parameter
echo "id_sensor = 1" >> /etc/voipmonitor.conf
 
# Restart the service
systemctl restart voipmonitor
</syntaxhighlight>
</syntaxhighlight>
* '''If you see a continuous stream of SIP and RTP packets''', it means traffic is reaching the server, and the problem is likely in VoIPmonitor's configuration (see Step 4).
* '''If you see NO packets''', the problem lies with your network configuration. Proceed to Step 3.


== Step 3: Troubleshoot Network and Interface Configuration ==
{{Tip|Use a unique numeric identifier (1-65535) for each sensor. Essential for multi-sensor deployments. See [[Sniffer_configuration#id_sensor|id_sensor documentation]].}}
If <code>tshark</code> shows no traffic, it means the packets are not being delivered to the operating system correctly.
== Missing Audio / RTP Issues ==


;1. Check if the interface is UP:
=== One-Way Audio (Asymmetric Mirroring) ===
Ensure the network interface is active.
 
'''Symptom''': SIP recorded but only one RTP direction captured.
 
'''Cause''': SPAN port configured for only one direction.
 
'''Diagnosis''':
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
ip link show eth0
# Count RTP packets per direction
tshark -i eth0 -Y "rtp" -T fields -e ip.src -e ip.dst | sort | uniq -c
</syntaxhighlight>
</syntaxhighlight>
The output should contain the word <code>UP</code>. If it doesn't, bring it up with:
 
<syntaxhighlight lang="bash">
If one direction shows 0 or very few packets, configure the switch to mirror both ingress and egress traffic.
ip link set dev eth0 up
 
=== RTP Not Associated with Call ===
 
'''Symptom''': Audio plays in sniffer but not in GUI, or RTP listed under wrong call.
 
'''Possible causes''':
 
'''1. SIP and RTP on different interfaces/VLANs''':
<syntaxhighlight lang="ini">
# voipmonitor.conf - enable automatic RTP association
auto_enable_use_blocks = yes
</syntaxhighlight>
</syntaxhighlight>


;2. Check for Promiscuous Mode (for SPAN/RSPAN Mirrored Traffic):
'''2. NAT not configured''':
'''Important:''' Promiscuous mode requirements depend on your traffic mirroring method:
<syntaxhighlight lang="ini">
# voipmonitor.conf - for NAT scenarios
natalias = <public_ip> <private_ip>


* '''SPAN/RSPAN (Layer 2 mirroring):''' The network interface '''must''' be in promiscuous mode. Mirrored packets retain their original MAC addresses, so the interface would normally ignore them. Promiscuous mode forces the interface to accept all packets regardless of destination MAC.
# If not working, try reversed order:
natalias = <private_ip> <public_ip>
</syntaxhighlight>


* '''ERSPAN/GRE/TZSP/VXLAN (Layer 3 tunnels):''' Promiscuous mode is '''NOT required'''. These tunneling protocols encapsulate the mirrored traffic inside IP packets that are addressed directly to the sensor's IP address. The operating system receives these packets normally, and VoIPmonitor automatically decapsulates them to extract the inner SIP/RTP traffic.
'''3. External device modifying media ports''':


For SPAN/RSPAN deployments, check the current promiscuous mode status:
If SDP advertises one port but RTP arrives on different port (SBC/media server issue):
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
ip link show eth0
# Compare SDP ports vs actual RTP
tshark -r call.pcap -Y "sip.Method == INVITE" -V | grep "m=audio"
tshark -r call.pcap -Y "rtp" -T fields -e udp.dstport | sort -u
</syntaxhighlight>
</syntaxhighlight>
Look for the <code>PROMISC</code> flag.


Enable promiscuous mode manually if needed:
If ports don't match, the external device must be configured to preserve SDP ports - VoIPmonitor cannot compensate.
<syntaxhighlight lang="bash">
=== RTP Incorrectly Associated with Wrong Call (PBX Port Reuse) ===
ip link set eth0 promisc on
'''Symptom''': RTP streams from one call appear associated with a different CDR when your PBX aggressively reuses the same IP:port across multiple calls.
 
'''Cause''': When PBX reuses media ports, VoIPmonitor may incorrectly correlate RTP packets to the wrong call based on weaker correlation methods.
 
'''Solution''': Enable <code>rtp_check_both_sides_by_sdp</code> to require verification of both source and destination IP:port against SDP:
<syntaxhighlight lang="ini">
# voipmonitor.conf - require both source and destination to match SDP
rtp_check_both_sides_by_sdp = yes
 
# Alternative (strict) mode - allows initial unverified packets
rtp_check_both_sides_by_sdp = strict
</syntaxhighlight>
</syntaxhighlight>
If this solves the problem, you should make the change permanent. The <code>install-script.sh</code> for the sensor usually attempts to do this, but it can fail.


;3. Verify Your SPAN/Mirror/TAP Configuration:
{{Warning|Enabling this may prevent RTP association for calls using NAT, as the source IP:port will not match the SDP. Use <code>natalias</code> mappings or the <code>strict</code> setting to mitigate this.}}
This is the most common cause of no traffic. Double-check your network switch or hardware tap configuration to ensure:
=== Snaplen Truncation ===
* The correct source ports (where your PBX/SBC is connected) are being monitored.
* The correct destination port (where your VoIPmonitor sensor is connected) is configured.
* If you are monitoring traffic across different VLANs, ensure your mirror port is configured to carry all necessary VLAN tags (often called "trunk" mode).


;4. Check for Non-Call SIP Traffic Only:
'''Symptom''': Large SIP messages truncated, incomplete headers.
If you see SIP traffic but it consists only of OPTIONS, NOTIFY, SUBSCRIBE, or MESSAGE methods (without any INVITE packets), there are no calls to generate CDRs. This can occur in environments that use SIP for non-call purposes like heartbeat checks or instant messaging.


You can configure VoIPmonitor to process and store these non-call SIP messages. See [[SIP_OPTIONS/SUBSCRIBE/NOTIFY]] and [[MESSAGES]] for configuration details.
'''Solution''':
<syntaxhighlight lang="ini">
# voipmonitor.conf - increase packet capture size
snaplen = 8192
</syntaxhighlight>


Enable non-call SIP message processing in '''/etc/voipmonitor.conf''':
For Kamailio siptrace, also check <code>trace_msg_fragment_size</code> in Kamailio config. See [[Sniffer_configuration#snaplen|snaplen documentation]].
<syntaxhighlight lang="ini">
# Process SIP OPTIONS (qualify pings). Default: no
sip-options = yes


# Process SIP MESSAGE (instant messaging). Default: yes
== PACKETBUFFER Saturation ==
sip-message = yes


# Process SIP SUBSCRIBE requests. Default: no
'''Symptom''': Log shows <code>PACKETBUFFER: memory is FULL</code>, truncated RTP recordings.
sip-subscribe = yes


# Process SIP NOTIFY requests. Default: no
{{Warning|This alert refers to VoIPmonitor's '''internal packet buffer''' (<code>max_buffer_mem</code>), '''NOT system RAM'''. High system memory availability does not prevent this error. The root cause is always a downstream bottleneck (disk I/O or CPU) preventing packets from being processed fast enough.}}
sip-notify = yes
</syntaxhighlight>


Note that enabling these for processing and storage can significantly increase database load in high-traffic scenarios. Use with caution and monitor SQL queue growth. See [[SIP_OPTIONS/SUBSCRIBE/NOTIFY#Performance_Tuning|Performance Tuning]] for optimization tips.
'''Before testing solutions''', gather diagnostic data:
* Check sensor logs: <code>/var/log/syslog</code> (Debian/Ubuntu) or <code>/var/log/messages</code> (RHEL/CentOS)
* Generate debug log via GUI: '''Tools → Generate debug log'''


== Step 4: Check the VoIPmonitor Configuration ==
=== Diagnose: I/O vs CPU Bottleneck ===
If <code>tshark</code> sees traffic but VoIPmonitor does not, the problem is almost certainly in <code>voipmonitor.conf</code>.


;1. Check the <code>interface</code> directive:
{{Warning|Do not guess the bottleneck source. Use proper diagnostics first to identify whether the issue is disk I/O, CPU, or database-related. Disabling storage as a test is valid but should be used to '''confirm''' findings, not as the primary diagnostic method.}}
:Make sure the <code>interface</code> parameter in <code>/etc/voipmonitor.conf</code> exactly matches the interface where you see traffic with <code>tshark</code>. For example: <code>interface = eth0</code>.


;2. Check the <code>sipport</code> directive:
==== Step 1: Check IO[] Metrics (v2026.01.3+) ====
:By default, VoIPmonitor only listens on port 5060. If your PBX uses a different port for SIP, you must add it. For example:
:<code>sipport = 5060,5080</code>


:'''For distributed/probe setups:''' If you are using a remote sensor (probe) with Packet Mirroring, the <code>sipport</code> configuration must match on BOTH the probe AND the central analysis host. See [[Sniffer_distributed_architecture#Troubleshooting_Distributed_Deployments|Distributed Architecture: Troubleshooting]] for details.
'''Starting with version 2026.01.3''', VoIPmonitor includes built-in disk I/O monitoring that directly shows disk saturation status:


;3. Check for a restrictive <code>filter</code>:
<syntaxhighlight lang="text">
:If you have a BPF <code>filter</code> configured, ensure it is not accidentally excluding the traffic you want to see. For debugging, try commenting out the <code>filter</code> line entirely and restarting the sensor.
[283.4/283.4Mb/s] IO[B1.1|L0.7|U45|C75|W125|R10|WI1.2k|RI0.5k]
</syntaxhighlight>


== Step 5: Check GUI Capture Rules (Causing Call Stops) ==
'''Quick interpretation:'''
If <code>tshark</code> sees SIP traffic and the sniffer configuration appears correct, but the probe stops processing calls or shows traffic only on the network interface, GUI capture rules may be the culprit.
{| class="wikitable"
|-
! Metric !! Meaning !! Problem Indicator
|-
| '''C''' (Capacity) || % of disk's sustainable throughput used || '''C ≥ 80% = Warning''', '''C ≥ 95% = Saturated'''
|-
| '''L''' (Latency) || Current write latency in ms || '''L ≥ 3× B''' (baseline) = Saturated
|-
| '''U''' (Utilization) || % time disk is busy || '''U > 90%''' = Disk at limit
|}


Capture rules configured in the GUI can instruct the sniffer to ignore ("skip") all processing for matched calls. This includes calls matching specific IP addresses or telephone number prefixes.
'''If you see <code>DISK_SAT</code> or <code>WARN</code> after IO[]:'''
<syntaxhighlight lang="text">
IO[B1.1|L8.5|U98|C97|W890|R5|WI12.5k|RI0.1k] DISK_SAT
</syntaxhighlight>


;1. Review existing capture rules:
→ This confirms I/O bottleneck. Skip to [[#Solution:_I.2FO_Bottleneck|I/O Bottleneck Solutions]].
:Navigate to '''GUI -> Capture rules''' and examine all rules for any that might be blocking your traffic.
:Look specifically for rules with the '''Skip''' option set to '''ON''' (displayed as "Skip: ON"). The Skip option instructs the sniffer to completely ignore matching calls (no files, RTP analysis, or CDR creation).


;2. Test by temporarily removing all capture rules:
'''For older versions or additional confirmation''', continue with the steps below.
:To isolate the issue, first create a backup of your GUI configuration:
:* Navigate to '''Tools -> Backup & Restore -> Backup GUI -> Configuration tables'''
:* This saves your current settings including capture rules
:* Delete all capture rules from the GUI
:* Click the '''Apply''' button to save changes
:* Reload the sniffer by clicking the green '''"reload sniffer"''' button in the control panel
:* Test if calls are now being processed correctly
:* If resolved, restore the configuration from the backup and systematically investigate the rules to identify the problematic one


;3. Identify the problematic rule:
{{Note|See [[Syslog_Status_Line#IO.5B....5D_-_Disk_I.2FO_Monitoring_.28v2026.01.3.2B.29|Syslog Status Line - IO[] section]] for detailed field descriptions.}}
:* After restoring your configuration, remove rules one at a time and reload the sniffer after each removal
:* When calls start being processed again, you have identified the problematic rule
:* Review the rule's match criteria (IP addresses, prefixes, direction) against your actual traffic pattern
:* Adjust the rule's conditions or Skip setting as needed


;4. Verify rules are reloaded:
==== Step 2: Read the Full Syslog Status Line ====
:After making changes to capture rules, remember that changes are '''not automatically applied''' to the running sniffer. You must click the '''"reload sniffer"''' button in the control panel, or the rules will continue using the previous configuration.


For more information on capture rules, see [[Capture_rules]].
VoIPmonitor outputs a status line every 10 seconds. This is your first diagnostic tool:


== Step 6: Check VoIPmonitor Logs for Errors ==
Finally, VoIPmonitor's own logs are the best source for clues. Check the system log for any error messages generated by the sensor on startup or during operation.
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
# For Debian/Ubuntu
# Monitor in real-time
journalctl -u voipmonitor -f
# or
tail -f /var/log/syslog | grep voipmonitor
tail -f /var/log/syslog | grep voipmonitor
</syntaxhighlight>


# For CentOS/RHEL/AlmaLinux
'''Example status line:'''
tail -f /var/log/messages | grep voipmonitor
<syntaxhighlight lang="text">
calls[424] PS[C:4 S:41 R:13540] SQLq[C:0 M:0] heap[45|30|20] comp[48] [25.6Mb/s] t0CPU[85%] t1CPU[12%] t2CPU[8%] tacCPU[8|8|7|7%] RSS/VSZ[365|1640]MB
</syntaxhighlight>
</syntaxhighlight>
Look for errors like:
* "pcap_open_live(eth0) error: eth0: No such device" (Wrong interface name)
* "Permission denied" (The sensor is not running with sufficient privileges)
* Errors related to database connectivity.
* Messages about dropping packets.


== Step 7: Check for OOM (Out of Memory) Issues ==
'''Key metrics for bottleneck identification:'''
If VoIPmonitor suddenly stops processing CDRs and a service restart temporarily restores functionality, the system may be experiencing OOM (Out of Memory) killer events. The Linux OOM killer terminates processes when available RAM is exhausted, and MySQL (<code>mysqld</code>) is a common target due to its memory-intensive nature.


;1. Check for OOM killer events in kernel logs:
{| class="wikitable"
|-
! Metric !! What It Indicates !! I/O Bottleneck Sign !! CPU Bottleneck Sign
|-
| <code>heap[A&#124;B&#124;C]</code> || Buffer fill % (primary / secondary / processing) || High A with low t0CPU || High A with high t0CPU
|-
| <code>t0CPU[X%]</code> || Packet capture thread (single-core, cannot parallelize) || Low (<50%) || High (>80%)
|-
| <code>comp[X]</code> || Active compression threads || Very high (maxed out) || Normal
|-
| <code>SQLq[C:X M:Y]</code> || Pending SQL queries || Growing = database bottleneck || Stable
|-
| <code>tacCPU[...]</code> || TAR compression threads || All near 100% = compression bottleneck || Normal
|}
 
'''Interpretation flowchart:'''
 
<kroki lang="mermaid">
graph TD
    A[heap values rising] --> B{Check t0CPU}
    B -->|t0CPU > 80%| C[CPU Bottleneck]
    B -->|t0CPU < 50%| D{Check comp and tacCPU}
    D -->|comp maxed, tacCPU high| E[I/O Bottleneck<br/>Disk cannot keep up with writes]
    D -->|comp normal| F{Check SQLq}
    F -->|SQLq growing| G[Database Bottleneck]
    F -->|SQLq stable| H[Mixed/Other Issue]
 
    C --> C1[Solution: CPU optimization]
    E --> E1[Solution: Faster storage]
    G --> G1[Solution: MySQL tuning]
</kroki>
 
==== Step 3: Linux I/O Diagnostics ====
 
Use these standard Linux tools to confirm I/O bottleneck:
 
'''Install required tools:'''
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
# For Debian/Ubuntu
# Debian/Ubuntu
grep -i "out of memory\|killed process" /var/log/syslog | tail -20
apt install sysstat iotop ioping
 
# CentOS/RHEL
yum install sysstat iotop ioping
</syntaxhighlight>


# For CentOS/RHEL/AlmaLinux
'''2a) iostat - Disk utilization and wait times'''
grep -i "out of memory\|killed process" /var/log/messages | tail -20
<syntaxhighlight lang="bash">
# Run for 10 intervals of 2 seconds
iostat -xz 2 10
</syntaxhighlight>


# Also check dmesg:
'''Key output columns:'''
dmesg | grep -i "killed process" | tail -10
<syntaxhighlight lang="text">
Device  r/s    w/s  rkB/s  wkB/s  await  %util
sda    12.50  245.30  50.00  1962.40  45.23  98.50
</syntaxhighlight>
</syntaxhighlight>
Typical OOM killer messages look like:
<pre>
Out of memory: Kill process 1234 (mysqld) score 123 or sacrifice child
Killed process 1234 (mysqld) total-vm: 12345678kB, anon-rss: 1234567kB
</pre>


;2. Monitor current memory usage:
{| class="wikitable"
|-
! Column !! Description !! Problem Indicator
|-
| <code>%util</code> || Device utilization percentage || '''> 90%''' = disk saturated
|-
| <code>await</code> || Average I/O wait time (ms) || '''> 20ms''' for SSD, '''> 50ms''' for HDD = high latency
|-
| <code>w/s</code> || Writes per second || Compare with disk's rated IOPS
|}
 
'''2b) iotop - Per-process I/O usage'''
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
# Check available memory (look for low 'available' or 'free' values)
# Show I/O by process (run as root)
free -h
iotop -o
</syntaxhighlight>


# Check per-process memory usage (sorted by RSS)
Look for <code>voipmonitor</code> or <code>mysqld</code> dominating I/O. If voipmonitor shows high DISK WRITE but system <code>%util</code> is 100%, disk cannot keep up.
ps aux --sort=-%mem | head -15


# Check MySQL memory usage in bytes
'''2c) ioping - Quick latency check'''
cat /proc/$(pgrep mysqld)/status | grep -E "VmSize|VmRSS"
<syntaxhighlight lang="bash">
# Test latency on VoIPmonitor spool directory
cd /var/spool/voipmonitor
ioping -c 20 .
</syntaxhighlight>
</syntaxhighlight>
Warning signs:
* '''Available memory consistently below 500MB during operation'''
* '''MySQL consuming most of the available RAM'''
* '''Swap usage near 100% (if swap is enabled)'''
* '''Frequent process restarts without clear error messages'''


;3. Solution: Increase physical memory:
'''Expected results:'''
The definitive solution for OOM-related CDR processing issues is to upgrade the server's physical RAM. After upgrading:
{| class="wikitable"
* Verify memory improvements with <code>free -h</code>
|-
* Monitor for several days to ensure OOM events stop
! Storage Type !! Healthy Latency !! Problem Indicator
* Consider tuning <code>innodb_buffer_pool_size</code> in your MySQL configuration to use the additional memory effectively
|-
| NVMe SSD || < 0.5 ms || > 2 ms
|-
| SATA SSD || < 1 ms || > 5 ms
|-
| HDD (7200 RPM) || < 10 ms || > 30 ms
|}


Additional mitigation strategies (while planning for RAM upgrade):
==== Step 4: Linux CPU Diagnostics ====
* Reduce MySQL's memory footprint by lowering <code>innodb_buffer_pool_size</code> (e.g., from 16GB to 8GB)
* Disable or limit non-essential VoIPmonitor features (e.g., packet capture storage, RTP analysis)
* Ensure swap space is properly configured as a safety buffer (though swap is much slower than RAM)
* Use <code>sysctl vm.swappiness=10</code> to favor RAM over swap when some memory is still available


== Step 8: Missing CDRs for Calls with Large Packets ==
'''3a) top - Overall CPU usage'''
If VoIPmonitor is capturing some calls successfully but missing CDRs for specific calls (especially those that seem to have larger SIP packets like INVITEs with extensive SDP), there are two common causes to investigate.
<syntaxhighlight lang="bash">
# Press '1' to show per-core CPU
top
</syntaxhighlight>


=== Cause 1: snaplen Packet Truncation (VoIPmonitor Configuration) ===
Look for:
The <code>snaplen</code> parameter in <code>voipmonitor.conf</code> limits how many bytes of each packet are captured. If a SIP packet exceeds <code>snaplen</code>, it is truncated and the sniffer may fail to parse the call correctly.
* Individual CPU core at 100% (t0 thread is single-threaded)
* High <code>%wa</code> (I/O wait) vs high <code>%us/%sy</code> (CPU-bound)


;1. Check your current snaplen setting:
'''3b) Verify voipmonitor threads'''
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
grep snaplen /etc/voipmonitor.conf
# Show voipmonitor threads with CPU usage
top -H -p $(pgrep voipmonitor)
</syntaxhighlight>
 
If one thread shows ~100% CPU while others are low, you have a CPU bottleneck on the capture thread (t0).
 
==== Step 5: Decision Matrix ====
 
{| class="wikitable"
|-
! Observation !! Likely Cause !! Go To
|-
| <code>heap</code> high, <code>t0CPU</code> > 80%, iostat <code>%util</code> low || '''CPU Bottleneck''' || [[#Solution: CPU Bottleneck|CPU Solution]]
|-
| <code>heap</code> high, <code>t0CPU</code> < 50%, iostat <code>%util</code> > 90% || '''I/O Bottleneck''' || [[#Solution: I/O Bottleneck|I/O Solution]]
|-
| <code>heap</code> high, <code>t0CPU</code> < 50%, iostat <code>%util</code> < 50%, <code>SQLq</code> growing || '''Database Bottleneck''' || [[#SQL Queue Overload|Database Solution]]
|-
| <code>heap</code> normal, <code>comp</code> maxed, <code>tacCPU</code> all ~100% || '''Compression Bottleneck''' (type of I/O) || [[#Solution: I/O Bottleneck|I/O Solution]]
|}
 
==== Step 6: Confirmation Test (Optional) ====
 
After identifying the likely cause with the tools above, you can confirm with a storage disable test:
 
<syntaxhighlight lang="ini">
# /etc/voipmonitor.conf - temporarily disable all storage
savesip = no
savertp = no
savertcp = no
savegraph = no
</syntaxhighlight>
</syntaxhighlight>
Default is 3200 bytes (6000 if SSL/HTTP is enabled).


;2. Test if packet truncation is the issue:
Use <code>tcpdump</code> with <code>-s0</code> (snap infinite) to capture full packets:
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
# Capture SIP traffic with full packet length
systemctl restart voipmonitor
tcpdump -i eth0 -s0 -nn port 5060 -w /tmp/test_capture.pcap
# Monitor for 5-10 minutes during peak traffic
journalctl -u voipmonitor -f | grep heap
</syntaxhighlight>
 
* If <code>heap</code> values drop to near zero → confirms '''I/O bottleneck'''
* If <code>heap</code> values remain high → confirms '''CPU bottleneck'''
 
{{Warning|Remember to re-enable storage after testing! This test causes call recordings to be lost.}}
 
=== Solution: I/O Bottleneck ===
 
{{Note|If you see <code>IO[...] DISK_SAT</code> or <code>WARN</code> in the syslog status line (v2026.01.3+), disk saturation is already confirmed. See [[Syslog_Status_Line#IO.5B....5D_-_Disk_I.2FO_Monitoring_.28v2026.01.3.2B.29|IO[] Metrics]] for details.}}


# Analyze packet sizes with Wireshark or tshark
'''Quick confirmation (for older versions):'''
tshark -r /tmp/test_capture.pcap -T fields -e frame.len -Y "sip" | sort -n | tail -10
 
</syntaxhighlight>
Temporarily save only RTP headers to reduce disk write load:
If you see SIP packets larger than your <code>snaplen</code> value (e.g., 4000+ bytes), increase <code>snaplen</code> in <code>voipmonitor.conf</code>:
<syntaxhighlight lang="ini">
<syntaxhighlight lang="ini">
snaplen = 65535
# /etc/voipmonitor.conf
savertp = header
</syntaxhighlight>
</syntaxhighlight>
Then restart the sniffer: <code>systemctl restart voipmonitor</code>.


=== Cause 2: MTU Mismatch (Network Infrastructure) ===
Restart the sniffer and monitor. If heap usage stabilizes and "MEMORY IS FULL" errors stop, the issue is confirmed to be storage I/O.
If packets are being lost or fragmented due to MTU mismatches in the network path, VoIPmonitor may never receive the complete packets, regardless of <code>snaplen</code> settings.


;1. Diagnose MTU-related packet loss:
'''Check storage health before upgrading:'''
Capture traffic with tcpdump and analyze in Wireshark:
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
# Capture traffic on the VoIPmonitor host
# Check drive health
tcpdump -i eth0 -s0 host <pbx_ip_address> -w /tmp/mtu_test.pcap
smartctl -a /dev/sda
 
# Check for I/O errors in system logs
dmesg | grep -i "i/o error\|sd.*error\|ata.*error"
</syntaxhighlight>
</syntaxhighlight>
Open the pcap in Wireshark and look for:
* Reassembled PDUs marked as incomplete
* TCP retransmissions for the same packet
* ICMP "Fragmentation needed" messages (Type 3, Code 4)


;2. Verify packet completeness:
Look for reallocated sectors, pending sectors, or I/O errors. Replace failing drives before considering upgrades.
In Wireshark, examine large SIP INVITE packets. If the SIP headers or SDP appear cut off or incomplete, packets are likely being lost in transit due to MTU issues.
 
'''Storage controller cache settings:'''
{| class="wikitable"
|-
! Storage Type !! Recommended Cache Mode
|-
| HDD / NAS || WriteBack (requires battery-backed cache)
|-
| SSD || WriteThrough (or WriteBack with power loss protection)
|}
 
Use vendor-specific tools to configure cache policy (<code>megacli</code>, <code>ssacli</code>, <code>perccli</code>).


;3. Identify the MTU bottleneck:
'''Storage upgrades (in order of effectiveness):'''
The issue is typically a network device with a lower MTU than the end devices. Common locations:
{| class="wikitable"
* VPN concentrators
|-
* Firewalls
! Solution !! IOPS Improvement !! Notes
* Routers with tunnel interfaces
|-
* Cloud provider gateways (typically 1500 bytes vs. standard 9000 jumbo frames)
| '''NVMe SSD''' || 50-100x vs HDD || Best option, handles 10,000+ concurrent calls
|-
| '''SATA SSD''' || 20-50x vs HDD || Good option, handles 5,000+ concurrent calls
|-
| '''RAID 10 with BBU''' || 5-10x vs single disk || Enable WriteBack cache (requires battery backup)
|-
| '''Separate storage server''' || Variable || Use [[Sniffer_distributed_architecture|client/server mode]]
|}


To locate the problematic device, trace the MTU along the network path from the PBX to the VoIPmonitor sensor.
'''Filesystem tuning (ext4):'''
<syntaxhighlight lang="bash">
# Check current mount options
mount | grep voipmonitor


;4. Resolution options:
# Recommended mount options for /var/spool/voipmonitor
* Increase MTU on the bottleneck device to match the rest of the network (e.g., from 1500 to 9000 for jumbo frame environments)
# Add to /etc/fstab: noatime,data=writeback,barrier=0
* Enable Path MTU Discovery (PMTUD) on intermediate devices
# WARNING: barrier=0 requires battery-backed RAID
* Ensure your switching infrastructure supports jumbo frames end-to-end if you are using them
</syntaxhighlight>


For more information on the <code>snaplen</code> parameter, see [[Sniffer_configuration#Network_Interface_.26_Sniffing|Sniffer Configuration]].
'''Verify improvement:'''
<syntaxhighlight lang="bash">
# After changes, monitor iostat
iostat -xz 2 10
# %util should drop below 70%, await should decrease
</syntaxhighlight>


=== Cause 3: External Source Packet Truncation (Traffic Mirroring/LBS Modules) ===
=== Solution: CPU Bottleneck ===
If packets are truncated or corrupted BEFORE they reach VoIPmonitor, changing <code>snaplen</code> will NOT fix the issue. This scenario occurs when using external SIP sources that have their own packet size limitations.


; Symptoms to identify this scenario:
==== Identify CPU Bottleneck Using Manager Commands ====
* Large SIP packets (e.g., WebRTC INVITE with big Authorization headers ~4k) appear truncated
* Packets show as corrupted or malformatted in VoIPmonitor GUI
* Changing <code>snaplen</code> in <code>voipmonitor.conf</code> has no effect
* Using TCP instead of UDP in the external system does not resolve the issue


; Common external sources that may truncate packets:
VoIPmonitor provides manager commands to monitor thread CPU usage in real-time. This is essential for identifying which thread is saturated.
# Kamailio <code>siptrace</code> module
# FreeSWITCH <code>sip_trace</code> module
# OpenSIPS tracing modules
# Custom HEP/HOMER agent implementations
# Load balancers or proxy servers with traffic mirroring


; Diagnose external source truncation:
'''Connect to manager interface:'''
Use <code>tcpdump</code> with <code>-s0</code> (snap infinite) on the VoIPmonitor sensor to compare packet sizes:
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
# Capture traffic received by VoIPmonitor
# Via Unix socket (local, recommended)
sudo tcpdump -i eth0 -s0 -nn port 5060 -w /tmp/voipmonitor_input.pcap
echo 'sniffer_threads' | nc -U /tmp/vm_manager_socket
 
# Via TCP port 5029 (remote or local)
echo 'sniffer_threads' | nc 127.0.0.1 5029
 
# Monitor continuously (every 2 seconds)
watch -n 2 "echo 'sniffer_threads' | nc -U /tmp/vm_manager_socket"
</syntaxhighlight>
 
{{Note|1=TCP port 5029 is encrypted by default. For unencrypted access, set <code>manager_enable_unencrypted = yes</code> in voipmonitor.conf (security risk on public networks).}}
 
'''Example output:'''
<syntaxhighlight lang="text">
t0 - binlog1 fifo pcap read          (  12345) :  78.5  FIFO  99    1234
t2 - binlog1 pb write                (  12346) :  12.3              456
rtp thread binlog1 binlog1 0        (  12347) :  8.1              234
rtp thread binlog1 binlog1 1        (  12348) :  6.2              198
t1 - binlog1 call processing        (  12349) :  4.5              567
tar binlog1 compression 0            (  12350) :  3.2                89
</syntaxhighlight>
 
'''Column interpretation:'''
{| class="wikitable"
|-
! Column !! Description
|-
| Thread name || Descriptive name (t0=capture, t1=call processing, t2=packet buffer write)
|-
| (TID) || Linux thread ID (useful for <code>top -H -p TID</code>)
|-
| CPU % || Current CPU usage percentage - '''key metric'''
|-
| Sched || Scheduler type (FIFO = real-time, empty = normal)
|-
| Priority || Thread priority
|-
| CS/s || Context switches per second
|}
 
'''Critical threads to watch:'''
{| class="wikitable"
|-
! Thread !! Role !! If at 90-100%
|-
| '''t0''' (pcap read) || Packet capture from NIC || '''Single-core limit reached!''' Cannot parallelize. Need DPDK/Napatech.
|-
| '''t2''' (pb write) || Packet buffer processing || Processing bottleneck. Check t2CPU breakdown.
|-
| '''rtp thread''' || RTP packet processing || Threads auto-scale. If still saturated, consider DPDK/Napatech.
|-
| '''tar compression''' || PCAP archiving || I/O bottleneck (compression waiting for disk)
|-
| '''mysql store''' || Database writes || Database bottleneck. Check SQLq metric.
|}
 
{{Warning|If '''t0 thread is at 90-100%''', you have hit the fundamental single-core capture limit. The t0 thread reads packets from the kernel and '''cannot be parallelized'''. Disabling features like jitterbuffer will NOT help - those run on different threads. The only solutions are:
* '''Reduce captured traffic''' using <code>interface_ip_filter</code> or BPF <code>filter</code>
* '''Use kernel bypass''' ([[DPDK]] or [[Napatech]]) which eliminates kernel overhead entirely}}


# Analyze actual packet sizes received
==== Interpreting t2CPU Detailed Breakdown ====
tshark -r /tmp/voipmonitor_input.pcap -T fields -e frame.len -Y "sip.Method == INVITE" | sort -n | tail -10
 
The syslog status line shows <code>t2CPU</code> with detailed sub-metrics:
<syntaxhighlight lang="text">
t2CPU[pb:10/ d:39/ s:24/ e:17/ c:6/ g:6/ r:7/ rm:24/ rh:16/ rd:19/]
</syntaxhighlight>
</syntaxhighlight>


If:
{| class="wikitable"
* You see packets with truncated SIP headers or incomplete SDP
|-
* The packet length is much smaller than expected (e.g., 1500 bytes instead of 4000+ bytes)
! Code !! Function !! High Value Indicates
* Truncation is consistent across all calls
|-
| '''pb''' || Packet buffer output || Buffer management overhead
|-
| '''d''' || Dispatch || Structure creation bottleneck
|-
| '''s''' || SIP parsing || Complex/large SIP messages
|-
| '''e''' || Entity lookup || Call table lookup overhead
|-
| '''c''' || Call processing || Call state machine processing
|-
| '''g''' || Register processing || High REGISTER volume
|-
| '''r, rm, rh, rd''' || RTP processing stages || High RTP volume (threads auto-scale)
|}


Then the external source is truncating packets before they reach VoIPmonitor.
'''Thread auto-scaling:''' VoIPmonitor automatically spawns additional threads when load increases:
* If '''d''' > 50% → SIP parsing thread ('''s''') starts
* If '''s''' > 50% → Entity lookup thread ('''e''') starts
* If '''e''' > 50% → Call/register/RTP threads start


; Solutions for Kamailio siptrace truncation:
==== Configuration for High Traffic (>10,000 calls/sec) ====
If using Kamailio's <code>siptrace</code> module with traffic mirroring:


1. Configure Kamailio to use TCP transport for siptrace (may help in some cases):
<syntaxhighlight lang="ini">
<pre>
# /etc/voipmonitor.conf
# In kamailio.cfg
 
modparam("siptrace", "duplicate_uri", "sip:voipmonitor_ip:port;transport=tcp")
# Increase buffer to handle processing spikes (value in MB)
</pre>
# 10000 = 10 GB - can go higher (20000, 30000+) if RAM allows
# Larger buffer absorbs I/O and CPU spikes without packet loss
max_buffer_mem = 10000
 
# Use IP filter instead of BPF (more efficient)
interface_ip_filter = 10.0.0.0/8
interface_ip_filter = 192.168.0.0/16
# Comment out any 'filter' parameter
</syntaxhighlight>
 
==== CPU Optimizations ====
 
<syntaxhighlight lang="ini">
# /etc/voipmonitor.conf
 
# Reduce jitterbuffer calculations to save CPU (keeps MOS-F2 metric)
jitterbuffer_f1 = no
jitterbuffer_f2 = yes
jitterbuffer_adapt = no
 
# If MOS metrics are not needed at all, disable everything:
# jitterbuffer_f1 = no
# jitterbuffer_f2 = no
# jitterbuffer_adapt = no
</syntaxhighlight>
 
==== Kernel Bypass Solutions (Extreme Loads) ====
 
When t0 thread hits 100% on standard NIC, kernel bypass is the only solution:
 
{| class="wikitable"
|-
! Solution !! Type !! CPU Reduction !! Use Case
|-
| '''[[DPDK]]''' || Open-source || ~70% || Multi-gigabit on commodity hardware
|-
| '''[[Napatech]]''' || Hardware SmartNIC || >97% (< 3% at 10Gbit) || Extreme performance requirements
|}
 
==== Verify Improvement ====


2. If Kamailio reports "Connection refused", VoIPmonitor does not open a TCP listener by default. Manually open one:
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
# Open TCP listener using socat
# Monitor thread CPU after changes
socat TCP-LISTEN:5888,fork,reuseaddr &
watch -n 2 "echo 'sniffer_threads' | nc -U /tmp/vm_manager_socket | head -10"
 
# Or monitor syslog
journalctl -u voipmonitor -f
# t0CPU should drop, heap values should stay < 20%
</syntaxhighlight>
</syntaxhighlight>
Then update kamailio.cfg to use the specified port instead of the standard SIP port.


3. Use HAProxy traffic 'tee' function (recommended):
{{Note|1=After changes, monitor syslog <code>heap[A&#124;B&#124;C]</code> values - should stay below 20% during peak traffic. See [[Syslog_Status_Line]] for detailed metric explanations.}}
If your architecture includes HAProxy in front of Kamailio, use its traffic mirroring to send a copy of the WebSocket traffic directly to VoIPmonitor's standard SIP listening port. This bypasses the siptrace module entirely and preserves original packets:
 
<pre>
== Storage Hardware Failure ==
# In haproxy.cfg, within your frontend/backend configuration
# Send a copy of traffic to VoIPmonitor
option splice-response
tcp-request inspect-delay 5s
tcp-request content accept if { req_ssl_hello_type 1 }
use-server voipmonitor if { req_ssl_hello_type 1 }
listen voipmonitor_mirror
    bind :5888
    mode tcp
    server voipmonitor <voipmonitor_sensor_ip>:5060 send-proxy
</pre>


Note: The exact HAProxy configuration depends on your architecture and whether you are mirroring TCP (WebSocket) or UDP traffic.
'''Symptom''': Sensor shows disconnected (red X) with "DROPPED PACKETS" at low traffic volumes.


; Solutions for other external sources:
'''Diagnosis''':
# Check the external system's documentation for packet size limits or truncation settings
<syntaxhighlight lang="bash">
# Consider using standard network mirroring (SPAN/ERSPAN/GRE) instead of SIP tracing modules
# Check disk health
# Ensure the external system captures full packet lengths (disable any internal packet size caps)
smartctl -a /dev/sda
# Verify that the external system does not reassemble or modify SIP packets before forwarding


== Step 9: Probe Timeout Due to Virtualization Timing Issues ==
# Check RAID status (if applicable)
cat /proc/mdstat
mdadm --detail /dev/md0
</syntaxhighlight>


If remote probes are intermittently disconnecting from the central server with timeout errors, even on a high-performance network with low load, the issue may be related to virtualization host timing problems rather than network connectivity.
Look for reallocated sectors, pending sectors, or RAID degraded state. Replace failing disk.


=== Diagnosis: Check System Log Timing Intervals ===
== OOM (Out of Memory) ==


The VoIPmonitor sensor generates status log messages approximately every 10 seconds during normal operation. If the timing system on the probe is inconsistent, the interval between these status messages can exceed 30 seconds, triggering a connection timeout.
=== Identify OOM Victim ===


;1. Monitor the system log on the affected probe:
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
tail -f /var/log/syslog | grep voipmonitor
# Check for OOM kills
dmesg | grep -i "out of memory\|oom\|killed process"
journalctl --since "1 hour ago" | grep -i oom
</syntaxhighlight>
</syntaxhighlight>


;2. Examine the timestamps of voipmonitor status messages:
=== MySQL Killed by OOM ===
Look for repeating log entries that should appear approximately every 10 seconds during normal operations.
 
Reduce InnoDB buffer pool:
<syntaxhighlight lang="ini">
# /etc/mysql/my.cnf
innodb_buffer_pool_size = 2G  # Reduce from default
</syntaxhighlight>


;3. Identify timing irregularities:
=== Voipmonitor Killed by OOM ===
Calculate the time interval between successive status log entries. '''If the interval exceeds 30 seconds''', this indicates a timing system problem that will cause connection timeouts with the central server.


=== Root Cause: Virtualization Host RDTSC Issues ===
Reduce buffer sizes in voipmonitor.conf:
<syntaxhighlight lang="ini">
max_buffer_mem = 2000  # Reduce from default
ringbuffer = 50        # Reduce from default
</syntaxhighlight>


This problem is '''not''' network-related. It is a host-level timing issue that impacts the application's internal timers.
=== Runaway External Process ===


The issue typically occurs on virtualized probes where the host's CPU timekeeping is inconsistent. Specifically, problems with the RDTSC (Read Time-Stamp Counter) CPU instruction on the virtualization host can cause:
<syntaxhighlight lang="bash">
# Find memory-hungry processes
ps aux --sort=-%mem | head -20


* Irregular system clock behavior on the guest VM
# Kill orphaned/runaway process
* Application timers that do not fire consistently
kill -9 <PID>
* Sporadic timeouts in client-server connections
</syntaxhighlight>
For servers limited to '''16GB RAM''' or when experiencing repeated MySQL OOM kills:


=== Resolution ===
<syntaxhighlight lang="ini">
# /etc/my.cnf or /etc/mysql/mariadb.conf.d/50-server.cnf
[mysqld]
# On 16GB server: 6GB buffer pool + 6GB MySQL overhead = 12GB total
# Leaves 4GB for OS + GUI, preventing OOM
innodb_buffer_pool_size = 6G


;1. Investigate the virtualization host configuration:
# Enable write buffering (may lose up to 1s of data on crash but reduces memory pressure)
Check the host's hypervisor or virtualization platform documentation for known timekeeping issues related to RDTSC.
innodb_flush_log_at_trx_commit = 2
</syntaxhighlight>


Common virtualization platforms with known timing considerations:
Restart MySQL after changes:
* KVM/QEMU: Check CPU passthrough and TSC mode settings
<syntaxhighlight lang="bash">
* VMware: Verify time synchronization between guest and host
systemctl restart mysql
* Hyper-V: Review Integration Services time sync configuration
# or
* Xen: Check TSC emulation settings
systemctl restart mariadb
</syntaxhighlight>
=== SQL Queue Growth from Non-Call Data ===


;2. Apply host-level fixes:
If <code>sip-register</code>, <code>sip-options</code>, or <code>sip-subscribe</code> are enabled, non-call SIP-messages (OPTIONS, REGISTER, SUBSCRIBE, NOTIFY) can accumulate in the database and cause the SQL queue to grow unbounded. This increases MySQL memory usage and leads to OOM kills of mysqld.
These are host-level fixes, not changes to the guest VM configuration. Consult your virtualization platform's documentation for specific steps to address RDTSC timing issues.


Typical solutions include:
{{Warning|1=Even with reduced <code>innodb_buffer_pool_size</code>, SQL queue will grow indefinitely without cleanup of non-call data.}}
* Enabling appropriate TSC modes on the host
* Configuring CPU features passthrough correctly
* Adjusting hypervisor timekeeping parameters


;3. Verify the fix:
'''Solution: Enable automatic cleanup of old non-call data'''
After applying the host-level configuration changes, monitor the probe's status logs again to confirm that the timing intervals are now consistently around 10 seconds (never exceeding 30 seconds).
<syntaxhighlight lang="ini">
# /etc/voipmonitor.conf
# cleandatabase=2555 automatically deletes partitions older than 7 years
# Covers: CDR, register_state, register_failed, and sip_msg (OPTIONS/SUBSCRIBE/NOTIFY)
cleandatabase = 2555
</syntaxhighlight>


Restart the sniffer after changes:
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
# Monitor for regular status messages
systemctl restart voipmonitor
tail -f /var/log/syslog | grep voipmonitor
</syntaxhighlight>
</syntaxhighlight>


Once the timing is corrected, probe connections to the central server should remain stable without intermittent timeouts.
{{Note|See [[Data_Cleaning]] for detailed configuration options and other <code>cleandatabase_*</code> parameters.}}
== Service Startup Failures ==
 
=== Interface No Longer Exists ===
 
After OS upgrade, interface names may change (eth0 → ensXXX):
 
<syntaxhighlight lang="bash">
# Find current interface names
ip a
 
# Update all config locations
grep -r "interface" /etc/voipmonitor.conf /etc/voipmonitor.conf.d/
 
# Also check GUI: Settings → Sensors → Configuration
</syntaxhighlight>


== Appendix: tshark Display Filter Syntax for SIP ==
=== Missing Dependencies ===
When using <code>tshark</code> to analyze SIP traffic, it is important to use the '''correct Wireshark display filter syntax'''. Below are common filter examples:


=== Basic SIP Filters ===
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
# Show all SIP INVITE messages
# Install common missing package
tshark -r capture.pcap -Y "sip.Method == INVITE"
apt install libpcap0.8  # Debian/Ubuntu
yum install libpcap    # RHEL/CentOS
</syntaxhighlight>
 
== Network Interface Issues ==
 
=== Promiscuous Mode ===


# Show all SIP messages (any method)
Required for SPAN port monitoring:
tshark -r capture.pcap -Y "sip"
<syntaxhighlight lang="bash">
# Enable
ip link set eth0 promisc on


# Show SIP and RTP traffic
# Verify
tshark -r capture.pcap -Y "sip || rtp"
ip link show eth0 | grep PROMISC
</syntaxhighlight>
</syntaxhighlight>


=== Search for Specific Phone Number or Text ===
{{Note|Promiscuous mode is NOT required for ERSPAN/GRE tunnels where traffic is addressed to the sensor.}}
 
=== Interface Drops ===
 
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
# Find calls containing a specific phone number (e.g., 5551234567)
# Check for drops
tshark -r capture.pcap -Y 'sip contains "5551234567"'
ip -s link show eth0 | grep -i drop
 
# If drops present, increase ring buffer
ethtool -G eth0 rx 4096
</syntaxhighlight>
 
=== Bonded/EtherChannel Interfaces ===
 
'''Symptom''': False packet loss when monitoring bond0 or br0.


# Find INVITE messages for a specific number
'''Solution''': Monitor physical interfaces, not logical:
tshark -r capture.pcap -Y 'sip.Method == INVITE && sip contains "5551234567"'
<syntaxhighlight lang="ini">
# voipmonitor.conf - use physical interfaces
interface = eth0,eth1
</syntaxhighlight>
</syntaxhighlight>


=== Extract Call-ID from Matching Calls ===
=== Network Offloading Issues ===
 
'''Symptom''': Kernel errors like <code>bad gso: type: 1, size: 1448</code>
 
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
# Get Call-ID for calls matching a phone number
# Disable offloading on capture interface
tshark -r capture.pcap -Y 'sip.Method == INVITE && sip contains "5551234567"' -T fields -e sip.Call-ID
ethtool -K eth0 gso off tso off gro off lro off
</syntaxhighlight>
 
== Packet Ordering Issues ==
 
If SIP messages appear out of sequence:
 
'''First''': Rule out Wireshark display artifact - disable "Analyze TCP sequence numbers" in Wireshark. See [[FAQ]].
 
'''If genuine reordering''': Usually caused by packet bursts in network infrastructure. Use tcpdump to verify packets arrive out of order at the interface. Work with network admin to implement QoS or traffic shaping. For persistent issues, consider dedicated capture card with hardware timestamping (see [[Napatech]]).
{{Note|For out-of-order packets in '''client/server mode''' (multiple sniffers), see [[Sniffer_distributed_architecture]] for <code>pcap_queue_dequeu_window_length</code> configuration.}}
 
=== Solutions for SPAN/Mirroring Reordering ===
 
If packets arrive out of order at the SPAN/mirror port (e.g., 302 responses before INVITE causing "000 no response" errors):
 
1. '''Configure switch to preserve packet order''': Many switches allow configuring SPAN/mirror ports to maintain packet ordering. Consult your switch documentation for packet ordering guarantees in mirroring configuration.
 
2. '''Replace SPAN with TAP or packet broker''': Unlike software-based SPAN mirroring, hardware TAPs and packet brokers guarantee packet order. Consider upgrading to a dedicated TAP or packet broker device for mission-critical monitoring.
== Database Issues ==
 
=== SQL Queue Overload ===
 
'''Symptom''': Growing <code>SQLq</code> metric, potential coredumps.
 
<syntaxhighlight lang="ini">
# voipmonitor.conf - increase threads
mysqlstore_concat_limit_cdr = 1000
cdr_check_exists_callid = 0
</syntaxhighlight>
 
=== Error 1062 - Lookup Table Limit ===


# Get Call-ID along with From and To headers
'''Symptom''': <code>Duplicate entry '16777215' for key 'PRIMARY'</code>
tshark -r capture.pcap -Y 'sip.Method == INVITE' -T fields -e sip.Call-ID -e sip.from.user -e sip.to.user
 
'''Quick fix''':
<syntaxhighlight lang="ini">
# voipmonitor.conf
cdr_reason_string_enable = no
</syntaxhighlight>
</syntaxhighlight>


=== Filter by IP Address ===
See [[Database_troubleshooting#Database_Error_1062_-_Lookup_Table_Auto-Increment_Limit|Database Troubleshooting]] for complete solution.
 
== Bad Packet Errors ==
 
'''Symptom''': <code>bad packet with ether_type 0xFFFF detected on interface</code>
 
'''Diagnosis''':
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
# SIP traffic from a specific source IP
# Run diagnostic (let run 30-60 seconds, then kill)
tshark -r capture.pcap -Y "sip && ip.src == 192.168.1.100"
voipmonitor --check_bad_ether_type=eth0


# SIP traffic between two hosts
# Find and kill the diagnostic process
tshark -r capture.pcap -Y "sip && ip.addr == 192.168.1.100 && ip.addr == 10.0.0.50"
ps ax | grep voipmonitor
kill -9 <PID>
</syntaxhighlight>
</syntaxhighlight>


=== Filter by SIP Response Code ===
Causes: corrupted packets, driver issues, VLAN tagging problems. Check <code>ethtool -S eth0</code> for interface errors.
 
== Useful Diagnostic Commands ==
 
=== tshark Filters for SIP ===
 
<syntaxhighlight lang="bash">
<syntaxhighlight lang="bash">
# Show all 200 OK responses
# All SIP INVITEs
tshark -r capture.pcap -Y "sip.Status-Code == 200"
tshark -r capture.pcap -Y "sip.Method == INVITE"
 
# Find specific phone number
tshark -r capture.pcap -Y 'sip contains "5551234567"'
 
# Get Call-IDs
tshark -r capture.pcap -Y "sip.Method == INVITE" -T fields -e sip.Call-ID


# Show all 4xx and 5xx error responses
# SIP errors (4xx, 5xx)
tshark -r capture.pcap -Y "sip.Status-Code >= 400"
tshark -r capture.pcap -Y "sip.Status-Code >= 400"
</syntaxhighlight>
=== Interface Statistics ===
<syntaxhighlight lang="bash">
# Detailed NIC stats
ethtool -S eth0


# Show 486 Busy Here responses
# Watch packet rates
tshark -r capture.pcap -Y "sip.Status-Code == 486"
watch -n 1 'cat /proc/net/dev | grep eth0'
</syntaxhighlight>
</syntaxhighlight>


=== Important Syntax Notes ===
== See Also ==
* '''Field names are case-sensitive:''' Use <code>sip.Method</code>, <code>sip.Call-ID</code>, <code>sip.Status-Code</code> (not <code>sip.method</code> or <code>sip.call-id</code>)
 
* '''String matching uses <code>contains</code>:''' Use <code>sip contains "text"</code> (not <code>sip.contains()</code>)
* [[Sniffer_configuration]] - Configuration parameter reference
* '''Use double quotes for strings:''' <code>sip contains "number"</code> (not single quotes)
* [[Sniffer_distributed_architecture]] - Client/server deployment
* '''Boolean operators:''' Use <code>&&</code> (and), <code>||</code> (or), <code>!</code> (not)
* [[Capture_rules]] - GUI-based recording rules
* [[Sniffing_modes]] - SPAN, ERSPAN, GRE, TZSP setup
* [[Scaling]] - Performance optimization
* [[Database_troubleshooting]] - Database issues
* [[FAQ]] - Common questions and Wireshark display issues
 
 
 
 
 


For a complete reference, see the [https://www.wireshark.org/docs/dfref/s/sip.html Wireshark SIP Display Filter Reference].


== See Also ==
* [[Sniffer_configuration]] - Complete configuration reference for voipmonitor.conf
* [[Sniffer_distributed_architecture]] - Client/server deployment and troubleshooting
* [[Capture_rules]] - GUI-based selective recording configuration
* [[Sniffing_modes]] - Traffic forwarding methods (SPAN, ERSPAN, GRE, TZSP)
* [[Scaling]] - Performance tuning and optimization


== AI Summary for RAG ==
== AI Summary for RAG ==
'''Summary:''' Step-by-step troubleshooting guide for VoIPmonitor sensor not capturing calls. Steps: (1) Verify service running with <code>systemctl status</code>. (2) Use <code>tshark</code> to confirm SIP/RTP traffic reaching interface. (3) Check network config - promiscuous mode required for SPAN/RSPAN but NOT for Layer 3 tunnels (ERSPAN/GRE/TZSP/VXLAN). (3b) If tshark shows only non-call SIP traffic (OPTIONS/NOTIFY/SUBSCRIBE/MESSAGE without INVITE), enable processing in voipmonitor.conf: sip-options, sip-message, sip-subscribe, sip-notify. (4) Verify <code>voipmonitor.conf</code> settings: interface, sipport, filter directives. (5) Check GUI capture rules with "Skip" option blocking calls. (6) Review system logs for errors. (7) Diagnose OOM killer events causing CDR processing stops. (8) Investigate missing CDRs due tosnaplen truncation, MTU mismatch, or EXTERNAL SOURCE packet truncation. Cause 3: If packets truncated before reaching VoIPmonitor (e.g., Kamailio siptrace, FreeSWITCH sip_trace, custom HEP/HOMER agents, load balancer mirrors), snaplen changes will NOT help. Diagnose with tcpdump -s0; check if received packets smaller than expected. Solutions: For Kamailio siptrace, use TCP transport in duplicate_uri parameter; if connection refused, open TCP listener with socat; best solution: use HAProxy traffic 'tee' to bypass siptrace entirely and send original packets directly. (9) Diagnose probe timeout due to virtualization timing issues - check syslog for 10-second voipmonitor status intervals, RDTSC problems on hypervisor cause >30 second gaps triggering timeouts. Includes tshark display filter syntax appendix.


'''Keywords:''' troubleshooting, no calls, not sniffing, no CDRs, tshark, promiscuous mode, SPAN, RSPAN, ERSPAN, GRE, TZSP, VXLAN, voipmonitor.conf, interface, sipport, filter, capture rules, Skip, syslog, OOM, out of memory, snaplen, MTU, packet truncation, external source truncation, Kamailio siptrace, FreeSWITCH sip_trace, OpenSIPS, HEP, HOMER, HAProxy tee, traffic mirroring, load balancer, socat, TCP listener, WebRTC INVITE, truncated packets, corrupted packets, Authorization header, 4k packets, display filter, sip.Method, sip.Call-ID, probe timeout, virtualization, RDTSC, timing issues, status logs, 10 second interval, KVM, VMware, Hyper-V, Xen, non-call SIP traffic, OPTIONS, NOTIFY, SUBSCRIBE, MESSAGE, sip-options, sip-message, sip-subscribe, sip-notify, qualify pings, heartbeat, instant messaging
<!-- This section is for AI/RAG systems. Do not edit manually. -->
 
=== Summary ===
Comprehensive troubleshooting guide for VoIPmonitor sniffer/sensor problems. Covers: verifying traffic reaches interface (tcpdump/tshark), diagnosing no calls recorded (service, config, capture rules, SPAN), missing audio/RTP issues (one-way audio, NAT, natalias, rtp_check_both_sides_by_sdp), PACKETBUFFER FULL errors (I/O vs CPU bottleneck diagnosis using syslog metrics heap/t0CPU/SQLq and Linux tools iostat/iotop/ioping), manager commands for thread monitoring (sniffer_threads via socket or port 5029), t0 single-core capture limit and solutions (DPDK/Napatech kernel bypass), I/O solutions (NVMe/SSD, async writes, pcap_dump_writethreads), CPU solutions (max_buffer_mem 10GB+, jitterbuffer tuning), OOM issues (MySQL buffer pool, voipmonitor buffers), network interface problems (promiscuous mode, drops, offloading), packet ordering, database issues (SQL queue, Error 1062).
 
=== Keywords ===
troubleshooting, sniffer, sensor, no calls, missing audio, one-way audio, RTP, PACKETBUFFER FULL, memory is FULL, buffer saturation, I/O bottleneck, CPU bottleneck, heap, t0CPU, t1CPU, t2CPU, SQLq, comp, tacCPU, iostat, iotop, ioping, sniffer_threads, manager socket, port 5029, thread CPU, t0 thread, single-core limit, DPDK, Napatech, kernel bypass, NVMe, SSD, async write, pcap_dump_writethreads, tar_maxthreads, max_buffer_mem, jitterbuffer, interface_ip_filter, OOM, out of memory, innodb_buffer_pool_size, promiscuous mode, interface drops, ethtool, packet ordering, SPAN, mirror, SQL queue, Error 1062, natalias, NAT, id_sensor, snaplen, capture rules, tcpdump, tshark


'''Key Questions:'''
=== Key Questions ===
* Why is VoIPmonitor not recording any calls?
* Why are no calls being recorded in VoIPmonitor?
* How can I check if VoIP traffic is reaching my sensor server?
* How to diagnose PACKETBUFFER FULL or memory is FULL error?
* How do I enable promiscuous mode on my network card?
* How to determine if bottleneck is I/O or CPU?
* Do I need promiscuous mode for ERSPAN or GRE tunnels?
* What do heap values in syslog mean?
* What are the most common reasons for VoIPmonitor not capturing data?
* What does t0CPU percentage indicate?
* How do I filter tshark output for SIP INVITE messages?
* How to use sniffer_threads manager command?
* What is the correct tshark filter syntax to find a specific phone number?
* How to connect to manager socket or port 5029?
* Why is my VoIPmonitor probe stopping processing calls?
* What to do when t0 thread is at 100%?
* What does the "Skip" option in capture rules do?
* How to fix one-way audio or missing RTP?
* How do I check for OOM killer events in Linux?
* How to configure natalias for NAT?
* Why are CDRs missing for calls with large SIP packets?
* How to increase max_buffer_mem for high traffic?
* What does the snaplen parameter do in voipmonitor.conf?
* How to disable jitterbuffer to save CPU?
* How do I diagnose MTU-related packet loss?
* What causes OOM kills of voipmonitor or MySQL?
* Why are my large SIP packets truncated even after increasing snaplen?
* How to check disk I/O performance with iostat?
* How do I tell if packets are truncated by VoIPmonitor or by an external source?
* How to enable promiscuous mode on interface?
* How do I fix Kamailio siptrace truncating large packets?
* How to fix packet ordering issues with SPAN?
* What is HAProxy traffic tee and how can it help with packet truncation?
* What is Error 1062 duplicate entry?
* Why does Kamailio report "Connection refused" when sending siptrace via TCP?
* How to verify traffic reaches capture interface?
* How do I open a TCP listener on VoIPmonitor for Kamailio siptrace?
* How do I use socat to open a TCP listening port?
* What should I do if FreeSWITCH sip_trace is truncating packets?
* Why are my probes disconnecting from the server with timeout errors?
* How do I diagnose probe timeout issues on high-performance networks?
* What causes intermittent probe timeout errors in client-server mode?
* How do I check for virtualization timing issues on VoIPmonitor probes?
* Why are there no CDRs even though tshark shows SIP OPTIONS/NOTIFY traffic?
* How do I enable sip-options, sip-message, sip-subscribe, sip-notify in voipmonitor.conf?
* What SIP methods are processed to generate CDRs vs non-call records?

Latest revision as of 19:08, 22 January 2026

Sniffer Troubleshooting

This page covers common VoIPmonitor sniffer/sensor problems organized by symptom. For configuration reference, see Sniffer_configuration. For performance tuning, see Scaling.

Critical First Step: Is Traffic Reaching the Interface?

⚠️ Warning: Before any sensor tuning, verify packets are reaching the network interface. If packets aren't there, no amount of sensor configuration will help.

# Check for SIP traffic on the capture interface
tcpdump -i eth0 -nn "host <PROBLEMATIC_IP> and port 5060" -c 10

# If no packets: Network/SPAN issue - contact network admin
# If packets visible: Proceed with sensor troubleshooting below

Quick Diagnostic Checklist

Check Command Expected Result
Service running systemctl status voipmonitor Active (running)
Traffic on interface tshark -i eth0 -c 5 -Y "sip" SIP packets displayed
Interface errors ip -s link show eth0 No RX errors/drops
Promiscuous mode ip link show eth0 PROMISC flag present
Logs grep voip No critical errors
GUI rules Settings → Capture Rules No unexpected "Skip" rules

No Calls Being Recorded

Service Not Running

# Check status
systemctl status voipmonitor

# View recent logs
journalctl -u voipmonitor --since "10 minutes ago"

# Start/restart
systemctl restart voipmonitor

Common startup failures:

  • Interface not found: Check interface in voipmonitor.conf matches ip a output
  • Port already in use: Another process using the management port
  • License issue: Check License for activation problems

Wrong Interface or Port Configuration

# Check current config
grep -E "^interface|^sipport" /etc/voipmonitor.conf

# Example correct config:
# interface = eth0
# sipport = 5060

💡 Tip:

GUI Capture Rules Blocking

Navigate to Settings → Capture Rules and check for rules with action "Skip" that may be blocking calls. Rules are processed in order - a Skip rule early in the list will block matching calls.

See Capture_rules for detailed configuration.

SPAN/Mirror Not Configured

If tcpdump shows no traffic:

  1. Verify switch SPAN/mirror port configuration
  2. Check that both directions (ingress + egress) are mirrored
  3. Confirm VLAN tagging is preserved if needed
  4. Test physical connectivity (cable, port status)

See Sniffing_modes for SPAN, RSPAN, and ERSPAN configuration.

Filter Parameter Too Restrictive

If filter is set in voipmonitor.conf, it may exclude traffic:

# Check filter
grep "^filter" /etc/voipmonitor.conf

# Temporarily disable to test
# Comment out the filter line and restart


Missing id_sensor Parameter

Symptom: SIP packets visible in Capture/PCAP section but missing from CDR, SIP messages, and Call flow.

Cause: The id_sensor parameter is not configured or is missing. This parameter is required to associate captured packets with the CDR database.

Solution:

# Check if id_sensor is set
grep "^id_sensor" /etc/voipmonitor.conf

# Add or correct the parameter
echo "id_sensor = 1" >> /etc/voipmonitor.conf

# Restart the service
systemctl restart voipmonitor

💡 Tip: Use a unique numeric identifier (1-65535) for each sensor. Essential for multi-sensor deployments. See id_sensor documentation.

Missing Audio / RTP Issues

One-Way Audio (Asymmetric Mirroring)

Symptom: SIP recorded but only one RTP direction captured.

Cause: SPAN port configured for only one direction.

Diagnosis:

# Count RTP packets per direction
tshark -i eth0 -Y "rtp" -T fields -e ip.src -e ip.dst | sort | uniq -c

If one direction shows 0 or very few packets, configure the switch to mirror both ingress and egress traffic.

RTP Not Associated with Call

Symptom: Audio plays in sniffer but not in GUI, or RTP listed under wrong call.

Possible causes:

1. SIP and RTP on different interfaces/VLANs:

# voipmonitor.conf - enable automatic RTP association
auto_enable_use_blocks = yes

2. NAT not configured:

# voipmonitor.conf - for NAT scenarios
natalias = <public_ip> <private_ip>

# If not working, try reversed order:
natalias = <private_ip> <public_ip>

3. External device modifying media ports:

If SDP advertises one port but RTP arrives on different port (SBC/media server issue):

# Compare SDP ports vs actual RTP
tshark -r call.pcap -Y "sip.Method == INVITE" -V | grep "m=audio"
tshark -r call.pcap -Y "rtp" -T fields -e udp.dstport | sort -u

If ports don't match, the external device must be configured to preserve SDP ports - VoIPmonitor cannot compensate.

RTP Incorrectly Associated with Wrong Call (PBX Port Reuse)

Symptom: RTP streams from one call appear associated with a different CDR when your PBX aggressively reuses the same IP:port across multiple calls.

Cause: When PBX reuses media ports, VoIPmonitor may incorrectly correlate RTP packets to the wrong call based on weaker correlation methods.

Solution: Enable rtp_check_both_sides_by_sdp to require verification of both source and destination IP:port against SDP:

# voipmonitor.conf - require both source and destination to match SDP
rtp_check_both_sides_by_sdp = yes

# Alternative (strict) mode - allows initial unverified packets
rtp_check_both_sides_by_sdp = strict

⚠️ Warning: Enabling this may prevent RTP association for calls using NAT, as the source IP:port will not match the SDP. Use natalias mappings or the strict setting to mitigate this.

Snaplen Truncation

Symptom: Large SIP messages truncated, incomplete headers.

Solution:

# voipmonitor.conf - increase packet capture size
snaplen = 8192

For Kamailio siptrace, also check trace_msg_fragment_size in Kamailio config. See snaplen documentation.

PACKETBUFFER Saturation

Symptom: Log shows PACKETBUFFER: memory is FULL, truncated RTP recordings.

⚠️ Warning: This alert refers to VoIPmonitor's internal packet buffer (max_buffer_mem), NOT system RAM. High system memory availability does not prevent this error. The root cause is always a downstream bottleneck (disk I/O or CPU) preventing packets from being processed fast enough.

Before testing solutions, gather diagnostic data:

  • Check sensor logs: /var/log/syslog (Debian/Ubuntu) or /var/log/messages (RHEL/CentOS)
  • Generate debug log via GUI: Tools → Generate debug log

Diagnose: I/O vs CPU Bottleneck

⚠️ Warning: Do not guess the bottleneck source. Use proper diagnostics first to identify whether the issue is disk I/O, CPU, or database-related. Disabling storage as a test is valid but should be used to confirm findings, not as the primary diagnostic method.

Step 1: Check IO[] Metrics (v2026.01.3+)

Starting with version 2026.01.3, VoIPmonitor includes built-in disk I/O monitoring that directly shows disk saturation status:

[283.4/283.4Mb/s] IO[B1.1|L0.7|U45|C75|W125|R10|WI1.2k|RI0.5k]

Quick interpretation:

Metric Meaning Problem Indicator
C (Capacity) % of disk's sustainable throughput used C ≥ 80% = Warning, C ≥ 95% = Saturated
L (Latency) Current write latency in ms L ≥ 3× B (baseline) = Saturated
U (Utilization) % time disk is busy U > 90% = Disk at limit

If you see DISK_SAT or WARN after IO[]:

IO[B1.1|L8.5|U98|C97|W890|R5|WI12.5k|RI0.1k] DISK_SAT

→ This confirms I/O bottleneck. Skip to I/O Bottleneck Solutions.

For older versions or additional confirmation, continue with the steps below.

ℹ️ Note: See Syslog Status Line - IO[] section for detailed field descriptions.

Step 2: Read the Full Syslog Status Line

VoIPmonitor outputs a status line every 10 seconds. This is your first diagnostic tool:

# Monitor in real-time
journalctl -u voipmonitor -f
# or
tail -f /var/log/syslog | grep voipmonitor

Example status line:

calls[424] PS[C:4 S:41 R:13540] SQLq[C:0 M:0] heap[45|30|20] comp[48] [25.6Mb/s] t0CPU[85%] t1CPU[12%] t2CPU[8%] tacCPU[8|8|7|7%] RSS/VSZ[365|1640]MB

Key metrics for bottleneck identification:

Metric What It Indicates I/O Bottleneck Sign CPU Bottleneck Sign
heap[A|B|C] Buffer fill % (primary / secondary / processing) High A with low t0CPU High A with high t0CPU
t0CPU[X%] Packet capture thread (single-core, cannot parallelize) Low (<50%) High (>80%)
comp[X] Active compression threads Very high (maxed out) Normal
SQLq[C:X M:Y] Pending SQL queries Growing = database bottleneck Stable
tacCPU[...] TAR compression threads All near 100% = compression bottleneck Normal

Interpretation flowchart:

Step 3: Linux I/O Diagnostics

Use these standard Linux tools to confirm I/O bottleneck:

Install required tools:

# Debian/Ubuntu
apt install sysstat iotop ioping

# CentOS/RHEL
yum install sysstat iotop ioping

2a) iostat - Disk utilization and wait times

# Run for 10 intervals of 2 seconds
iostat -xz 2 10

Key output columns:

Device   r/s     w/s   rkB/s   wkB/s  await  %util
sda     12.50  245.30  50.00  1962.40  45.23  98.50
Column Description Problem Indicator
%util Device utilization percentage > 90% = disk saturated
await Average I/O wait time (ms) > 20ms for SSD, > 50ms for HDD = high latency
w/s Writes per second Compare with disk's rated IOPS

2b) iotop - Per-process I/O usage

# Show I/O by process (run as root)
iotop -o

Look for voipmonitor or mysqld dominating I/O. If voipmonitor shows high DISK WRITE but system %util is 100%, disk cannot keep up.

2c) ioping - Quick latency check

# Test latency on VoIPmonitor spool directory
cd /var/spool/voipmonitor
ioping -c 20 .

Expected results:

Storage Type Healthy Latency Problem Indicator
NVMe SSD < 0.5 ms > 2 ms
SATA SSD < 1 ms > 5 ms
HDD (7200 RPM) < 10 ms > 30 ms

Step 4: Linux CPU Diagnostics

3a) top - Overall CPU usage

# Press '1' to show per-core CPU
top

Look for:

  • Individual CPU core at 100% (t0 thread is single-threaded)
  • High %wa (I/O wait) vs high %us/%sy (CPU-bound)

3b) Verify voipmonitor threads

# Show voipmonitor threads with CPU usage
top -H -p $(pgrep voipmonitor)

If one thread shows ~100% CPU while others are low, you have a CPU bottleneck on the capture thread (t0).

Step 5: Decision Matrix

Observation Likely Cause Go To
heap high, t0CPU > 80%, iostat %util low CPU Bottleneck CPU Solution
heap high, t0CPU < 50%, iostat %util > 90% I/O Bottleneck I/O Solution
heap high, t0CPU < 50%, iostat %util < 50%, SQLq growing Database Bottleneck Database Solution
heap normal, comp maxed, tacCPU all ~100% Compression Bottleneck (type of I/O) I/O Solution

Step 6: Confirmation Test (Optional)

After identifying the likely cause with the tools above, you can confirm with a storage disable test:

# /etc/voipmonitor.conf - temporarily disable all storage
savesip = no
savertp = no
savertcp = no
savegraph = no
systemctl restart voipmonitor
# Monitor for 5-10 minutes during peak traffic
journalctl -u voipmonitor -f | grep heap
  • If heap values drop to near zero → confirms I/O bottleneck
  • If heap values remain high → confirms CPU bottleneck

⚠️ Warning: Remember to re-enable storage after testing! This test causes call recordings to be lost.

Solution: I/O Bottleneck

ℹ️ Note: If you see IO[...] DISK_SAT or WARN in the syslog status line (v2026.01.3+), disk saturation is already confirmed. See IO[] Metrics for details.

Quick confirmation (for older versions):

Temporarily save only RTP headers to reduce disk write load:

# /etc/voipmonitor.conf
savertp = header

Restart the sniffer and monitor. If heap usage stabilizes and "MEMORY IS FULL" errors stop, the issue is confirmed to be storage I/O.

Check storage health before upgrading:

# Check drive health
smartctl -a /dev/sda

# Check for I/O errors in system logs
dmesg | grep -i "i/o error\|sd.*error\|ata.*error"

Look for reallocated sectors, pending sectors, or I/O errors. Replace failing drives before considering upgrades.

Storage controller cache settings:

Storage Type Recommended Cache Mode
HDD / NAS WriteBack (requires battery-backed cache)
SSD WriteThrough (or WriteBack with power loss protection)

Use vendor-specific tools to configure cache policy (megacli, ssacli, perccli).

Storage upgrades (in order of effectiveness):

Solution IOPS Improvement Notes
NVMe SSD 50-100x vs HDD Best option, handles 10,000+ concurrent calls
SATA SSD 20-50x vs HDD Good option, handles 5,000+ concurrent calls
RAID 10 with BBU 5-10x vs single disk Enable WriteBack cache (requires battery backup)
Separate storage server Variable Use client/server mode

Filesystem tuning (ext4):

# Check current mount options
mount | grep voipmonitor

# Recommended mount options for /var/spool/voipmonitor
# Add to /etc/fstab: noatime,data=writeback,barrier=0
# WARNING: barrier=0 requires battery-backed RAID

Verify improvement:

# After changes, monitor iostat
iostat -xz 2 10
# %util should drop below 70%, await should decrease

Solution: CPU Bottleneck

Identify CPU Bottleneck Using Manager Commands

VoIPmonitor provides manager commands to monitor thread CPU usage in real-time. This is essential for identifying which thread is saturated.

Connect to manager interface:

# Via Unix socket (local, recommended)
echo 'sniffer_threads' | nc -U /tmp/vm_manager_socket

# Via TCP port 5029 (remote or local)
echo 'sniffer_threads' | nc 127.0.0.1 5029

# Monitor continuously (every 2 seconds)
watch -n 2 "echo 'sniffer_threads' | nc -U /tmp/vm_manager_socket"

ℹ️ Note: TCP port 5029 is encrypted by default. For unencrypted access, set manager_enable_unencrypted = yes in voipmonitor.conf (security risk on public networks).

Example output:

t0 - binlog1 fifo pcap read          (  12345) :  78.5  FIFO  99     1234
t2 - binlog1 pb write                (  12346) :  12.3               456
rtp thread binlog1 binlog1 0         (  12347) :   8.1               234
rtp thread binlog1 binlog1 1         (  12348) :   6.2               198
t1 - binlog1 call processing         (  12349) :   4.5               567
tar binlog1 compression 0            (  12350) :   3.2                89

Column interpretation:

Column Description
Thread name Descriptive name (t0=capture, t1=call processing, t2=packet buffer write)
(TID) Linux thread ID (useful for top -H -p TID)
CPU % Current CPU usage percentage - key metric
Sched Scheduler type (FIFO = real-time, empty = normal)
Priority Thread priority
CS/s Context switches per second

Critical threads to watch:

Thread Role If at 90-100%
t0 (pcap read) Packet capture from NIC Single-core limit reached! Cannot parallelize. Need DPDK/Napatech.
t2 (pb write) Packet buffer processing Processing bottleneck. Check t2CPU breakdown.
rtp thread RTP packet processing Threads auto-scale. If still saturated, consider DPDK/Napatech.
tar compression PCAP archiving I/O bottleneck (compression waiting for disk)
mysql store Database writes Database bottleneck. Check SQLq metric.

⚠️ Warning: If t0 thread is at 90-100%, you have hit the fundamental single-core capture limit. The t0 thread reads packets from the kernel and cannot be parallelized. Disabling features like jitterbuffer will NOT help - those run on different threads. The only solutions are:

  • Reduce captured traffic using interface_ip_filter or BPF filter
  • Use kernel bypass (DPDK or Napatech) which eliminates kernel overhead entirely

Interpreting t2CPU Detailed Breakdown

The syslog status line shows t2CPU with detailed sub-metrics:

t2CPU[pb:10/ d:39/ s:24/ e:17/ c:6/ g:6/ r:7/ rm:24/ rh:16/ rd:19/]
Code Function High Value Indicates
pb Packet buffer output Buffer management overhead
d Dispatch Structure creation bottleneck
s SIP parsing Complex/large SIP messages
e Entity lookup Call table lookup overhead
c Call processing Call state machine processing
g Register processing High REGISTER volume
r, rm, rh, rd RTP processing stages High RTP volume (threads auto-scale)

Thread auto-scaling: VoIPmonitor automatically spawns additional threads when load increases:

  • If d > 50% → SIP parsing thread (s) starts
  • If s > 50% → Entity lookup thread (e) starts
  • If e > 50% → Call/register/RTP threads start

Configuration for High Traffic (>10,000 calls/sec)

# /etc/voipmonitor.conf

# Increase buffer to handle processing spikes (value in MB)
# 10000 = 10 GB - can go higher (20000, 30000+) if RAM allows
# Larger buffer absorbs I/O and CPU spikes without packet loss
max_buffer_mem = 10000

# Use IP filter instead of BPF (more efficient)
interface_ip_filter = 10.0.0.0/8
interface_ip_filter = 192.168.0.0/16
# Comment out any 'filter' parameter

CPU Optimizations

# /etc/voipmonitor.conf

# Reduce jitterbuffer calculations to save CPU (keeps MOS-F2 metric)
jitterbuffer_f1 = no
jitterbuffer_f2 = yes
jitterbuffer_adapt = no

# If MOS metrics are not needed at all, disable everything:
# jitterbuffer_f1 = no
# jitterbuffer_f2 = no
# jitterbuffer_adapt = no

Kernel Bypass Solutions (Extreme Loads)

When t0 thread hits 100% on standard NIC, kernel bypass is the only solution:

Solution Type CPU Reduction Use Case
DPDK Open-source ~70% Multi-gigabit on commodity hardware
Napatech Hardware SmartNIC >97% (< 3% at 10Gbit) Extreme performance requirements

Verify Improvement

# Monitor thread CPU after changes
watch -n 2 "echo 'sniffer_threads' | nc -U /tmp/vm_manager_socket | head -10"

# Or monitor syslog
journalctl -u voipmonitor -f
# t0CPU should drop, heap values should stay < 20%

ℹ️ Note: After changes, monitor syslog heap[A|B|C] values - should stay below 20% during peak traffic. See Syslog_Status_Line for detailed metric explanations.

Storage Hardware Failure

Symptom: Sensor shows disconnected (red X) with "DROPPED PACKETS" at low traffic volumes.

Diagnosis:

# Check disk health
smartctl -a /dev/sda

# Check RAID status (if applicable)
cat /proc/mdstat
mdadm --detail /dev/md0

Look for reallocated sectors, pending sectors, or RAID degraded state. Replace failing disk.

OOM (Out of Memory)

Identify OOM Victim

# Check for OOM kills
dmesg | grep -i "out of memory\|oom\|killed process"
journalctl --since "1 hour ago" | grep -i oom

MySQL Killed by OOM

Reduce InnoDB buffer pool:

# /etc/mysql/my.cnf
innodb_buffer_pool_size = 2G  # Reduce from default

Voipmonitor Killed by OOM

Reduce buffer sizes in voipmonitor.conf:

max_buffer_mem = 2000  # Reduce from default
ringbuffer = 50        # Reduce from default

Runaway External Process

# Find memory-hungry processes
ps aux --sort=-%mem | head -20

# Kill orphaned/runaway process
kill -9 <PID>

For servers limited to 16GB RAM or when experiencing repeated MySQL OOM kills:

# /etc/my.cnf or /etc/mysql/mariadb.conf.d/50-server.cnf
[mysqld]
# On 16GB server: 6GB buffer pool + 6GB MySQL overhead = 12GB total
# Leaves 4GB for OS + GUI, preventing OOM
innodb_buffer_pool_size = 6G

# Enable write buffering (may lose up to 1s of data on crash but reduces memory pressure)
innodb_flush_log_at_trx_commit = 2

Restart MySQL after changes:

systemctl restart mysql
# or
systemctl restart mariadb

SQL Queue Growth from Non-Call Data

If sip-register, sip-options, or sip-subscribe are enabled, non-call SIP-messages (OPTIONS, REGISTER, SUBSCRIBE, NOTIFY) can accumulate in the database and cause the SQL queue to grow unbounded. This increases MySQL memory usage and leads to OOM kills of mysqld.

⚠️ Warning: Even with reduced innodb_buffer_pool_size, SQL queue will grow indefinitely without cleanup of non-call data.

Solution: Enable automatic cleanup of old non-call data

# /etc/voipmonitor.conf
# cleandatabase=2555 automatically deletes partitions older than 7 years
# Covers: CDR, register_state, register_failed, and sip_msg (OPTIONS/SUBSCRIBE/NOTIFY)
cleandatabase = 2555

Restart the sniffer after changes:

systemctl restart voipmonitor

ℹ️ Note: See Data_Cleaning for detailed configuration options and other cleandatabase_* parameters.

Service Startup Failures

Interface No Longer Exists

After OS upgrade, interface names may change (eth0 → ensXXX):

# Find current interface names
ip a

# Update all config locations
grep -r "interface" /etc/voipmonitor.conf /etc/voipmonitor.conf.d/

# Also check GUI: Settings → Sensors → Configuration

Missing Dependencies

# Install common missing package
apt install libpcap0.8  # Debian/Ubuntu
yum install libpcap     # RHEL/CentOS

Network Interface Issues

Promiscuous Mode

Required for SPAN port monitoring:

# Enable
ip link set eth0 promisc on

# Verify
ip link show eth0 | grep PROMISC

ℹ️ Note: Promiscuous mode is NOT required for ERSPAN/GRE tunnels where traffic is addressed to the sensor.

Interface Drops

# Check for drops
ip -s link show eth0 | grep -i drop

# If drops present, increase ring buffer
ethtool -G eth0 rx 4096

Bonded/EtherChannel Interfaces

Symptom: False packet loss when monitoring bond0 or br0.

Solution: Monitor physical interfaces, not logical:

# voipmonitor.conf - use physical interfaces
interface = eth0,eth1

Network Offloading Issues

Symptom: Kernel errors like bad gso: type: 1, size: 1448

# Disable offloading on capture interface
ethtool -K eth0 gso off tso off gro off lro off

Packet Ordering Issues

If SIP messages appear out of sequence:

First: Rule out Wireshark display artifact - disable "Analyze TCP sequence numbers" in Wireshark. See FAQ.

If genuine reordering: Usually caused by packet bursts in network infrastructure. Use tcpdump to verify packets arrive out of order at the interface. Work with network admin to implement QoS or traffic shaping. For persistent issues, consider dedicated capture card with hardware timestamping (see Napatech).

ℹ️ Note: For out-of-order packets in client/server mode (multiple sniffers), see Sniffer_distributed_architecture for pcap_queue_dequeu_window_length configuration.

Solutions for SPAN/Mirroring Reordering

If packets arrive out of order at the SPAN/mirror port (e.g., 302 responses before INVITE causing "000 no response" errors):

1. Configure switch to preserve packet order: Many switches allow configuring SPAN/mirror ports to maintain packet ordering. Consult your switch documentation for packet ordering guarantees in mirroring configuration.

2. Replace SPAN with TAP or packet broker: Unlike software-based SPAN mirroring, hardware TAPs and packet brokers guarantee packet order. Consider upgrading to a dedicated TAP or packet broker device for mission-critical monitoring.

Database Issues

SQL Queue Overload

Symptom: Growing SQLq metric, potential coredumps.

# voipmonitor.conf - increase threads
mysqlstore_concat_limit_cdr = 1000
cdr_check_exists_callid = 0

Error 1062 - Lookup Table Limit

Symptom: Duplicate entry '16777215' for key 'PRIMARY'

Quick fix:

# voipmonitor.conf
cdr_reason_string_enable = no

See Database Troubleshooting for complete solution.

Bad Packet Errors

Symptom: bad packet with ether_type 0xFFFF detected on interface

Diagnosis:

# Run diagnostic (let run 30-60 seconds, then kill)
voipmonitor --check_bad_ether_type=eth0

# Find and kill the diagnostic process
ps ax | grep voipmonitor
kill -9 <PID>

Causes: corrupted packets, driver issues, VLAN tagging problems. Check ethtool -S eth0 for interface errors.

Useful Diagnostic Commands

tshark Filters for SIP

# All SIP INVITEs
tshark -r capture.pcap -Y "sip.Method == INVITE"

# Find specific phone number
tshark -r capture.pcap -Y 'sip contains "5551234567"'

# Get Call-IDs
tshark -r capture.pcap -Y "sip.Method == INVITE" -T fields -e sip.Call-ID

# SIP errors (4xx, 5xx)
tshark -r capture.pcap -Y "sip.Status-Code >= 400"

Interface Statistics

# Detailed NIC stats
ethtool -S eth0

# Watch packet rates
watch -n 1 'cat /proc/net/dev | grep eth0'

See Also





AI Summary for RAG

Summary

Comprehensive troubleshooting guide for VoIPmonitor sniffer/sensor problems. Covers: verifying traffic reaches interface (tcpdump/tshark), diagnosing no calls recorded (service, config, capture rules, SPAN), missing audio/RTP issues (one-way audio, NAT, natalias, rtp_check_both_sides_by_sdp), PACKETBUFFER FULL errors (I/O vs CPU bottleneck diagnosis using syslog metrics heap/t0CPU/SQLq and Linux tools iostat/iotop/ioping), manager commands for thread monitoring (sniffer_threads via socket or port 5029), t0 single-core capture limit and solutions (DPDK/Napatech kernel bypass), I/O solutions (NVMe/SSD, async writes, pcap_dump_writethreads), CPU solutions (max_buffer_mem 10GB+, jitterbuffer tuning), OOM issues (MySQL buffer pool, voipmonitor buffers), network interface problems (promiscuous mode, drops, offloading), packet ordering, database issues (SQL queue, Error 1062).

Keywords

troubleshooting, sniffer, sensor, no calls, missing audio, one-way audio, RTP, PACKETBUFFER FULL, memory is FULL, buffer saturation, I/O bottleneck, CPU bottleneck, heap, t0CPU, t1CPU, t2CPU, SQLq, comp, tacCPU, iostat, iotop, ioping, sniffer_threads, manager socket, port 5029, thread CPU, t0 thread, single-core limit, DPDK, Napatech, kernel bypass, NVMe, SSD, async write, pcap_dump_writethreads, tar_maxthreads, max_buffer_mem, jitterbuffer, interface_ip_filter, OOM, out of memory, innodb_buffer_pool_size, promiscuous mode, interface drops, ethtool, packet ordering, SPAN, mirror, SQL queue, Error 1062, natalias, NAT, id_sensor, snaplen, capture rules, tcpdump, tshark

Key Questions

  • Why are no calls being recorded in VoIPmonitor?
  • How to diagnose PACKETBUFFER FULL or memory is FULL error?
  • How to determine if bottleneck is I/O or CPU?
  • What do heap values in syslog mean?
  • What does t0CPU percentage indicate?
  • How to use sniffer_threads manager command?
  • How to connect to manager socket or port 5029?
  • What to do when t0 thread is at 100%?
  • How to fix one-way audio or missing RTP?
  • How to configure natalias for NAT?
  • How to increase max_buffer_mem for high traffic?
  • How to disable jitterbuffer to save CPU?
  • What causes OOM kills of voipmonitor or MySQL?
  • How to check disk I/O performance with iostat?
  • How to enable promiscuous mode on interface?
  • How to fix packet ordering issues with SPAN?
  • What is Error 1062 duplicate entry?
  • How to verify traffic reaches capture interface?