Systems Inside: VMware Troubleshoot Commands, Logs and Performance

Today in this post I will show you some commands and logs to try to solve some errors in a VMware environment. I'll give special attention to a very useful command called esxtop

What I'm going to show is the following:

Important Log Files

Useful ESXi Commands

Some ESXi Configuration Files

ESXTOP Command

Let's Begin!

Important Log Files

Host abruptly rebooted

/var/log/vmksummary.log

Slow boot issues

/var/log/boot.gz (You can also enable serial logging (Shift + o))

ESXi not responding

/var/log/hostd.log
/var/log/hostd-probe.log

VM issues

vmware.log (This file is located in virtual machine folder along with the vmx file)

Storage issues

/var/log/vmkernel.log

Network and storage issues

/var/log/vobd.log

HA issues

/var/log/fdm.log
/opt/vmware/fdm/prettyprint.sh hostlist | less

Unable to login because, root account lockout

/var/log/vobd.log
/var/log/auth.log
By default 10 maximum attempts. Only active in SSH and WebService.

Useful ESXi Commands

Basically these are the most used commands:

Monitor & configure ESXi

esxcli

Manage ESXi and VM config

vim-cmd

VMFS volumes & virtual disks

vmkfstools

Detailed memory stats

memstats

Monitoring and Identify performance issues

esxtop (see below for more specifications)

Let's see some examples:

List delta disks

find /path/to/vm/folder –iname "*delta*"

Restart services ESXi

services.sh restart

Restart vCenter agent

/etc/init.d/vpxa restart

ESXi services state

cat /etc/chkconfig.db

Ping from VMKernel network

vmkping -I vmk1 <IP_Address>

Clone Disk test.vmdk to testclone.vmdk

vmkfstools -i test.vmdk testclone.vmdk

NIC Info

esxcli network nic

vSwicthes Info

esxcli network vswitch standard list

Software and Drivers Info

esxcli software vib list

Maintenance Mode

esxcli system maintenanceMode set -enabled yes/no

ESXi Version

esxcli system version get

Installation Date

esxcli system stats installtime get

List Local Users

esxcli system account list

Shutdown/Restart Host

esxcli system shutdown reboot -d 10 -r "Patch Updates"

IPv4 Interfaces

esxcli network ip interface ipv4 get

Virtual Machine State

vim-cmd vmsvc/power.getstate vmid

List all VMs in one Host (Check World ID)

esxcli vm process list

Kill Virtual Machine

esxcli vm process kill -w ID -t <soft | hard | force>
vim-cmd vmsvc/power.off vmid

List Datastores

esxcli storage filesystem list

Copy VMDK between Hosts (Offline)

Start the SSH service in the source and destination ESX hosts - service sshd restart
scp local_filename.vmdk user@server:/path/final

VMkernel sysinfo

vsish get /bios
hardwareinfo

Number of failed login attempt

pam_tally2 --user root

Clear the the password lockout

pam_tally2 --user root --reset

Some Configuration Files

Storage, networking, HW info

/etc/vmware/esx.conf

VM inventory

/etc/vmware/hostd/vminventory.xml

vCenter to ESXi host connection

/etc/vmware/hostd/authorization.xml

vCenter and ESXi connectivity

/etc/vmware/vpxa/vpxa/cfg

iSCSI configuration

/etc/vmware/vmkiscsid/iscsi.conf

HA configuration

/etc/vmware/fdm

License configuration

/etc/vmware/license.cfg

ESXTOP

It can be accessed through the ESXi console or remotely using a Secure Shell session. You will immediately be taken to the interactive display mode. ESXTOP can be run in the following three modes:

Interactive – This displays the collected performance information and displays it in real-time. It is the default mode and the stats displayed are updated every 5 seconds by default;

Batch – Collects the performance data and saves it to a file. This is useful for collecting statistics for a long period of time to use for later to analyze. Data collected can be analyzed using tools like Excel, Perfmon, and other 3rd party tools;

Replay –Interactively replays the data that was collected when using the VM-Support tool. This is a tool that is commonly used to capture logs and config data to send to VMware Support. One important thing to note is that you cannot interactively replay data the collected from using batch mode.

I will focus on the default mode (Interactive).

A good way to find your way around in interactive mode is by pressing "h", this will give you a menu of the various commands available. Notice the "Switch Display" section which lists the different display options that are available. These options listed allow us to switch between the different resources that we would like to view. For example, let’s say we want to take a look at the resources on our disk adapters, the "Switch Display" section shows that "d" is the command to switch to the disk adapter display so we simply type in "d".

This can be a lot of information to view all at one time, so if we wanted to we could remove a few of the columns that we aren’t interested in looking at. To do this simply use the "f" command. The items that have an asterisk symbol (*) next to them are the ones that are actively displayed at the moment. To remove or add them simply type in the corresponding letter.

If we want, we can even save this modified view for the next time we access interactive mode. To do this simply press "W" (make sure it’s a upper case) to save the settings.

What we can look for with ESXTOP:

CPU

%RDY- Indicates the percentage of time a VM was ready to run but could not because there wasn’t enough CPU resources available. Could be due to too many vCPUs, vSMP VMs or a CPU limit enforced on a VM. Threshold: Higher than 10.

%SWPWT - Indicates the percentage of time a VM has to wait for the host to swap memory. This could be a sign of overcommitted memory. Threshold: Higher than 5.

%MLMTD - Indicates the percentage of time a VM or world was not scheduled because of a limit setting. Unless a limit on a resource pool or VM was purposely configured by design, there shouldn’t be anything higher than 0 in this field. Threshold: Greater than 0.

%CSTP - Indicates the percentage of time a VM spends in a ready, co-deschedule state. This field really only applies to virtual machines that are using vSMP and indicates that one vCPU is being used a lot more than the other vCPU allocated to the VM. Threshold: Higher than 3.

Memory

MCTLSZ (MB) - Indicates the amount of physical memory the ESXi Host is reclaiming by balloon driver. Could possibly be a sign of overcommitted memory. Threshold: Greater than 0.

ZIP/s (MB/s) - Indicates the amount of memory that is compressed per second on the host. If the host is compressing memory pages it’s an indicator of memory contention issues and is usually due to overcommitted memory. Threshold: greater than 0.

UNZIP/s (MB/s) - Indicates the amount of memory that is decompressed per second on the host. Can be a sign of overcommitted memory. Threshold: greater than 0.

SWCUR (MB) - Memory that that is being swapped by the VM or resource pool. Points to overcommitted memory. Threshold: greater than 0.

CACHEUSD (MB) - Amount of memory being compressed by the ESXi host. Could be an indication of overcommitted memory. Threshold: greater than 0.

SWW/s and SWR/s - Indicates the rate at which the ESXi Host read or writes to the disk from or to swapped memory. Possible cause would be overcommitted memory. Threshold: greater than 0.

Network

%DRPTX - Dropped packages transmitted. Higher values than 0 could be a sign of high network utilization. Threshold: greater than 0.

%DRPRX - Dropped packages received. Higher values than 0 could be a sign of high network utilization. Threshold: greater than 0.

Used-by and Team-PNIC - These two fields are very useful to distinguishing which physical NIC a VM is using.

Disk

DAVG/cmd - Indicates the average device Latency per command at the device driver level. High values point to storage performance issues. Threshold: Over 25

ABRTS/s - Commands aborted per second. Aborts are issued from the guest OS when storage stops responding. The Windows OS has a default time out of 60 seconds. Possible cause could be an issue with the storage fabrics or array. Threshold: Anything over 0

KAVG/cmd - Average VMKernel latency per command. A high value indicates I/O is being throttled between guest OS and storage, best bet is to check with vendor for performance tuning options or an updated firmware release. Threshold Over 2

GAVG/cmd - Average guest operating system latency per command. This value is calculated by the sum of the DAVG and KAVG. Threshold: Over 25

Resets/s - Command reset’s per second. A reset command get issued when the operation fails to reach the target. Threshold: Anything over 0

That's all for now. See you next time!

Paginas

03 May 2018

VMware Troubleshoot Commands, Logs and Performance - ESXTOP

Important Log Files

Useful ESXi Commands

Some Configuration Files

ESXTOP

No comments:

Post a Comment

About Me