Paginas

03 May 2018

VMware Troubleshoot Commands, Logs and Performance - ESXTOP



Today in this post I will show you some commands and logs to try to solve some errors in a VMware environment. I'll give special attention to a very useful command called esxtop
 
What I'm going to show is the following: 
  • Important Log Files
  • Useful ESXi Commands
  • Some ESXi Configuration Files
  • ESXTOP Command

Let's Begin!

Important Log Files

 
  • Host abruptly rebooted
    • /var/log/vmksummary.log
  • Slow boot issues
    • /var/log/boot.gz  (You can also enable serial logging (Shift + o))
  • ESXi not responding
    • /var/log/hostd.log
    • /var/log/hostd-probe.log
  • VM issues
    • vmware.log (This file is located in virtual machine folder along with the vmx file)
  • Storage issues
    • /var/log/vmkernel.log
  • Network and storage issues
    • /var/log/vobd.log
  • HA issues
    • /var/log/fdm.log
    • /opt/vmware/fdm/prettyprint.sh hostlist | less
  • Unable to login because, root account lockout
    • /var/log/vobd.log
    • /var/log/auth.log
    • By default 10 maximum attempts. Only active in SSH and WebService.

 

Useful ESXi Commands

 
  •  Basically these are the most used commands:
    • Monitor & configure ESXi
      • esxcli
    • Manage ESXi and VM config
      • vim-cmd
    • VMFS volumes & virtual disks
      • vmkfstools
    • Detailed memory stats
      • memstats
    • Monitoring and Identify performance issues
      • esxtop (see below for more specifications)
 
Let's see some examples:
  • List delta disks
    • find /path/to/vm/folder –iname "*delta*"
  • Restart services ESXi
    • services.sh restart
  • Restart vCenter agent
    • /etc/init.d/vpxa restart
  • ESXi services state
    • cat /etc/chkconfig.db
  • Ping from VMKernel network
    • vmkping -I vmk1 <IP_Address>
  • Clone Disk test.vmdk to testclone.vmdk
    • vmkfstools -i test.vmdk testclone.vmdk
  • NIC Info
    • esxcli network nic 
  • vSwicthes Info
    • esxcli network vswitch standard list
  • Software and Drivers Info
    • esxcli software vib list
  • Maintenance Mode
    • esxcli system maintenanceMode set -enabled yes/no
  • ESXi Version
    • esxcli system version get
  • Installation Date
    • esxcli system stats installtime get
  • List Local Users
    • esxcli system account list
  • Shutdown/Restart Host
    • esxcli system shutdown reboot -d 10 -r "Patch Updates"
  • IPv4 Interfaces
    • esxcli network ip interface ipv4 get
  • Virtual Machine State
    • vim-cmd vmsvc/power.getstate vmid
  • List all VMs in one Host (Check World ID)
    • esxcli vm process list
  • Kill Virtual Machine
    • esxcli vm process kill -w ID -t <soft | hard | force> 
    • vim-cmd vmsvc/power.off vmid 
  • List Datastores
    • esxcli storage filesystem list
  • Copy VMDK between Hosts (Offline)
    • Start the SSH service in the source and destination ESX hosts - service sshd restart
    • scp local_filename.vmdk user@server:/path/final
  • VMkernel sysinfo
    • vsish get /bios
    • hardwareinfo
  • Number of failed login attempt
    • pam_tally2 --user root
  • Clear the the password lockout
    • pam_tally2 --user root --reset
 
 

Some Configuration Files

 
  • Storage, networking, HW info
    • /etc/vmware/esx.conf
  • VM inventory
    • /etc/vmware/hostd/vminventory.xml
  • vCenter to ESXi host connection
    • /etc/vmware/hostd/authorization.xml
  • vCenter and ESXi connectivity
    • /etc/vmware/vpxa/vpxa/cfg
  • iSCSI configuration
    • /etc/vmware/vmkiscsid/iscsi.conf
  • HA configuration
    • /etc/vmware/fdm
  • License configuration
    • /etc/vmware/license.cfg

 

ESXTOP


It can be accessed through the ESXi console or remotely using a Secure Shell session. You will immediately be taken to the interactive display mode. ESXTOP can be run in the following three modes:
  • Interactive – This displays the collected performance information and displays it in real-time. It is the default mode and the stats displayed are updated every 5 seconds by default;
  • Batch – Collects the performance data and saves it to a file. This is useful for collecting statistics for a long period of time to use for later to analyze. Data collected can be analyzed using tools like Excel, Perfmon, and other 3rd party tools;
  • Replay –Interactively replays the data that was collected when using the VM-Support tool. This is a tool that is commonly used to capture logs and config data to send to VMware Support. One important thing to note is that you cannot interactively replay data the collected from using batch mode.
I will focus on the default mode (Interactive).
 
A good way to find your way around in interactive mode is by pressing "h", this will give you a menu of the various commands available. Notice the "Switch Display" section which lists the different display options that are available. These options listed allow us to switch between the different resources that we would like to view. For example, let’s say we want to take a look at the resources on our disk adapters, the "Switch Display" section shows that "d" is the command to switch to the disk adapter display so we simply type in "d".
This can be a lot of information to view all at one time, so if we wanted to we could remove a few of the columns that we aren’t interested in looking at. To do this simply use the "f" command. The items that have an asterisk symbol (*) next to them are the ones that are actively displayed at the moment. To remove or add them simply type in the corresponding letter.
If we want, we can even save this modified view for the next time we access interactive mode. To do this simply press "W" (make sure it’s a upper case)  to save the settings.

What we can look for with ESXTOP:
  • CPU
    • %RDY- Indicates the percentage of time a VM was ready to run but could not because there wasn’t enough CPU resources available.  Could be due to too many vCPUs, vSMP VMs or a CPU limit enforced on a VM. Threshold: Higher than 10.
    • %SWPWT - Indicates the percentage of time a VM has to wait for the host to swap memory. This could be a sign of overcommitted memory. Threshold: Higher than 5.
    • %MLMTD - Indicates the percentage of time a VM or world was not scheduled because of a limit setting. Unless a limit on a resource pool or VM was purposely configured by design, there shouldn’t be anything higher than 0 in this field. Threshold: Greater than 0.
    • %CSTP - Indicates the percentage of time a VM spends in a ready, co-deschedule state. This field really only applies to virtual machines that are using vSMP and indicates that one vCPU is being used a lot more than the other vCPU allocated to the VM. Threshold: Higher than 3.
  • Memory
    • MCTLSZ (MB) - Indicates the amount of physical memory the ESXi Host is reclaiming by balloon driver. Could possibly be a sign of overcommitted memory. Threshold: Greater than 0.
    • ZIP/s (MB/s) - Indicates the amount of memory that is compressed per second on the host. If the host is compressing memory pages it’s an indicator of memory contention issues and is usually due to overcommitted memory. Threshold: greater than 0.
    • UNZIP/s (MB/s) - Indicates the amount of memory that is decompressed per second on the host. Can be a sign of overcommitted memory. Threshold: greater than 0.
    • SWCUR (MB) - Memory that that is being swapped by the VM or resource pool. Points to overcommitted memory. Threshold: greater than 0.
    • CACHEUSD (MB) - Amount of memory being compressed by the ESXi host. Could be an indication of overcommitted memory. Threshold: greater than 0.
    • SWW/s and SWR/s - Indicates the rate at which the ESXi Host read or writes to the disk from or to swapped memory. Possible cause would be overcommitted memory. Threshold: greater than 0.
  • Network
    • %DRPTX - Dropped packages transmitted. Higher values than 0 could be a sign of high network utilization. Threshold: greater than 0.
    • %DRPRX - Dropped packages received. Higher values than 0 could be a sign of high network utilization. Threshold: greater than 0.
    • Used-by and Team-PNIC - These two fields are very useful to distinguishing which physical NIC a VM is using.
  • Disk
    • DAVG/cmd - Indicates the average device Latency per command at the device driver level. High values point to storage performance issues. Threshold: Over 25
    • ABRTS/s - Commands aborted per second. Aborts are issued from the guest OS when storage stops responding. The Windows OS has a default time out of 60 seconds. Possible cause could be an issue with the storage fabrics or array. Threshold: Anything over 0
    • KAVG/cmd - Average VMKernel latency per command. A high value indicates I/O is being throttled between guest OS and storage, best bet is to check with vendor for performance tuning options or an updated firmware release. Threshold Over 2
    • GAVG/cmd - Average guest operating system latency per command. This value is calculated by the sum of the DAVG and KAVG. Threshold: Over 25
    • Resets/s - Command reset’s per second. A reset command get issued when the operation fails to reach the target. Threshold: Anything over 0
 
That's all for now. See you next time!

No comments:

Post a Comment