Wednesday, August 24, 2016

VMware ESXi Storage I/O Path

For me it was a long time a myth how the I/O flow works in ESXi. For sure I knew what a Path Selection Policy (PSP) as well as the Storage Array Type Plugin (SATP) was and I have heard about vSCSI stats but I was not really able to explain the I/O flow in depth. It was more or less I/O gets from the VM somehow into the kernel. Then you could monitor certain performance values and stats with e.g ESXTOP or change the IOPS settings at the PSP level if a vendor recommend so to improve performance. And yeah there were RAW Devices, different storage protocols and queues. But how does this all work together?


This blog article explains the Storage I/O path with a certain level of detail. The good thing working closely to Engineering is that you get input on a lot of things day by day you would hardly find out on your own. So getting this input made me research deeper into stuff which finally ended in this blog article. Obviously since ESXi 6 U1 there is now the VAIO Framework (vSphere APIs for IO Filtering) but I will get into this with another blog article later. VAIO which helps vendors to develop a way to interject I/O in a more VMware controlled way via a filter driver running in a user world context. There are not that many vendors at this point who even sell a GA VAIO product so VAIO is still pretty new. Why I am saying that? Because PernixData's (now Nutanix) flagship product FVP and the storage analytics software Architect integrates in the Pluggable Storage Architecture (PSA) of ESXi. There are also VVOLs which are a big thing making storage management so much easier but most places I see every day in my role as Technical Support Engineer just start to think about VVOLs or will think about it when the next infrastructure comes into play. VMware did already a very good job with policy driven storage migrations before VVOLs so many customer I see are absolutely happy and eventually think about VVOLs later.


Every I/O from a Virtual Machine starts in a user world context using different threads (worlds) for different things like vCPU, VMX, MKS, SVGA. Understanding what the difference about user and kernel mode is you should follow my first ever article: Kernel vs. UserMode. The manager of the VM itself is the VMM (Virtual Machine Monitor).

The I/O then basically flows through the VMkernel taking different junctions dependent what action happened or which storage implementation is in use. This could be:

  • Block Storage:
    • Fibre Channel (FC)
    • Internet Small Computer Systems Interface (iSCSI)
    • Fibre Channel over Ethernet (FCoE)

  • File Storage:
    • Network File System (NFS)

As every I/O starts at the VMM level we will skip this and explain the I/O flow from the vSCSI frontend. There are interesting facts about how the I/O gets from the user world into the vSCSI frontend and how things work different here with Paravirtual implementations like the VMXNET or PVSCSI implementation which you find here.

ESXi I/O Flow

  • vSCSI frontend:
    • The I/O gets into either the Flat backend (VMDKs) or the RDM backend.
    • The I/O then travels through the FSS (File System Switch) to then either gets to the DevFS (SnapShot related I/O), the VMFS layer or NFS.
    • In case of NFS it will get into the SUNRPC implementation of ESXi and uses the TCP/IP stack as well as a VMkernel interface (vmk) to proceed with the I/O.
    • In case of VMFS the way proceeds through the FDS (File Device Switch) which then splits again into Disk (normal non-snapshot I/O), the Snapshot Driver or the LVM (logical volume manager).
      • Disk: Normal VMDK I/O.
      • Snapshot Driver: when the VM triggers a snapshot or is running on a snapshot every I/O always gets again through the FSS again as the chain of snapshots are mounted through the DevFS implementation.
      • LVM:  Any time a disk gets created, extended or spanned.
    • Now the I/O gets into the PSA (Pluggable Storage Architecture).
      • The Device Queue contains two areas:
        • The SCSI Device
        • The SCSI Sched Queue
      • The I/O gets to a point where it in most cases flows into the NMP (Native Multipathing Plugin) or into the only other existing complete Plugin implementation delivered by EMC called PowerPath. We will focus here on just the NMP.
        • SATP (Storage Array Type Plugin) - the control plane doing the coordination with the storage system.
        • PSP (Path Selection Policy) - the dataplane implementation handling the I/O to the storage itself.
      • Now finally the I/O gets into the SCSI Midlayer managing physical HBAs, queuing and handling SCSI errors. For further information regarding the SCSI Midlayer in ESXi please follow this link VMware SAN System Design and Deployment Guide on Page 58.

The following figure shows the I/O flow in detail and how the different parts are connected to each other.

Figure 1: ESXi Storage Implementation

VMM:

Virtual Machine Monitor (Emulates the x86 CPU, SCSI controller etc.)

Shared Ring:

Shared queue between the VMM and VMKernel, both can access the shared ring without performance penalties, at any time it is VMM or VMkernel accesses the shared ring. The size of the Shared Ring determines the number of outstanding I/Os. As soon as the VMkernel sees an I/O in the shared ring, it dequeues this I/O to the vSCSI frontend. For further information please refer to the following KB.
Source: VMware KB: 2053145.

vSCSI:

SCSI frontend for the VMkernel, takes SCSI requests and virtualizes the I/O requests to file I/O requests.

Flat Backend:

Emulates a file.

RDM Backend:

Pointers/Symbolic Links to RAW LUNs. Embeds the identifier in a link.

FSS (File System Switch):

Can be used to write a file system driver. APIs are not public.

DevFS:

Is very similar to any UNIX based file system. It emulates real and non real devices.

VMFS:

Single module that implements both VMFS 3 & 5.

NFS:

Implements the NFS protocol.

FDS (File Device Switch - Block Storage):

To switch between Disk, SnapShot Driver and LVM.

Disk:

Emulates all real disk based block devices. Talks directly to the storage stack.

SnapShot Driver:

Implemented as a filter driver. Because of a the filter driver it goes back to FSS and the disk chain is mounted into DevFS.

LVM (Logical Volume Manager):

Any time when remove, span or create happens the LVM is involved. The LVM allows to concatenate a volume and present this volumes as a logical volume to the upper layer. Formatting a VMFS datastore works in the following way:

  • First a logical volume with the LVM gets created
  • Then the format VMFS on top of this volume.

The reason for this is the possibility to extend/span a disk across different LUNs. Also used for re-signaturing for e.g. a snapshotted LUN which has the same UUID so LVM has the possibility to resignature.

PSA (Pluggable Storage Architecture):

SCSI Device:

Every local or remote device has an SCSI device structure (not more than 256 LUNs possible per target).

SCSI Schedule Queue:

The storage stack implements a proportional share scheduler (shares), SIOC (Storage I/O Control) does this on a cluster basis but uses SCSI Scheduler Queues. The number of shares are the number of all SCSI disks.

SATP (Storage Array Type Plugin - Control Plane):

Has vendor specific functionality e.g. Trespass: Active-Passive a given SP (Service Processor - Controller of a Storage System) is in charge of the LUN. It also helps to understand if in case the I/O gets delivers from a different SP as it then communicates the other SP if it wants to grab that I/O. The SATP also does the communication with the array in the case of a failure.

PSP (Path Selection Policy - Data Plane):

The PSP is responsible how read and write gets handled to the storage backend. There are three flavour of PSP: MRU (Most Recently Used), Fixed (Fixed) and Round Robin (RR).


  • Most Recently Used (MRU):  The VMW_PSP_MRU policy selects the first working path, discovered at system boot time. If this path becomes unavailable, the ESXi/ESX host switches to an alternative path and continues to use the new path while it is available. This is the default policy for Logical Unit Numbers (LUNs) presented from an Active/Passive array. ESXi/ESX does not return to the previous path if, or when, it returns; it remains on the working path until it, for any reason, fails.
  • Fixed (Fixed): The VMW_PSP_FIXED policy uses the designated preferred path flag, if it is configured. Otherwise, it uses the first working path discovered at system boot time. If the ESXi/ESX host cannot use the preferred path or it becomes unavailable, the ESXi/ESX host selects an alternative available path. The host automatically returns to the previously defined preferred path as soon as it becomes available again. This is the default policy for LUNs presented from an Active/Active storage array.
  • Round Robin (RR): The VMW_PSP_RR policy uses an automatic path selection, rotating through all available paths, enabling the distribution of load across the configured paths.

    • For Active/Passive storage arrays, only the paths to the active controller will be used in the Round Robin policy.
    • For Active/Active storage arrays, all paths will be used in the Round Robin policy.

    Source: VMware KB 1011340

    I hope this blog article was helpful for you and as always if you have any questions please get in contact with me.

    14 comments:

    1. Guido Congratulations on your blog. Your Blog Posts are very much self explanatory and at the same time very elaborative and informative.
      Looking forward to further posts.

      ReplyDelete
    2. Hi Guido, I started working with VMware in new company and your post explains very well my questions about I/O path. Looking forward to more post from you.

      ReplyDelete
    3. Thank you for sharing such a nice and interesting blog with us. Hope it might be much useful for

      us. keep on updating...!!Vmware Jobs in Hyderabad.

      ReplyDelete
    4. Thanks !! Nice Blog .

      What is the reason behind using PSP_Fixed policy for Active-Active storage array as only one IO path is active whereas in Round robin we can have all IO Paths active ? why by default it always comes as PSP_Fixed with VMW_SATP_DEFAULT_AA ?

      ReplyDelete
      Replies
      1. Vishal,

        Even with an active-active array you only have half of the paths available for I/O traffic handled via ALUA. That being said in most cases RR would be benefitial to use all the active paths on one controller and in case you have multiple HBA ports that as well. I assume that the default is still set to FIXED because there are still systems around which won't support RR and before running in to issue it is better to use the PSP which works in every case. The way to fix this is simply change the PSP of the individual SATP to RR before configuring the LUNs and you are fine.

        Delete
    5. I really appreciate the information shared above. It’s of great help. If someone wants to learn Online (Virtual) instructor lead live training in VMware TECHNOLOGY, kindly Contact MaxMunus
      MaxMunus Offer World Class Virtual Instructor-led training on TECHNOLOGY. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 1,00,000 + training in India, USA, UK, Australia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
      For Demo Contact us.
      Pratik Shekhar
      MaxMunus
      E-mail: pratik@maxmunus.com
      Ph:(0) +91 9066268701
      www.MaxMunus.com

      ReplyDelete
    6. Virtualization is the future and VMware is setting a strong platform for it. There are several aspiring engineers across the country who are looking to master this Domain. If you wish to master this, you could choose from several VMware Training

      ReplyDelete
    7. This is good blog. gives a very good view of the I/O path.

      is there any blog that talks of the way the device discovery happens at the path level and the stack is build end to end?
      appreciate if you could point me to that.

      ReplyDelete
    8. A very insightful post indeed, Guido. Thanks a lot. You have very clearly explained what would otherwise take a lot of experience working with the stack to understand.
      Any sources you can point to understand the I/O flow once it reaches the SCSI midl layer?

      ReplyDelete
    9. Loved your efforts on it buddy. Thanks for sharing this with us
      Vmware training
      Vmware certification

      ReplyDelete
    10. I got here much interesting stuff. The post is great! Thanks for sharing it! File I/O Monitor

      ReplyDelete