Quantcast
Channel: VMware Labs
Viewing all 226 articles
Browse latest View live

VMware ESX Memory Resource Management: Swap

$
0
0

Ishan Banerjee
VMware, Inc.
ishan@vmware.com
Philip Moltmann
VMware, Inc.
moltmann@vmware.com
Kiran Tati
VMware, Inc.
ktati@vmware.com
Rajesh Venkatasubramanian
VMware, Inc.
vrajesh@vmware.com

Abstract

As virtualization matures into a mainstream data center component, there is an increasing demand for improving consolidation ratios. Consolidation ratio measures the amount of virtual hardware placed on physical hardware. A higher consolidation ratio typically implies greater operational efficiency in the data center.

VMware® ESX® is a reliable and efficient hypervisor, enabling high consolidation ratios, reliable operation, and efficient utilization of hardware resources. The memory resource management (RM) system of ESX distributes hardware memory resources to virtual machines (VMs) in a fair and efficient manner. It provides certain resource management guarantees while ensuring the stability and reliability of ESX and powered-on VMs.

To enable high consolidation ratios in a reliable manner, the memory RM system enables overcommitting of ESX. When VMs whose total configured memory size exceeds the memory available to ESX are powered on, ESX is considered to be memoryovercommitted. The memory RM system balances memory distribution between powered-on VMs dynamically when ESX is memory-overcommitted as well as undercommitted. It utilizes the hypervisor-level swap system to reclaim unused memory from VMs.

The swap system is the backbone for enabling reliable memory overcommitment. It is an integral part of the memory RM system. It is integrated with VMware vMotion technology to enable seamless and efficient migration of VMs from one host to another. This article describes the swap system. It shows how swap spaces are created, configured, and managed on ESX.

General Terms:
memory management, memory reclamation, memory swapping
Keywords: ESX, memory resource management

1. Introduction

VMware ESX enables reliable, efficient operation of virtual machines (VM) in a data center. The key enablers for achieving these are the VMware overcommitment technology and vMotion technology. ESX enables memory overcommitment and CPU overcommitment.

The VMware memory-overcommitment technology enables a user to power on VMs whose total configured memory size (virtual RAM, or vRAM) exceeds the total physical memory (pRAM) available to ESX. Memory overcommitment does not mean that ESX will magically provide powered-on VMs with more memory than that available to ESX. It provides memory to those VMs that need it most and reclaims memory from those VMs that are not actively using memory.

Memory reclamation is therefore a part of memory overcommitment. Memory reclamation reclaims memory from VMs that are not actively using it. The reclaimed memory is then given to a VM that needs it to perform better. At all times, the total memory given to all powered-on VMs remains less than or equal to the memory available to ESX.

In order to enable VMs to perform at their best, ESX must carefully determine which VM to reclaim memory from and which VM to give memory to. This decision is made by the memory management policy-maker of ESX, termed MemSched. Each powered-on VM of a configured memory size (memsize) has certain memory attributes attached to it. They are memory reservation, memory limit, and memory shares. These parameters, termed RLS parameters, are used by MemSched along with certain other dynamic attributes of the powered-on VM to decide on the memory that needs to be reclaimed from (or given to) the VM. Detailed description of the memory management policies implemented in MemSched is beyond the scope of this article.

MemSched decides on the memory that must be reclaimed from a VM. It then instructs the reclamation components of ESX to reclaim the targeted memory from the VM. The reclamation components that reclaim memory from a VM are: the transparent page sharing component, balloon component, compression component, and hypervisor-level swap component [7]. The reclamation components reclaim memory from VMs by using their respective methods and release the reclaimed memory to ESX. ESX can then allocate this memory to VMs that need it most. ESX server. The two pages are collapsed into one page. Both VMs contain a reference to the single page and access it in a read-only manner. Thus, when two pages are shared into one page, one memory page is considered reclaimed. Similarly, many memory pages with the same content can refer to one read-only page with the same content. None of the VMs is aware that it is actually sharing a memory page with another VM. Sharing is completely transparent to the VMs.

The page sharing component attempts to reclaim memory from a VM when the contents of a memory page fully match the contents of another memory page in the same or a different powered-on VM on that ESX server. The two pages are collapsed into one page. Both VMs contain a reference to the single page and access it in a read-only manner. Thus, when two pages are shared into one page, one memory page is considered reclaimed. Similarly, many memory pages with the same content can refer to one read-only page with the same content. None of the VMs is aware that it is actually sharing a memory page with another VM. Sharing is completely transparent to the VMs.

ESX shares memory pages at a granularity of 4KB only. The page sharing component is active at all times. It continuously scans VMs for shareable pages and shares them whenever possible. A VM with many shared pages consumes fewer memory pages than the total amount of its configured memory space that has been mapped by the guest OS. The actual amount of savings depends on the number of physical pages backing the virtual memory pages and can be determined at runtime only. A VM with many shared pages exerts less memory pressure on ESX than one that does not have many shared pages, because it consumes less memory.

The balloon component of ESX interacts with the ESX balloon driver inside the guest OS. The balloon component reclaims memory from a VM by instructing the balloon driver to allocate guest memory pages and pin them in the guest memory. These pages are known to ESX as belonging to the balloon driver. ESX can then allocate those pages to other VMs. This approach effectively utilizes the guest OS’s memory reclamation logic to swap-out cold memory pages inside the guest OS into the guest OS’s swap space.

The compression and swap stages operate in sequence. The compression component compresses memory pages before they are swapped out by the hypervisor-level swap component of ESX. If the compression ratio is good, then the page is retained as a compressed page. If the compression ratio is not good, then the page is swapped out to the hypervisor-level swap space of ESX. Compressing a page yields less than one page’s worth of effective reclamation. Swapping a page yields one reclaimed page.

These four memory reclamation techniques are employed to reclaim guest memory from a VM when required. This article describes the hypervisor-level swap component of ESX in vSphere 5.5. The rest of this article is organized as follows. Section 2 describes the memory reclamation mechanism of ESX, Sections 3 and 4 describe the swap component of ESX in the context of VMs, section 5 describes the swap component in the context of user-worlds (UWs), and Section 7 completes this article with a closing discussion.

2. Background and Related Work

Efficient use of main memory continues to be a priority for OSs and hypervisors even though the cost of main memory continues to decrease. In this section, recent work related to memory reclamation in hypervisors is described. The necessity of memory reclamation in ESX is also shown.

2.1 Related Work

Contemporary OSs and hypervisors employ one or more mechanisms to reclaim memory from their memory consumers. Content-based page sharing is used in Linux/KVM and ESX; memory paging is used by most OSs and hypervisors (ESX, KVM, Hyper-V, Xen); and memory ballooning is used by ESX and Xen. The research community has also demonstrated memory reclamation techniques in Singleton [6], Satori [4], Difference Engine [2], KSM [1], CMM [5], and Ginkgo [3].

2.2 Background

MemSched is the ESX component that determines the memory to be given to each powered-on VM. It employs the four reclamation techniques—transparent page sharing, ballooning, compression, and hypervisor-level swapping—to reclaim memory from powered-on VMs when required.
When a powered-on VM allocates memory, ESX does not prevent it from doing so. However, if a VM consumes more memory than its fair share, then ESX (via MemSched) reclaims the amount in excess of its fair share. ESX has four operational memory states. These are high, soft, hard, and low. These states are determined by the free memory available for allocation at a given time compared to the total memory of ESX. Table 1 shows a schematic representation of the memory states of ESX. ESX attempts to stay in the high state at all times. Whenever it dips into any of the other states, ESX reclaims memory from VMs until it enters the high state again. The values of the state threshold depend on the ESX memory. Each threshold consists of two sub-thresholds not exposed to the user. This prevents oscillation of ESX memory state near each threshold.

Table 1. Free memory state transition threshold in ESX. (a) User visible (b) Internal threshold to avoid oscillation.

Table 1. Free memory state transition threshold in ESX. (a) User visible (b) Internal
threshold to avoid oscillation.

The memory allocation and reclamation behavior of ESX depends on the state of ESX. Table 2 lists the actions ESX initiates at different states.

Table 2: Actions performed by ESX in different memory states.

Table 2: Actions performed by ESX in different memory states.

In the high memory state, ESX attempts the page sharing reclamation technique only. Page sharing works on small—4KB sized—pages only. If a VM contains large—2MB sized—pages, then the page sharing technique ignores them for the time being. This is a passive reclamation mode whereby ESX opportunistically attempts to reclaim memory from VMs without affecting their performance. In the soft memory state, ESX attempts to actively reclaim memory from VMs. It computes and sets a balloon target for VMs that might have consumed more than their share of memory. The balloon drivers in those VMs expand, releasing memory to ESX. This process continues until ESX enters the high state. ESX is dependent on the balloon driver and the guest OS to reclaim memory using the ballooning technique. If the balloon driver is not running, or the guest OS is not responsive, or the guest OS is unable to allocate memory to the balloon driver, then no memory can be reclaimed using this technique. Ballooning takes place in addition to page sharing. If memory reclamation using ballooning is not fast enough and VMs continue to allocate memory, then ESX can enter the hard memory state. In this state, ESX attempts to actively and aggressively reclaim memory from VMs. ESX computes and sets a swap target for VMs that are consuming memory in excess of their fair share. The swap component of ESX attempts to swap out the targeted memory pages from those VMs.

Figure 1. Steps performed by ESX when swapping out guest memory page.

Figure 1. Steps performed by ESX when swapping out guest memory page.

For each guest memory page selected for swapping out, ESX opportunistically attempts to share it. If the page is shareable, it is shared instead of being swapped out. If the page is not shareable, ESX now attempts to compress the page. If compressed with a good compression ratio, the page does not need to be swapped out. If the page is not compressible, then ESX swaps the page out to its swap space. If a VM contains a large page—2MB size—ESX converts the page into small—4KB size—pages before attempting to share or compress it. Therefore, if a VM contains many large pages that might not have been shared earlier, they will undergo an attempt at sharing at this time.

If VMs allocate memory at a rate faster than can be swapped out, ESX can enter the low memory state. In this state, memory is critically low and ESX makes every attempt to prevent memory exhaustion. This is typically done by preventing VMs from allocating memory while swapping memory away from VMs. A VM attempting to allocate memory will be blocked from allocation until enough memory has been reclaimed from it.

Among the four memory reclamation techniques—page sharing, ballooning, compression, and hypervisor-level swapping—the first three are opportunistic. They depend on factors outside the control of ESX for reclaiming memory from a VM. Page sharing and compression depend on the contents of a guest memory page. Ballooning depends on the balloon driver and the stability of the guest OS. The fourth method, hypervisor-level swap, is the only guaranteed method for reclaiming memory from a VM. In the event that the first three methods are not effective or fast enough, ESX swaps out memory from a VM in order to maintain its stability. The ESX swap system is therefore designed to guarantee reclamation of memory from VMs under all circumstances. The swap space is created such that it is not exhausted. In addition, memory management policies and mechanisms of ESX ensure that memory can always be successfully reclaimed from VMs when required.

A swap-out operation is defined as reclamation of guest memory pages by ESX, using hypervisor-level swapping, into a suitable swap space attached to ESX. A swap-in operation is defined as reading a swapped-out memory page from the swap space, when it is accessed by the VM. Swap-out operations are typically asynchronous to the execution of the VM and guest OS. Swap-in operation are always synchronous with the access made by the VM or the guest OS. If a swap-in operation fails owing to storage device failure, then the corresponding VM is terminated immediately.

In the next section, a detailed description of the swap component is presented.

3. Swap Without vMotion
ESX is designed to guarantee memory reclamation by swapping out guest memory pages under all conditions. This ensures that ESX never runs out of memory when overcommitted and under memory pressure from guest workloads. ESX can reclaim memory from VMs (see section 4) as well as native processes, known as user-worlds (UWs) (see section 5), by hypervisor-level swapping. This section describes memory reclamation, by swapping, from VMs, in the absence of vMotion.

3.1 Simple VM
ESX creates a swap space for each powered-on VM. Powered-off VMs do not require a swap space. The swap space for a powered-on VM is created in the form of a fully allocated (also known as thick) per-VM swap file. There is one swap file for each powered-on VM. All storage blocks of the file are allocated from the underlying storage space before the VM successfully powers on. The location of this swap space defaults to the same location as the .vmx configuration file for the VM. This swap file is used for reclaiming memory from the corresponding VM only.

Figure 2(a) shows the schematic representation of an ESX with one powered-on VM. The VM is linked to its per-VM swap space. This swap space is distinct from the swap space created by the guest OS inside the VM. ESX does not have any control over the swap space created by the guest OS.

Memory reclaimed from this VM by means of swapping is placed in this swap space. Figure 2 (b) shows the swap space created for a single powered-on VM.

Figure 2.(a) Schematic Representation of a VM Linked to Its Per-VM Swap Space (b) Swap Space Consumed by a Powered-on VM. The VM has a configured memory size of 1GB and a configured memory reservation of 256MB.

Figure 2.(a) Schematic Representation of a VM Linked to Its Per-VM Swap Space
(b) Swap Space Consumed by a Powered-on VM. The VM has a configured memory
size of 1GB and a configured memory reservation of 256MB.

The size of the swap space for this VM, at power-on, is given by swapsz = memsize – memory reservation – (1) where memsize is the configured guest memory size for the VM and memory reservation is the corresponding setting for the VM. For the example in Figure 2, a swap space of size 768MB is created. ESX guarantees that memory up to memory reservation will always be present to back the virtual memory of the VM. Virtual memory in excess of this amount can be reclaimed from the VM by ESX, if required. The swap space is sized such that there are always enough storage blocks to back the reclaimed memory pages. These storage blocks are allocated in the file system before declaring the VM as powered-on. The actual space consumed from the swap space can be less than its size. This depends on the memory being from the VM that uses swapping.

When the memory reservation of a powered-on VM is reduced, the VM’s swap space is recalculated and extended if required, using Equation 1. However, when a powered-on VM’s memory reservation is raised, the swap space is not reduced. This can result in a swap space that is larger than required. A subsequent reduction in memory reservation might or might not result in a change in the swap space.

Typically, when memory is reclaimed by means of hypervisor-level swapping, the execution of the guest OS is not stalled. This means that the guest OS and workload execution are not blocked by the write operation of memory pages to the swap space. However, when the guest OS accesses a memory page that has been swapped out by ESX, ESX handles the access by reading in the swapped page from the swap space. The guest vCPU generating the access waits during this time. This waiting time can adversely affect the guest performance. Figure 3 shows a schematic representation of a VM whose memory pages have been reclaimed using different techniques. Figure 3(a) shows the state of the VM before pages were reclaimed by hypervisorlevel swapping. In this stage, some memory pages are already shared. The memory saved by sharing is shown as P. Some memory has also been reclaimed by ballooning, shown as B. In this stage, the memory pages being consumed by the VM are more than its entitlement. Also, ESX is in the low state. This triggers memory reclamation by swapping. In Figure 3(b), memory has been reclaimed by swapping using the steps in Figure 1. During the swapping process, some pages were found to be compressible and hence compressed,shown as C; some pages were found sharable and hence shared.

Pages that could not be compressed or shared were swapped to the per-VM swap space. This is shown as S.

Figure 3. Figure 3. State of a VM Before and After Reclaiming Pages by Swapping. a) State of the VM in Low ESX State Before Memory Reclamation by Swapping. b) Memory Reclaimed by Swapping, Some more pages have been shared, compressed. c) In the presence of global SSD swap space, swapped pages are written to SSD. d) In the presence of global SSD swap space, pages are written to SSD and overflow into per-VM swap space.

Figure 3. Figure 3. State of a VM Before and After Reclaiming Pages by Swapping.
a) State of the VM in Low ESX State Before Memory Reclamation by Swapping.
b) Memory Reclaimed by Swapping, Some more pages have been shared, compressed.
c) In the presence of global SSD swap space, swapped pages are written to SSD.
d) In the presence of global SSD swap space, pages are written to SSD and overflow
into per-VM swap space.

The scenario described in this subsection shows one VM powered on in ESX. When ESX needs to reclaim memory from this VM, it will follow the steps shown in Figure 1.

3.2 VM with Solid State Device
The scenario described in this subsection is similar to the one in subsection 3.1. In addition, the ESX server on which this VM is powered on has an attached solid state device (SSD). ESX takes advantage of a SSD to enable faster swap operations. The read operations from a SSD are several orders of magnitude faster than a read operation from a spinning disk. Hence, the swap-in operation of a guest memory page takes place much faster when reclaimed memory is placed on a SSD.

The SSD is a global resource to ESX. ESX can use a part of the SSD to reclaim memory pages from VMs. ESX can be configured to create a global swap space on the SSD. All VMs share this swap space. The amount of storage space that ESX can consume from the SSD as a global swap space is configured by the user.

The size of the swap space created on the SSD is user-defined. It is not related to any attribute of ESX or VMs.

Figure 4 shows a schematic representation of the swap space when ESX uses a SSD. The SSD is used in addition to the per-VM swap space. For a given VM, the SSD is a global shared resource, whereas its per-VM swap space is a private resource. The global shared swap space on the SSD is available for reclaiming memory from a VM on a first-come-first-served basis.

Figure 4. Schematic representation of an ESX with attached SSD and two powered-on VMs.

Figure 4. Schematic representation of an ESX with attached SSD and two
powered-on VMs.

Figure 5 shows the steps when ESX swaps out a memory page from a VM in the presence of swap space on a SSD. This figure expands on the swap step from Figure 1. ESX first attempts to swap out the guest memory page to the global swap space on the SSD. If the swap space on the SSD is full, then ESX swaps out the page to the per-VM swap space.

Figure 5. Steps performed by ESX when swapping out guest memory page when swap space on SSD is configured.

Figure 5. Steps performed by ESX when swapping out guest memory page when
swap space on SSD is configured.

ESX does not guarantee swap space on the SSD to a VM, because its size is configured by the user and space from the SSD is allocated to VMs on a first-come-first-served basis. However, swap space is always available on the per-VM swap space. Figure 3(c) shows the swapped pages being written to the global SSD swap space. In this case there is enough space on the SSD to accommodate all swapped pages for the VM. In Figure 3(d) the global SSD space has been consumed by other VMs. There is not enough space for the VM in this figure. Hence excess pages overflow into the per-VM swap space. If the swap space on a SSD becomes full, then swapped memory from a VM is written directly into the VM’s per-VM swap space. This ensures reliable operation of the VM.

However, it can be noted that memory pages residing on the swap space on the SSD for a long time are probably cold memory pages and might not be required by the VM in the near future. The incoming pages to the SSD, which are being swapped out, are likely hotter than those already on the SSD.

ESX recognizes this effect. Therefore, when the free space on the SSD swap space falls below a watermark, ESX actively transfers swapped memory pages from the SSD swap space into the per-VM swap space of the corresponding VM. As a result, pages residing on the SSD swap space are typically hotter than those residing on the per-VM swap space. Therefore, if the hot pages are subsequently accessed by the VM, they are more likely to be found on the SSD swap space than on the per-VM swap space. This results in a faster swap-in operation. Figure 6 illustrates the transfer of swapped guest memory pages from the SSD swap space to the per-VM swap space when the SSD swap space is almost full. This step is an optimization. If ESX is not quick enough to transfer pages, then the incoming pages overflow to the per-VM swap space.

Figure 6. ((a) The SSD swap space is almost full. This triggers a transfer of swapped memory pages to the per-VM swap space. (b) Pages have been transferred to the per-VM swap space such that more space is available on the SSD swap space.

Figure 6. ((a) The SSD swap space is almost full. This triggers a transfer of swapped
memory pages to the per-VM swap space. (b) Pages have been transferred to the
per-VM swap space such that more space is available on the SSD swap space.

The scheme described in Figure 3(c) is desirable in the presence of memory reclamation by hypervisor-level swapping. This is because SSD is typically several orders of magnitude faster than magnetic disk storage. Hence swap read and write operations from SSD are fast. This results in reduced latency when a swapped page is being read in from the SSD swap space, compared to reading it in from the per-VM swap space.

This section described swapping in the absence of vMotion. The next section describes swapping in the presence of vMotion.

4. Swap with vMotion

vMotion is VMware technology that migrates a powered-on VM from one ESX server to another with no downtime. ESX memory resource management technology is integrated with its vMotion technology and provides seamless performance during migration. ESX supports two types of migration: shared-swap vMotion1 and unshared-swap vMotion. The swap space is treated differently in these two types of vMotion. They are described in the following subsections. The size of the swap space is given by Equation 1.

4.1 Shared-Swap vMotion
Shared-swap vMotion takes place when a powered-on VM migrates from one ESX server to another such that the per-VM swap file of both the source and the destination VM are the same. The per-VM swap file is stored on a shared storage device. It is accessible from the source VM before and during the vMotion. It is subsequently accessed by the destination VM. The presence of a shared storage device between the source and destination ESX server does not make a vMotion a shared-swap vMotion. A vMotion is termed shared-swap vMotion if and only if the swap file being used by the source VM is the same as the one to be subsequently used by the destination VM.
Figure 7 shows the steps during shared-swap vMotion. In these steps, a VM is migrated from a source ESX (left) to a destination ESX (right). The two ESX servers share a common storage device. The swap file of the VM is located on this shared device.

In Figure 7(a), which is the initial state, the source VM on the source ESX server is linked to its per-VM swap file (termed regular swap file). Because the VM is actively writing to this file, it is difficult for another VM on another ESX server to write to this file without complex synchronization schemes. Hence, for simplicity, ESX allows only one VM to access the per-VM swap file.

1 The term ‘vMotion’ refers to “VMware vSphere® vMotion®” technology.

In Figure 7(b), the source ESX server has initiated migration of the VM to the destination ESX server. The destination ESX server responds by creating a new VM with the same configuration (.vmx) as the source VM. This VM is not running yet. ESX has simply set up its state in preparation for transferring the state of the source VM.

Figure 7. Schematic Representation of Migrating a VM from a Source (Left) to a Destination (Right) ESX Server. The regular swap space is located in a swap file on a shared storage device.

Figure 7. Schematic Representation of Migrating a VM from a Source (Left) to a
Destination (Right) ESX Server. The regular swap space is located in a swap file
on a shared storage device.

The new destination VM cannot open the shared swap file on the shared storage device, because the source VM is actively using it. Instead, the destination VM creates a temporary migration swap file. If the destination ESX needs to reclaim memory from the destination VM by means of hypervisor-level swapping, then the swapped memory is written into the migration swap file.

In Figure 7(c), the source ESX server transfers the memory, CPU, and other states of the source VM to the destination ESX server. The destination ESX server sets up the state for the destination VM. During this step, the guest memory content of the source VM is transferred to the destination ESX server. This consumes memory pages on the destination. As a result, the destination ESX server can experience memory pressure and might need to reclaim memory from the destination VM. At this stage, the reclamation takes place by means of swapping only. The swapped memory is written into the migration swap file. During this step, the source VM is executing the guest OS. During the transfer of state from the source ESX server to the destination ESX server, guest memory that was swapped out to the swap space from the source VM is not transferred explicitly. It remains on the swap space. Subsequently, when the destination VM starts execution, it gains access to the swapped pages on this swap space.

In Figure 7(d), the source VM’s state has been completely transferred to the destination and is ready to be deleted on the source ESX server. The destination VM, after receiving complete state information from the source, has started executing the guest OS. In addition, the source VM has released access to the shared swap file. The destination VM has opened the shared swap file as its own swap file. From Figure 7(c), the destination VM might have swapped out guest memory pages to the migration swap file. The destination VM thus has two swap files, potentially with swapped pages in both of them. Because the regular swap file has enough space to back the unreserved memory of the VM, the temporary migration swap file is no longer required. In this step, ESX reads in all the swapped pages from the migration swap file and can swap them out to the regular swap file. In Figure 7(e), the contents of the migration swap file have been completely read into the VM. It is no longer required and is hence deleted. The VM has been completely migrated from the source ESX server to the destination ESX server. It is using the swap file on the shared device as its swap space.

Shared-swap vMotion is a mechanism for migrating a powered-on VM from a source ESX server to a destination ESX server when the swap space of the source and destination VM is the same and is located on a shared storage device.

4.2. Unshared-Swap vMotion
Unshared-swap vMotion takes place when a powered-on VM is migrated from a source ESX server to a destination ESX server such that the per-VM swap space of the source VM is not the same as the swap space of the destination VM. It is possible for the two swap space to reside on shared storage devices, or even on the same shared storage device. However, during and after the vMotion process the destination VM creates and uses its own swap space. The size of the swap space is given by Equation 1.

Figure 8 shows the steps during unshared-swap vMotion. In these steps, a VM is migrated from a source ESX server (left) to a destination ESX server (right). The per-VM swap file of the VM on both source and destination ESX servers are accessible to the respective VMs only.

Figure 8. Schematic representation of migrating a VM from a source (left) to a destination (right) ESX. The swap space of the source and destination VM are not on a shared storage device.

Figure 8. Schematic representation of migrating a VM from a source (left) to a destination (right) ESX. The swap space of the source and destination VM are not on a shared storage device.

 

In Figure 8(a), which is the initial state, the source VM on the source ESX server is linked to its per-VM swap file (termed regular swap file). In Figure 8(b), the source ESX server has initiated a migration to the destination ESX server for the VM. Similarly to the shared-swap vMotion, the destination VM creates a new VM and prepares to receive the source VM’s state from the source ESX server. The destination VM is linked to its own swap space. It is the permanent swap space for destination VM. It opens and has full access to this swap space in this step.

In Figure 8(c), the source ESX server transfers the VM’s state to the destination ESX server. Swapped pages from the source VM are explicitly transferred to the destination ESX server. A page that was swapped on the source VM enters the destination VM as an in-memory page. This can subsequently be swapped to the destination swap file, if the destination ESX server decides to reclaim this page from the destination VM. During this step, the source VM is executing the guest OS.

In Figure 8(d), the vMotion is complete. All state is transferred to the destination ESX server. The source VM is deleted. The destination VM continues to use the swap space as its permanent swap space. Unshared-swap vMotion is a mechanism for migrating a powered-on VM from a source ESX server to a destination ESX server such that the source and destination VMs do not have the same swap file.

5. User-World Swap

Native processes in ESX are termed user-worlds (UWs). A UW is very similar to a Linux process. ESX can execute applications as a UW. Typical examples for UWs are ESX agents such as hostd and vpxd, third-party monitoring agents, and the vmx that encapsulates the execution of a VM in ESX.

A UW can be swapped out by ESX when ESX is under memory pressure. When created, UW is tagged with one of three types: non-swappable, swappable to private swap-file, or swappable to system swap file:

  •  No swap – A UW started without linking to a swap space becomes non-swappable. All of its memory pages are considered pinned.
  • Private swap – A UW can be started with its own private swap file. In this mode, the UW is exclusively linked to a file marked as its private swap file.
  • System swap – ESX can be configured to create a global swap file for UWs. This swap file, termed system swap file, can be created after ESX has booted. The location can be specified by the user, with a default size of 1GB. Only one system swap file for UWs can be created on an ESX server. When a UW is started without explicitly specifying a swap space, the UW is automatically linked to the system UW swap file.

 

Figure 9. An ESX with UWs in different swap modes.

Figure 9. An ESX with UWs in different swap modes.

Figure 9 shows an ESX server with UWs in different swap modes. In this example, the UW smallapp does not have a swap file. It is non-swappable. The UW bigapp has its own private swap file. The UWs’ ls are attached to the system swap file.

6. Revisiting VM Swap

For simplicity of discussion, the previous sections have omitted aspects of the swap space layout of a VM. In this section, these aspects are discussed in greater detail.

6.1 VM Swap Layout
A VM is always linked to its private swap space. In addition, memory pages from a VM can be swapped out to a global swap space created on a SSD. A UW, on the other hand, has three modes: no swap, private swap, and system swap.
A VM in ESX consists of two parts: the vmx process that is a UW and a virtual machine monitor and corresponding guest address space. The former, being a UW, can potentially possess one of the three swap modes for a UW. The latter follows the swapping mechanism described in the preceding section for VMs.

ESX is configured to assign the private swap file mode to the vmx process. Hence, when a VM is powered on, two swap files are created: a private swap file for the vmx process and a private swap file for the VM’s guest address space. Figure 10 shows the complete swap space that a VM is linked to. In Figure 10, two VMs are powered on. Each VM has its own private swap space for its vmx process. Each VM also has a private swap space for the guest address space. In addition, ESX is configured with a SSD. The SSD is available for storing swapped pages from both VMs.

Figure 10. Schematic Representation of an ESX Server with Two Powered-on VMs. Each VM has its per-VM swap space. The vmx process for each VM also has its private swap file. The ESX is configured with an SSD.

Figure 10. Schematic Representation of an ESX Server with Two Powered-on VMs.
Each VM has its per-VM swap space. The vmx process for each VM also has its private
swap file. The ESX is configured with an SSD.


6.2 Host-Local Swap Space

Typically, a VM has its own private swap space. This swap space is located in the same directory as the VM’s virtual disk, its configuration file, and other meta-information.

ESX allows the VMs swap file to be located in a different location. ESX provides for a global setting—host-local swap directory—where each VM on that ESX server can store its private swap file. The hostlocal setting is not set by default. When it is set, all subsequent power-on operations use the host-local swap directory to create the private swap file for the VM. Figure 11 shows a schematic representation when ESX is configured with a host-local directory. VMs powered on after this location is set have their swap space created in the host-local directory. This option does not affect the location of the SSD swap space or UW system swap space.

Figure 11. Schematic representation of the host-local swap directory setting. When set, a VM’s private swap file is created in this directory. This setting does not affect powered-on VMs, SSD swap space, and UW’s system swap space.

Figure 11. Schematic representation of the host-local swap directory setting. When
set, a VM’s private swap file is created in this directory. This setting does not affect
powered-on VMs, SSD swap space, and UW’s system swap space.

6.3 VM Checkpoint Operation
When a VM is checkpointed, using the suspend or snapshot operation, memory pages of the guest OS are written into a checkpoint file, corresponding to the suspend or snapshot operation. During this process, ESX reads the swapped memory pages from the VM’s swap space before writing them to the checkpoint file. Reading the swapped pages can impact the performance of the checkpoint operation.

7. Conclusion

This article describes guest memory reclamation in ESX using hypervisor-level swapping. Swapping memory from a VM enables reliable overcommitment of ESX. ESX uses per-VM swap files for each powered-on VM. When an SSD is attached to ESX, ESX can carve out a global swap space from it to enable fast swap operations.

Acknowledgments

Memory overcommitment in ESX was designed and implemented by Carl Waldspurger [7]. The initial swap system in ESX was implemented by Mahesh Patil. Thanks to Mukheem Ahmed for reviewing this article.

References

  1. A. Arcangeli, I. Eidus, and C. Wright. Increasing memory density by using KSM. In Proceedings of the Linux Symposium, pp. 313–328, 2009.
  2. D. Gupta, S. Lee, M. Vrable, S. Savage, A. C. Snoeren, G. Varghese, G. M. Voelker, and A. Vahdat. Difference engine: harnessing memory redundancy in virtual machines. Commun. ACM, 53(10):85–93, Oct. 2010.
  3. M. Hines, A. Gordon, M. Silva, D. Da Silva, K. D. Ryu, and M. BenYehuda. Applications Know Best: Performance-Driven Memory Overcommit with Ginkgo. In Cloud Computing Technology and Science (CloudCom), 2011 IEEE Third International Conference on, pp. 130–137, Dec. 2011.
  4. G. Milos, D. Murray, S. Hand, and M. Fetterman. Satori: Enlightened page sharing. In Proceedings of the 2009 USENIX Annual Technical Conference. USENIX Association, 2009.
  5. M. Schwidefsky, H. Franke, R. Mansell, H. Raj, D. Osisek, and J. Choi. Collaborative Memory Management in Hosted Linux Environments. In Proceedings of the Linux Symposium, 2006.
  6. P. Sharma and P. Kulkarni. Singleton: system-wide page deduplication in virtual environments. In Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing, HPDC’12, pp. 15–26, New York, NY, 2012. ACM.
  7. C. A. Waldspurger. Memory resource management in VMware ESX Server. SIGOPS Oper. Syst. Rev., 36(SI):181–194, Dec. 2002.

 


Statistical Normalcy Determination Based on Data Categorization

$
0
0

Mazda A. Marvasti
VMware, Inc.
mazda@vmware.com

Arnak V. Poghosyan
VMware, Inc.
apoghosyan@vmware.com

Ashot N. Harutyunyan
VMware, Inc.
aharutyunyani@vmware.com

Naira M. Grigoryan
VMware, Inc.
ngrigoryan@vmware.com

Abstract

We introduce a statistical learning Normalcy Determination System (NDS) for data-agnostic management of monitoring flows. NDS performs data categorization with analysis tools that identify category-specific normalcy bounds in terms of dynamic thresholds. This information can be applied further for anomaly detection, prediction, capacity planning, and root-cause analysis. Keywords: monitoring, time series data, statistical process control, normalcy determination, dynamic thresholding, data categorization, parametric and non-parametric statistics.

1. Introduction

Today’s business and IT management face the problem of “big infrastructures” with millions of monitored metrics that need to be efficiently analyzed for gaining valuable insights in terms of underlying system control. The concept of control for ensuring the quality of services of different systems originates from the ideas of statistical process control charts, which provide a comprehensive tool to determine whether the process is in normalcy state or not. The foundation of these concepts was established by Shewart [1] where he developed methods to improve quality and lower costs. Fluctuations and deviations from standards are present everywhere, and the problem of constructing a relevant chart is in understanding which variations are normal and which are caused by a problem. In the classical theory of control, the underlying processes have a bell-shaped distribution, and in those cases control charts are based on the strong foundation of parametric statistics. The problems arise when the classical theory of control is applied to processes of other types.

The problem of construction of a relevant control tool is in identification of normalcy bounds of the processes. Since the developments of Shewart, an explosion in controlling techniques has occurred [2]–[4]. Different processes require different measures to be controlled, and each one leads to a new control chart with the corresponding normalcy states defined by thresholds [5]–[10]. Modern businesses and infrastructures are dynamic and, as a consequence, measured metrics are dynamic without any ad-hoc known behavior. These cause extension of the classical ideas to encompass the notion of dynamic normalcy behavior. In some applications a notion of normalcy bound in terms of dynamic threshold (DT) arises naturally [11]–[16]. An essentially different approach is determination of normalcy state in terms of correlated events, which leads to a directed graph revealing the fundamental structure of a system beyond the sources and processes [17] [18]. In this paper, we introduce a fully data-agnostic system [19] [20] for determining normalcy bounds in terms of DTs of monitoring time series data without presumed behavior. The system performs data categorization based on some parametric and non-parametric models and applies category-specific procedures for optimized normalcy determination via historical simulation. Although experimental results are obtained based on IT data, the approach is applicable to wider domains, because for different applications the data categories can be appropriately defined.

Determined DTs can be further applied for anomaly detection by construction of anomaly events. As soon as the DTs are historically constructed, they can be projected into the future as prediction for time-based normalcy ranges. Any data point appearing above or below those thresholds is an abnormal event. An approach described in [21]–[24] employs a directed virtual graph showing relationships between event pairs. An information-theoretic processing of this graph enables reliable prediction of root causes of problems, bottlenecks, and black swan events in IT systems.

The NDS described here is realized in VMware® vCenter™ Operations Manager™ [25] analytics, and the last section presents some results for real customer data.

2. General Description of the System

In this section, we present general principles of the NDS, which performs fully data-agnostic normalcy determination based on historical simulation. Flowchart 1 illustrates the general concept. The system sequentially utilizes different Data Quality Assurance (DQA) and Data Categorization (DC) routines that enable choosing the right procedure (right category of qualified data) for determination of data normalcy bounds in terms of DTs.

Flowchart 1. General Concept of the NDS.

Flowchart 1. General Concept of the NDS.

DQA filters data by checking different statistical characteristics defined for data qualification, and it further passes through DC. DC performs data identification into categories (e.g., trendy, cyclical, etc.). We repeat this cycle for each time series, performing category checking and identification with hierarchical/priority order until data is identified as belonging to some category or is identified as corrupted. The categorization order or the hierarchy is important, because different orders of iterative checking and identification will lead to different final categorization with differently specified normalcy states.

Flowchart 2 shows a specific realization of the general concept. Here, NDS consists of three DQA modules (Data Quality Detector, Data Density Detector, and Stability Detector) and two DC modules (Parametric Category Detector and Variability Detector).

Flowchart 2. A Specific Realization of NDS.

Flowchart 2. A Specific Realization of NDS.

As a final result, the initial data is interpreted as Parametric, Sparse, Low-Variability, and High-Variability. In each of those cases the normalcy determination method is different. For instance, Parametric Data can be of different categories (Transient, Multinomial, Semiconstant, and Trendy) with a specific normalcy analysis algorithm in each case.
The functional meanings of the abovementioned detectors are as follows:

Data Quality Detector performs a check of sufficient statistics. This lock classifies data as Qualified when available data points and length of data are sufficient for further analysis; otherwise data is classified as Corrupted.

Parametric Category Detector performs data categorization by verifying data against a selected statistical parametric model. If categorization is possible, data is named as Parametric Data; otherwise it is called Regular Data.

Data Density Detector filters Regular Data against gaps. Data with extremely an high percentage of gaps is Corrupted Data. Data with a low percentage of gaps is Dense Data. Data with a high percentage of gaps that are uniformly distributed in time is Sparse Data. Data with a high percentage of gaps that have localization in time is further processed through a gap filter whose output is Dense or Corrupted Data.

Stability Detector analyzes Dense Data in terms of statistical stability. If data is piecewise stable and the latest stable region is enough for further processing, this block performs a data selection; otherwise the data is Corrupted. Stable Data is then passed through Variability Detector.
Variability Detector calculates variability indicators and classifies data into High-Variability or Low-Variability.

In all categorization scenarios the data additionally is verified against periodicity for efficient construction of its normalcy bounds (see Flowchart 3).

Flowchart 3. Categorization in Terms of Periodicity.

Flowchart 3. Categorization in Terms of Periodicity.

3. Period Detector

The period determination procedure (see Period Detector in Flowchart 3) in NDS seeks similar patterns in the historical behavior of time series for accurate setting of its normalcy bounds based on the information on cycles.

Some classical techniques known in the literature include seasonality analysis and adjustment [26]–[30], spectral analysis, Fourier transform, discrete Fourier transform [31]–[35], data decomposition into cyclical and non-cyclical components, and the Prony method [36], [37]. Our procedure is closer to the approach described in [38], which is based on clustering principles.

The main steps of the period determination procedure are presented in Flowchart 4.

Flowchart 4. Main Steps in Period Determination.

Flowchart 4. Main Steps in Period Determination.

Data preprocessing performs data smoothing and outlier removal by various procedures. We refer to standard classical algorithms [39]–[42]. The purpose of this step is two-fold: eliminating extreme high or low outliers that can degrade the range information and smoothing of local fluctuations in data for more robust pattern recognition.
Data Quantization performs construction of the Footprint of the historical data for further cyclical pattern recognition. This is a two-step procedure:

1. Frame construction – The range of data (smoothed data) is divided into non-uniform parts by quantiles qk with k=k1,…,km, 0≤k1<…<km≤1, where parameter m and the values of kj are predefined. Evidently, the grid lines are dense where the data is dense. For division of data into parts along the time axis, two parameters time_unit and time_unit_parts are used. Time_unit is a basic parameter that defines the minimal length of possible cycles that can be found. Moreover, any cycle can be a factor only of the length of time_unit. The usual setting is time_unit=1 day. The time_unit_parts parameter shows the number of subintervals into which that time_unit must be divided. Actually, this parameter is the measure of resolution. The bigger the value of time_unit_parts, then more sensitive is the footprint of the historical data. Figure 1 shows an example of a frame. Gridlines are equidistant along the time axis and non-uniform along the range.

Figure 1. Example of a Frame.

Figure 1. Example of a Frame.

2. Percentage calculation – For the given framework, we calculate the percentage of data in every grid-cell and obtain the corresponding column for the given time interval. Collecting all columns, we construct the matrix of percentages for that particular frame. The final matrix is a two2-dimensional classical histogram of historical data. Then, for every column we calculate the corresponding cumulative sums, getting a cumulative distribution of data in each column. We call this matrix as a Footprint of historical data.

The Pattern Recognition procedure is described as: Let T=N×time_unit, N=1,2,… We collect the columns of the footprint matrix into subgroups with L=N×time_unit×time_unit_parts columns in every subgroup. The overall number of the subgroups equals to M=length(footprint)/L (footprint matrix can be extended by zero columns if needed). For each k-th k=1,…,L column in every subgroup, we check the similarity of columns by the well-known relative L2-norm:

marvasti-6

for some user-defined parameter closeness, then it is assumed that two columns are similar. Let users defines the parameters closeness (=0.2) and quality (=75%) parameters. Figure 2 shows an example is shown wherein which data is divided into T-cycles. For this particular example, we assumed that

     d(A,E)>closeness,d(A,I)>closeness, d(E,M)>closeness

Hence column A is not similar to columns E, I, and M, and we put zero under it. Now, we try column E. We assumed that

     d(E,I)≤closeness,d(E,M)≤closeness

Hence, column E is assumed to be similar to I and M , and we put ones under these columns. If the percentage of ones is not less than the value of the quality parameter quality then, we declare that the corresponding column of T-cycle is periodic; otherwise it is non-periodic. In our example, taking into account that three columns out of four compose 75%≥quality, we declare that the first column of the T-cycle is periodic and put one in the corresponding column (see Figure 3). We repeat the procedure for all columns (see the particular example in Figure 2) and check the periodicity of each column (see Figure 3).

Figure 2. T-Cycle Checking Procedure.

Figure 2. T-Cycle Checking Procedure.

Figure 3 shows that in this particular example, two columns are periodic (A,E,I,M and D,H,L,P) and two columns are non-periodic. Because periodic columns compose 50% of all columns, then T-cycle has 50% similarity of columns.

Figure 3. T-Cycle Similarity Calculation.

Figure 3. T-Cycle Similarity Calculation.

Repeating this procedure for all possible T-cycles, we compose a Cyclochart of data that shows the percentage of similarities against T. Figure 4 shows an example of a Cyclochart.

Figure 4. Example of a Cyclochart.

Figure 4. Example of a Cyclochart.

The period-determination procedure is based on the Cyclochart analysis and consists of four steps. Let us consider the particular example of Figure 4:

1. Finding out the local maximums in the Cyclochart with their corresponding similarities (see Table 1).

marvasti-10

2. Construction of the periods that correspond to every local maximum. Data with T-cycle also has kT-cycle for every natural k. Hence, for the first row in Table 1 together with the 2-day period, we expect also 4-day, 6-day, .… periods. So a 2-day period creates the following period series:

    2 → 2,4,6,8,10,….

The peak 4 creates another series

    4 → 4,8,12,16,...

and so on.

3. Calculation of the series characteristics for every period series. The following characteristics are assumed to be important: positive factor of the period series is the number of peaks in that series and negative factor of the period series is the number of members in that series which are not the peaks. Then, we set

    Strength = Positive factor – Negative factor.

Table 2 shows these characteristics for Table 1.

Table 2: Positive Factors, Negative Factors, Strengths and the Corresponding Similarities.

Table 2: Positive Factors, Negative Factors, Strengths and the Corresponding Similarities.

4. Period determination based on the defined characteristics by the following procedure. We select the periods with maximum strength. From that list, we choose the periods with minimum negative factor then with maximum similarity. Then, we pick the period with minimum length and check its similarity measure. The similarity of the determined final period must be greater than 20%; otherwise, data is claimed to be non-periodic. This procedure applied to the results in Table 2 leads to the 7-day period, because it has the maximum Strength=4

Figure 5. Normalcy Bounds for Periodic Data.

Figure 5. Normalcy Bounds for Periodic Data.

We will now discuss the general procedure of how the normalcy bounds can be determined based on periodicity, taking into account that for specific categories it can be modified appropriately. In the case of non-periodic data, normalcy bounds can be determined by the well-known whisker’s method. If data is claimed as periodic, then normalcy bounds are calculated based on cycle information (see Figure 5 for a specific example). More specifically, consider the case of cyclical data and the following four columns from the Footprint, which are shifted one from another by the period of data:

If

marvasti-13

d(A,B)≤closeness,d(A,C)≤closeness,
and
d(A,D)≤closeness,

then all columns make a cyclical subgroup and we calculate the bounds based on the four data columns. Then, if
d(A,B)≤closeness, d(A,C)≤closeness
but
d(A,D)>closeness

then only the columns A,B,C make a cyclical subgroup. Taking into account that quality = 75% and three columns out off four compose 75% , then we assume that column D is corrupted and consider only the three columns. If

d(A,D)≤closeness
but
d(A,B)>closeness

then we discard column A. If less than 75% of these four columns are similar, then we have a non-cyclical subgroup and take into consideration all columns A,B,C,D. Then, for each group of columns, normalcy bounds can be taken as (min,max) values of data (see Figure 6), taking into account the preliminary data smoothing factor.

Figure 6. Normalcy Bounds Construction Procedure from Footprint Taking into Account the Information on Cycles.

Figure 6. Normalcy Bounds Construction Procedure from Footprint Taking into
Account the Information on Cycles.

4. Data Quality and Parametric Category Detectors

Data Quality Detector performs a check for sufficient statistics. This block classifies data as Qualified when available data points and length of data are enough (for example, more than 20 points and longer than 1 week) for further analysis; otherwise data is classified as Corrupted.

Flowchart 5 shows the principal scheme of the Parametric Category Detector. It specifies data as either Parametric or Regular. Parametric Data can belong to different categories: Multinomial, Transient, Semi-Constant, Trendy, or any other user-defined category.

Flowchart 5. Parametric Category Detector.

Flowchart 5. Parametric Category Detector.

Multinomial Data – Flowchart 6 describes the process of Multinomial Data (MD) categorization. The module Checking Parameters module for MD calculates statistical parameters for comparison with the predefined measures. If the checking is positive. then data is classified as MD; otherwise the module Performing De-noising module performs data cleaning with sequential checking of predefined parameters. We consider the following predefined statistical measures. Let pj be the frequency of occurrences of the integer nj

marvasti-16

where N is the total number of integer values and m is the number of different integer values.

Flowchart 6. Categorization of Multinomial Data.

Flowchart 6. Categorization of Multinomial Data.

Data is multinomial if it takes fewer than m different integer values and at least s of them have frequencies greater than parameter H1. Two different de-noising procedures can be performed. The first procedure is filtering against non-integer values with smaller than H2 percentage (H1<H2). If this condition is satisfied, then in the remaining analysis the non-integer numbers are discarded. The second procedure is filtering against integer values with a small cumulative percentage. Sorting the percentages Pj in descending order. we define the cumulative sum Cj as:

c1=100, cj=pj+…+pm, cm=pm

Now, if ck<H3, ck-1≥H3, then the integer values nk,nk-1,…,nm can be discarded from further analysis.

Determination of normalcy bounds is started with periodicity analysis. Here, while constructing the Footprint, instead of the percentages of data in every cell, we are taking the values of ck in every column. Then, if data is claimed as periodic, points in similar columns are collected together and new values of the numbers ck are calculated. If ck-1<H, ck≥H , then the values n1,n2,…,nk constitute the most probable set (normalcy set) of similar columns. If data is determined as non-periodic, then the numbers ck are calculated for all data points and the normalcy set is determined similarly.

Transient Data is categorized by multimodality, modal inertia, and randomness of modes appearing along the time axis. Transient Data must have at least two modes. In this context, modal inertia means that data points in each mode must have some inertia. (They can’t oscillate from one mode to the other too quickly.) Actually, the inertia can be associated with the time duration that data points remain in the selected mode.

Detection of inertial modes is based on calculation of transition probabilities. By the first step we seek for a region/interval of sparse data values and for data with some inertia concentrated in upper and lower regions of this interval. We take two numbers a,b such that

x_min≤a<b≤x_max,

where x_min,x_max are minimum and maximum values of data, respectively. These numbers divide the interval[x_min,x_max] into three regions A≝[x_min,a], B≝(a,b), and C≝[b,x_max]. We calculate the following transition probabilities:

marvasti-18

where NA —is the number of points in [xmin,a), NB —is the number of points in [a,b], NC—is the number of data points in (b,xmax], NA→A) —is the number of points with the property x(ti )∈A and x(ti+1) ∈A, NB→B —is the number of points with the property x(ti )∈B and x(ti+1)∈B, Nc→C —is the number of points with the property x(ti )∈C and x(ti+1) ∈C.

Starting from the highest possible position and shifting the region B to the lowest possible, we calculate those three transition probabilities and stop the procedure until the following condition is fulfilled: p_(A→A)>H, p_(C→C)>H, p_(B→B)<h, and N_A,N_C≫1,
where the numbers H and h are some predefined parameters. In our experiments below we set H=0.75 and h=0.25. If this procedure ends without finding the needed interval we narrow the region B and repeat the procedure.

In our experiments we divide the interval [xmin,xmax] into N+1 equal parts xmin<x1<x2<⋯<xNc Δt are holes (gaps). c is a predefined parameter for the hole determination. It is assumed that for transient data the holes must be “uniformly” distributed along time axis. This can be checked by transition probabilities. Let Tk be the duration (in milliseconds, seconds, minutes, etc., but in the same measures as the monitoring time) of the k-th gapless data portion. For data without holes we have only one such portion and Tk=tn-t1. The sum

marvasti-19

is the duration of gapless data. Nt is the number of gapless data portions. Let Gk be duration (in the same measures as Tk) of the k-th hole. The sum

marvasti-20

is the duration of all holes in data. NG is the number of hole portions. Obviously , G+T = tn-t1.

By ρ we define the percentage of holes in data

marvasti-21

Calculation of Probabilities – By p11,p10,p00,p01 define the probabilities of data-to-data, data-to-gap, gap-to-gap and gap-to-data transitions, respectively

marvasti-22

and

marvasti-23

We seek for an inertial mode for which

ρ>P,p10>ε,p01>ε

Where P and ε are user defined parameters. If at least two inertial modes satisfy these conditions, then data is transient.
Normalcy determination is performed separately for each mode.
Semi-Constant Data – Data is Semi-Constant if

marvasti-24

where N corresponds to data length and iqr stands for interquartile range of xk=x(tk).

If data is not semi-constant but its latest enough long-enough time period satisfies the condition, then, we select it for further normalcy bounds determination.

Normalcy determination of the Semi-Constant Data can be performed as follows. For Semi-Constant Data every data point greater than q0.75 (quantile) or less than q0.25 is an outlier. If the percentage of outliers is greater than p% (p=15%), then we check for periodicity in outlier data by the procedure described above. For that, data points equal to the median are excluded from the analysis. In case of nonperiodic data the normalcy bounds are calculated by the whisker’s method. In the case of periodic data, the same procedure is applicable for each periodic column of the Footprint of data.

Trendy Data – Different classical methods are known for trend determination [43]—[48]. In our analysis, Trendy Data recognition and related determination of its normalcy bounds consists of three main steps (see Flowchart 7).

1. Trend identification by Trend Detector, which separates Qualified Data into Trendy and Non-Trendy Data.
2. Trendy Data goes through the Trend Recognition module, which classifies the trend into linear, log-linear and non-linear categories. The main purpose of this step is decomposition of the original time series f0 (t), consisting of N points, into sum of non-trendy time series f(t) and trend component trend(t) f0 (t)=f(t)+trend(t) that allows more accurate normalcy analysis based on f(t). 3. Specific normalcy bounds calculation for each category.

Flowchart 7. Trend Determination and Normalcy Analysis.

Flowchart 7. Trend Determination and Normalcy Analysis.

Trend Detector performs different classical tests for trend detection. The Mann-Kendall (MK) test is appropriate for our purposes. although other known tests are also possible to apply. MK statistic (S0) can be computed by the formula

marvasti-26

In general, the procedure consists of the following steps: data smoothing, and calculation of the MK statistic S0 for the smoothed data. If S0>0, then the trend can be increasing; otherwise (if S0<0) decreasing. Then, calculation of the trend measure is

marvasti-27

where

marvasti-28

Data is trendy if, for example p>40%.

Trend Recognition reveals the nature (linear, log-linear, or non-linear) of the trend. We are checking linear and log-linear trends by the linear regression analysis. Goodness of fit is checked by the following formula

marvasti-29

where Rregression is the sum of squares of the vertical distances of the points from the regression line and R0 is a similar quantity for the line with zero slop and passing through the mean of data (null hypothesis).

If R is, for example, greater than 0.6 then it is assumed that a trend is linear; otherwise the log-linearity is checked by the same procedure for f(ec t), where c is some constant. If the corresponding goodness of fit is greater than 0.6, then data is assumed to be log-linear; otherwise, data is non-linear trendy.

Normalcy is performed as follows:

Data with Linear Trend. We decompose original data f0 (t) into the form

f_0 (t)=f(t)+linear_trend(t)

where

linear_trend(t)=kt+b

with coefficients k and b determined by linear regression analysis and perform periodicity analysis for f(t) as we described above. If f(t) is non-periodic, then normalcy bounds of f0 (t) are straight lines (upper and lower dynamic thresholds) that we set up by maximization of the objective function. As an objective function we consider the following expression

marvasti-30

where S is the square of the area limited by tmin, tmax, and some lower and upper lines (see Figure 7),

Smax=h(tmax-tmin)

and P is the fraction of data within the upper and lower lines and a is a user- defined parameter. Then we calculate variability (standard deviation) of f(t) σ=std(f(t)) and consider the following set of lower and upper lines
[kt+b-zj σ,kt+b+zj σ], j=1,2,…

calculating each time the corresponding value gj of the objective function. Lines that correspond to max (gj) we take as appropriate normalcy bounds.

Figure 7. Auxiliary Drawing for Definition of the Objective Function.

Figure 7. Auxiliary Drawing for Definition of the Objective Function.

In our experiments we use the following values for zj:
z1=1, z2=1.5, z3=2, z4=3, z5=4

If f(t) is periodic. then the procedure described above can be performed for each set of similar columns by calculating variability (σm) of the m-th set and considering the following normalcy bounds:

[kt+b-zj σm, kt+b+zj σm ], j=1,2,…

Then maximum of the objective function will give the normalcy bounds of the m-th set.

Data with Log-Linear Trend. Taking into account that f(ec t) is data with linear trend the above described procedure is valid for this as well.

Data with Non-Linear trend. For this case we select the last reasonable portion of data and calculate the normalcy bounds according to the above described procedure for non-periodic case.

Figure 8 shows an example of trendy periodic data with the corresponding normalcy bounds.

Figure 8. Normalcy Bounds for Trendy Periodic Data (Red curve is upper threshold, green curve is lower threshold, and blue curve is the original data.)

Figure 8. Normalcy Bounds for Trendy Periodic Data (Red curve is upper threshold,
green curve is lower threshold, and blue curve is the original data.)

5. Data Density Detector

Data density recognition is based on probability calculation that reveals distribution of gap. According to our analysis we differentiate the following categories: Dense Data (relative to estimated monitoring time), Sparse Data (relative to estimated monitoring time). and data with technical gap (localized gap due to malfunction of device) that after data selection will belong to Dense Data cluster, and finally, Corrupted Data.

The principal scheme of density recognition and recovering (data selection) procedures is presented in Flowchart 8. For categorization purposes we deal with the following measures that characterize the nature of gap presence in data:

  • Percentage of gaps
  • Probabilities for gap-to-gap, data-to-data, gap-to-data, and data-to-gap transitions If the total percentage of gaps is acceptable, then data is categorized as Dense Data.

If the total percentage of gaps is higher than some limit and they have non-uniform distribution in time (which means that gaps have some localization in time), then the gap cleanup (data selection) procedure will give a Dense Data. If gaps have uniform distribution in time, then data belongs to a Sparse Data cluster. If gaps have such an extremely high percentage that further analysis is impossible, then the data belongs to a Corrupted Data cluster.

We omit the technical details because calculation of the transition probabilities can be performed as for Transient Data.

Flowchart 8.Data Density Detector.

Flowchart 8.Data Density Detector.

For normalcy determination data, is preliminary checked for periodicity. In the case of Sparse Data, duration of gaps is reasonable to take into account.

6. Stability Detector

The problem of change detection in time series [49]–[55] is a wellknown statistical problem. The Stability Detector (see Flowchart 9) performs data processing for statistical stability recognition. If data is stable or its stable portion can be selected, then the data (or selected portion) is defined as Stable Data; otherwise it is Corrupted.

Stability identification is accomplished by construction of a data StabiloChart that shows the stability intervals of time series and allows selection of the recent long-enough data region for further analysis. For every given m we calculate the quantity

marvasti-34

that shows the relative change (left-right) attached to the point xm in terms of the iqr measure, where n is some parameter (for example n=marvasti-25 , where T is the length of data).
If
sm<S
then we set sm=0 showing the stability against the point xm with the given sensitivity S (=70%);, otherwise we put sm=1 showing instability against the point xm.

The graph of sm values obtained along the moving (by a preset data points) xm is the Stabilochart of the data. The Stabilochart shows that if data is stable, the latest stable portion can be selected, or that data is corrupted.

Flowchart 9. Stability Detector.

Flowchart 9. Stability Detector.

7. Variability Detector

The Variability Detector performs data processing for variability recognition. Two different categories can be recognized: Low-
Variability and High-Variability. Based on the absolute jumps x′k of data points
x′k=|xk+1-xk|
the following measure R of variability is considered

marvasti-37

Data clustering is performed by the following comparison with parameter V (=20%): if R≤V , then data is from a Low-Variability cluster; otherwise it is from a High-Variability cluster. Normalcy determination for both categories is performed by different setup of preliminary parameters—less sensitive for High-Variability Data.

8. Experimental Results and Discussion

We present some results of experiments on an actual customer data set. First we performed experiments for short-term data with almost a one-month duration. NDS was applied to 3,215 time series metrics. Table 3 shows the distribution along different data categories. Table 4 shows the count of periodic and non-periodic data.

marvasti-38

marvasti-39

We also examined the distribution of periodic data in some categories. For 532 Semi-Constant Data metrics (see Table 3), 267 have percentage of outliers of less than 15%, and they are claimed as non-periodic without any further checking. The remaining 235 metrics are investigated for periodic structure, and in 212 of them periods are found. In the case of the High- Variability and Low-Variability categories, periods are found for 378 and 165 metrics, respectively.

Second, we performed experiments for long-term data with almost a three-month length. We obtained 3,956 metrics, and Table 5 shows distribution along different categories. Table 6 shows distribution of metrics along periodicity. For 586 Semi-Constant metrics, 324 have outliers less than 15% and they are categorized as non-periodic, 262 checked for periodicity, and for 221 periods are found. Then, for the High-Variability and Low-Variability categories, periods are found for 457 and 165 metrics, respectively.

It is worth noting that results obtained for the specific customer can’t be in any manner generalized to other cases. The results can vary widely from one customer to another without any intersection.

marvasti-40

marvasti-41

However, results obtained for a specific customer can provide useful information about the customer’s environment. In terms of our approach, this can also lead to some optimizations by excluding procedures for a specific category that is not common for the customer. For example, in Table 5, we see that the Sparse and Transient categories cover only 5.5% of the overall data set, and the system can be applied without specifying them.

Another important insight is that data category is not an invariant property. Change in length of data in general changes the category. Moreover, the data selection module picks up the last stable portion, and categorization is performed only on this portion. So, visually, data can be corrupted, but its latest stable portion belongs to some of the predefined categories.

Figures 10, 11, and 12 present reliably predicted normalcy bounds obtained by the NDS.

marvasti-42

References

1. W.A.Shewhart (1931), “Economic control of quality manufactured product,” New York: D. Van Nostrand Company, Republished in 1980 by the American Society of Quality Control.
2. D.J. Wheeler and D.S. Chambers (1986), “Understanding statistical process control,” Knoxville, TN: SPC Press.
3. D.J. Wheeler (1991), “Shewhart’s chart: myths, facts, and competitors,” 45th Annual Quality Congress Transactions.
4. F. Alt (1985), “Multivariate Quality Control,” Encyclopedia of Statistical Sciences, Volume 6.
5. C.A. Parris, “Hard disk drive infant mortality tets,” Patent application number: US 09/387,677. Filing date: Aug 31, 1999. Publication number: US6408406B1. Publication date: Jun18, 2002. Publication type: grant.
6. F.L. Paulson, “System and method for determining optimal sweep threshold parameters for demand deposit accounts,” Patent application number: US 08/825,012. Filing date: Mar 26, 1997. Publication number: US5893078 A. Publication date: Apr 6, 1999. Publication type: grant.
7. J.D. Luan, “Threshold alarms for processing errors in a multiplex communications system,” Patent application number: EP19880112793. Filing date: Aug 5, 1988. Publication number: EP0310781B1. Publication date: Mar 10, 1993. Publication type: grant.
8. P. Celka, “A method and system for determining the state of a person,” Patent application number: PCT/EP2013/064678. Filing date: Jul 11, 2013. Publication number: WO2014012839 A1. Publication date: Jan 23, 2014.
9. B.M. Jakobson, “Systems and methods for authenticating a user and device,” Patent application number: US 13/523, 425. Filing date: Jun 14, 2012. Publication number: US20130340052 A1. Publication date: Dec 19, 2013.
10. T. Blumensath and M.E. Davies (2009), “Iterative hard thresholding for compressed sensing,” Applied and Computational Harmonic Analysis, Vol. 27, no. 3, pp. 265-274.
11. H. Huang, H. Al-Azzawi, and H. Brani (2014), “Network traffic anomaly detection,” ArXiv:1402.0856v1.
12. D. Dang, A. Lefaive, J. Scarpelli, and S. Sodem, “Automatic determination of dynamic threshold for accurate detection of abnormalities,” Patent application number: US 12/847,391. Filing date: Jul 30, 2010. Publication number: US20110238376 A1. Publication date: Sep 29, 2011.
13. J. C. Jubin, V. Rajasimman, and N. Thadasina, “Hard handoff dynamic threshold determination,” Patent application number: US 1/459,317. Filing date: Jun 30, 2009. Publication number: US20100056149 A1. Publication date: Mar 4, 2010.

 

Automatic Discovery of Configuration Policies

$
0
0

Lalit P. Jain
Cloud Management Intern, VMware, Inc.
ljain2@uccs.edu

Greg Frascadore
VMware, Inc.
gfrascadore@vmware.com

Abstract

Cloud computing data centers contain thousands of host servers and millions of virtual machines, each with its own configuration. Automation can enforce standards and keep configurations synchronized, but defining the desired state (the policies) is still a manual process. We describe a method that automatically discovers configuration policies by monitoring configuration changes and clustering resource properties into policies that are based on their correlation using mutual information.
This work is a step toward the automatic discovery and generation of configuration assessment rules.
General Terms: algorithms, management, measurement
Keywords: clustering, configuration management, configuration assessment

1. Introduction

Cloud computing data centers contain physical and virtual forms of computing servers, network switches, data stores, storage arrays, application servers, and numerous other resources that have software-defined configuration. In a typical data center there can be thousands of physical devices and millions of virtual resources. Automation is the key to managing the complexity. Automation can enforce configuration standards, detect configuration drift, and measure the degree to which a data center is in compliance with its desired state. However, although automation can enforce configurations standards, defining the desired state is still a manual process.
The identification of configuration patterns and automation are the key to managing the various devices in the data center.

Configuration patterns stem from configuration goals for resources. These goals are policies. Just as there is more than one desired state in the data center, there are also many policies. A single resource can be subject to multiple policies, and a single policy can apply to multiple resources. Configuration policies arise from practicalities, from best practices, or by fiat.

2. Overview

In this paper we describe a process for automatically discovering policies for resources and for discovering the asset classes that such policies induce. Our approach is to monitor the configuration changes made by an administrator over time. We leverage several assumptions:

  • In the modern data center, the trend toward the software-defined data center and DevOps practices is making configuration changes monitorable and trackable, because they take place in virtual resources or through management applications that create logs.
  • The purpose of a series of configuration changes is to place the target resource into a desired state.
  • Relevant properties are ones that influence a key performance indicator (KPI). These are the subjects of configuration change. Unchanging properties are irrelevant.

3. Background

A configuration policy is a rule or guideline that constrains the state of a resource by limiting, or disciplining, certain property values of the resource. For example, the VMware vSphere® Hardening Guide [1] is a policy that includes constraints for properties like those shown in Table 1.

jain-1

Although some systems—such as SCAP/OVAL [2] [3] and Puppet [4] [5] [6]—support the automatic testing and enforcement of configuration policies, most policies are uncodified or intangible. Some policies arise as local best practices that administrators implement manually with ad-hoc changes. Other policies arrive in the form of security implementation guides (STIGS) from the federal government [7] or as regulations from the Payment Card Industry (PCI) [8]. Policies from these sources are nonoperational blueprints and checklists that the administrator interprets and implements manually. This situation is undesirable because intangible policies are hard to evolve, change, and understand.

Complicating the issue is that the policies are not the only intangibles. An asset class is a subset of resources related by mission, location, or affiliation. Examples are production servers, accounting desktops, and West Coast hosts. Historically, asset-class definitions also arise informally. An administrator takes the inventory of known resources and categorizes them using local expertise. The admin then interprets and applies policies differently to the resources according to their asset class. For example, resources that process payments might be disciplined more frequently than documentation portals. The entire process is error-prone and nonreproducible.

Policies induce asset classes. The domain of every policy partitions the resources of the data center into one group that is subject to the policy, and another that is not. The former is an asset class. The intersections and unions of the directly induced asset classes create even more subdivisions. Asset classes enable us to think of resources as groups differentiated by the policies that apply. For this partitioning to be practical, the policies from which the classes arise must be codified.

The DevOps philosophy for data center operations is gaining momentum [9]. DevOps calls for creating reproducible configuration change driven by codified policies (i.e., tangible desired state). The idea is to treat infrastructure like code. Unfortunately, writing executable policies resembles software development. Policy interpreters require the use of languages such as OVAL or XQuery. Property tests written in these languages require domain expertise regarding the guideline being codified, as well as technical expertise in programming the rule language of the assessment system. The limiting factor of the current approaches is this necessity for manual declaration. The effect is that many configuration policies never become codified.

4. Details

We propose to remove the necessity for manual declaration of policy rules. We can use the logs of configuration changes made by users, along with semisupervised clustering techniques, to discover configuration policy. Effectively, we are codifying policies by observing the changes that users perform while enforcing intangible policies. This becomes possible because the modern data center is software-defined, making configuration change trackable. Configuration changes are made for a purpose. An administrator makes a change in order to bring a resource into a desired state. The goal state can be informal and uncodified, or it can be motivated by regulations the administrator is trying to follow. Configuration change can also be the result of tailoring of externally imported guidelines. No matter the cause, the effect is that the data center is being made to conform to the desired state of the administrator’s organization. The changes are an expression of policy and desired state.

In the software-defined data center, the configuration of a resource is trackable because the physical data center components are commoditized and customization takes place on the virtual replacements: virtual compute, network, and storage. Administrators provision and control resources and implement policy through management and workflow applications instead of adjusting physical components.

Configuration change is a software-controlled state change made through applications that create logs. Management applications such as the VMware® vCenter™ inventory service report state changes in a change-event log or Atom feed. By retrieving the new state of a resource and comparing it with the previous state, we create a change log for every resource and property. For every resource and property, the log records a row entry like those in Table 2.

jain-2

There are many resource types: virtual machines, hosts, storage, and networks. Applications such as Web servers and database systems are resources, as are instances of application stacks (e.g., LAMP stacks). Each resource type supports a collection of mutable properties. Guidelines and policies dictate the desired state and constraints on a mutable property.

We associate a baseline value with every resource property. The initial value is usually the baseline when a property has a predictable initial value. The specific value of the baseline is unimportant as long as a comparison with it indicates when a property is set to a nonbaseline condition. Examples of properties and respective baseline values are shown in Table 3.

jain-3

In combination with the configuration change log of Table 2, the knowledge of baseline property values provides an insight necessary for discovering policies. When a property is consistently set to a nonbaseline condition, some policy is disciplining that property. If the policy is unknown, we can discover it by correlating the changes. To do this, we look back over an attention window—a past region of the change log. From the log entries we create two tables: a change indicator table and a final-value table.

4.1 Change Indicator Table

The change indicator table is an array of binary-valued feature vectors. Each row represents the final state of a resource at the end of the attention window. During the attention window, properties of the resource might have changed value. The bits within a row indicate whether the respective property landed in a non-baseline value at the end of the attention window. In other words, for each resource ri a vector ri = {Xij} indicates whether the jth property Pj was set to a non-baseline value. The value Xij = 0 if the jth property of ri has its baseline value by the end of the attention window (i.e., was never set to a non-baseline value or was set, but eventually reset, to the baseline value). Otherwise, Xij= 1. Figure 1 shows an example row of an indicator table.

Figure 1. For each resource like vm01 we associate a vector of bits that indicate whether the respective property of the resource has changed value. At the end of the attention window, if the property value lands in a nonbaseline condition, the corresponding bit is set to 1. Otherwise, the property was unchanged or reset to the baseline, and the bit value is 0.

Figure 1. For each resource like vm01 we associate a vector of bits that indicate
whether the respective property of the resource has changed value. At the end
of the attention window, if the property value lands in a nonbaseline condition,
the corresponding bit is set to 1. Otherwise, the property was unchanged or reset
to the baseline, and the bit value is 0.

We must pad the indicator table for certain resource and property combinations. Virtual machines and VMware® ESX™ hosts share some properties, such as IP address and config-ntp. In these cases the rows for virtual-machine resources and hosts might all have the bit set in the column for the shared property (e.g., IP address). However, some resource types have unique properties. Virtual machines have usb.present, but hosts do not. When a property is not applicable for a particular resource type, the respective row and column of the indicator table will contain 0, as if the nonexistent property is fixed in its baseline condition. With this padding, the indicator table for an attention window resembles Figure 2.

The change indicator table is not necessarily large. A row appears only if the respective resource has undergone at least one change during the attention window. A property column appears only if at least one resource has had that property value land in a nonbaseline condition. If a property does not change for any resource during the attention window, no column appears. Other omissions are possible. Some properties are immutable or read-only. Other properties, such as MAC -addresses, are a -priori irrelevant to policies even if they do change value. By omitting columns for unchanging and irrelevant properties, we further reduce the number of P1 .. Pk and the size of the indicator table. (Relevant properties are ones that affect a key performance indicator (KPI) or are tested by an extant policy such as an imported PCI benchmark, hardening guide, or STIG.).

4.2 Final-Value Table

The final- value table contains the final property value corresponding to each change (each ‘1’) in the change indicator table. See Figure 2b. The indicator and final-value tables provide two more insights necessary for discovering policies. Resources subject to a policy have 1s in the columns corresponding to properties disciplined by the policy. Meanwhile, the respective values in the final-value table are the property’s desired state. In this way, the indicator and finalvalue tables contain the information needed to
discover the policies that were driving changes made during the attention window. In theory we should be able to group resources with identical indicator

Figure 2a. Change Indicators for Resources. A value 1 in row i column j indicates that resource i’s Pj landed in a nonbaseline value at the end of the attention window. A 0 designates that the value did not change, changed back to the baseline by the end of the window, or was inapplicable for the resource type. Properties such as usb.present are applicable only to ESX hosts and must be zero (MBZ) for resources of other types. Figure 2b. For each resource property, the final-value table records where that property value landed at the end of the attention window.

Figure 2a. Change Indicators for Resources. A value 1 in row i column j indicates that
resource i’s Pj landed in a nonbaseline value at the end of the attention window. A 0
designates that the value did not change, changed back to the baseline by the end of
the window, or was inapplicable for the resource type. Properties such as usb.present
are applicable only to ESX hosts and must be zero (MBZ) for resources of other types.
Figure 2b. For each resource property, the final-value table records where that
property value landed at the end of the attention window.

vectors. Each group represents a policy, each bit in the indicator vector represents a policy condition, and the respective value in the final-value table is the desired state. Finally, the resources in the group constitute an asset class. In practice however, things are more complicated than that. Grouping identical indicator vectors won’t work because of many issues:

  • Noise is present in the form of ad-hoc changes unrelated to policy.
  • People are inconsistent. (A change is made, then undone.)
  • Human actions can be incomplete or lack oversight. (A policy-prescribed change is never made.)
  • Earlier assumptions that identify relevant attributes can be imperfect.
  • Distinct but overlapping policies will discipline some common properties.
  • The number of policies being sought is unknown.

To deal with these difficulties, we use a more sophisticated type of grouping based on clustering techniques from machine learning. Clustering discovers relationships between subjects that are characterized by similar but not identical features. In a straightforward application we would use k-means clustering, take resource rows as subjects, and use the respective indicator vector as a feature vector. We would define a difference metric d(v,w) that measures the distance between two indicator vectors and run k-means over the indicator table. For example, in Figure 2, consider the three rows vm01, 02, 03. Vectors vm01 and vm03 are similar, and the common properties are P1 and Pb. K-means would propose a policy {P1 = 2, Pb =utcnist.colorado.edu } covering the asset class {vm01, vm03}.

Unfortunately, the straightforward application of k-means isn’t appropriate for this problem. The usual difference metrics like such as Euclidean distance don’t work well for measuring the difference between binary indicator vectors, and real-valued feature vectors are not unavailable because resource properties (e.g., in Tables 1—3) usually have Boolean and other non-numeric values. Instead of clustering- indication vectors, we cluster properties. We use the indicator table (i.e., the indicator vectors) as a parameter to create a metric of property correlation. The reciprocal of correlation becomes the clustering difference d(Px,Py). For the property correlation, we use the mutual information I(Px;Py) [10] from how often two properties Px and Py change from their baseline values according to the indicator table:

jain-10

(1) Here P is probability, and P(Px = 1) is the frequency that Px = 1 in the Px column of the indicator table, divided by the number of rows. The distance d(Px,Py) between two properties varies inversely with their mutual information.

Now we apply agglomerative clustering [11]. Initially the process places each property such as P1 into a singleton cluster {P1}. In each iteration, the closest two existing clusters are grouped until only one grouping remains. The difference between clusters is calculated as the distance between the closest members of the clusters (i.e., single linkage clustering). Agglomerative clustering returns the final grouping containing all the intermediate groupings it created during the process. This enables us to see a dendrogram tree of policy proposals.

4.3 Asset Classes and Varieties
An asset class is a set of resources disciplined by a common policy (i.e., having the same change indications). Within an asset class there are varieties. These are subclasses of resources with identical change indications and also having the same values in the final-value table. SCAP terminology calls a variety a tailored benchmark [8]. Puppet calls a variety a set of resources sharing a desired state [13]. The vSphere Hardening Guide calls them risk profile levels [1]. Figure 3 shows one asset class having three varieties. (Real asset classes and varieties would have more members than this example.). Resources vm01—4 share the same change indications to P1 and P3. This makes vm01—4 an asset class and {P1, P3} would be the policy discovered by clustering. Within the asset class, the subset {vm01, vm02} is a variety, distinguished by the desired-state {P1 = 2, P3 = 10.0.1.11}. Resource vm03 forms another variety {vm03}. It also exhibits changes to P1 and P3, but now the desired-state is {P1 = 1, P3 = 10.0.1.33}. This is a simplified version of an actual risk profile in the vSphere Hardening Guide, where P1 = 2 is from the virtual- machine hardening policy at risk profile level 3 and P1 = 1 is the same policy at risk profile level 1.

Figure 3. This final-value table shows that changes to properties P1 and P3 are highly correlated, making them a candidate policy. The respective asset class includes resources vm01–4, but not vm05. The value 0 indicates the baseline value. Non-zeros indicate the specific change. The changes produce three varieties: {vm01, vm02}, {vm03}, {vm04} distinguished by the respective property changes {P1=2, P3=10.0.1.11}, {P1=2, P3=10.0.1.33} and {P1=1, P3=10.0.1.33}.

Figure 3. This final-value table shows that changes to properties P1 and P3 are highly correlated, making them a candidate policy. The respective asset class includes resources vm01–4, but not vm05. The value 0 indicates the baseline value. Non-zeros indicate the specific change. The changes produce three varieties: {vm01, vm02}, {vm03}, {vm04} distinguished by the respective property changes {P1=2, P3=10.0.1.11}, {P1=2, P3=10.0.1.33} and {P1=1, P3=10.0.1.33}.

5. Summary

To discover policies, we begin with a change log (see Table 2) and knowledge of the baseline value of every property (see Table 3). We scan the changes from a bounded attention window within the log. Each log entry there records the change to a property value on a specific resource such as vm01 (see Figure 1). Collecting these changes, we create two tables: an indicator table (see Figure 2) that records for each resource which properties changed to nonbaseline values during the attention window, and a final-value table (see Figure 3) that records the final value of each resource property. Formula (1) defines a difference metric from correlation of two properties by using the mutual information in their indicator vectors. Using the difference metric, we apply agglomerative clustering of properties P1, P2, and so on. The discovered clusters (see Figure 4) are candidate policies. The properties in each cluster are the subjects of the policy’s condition tests. The respective values (see Figures 2 and 3) in the final-value table are the desired state. Any resource with an indicator vector matching properties disciplined by a discovered policy is classified as a member of the policy’s asset class. Within the asset class, resources with properties having the same desired state make up a variety. Varieties correspond to tailorings for SCAP and desired-state manifests for Puppet.

6. Results

We created a prototype to discover configuration guidelines from a log of vCenter changes. We verified that we could generate the change log by using an existing program, pyReplay, which listens to the vCenter inventory service Atom feed. When the feed reports a change to a resource, pyReplay retrieves the XML description of the resource; compares the XML with the previous state; generates a tuple that includes the resource ID, timestamp, property ID, old value, and new value; and stores the XML if there were changes. Using pyReplay output, we created the change log of Table 2. From the change log we wrote separate C and Python programs that generate the indicator and final-value tables. To perform k-means and agglomerative clustering we used C++, Python, and the Weka toolkit [12].
To conduct experiments, we also created a C++ program that peppers the indicator table of Figure 2 with varying amounts of random noise. The noise takes the form of a probability e. Each bit of the table is reversed (i.e., 0→10 or 1→0) with probability e. By increasing e we can study the negative effects of noise on policy discovery. When e = 0.5 the table is entirely noise.

For generating the change log, we generated changes to 1,000 virtual-machine resources, each having 26 properties tested by the vSphere Hardening Guide. We generated the property changes by enforcing policies that ranged in size from 2 to 13 of the 26 properties. Each trial tested one guideline that was repeatedly applied to a decreasing percentage (decreasing asset class size) of the 1,000 virtual machines. In addition, we injected noise in the form of increasing values of e.

Because each policy consists of condition tests to an unknown subset of the 26 properties, we calculated the policy discovery error to be the difference between the number of properties of the target policy and the number of properties in the smallest discovered cluster that contains all the properties of the target. Figure 4 illustrates this calculation.

Figure 4. The policy discovery error is the number of properties misclustered into the smallest subtree containing all the properties of the unknown target policy that generates the test problem. Here the hidden policy is P1–P7, and the error is 8.

Figure 4. The policy discovery error is the number of properties misclustered into
the smallest subtree containing all the properties of the unknown target policy that
generates the test problem. Here the hidden policy is P1–P7, and the error is 8.

Figure 5 plots policy discovery performance versus noise and asset class size. The target (unknown) policy disciplines 7 of 26 properties of three asset classes of size 10, 20, and 40 chosen from 1,000 resources. Recall that the asset class size is the number of resources disciplined by the hidden policy. Along the x-axis the noise is increasing from 0 to 0.3. At 0.3, 30% of the indicator vector bits are randomly reversed to represent non–policy-induced changes in the event log. The y-axis shows the policy discovery performance in terms of the error.

Figure 5. Performance at Detecting a Policy That Is Disciplining 7 of 26 Properties and 10, 20, and 40 of 1,000 Resources.

Figure 5. Performance at Detecting a Policy That Is Disciplining 7 of 26 Properties and
10, 20, and 40 of 1,000 Resources.

The results in Figure 6 show that our process accurately discovers an unknown policy disciplining as little as 1% of the resource inventory. The discovery error begins to increase with increasing amounts of noise but is counteracted (delayed) by increasing asset class size. A policy with a modest asset class size of 4% of inventory (i.e., 40/1000) is discovered accurately despite 15% noise.

7. Related Work

The National Institute of Standards (NIST) maintains a large collection of configuration guidelines that assess configuration compliance. These are part of the National Checklist Program [7] at the National Vulnerability Database. These are examples of manually created policies. Many are not codified for automatic assessment. Our work is trying to automate the discovery and encoding of similar guidelines that are timely and suitable for automatic assessment.

SCAP and OVAL [2] [3] are configuration assessment frameworks. These systems apply policies like the ones we discover. Currently, SCAP and OVAL guidelines (called benchmarks) are manually developed and codified using text editors and tools such as the Benchmark Editor and the Recommendation Tracker [13].

Enterprises sometimes require that their business processes, as well as configuration state, align with rules and policies. In the literature there is work related to compliance management for processes. In healthcare, for example, medical guidelines and clinical policies should be followed during patient treatments [16]. In control-flow compliance checking, Petri nets capture compliance rules in the form of patterns subsequently used to check the alignment of process behavior recorded in event logs [14]. Activity-oriented clustering determines the dependencies between process models and compliance rules with respect to a large number of business processes [15]. A datadriven approach has coordinated the behavior of business policies and their interactions [16]. Compliance Rule Graphs (CRG) detect the occurrence of business policies [17]. Our work differs from these process-oriented approaches in that our focus is on finding configuration policies expressed by users within logs of user activity. The Facter project gathers resource, property-value, and system information similar to the vCenter-plus-pyReplay process that we described above [5]. Facter returns a snapshot of a system’s current state. By differencing the state between Facter gatherings, pyReplaystyle, we could generate the change log of Table 2 from any inventory of Facter-support resource. This would extend the domain of our guideline discovery process to sources beyond vCenter.

Policies are models of desired state. The idea of using such models for data center automation is implemented by systems like Puppet [4] and CFEngine [6]. These systems require the administrator to first write the manifests and profiles manually. A future direction for our work is to discover and propose such manifests automatically.

8. Conclusions and Future Work

We have described a process for discovering configuration policies from a log of changes kept by vCenter during the operation of a data center. Our process does not require the manual entry of test conditions by an administrator. Instead, we discover policies and desired values by observing the resources that change, along with the affected property, new value, and knowledge of the baseline value. We cluster together properties by agglomerating ones that correlate. Two properties correlate if there is mutual information in the co-occurrence of their change to nonbaseline values. Clustered properties become the discovered policy. Resources disciplined by the policy form an asset class. We subdivide the policy into varieties by grouping resources within the asset class having the same desired values for the properties tested by the policy.
Automatically discovering configuration policies is a time-saving tool. Instead of editing test conditions, the user’s actions over time propose new policies and asset classes. This not only exploits the software-defined nature of the modern data center, but it also saves time and supports a DevOps-style assessment of data center operations. Policy discovery also has the potential to detect new guidelines that were not realized or articulated by the administrator. Our next step is to apply the policy discovery process to a live feed of vCenter changes and automatically propose desired states to assessment and remediation engines such as OVAL and Puppet.

Acknowledgments

We would like to thank Rob Helander for writing the pyReplay code that generates the configuration change events and resource states from the vCenter inventory service Atom feed. Thanks also to Rick Frantz for proposing and sponsoring the topic of automatic policy discovery as a research project.

References

1. vSphere 5.5 Hardening Guide, VMware Inc. 2013.
2. The Security Content Automation Protocol (SCAP), National Institute of Standards. Jan 2014.
3. The Open Vulnerability Assessment Language (OVAL), The MITRE Corporation. Mar 2014.
4. Puppet Enterprise. Puppet Labs. Mar 2014.
5. Facter 1.7. Puppet Labs. Mar 2014.
6. CFEngine. CFEngine AS. Mar 2014.
7. National Vulnerability Database, National Institute of Standards (NIST). 2014.
8. PCI SSC Data Security Standard. Payment Card Industry Security Standards Council. 2014.
9. Huttermann, M., DevOps for Developers. Apress. 2013.
10. Cover, T. M. and Thomas, J. A., Elements of Information Theory, John Wiley & Sons, Inc. 1991.
11. Whitten, I. H., Frank, E. and Hall, M. A. Data Mining: Practical Machine Learning Tools and Techniques, 3rd Ed., Morgan Kaufmann. 2011.
12. Weka 3: Data Mining Software in Java, University of Waikato. Mar 2014.

VProbes: Deep Observability Into the ESXi Hypervisor

$
0
0

Martim Carbone
VMware, Hypervisor Team
mcarbone@vmware.com

Alok Kataria
VMware, Hypervisor Team
akataria@vmware.com

Radu Rugina
VMware, Hypervisor Team
rrugina@vmware.com

Vivek Thampi
VMware, Hypervisor Team
vithampi@vmware.com

Abstract

This paper presents VProbes, a flexible dynamic instrumentation system aimed at providing deep observability into the VMware® ESXi™ hypervisor and virtual machines (VMs) running on it. The system uses a high-level scripting language with which users define system events (called probes), and the associated actions at those events, typically to collect specific pieces of data. VProbes users can quickly collect customized pieces of data at arbitrary points in the system, using just a handful of lines of VProbes scripting. VProbes is a safe system, has no overhead when not in use, is scalable, allows instrumentation of ESXi on the fly, and can be used in all ESXi build types. The VProbe system is specifically designed to observe the main layers of the VMware software stack: the guest OS, the Virtual Machine Monitor (VMM), and the ESXi kernel. The tool makes it easy to collect various pieces of data at each layer, correlate events across these layers, or trace events from the guest all the way down to the hardware devices.

VProbes has been successfully used internally at VMware by code developers, the performance team, and by our customer support organizations. In certain cases, the support teams estimated that they saved several weeks of staff hours when troubleshooting difficult issues. VProbes has also been used as the underlying data collection and system instrumentation mechanism in the Dynamic Driver Verifier (DDV) tool that the VMware ecosystem uses for developing ESXi drivers.

1. Introduction

As software systems get increasingly complex, debugging, monitoring, or understanding system behavior becomes increasingly difficult. The problem is even more challenging in a virtualization environment, where the hypervisor adds more layers to the software stack. Understanding the interactions among the different layers and identifying the component that causes a certain issue—such as a VM hang, high disk latency, or slow network—requires tools that quickly gather low-level data from a running system.

VProbes is an internal tool developed at VMware to help this process by collecting system data that provides detailed observability into all the layers of the ESXi software stack, from the hypervisor to the guest running inside the VM. VProbes uses dynamic instrumentation and is similar in spirit to other tools developed for instrumenting operating system kernels and applications, such as DTrace [4] or SystemTap [12]. VProbes generalizes probing of a single operating system to probing across the multiple layers of a virtualization stack. To use VProbes, users write simple scripts in a C-like language. The script defines a set of events to be observed and the actions corresponding to those events. To troubleshoot an issue, users can start asking high-level questions about the system, then iteratively refine their queries using more detailed VProbes scripts, forming hypotheses and collecting specific data to verify their conjectures. VProbes can be used to help answer questions such as why an application’s performance degrades when running in a VM compared to running on bare metal; which VMs or services are using the most I/O bandwidth, CPU cycles, or memory; or how application or guest OS behavior affects the hypervisor and the virtual infrastructure. VProbes has a number of features that make it appealing to a wide range of users:

  • Dynamic – It uses dynamic instrumentation to probe a running system on the fly, without having to recompile or reboot ESXi. This is a key feature that distinguishes it from common practices of adding print statements in the source code. Avoiding rebuilds and reboots can significantly improve developer productivity and speed up the turnaround for VMware support teams when they troubleshoot customer issues.
  • Flexible – It exposes a scripting-language interface to facilitate the process of asking arbitrary queries and collecting customized pieces of data. Together with the dynamic aspect, VProbes enables developers and support engineers to iterate through different hypotheses, gradually zooming in on the answer they are looking for.
  • Available everywhere – It is available in all build types, including release builds (but currently available only for internal use, e.g., by the VMware support teams). Because release builds trade off certain pieces of debug information, such as logging or code assertions, in favor of performance, VProbes can provide invaluable visibility into systems running release builds of ESXi.
  • End-to-end – It allows inspection of the whole virtualization stack, from the guest to the hypervisor, tracking events or making correlations across layers.
  • Safe – It ensures that a script will not cause the system to hang or crash due to an infinite loop or a bad pointer.
  • Free – When not in use, VProbes causes no overhead or penalty in the running system. When a VProbes script is unloaded, the system goes back to its original state, with zero overhead. The rest of the paper is organized as follows. Section 2 describes the VProbes programming system. Section 3 illustrates a number of common VProbes usage patterns. Section 4 discusses real-world uses of VProbes. Section 5 presents an experimental evaluation. Section 6 discusses related work. Section 7 concludes the paper and discusses future directions.

2. VProbes Programming System

The VProbes programming system is a framework expressly designed for observing ESXi in a production environment. It consists of a high-level language, a supporting runtime, and a collection of instrumentation engines.

The VProbes language, called Emmett, has been tailored from the ground up for programmatically observing and collecting system information, and for processing and reporting that information on the fly. Users write scripts consisting of one or more probes describing events of interest, and blocks of code that are executed when they happen.

There are three instrumentation domains in VProbes, which represent major components of ESXi: VMK (VMkernel, the ESXi operating system kernel), VMM (the Virtual Machine Monitor, which creates a virtual execution environment with one or more virtual CPUs, for running guest operating systems), and VMX (the Virtual Machine Extension, a user-level companion process to the VMM).

Figure 1 shows a graphical representation of these domains. Each of these components has a VProbes engine built into it, providing low-level binary instrumentation capabilities and runtime support for executing scripts. User scripts are first compiled into a safe, type-checked intermediate form and dispatched to intended target domains.

Figure 1. VProbes Instrumentation Domains in ESXi.

Figure 1. VProbes Instrumentation Domains in ESXi.

In addition to the three domains described above, VProbes also has a GUEST instrumentation domain representing the guest operating system running in the VM. Instrumentation of guest software is implemented in the VMM and VMX.

2.1 Concurrency
ESXi components are multithreaded software systems. Events of interest can occur on any thread of execution, and potentially simultaneously as they are executed on different physical CPUs. Consequently, probes can fire concurrently.

VProbes has been designed to deal with this inherent concurrency in the system as efficiently as possible. The language runtime is innately thread-aware and safe. Basic integer and string types provide perthread semantics. Integer variables can also be explicitly declared as shared between threads with guarantees of atomicity and sequential consistency. More-complex shared data structures—such as bags, which are used to store integer key-value pairs, and aggregates, which are used to build histograms—are designed not only to be thread-safe, but also to provide wait-free guarantees.

Output generated by probes, such as the output of a printf statement, is queued in per-thread or per-CPU buffers. It is eventually serialized in global time order and made available to the user, when the system has the resources to do so.

2.2 Safety
Safety is central to the design of VProbes. Users cannot inadvertently impede the normal functioning of a system, modify system state, or perform illegal operations that can corrupt or crash the system. VProbes scripts are compiled to an intermediate form with static type checking. This intermediate form is further translated into machine code with strict runtime bounds checking.

Execution of probes is time bounded, which ensures that probes cannot cause the system to stall, guarding against inadvertent uses of looping constructs that continue forever. VProbes also does accounting of overhead incurred due to probe execution and performs rate-limiting if necessary.

Low-level systems such as ESXi have transitional phases during which the state of the CPU might not be reliable—for example, the small window of time as a system call transitions from the VMM to the VMkernel, or during world switch, as CPU state is being saved and restored. VProbes has been hardened for these scenarios, either by ensuring that we make reliable assumptions about the system, or by disallowing probe execution in these small windows. To increase confidence in VProbes safety, our verfication team designed a random test generator [3] that injects random, machine-generated scripts into ESXi. This tool has uncovered a number of corner cases in critical parts of the kernel and enabled us to harden the VProbes system to safely handle those cases.

2.3 The Emmett Language
Emmett is the high-level, domain-specific language (DSL) for interacting with the VProbes instrumentation framework. Its syntax and semantics are heavily borrowed from the C programming language, which allows for a lower barrier to entry for users, who we expect are familiar with low-level system software written in C, such as ESXi. In the following subsections, we provide an overview of the language.

2.3.1 Probes
A probe in the Emmett language consists of a name that resolves to a system event or instrumentation point, called a probe point. An associated block of code is executed when the system “hits” that probe point, an event also known as a probe fire. A probe name is a series of colon-separated tokens that resolve to a probe point, starting with the domain (one of GUEST, VMM, VMX, or VMK), followed by the type of the probe (ENTER, EXIT, PROFILE, etc.), and a location identifier (function name, raw linear address, etc.). A probe can also receive arguments. In the VMK domain, for instance, an ENTER probe receives the arguments to the invoked function. Probes can be broadly categorized into

  • Dynamic probes, which fire at arbitrarily selected control points identified by function entry, exit, or instruction offsets, or at instructions that perform data reads and writes to specific linear addresses in the target system. Dynamic probes provide arguments relevant to the probe type, such as the arguments to the function call in entry probes, value being returned by the function in exit probes, and exception frames in offset probes.
  • Static probes, which fire at predefined points in the system. They are placed by developers using special markers in the source code. These probe points usually represent a system’s architectural points of interest, making them observable without details of underlying implementation. Static probes also provide contextually relevant arguments as predetermined by the system designer.
  • Periodic probes, which fire at a time-periodic rate. The period can be fixed or programmable. They are useful for profiling and performing periodic reporting of data.

2.3.2 Data Types
Emmett provides basic integer and string types for variables, and a standard set of arithmetic operators and built-in functions for manipulating them. Variable values are persistent across probe fires and, by default, instantiated on a per-thread basis. For dynamic storage and lookup of integer key-value pairs, the language also provides bags, which are a highly efficient, shared, lockless data structures with fast insertion, lookup, and removal times. Emmett also provides native support for building histograms in the form of aggregates. Aggregate variables can be used to collect integer value samples distributed into buckets identified by a combination of integer and string keys. A built-in function called printa can customize and print the aggregate. Adding samples to an aggregate variable is fast and wait-free.

2.3.3 Compound Statements
The Emmett language includes the following compound statements:

  • if-else branching
  • for, while. and do-while generic loops
  • foreach loops for iterating over bags
  • try-catch construct for handling exceptions

Unlike C, Emmett has no unstructured control-flow statements (goto).

2.3.4 Target Inspection
VProbes has built-in support for accessing target system memory, which is often crucial in analyzing the state of the system. One primitive way of accessing raw system memory is to use the built-in getvmw() function , which takes a linear address and returns the 8-byte value at that location. For more-structured access to memory, Emmett also provides special support for pointers and composite types (such as structures, unions, and arrays), which can only be used for target memory inspection. Emmett natively supports pointers to standard C types (int*, char*, etc.), which can be assigned linear addresses in the target domain and dereferenced using the * operator. For access to more complex objects in memory, users can define and use composite types. For probes in the VMkernel domain only, Emmett also supports a special $module.typename notation, which imports data types from the target system, obviating the need to define them in user scripts. This greatly improves usability by making access to complex objects in memory seamless.

In addition to accessing memory, the Emmett language includes a variety of built-in variables and functions that provide access to the current call stack, virtual and physical CPU register state, the time stamp counter (TSC), and so on.

3. Observability Patterns

The generality of VProbes as a multilayer observability tool makes it useful in a wide variety of use cases, including debugging, macro or micro performance analysis, and general system introspection. Using real-world examples, this section builds upon the basics described in section 2 and illustrates some of the most common and useful VProbes observability patterns that are being used at VMware.

3.1 Tracing
One of the simplest and most common observability patterns is tracing the execution of a function. This usually involves printing a message to the screen or a log file at every invocation of the function, or a subset of those. This pattern is commonly used to generate a chronological log of a certain type of event, which can then be used for performance analysis or debugging purposes. The following script is an example of tracing with VProbes:

VMK:ENTER:user.Linux32_Write(int fd, int buf, uint32 size) {
  if((string)curworld()->name == “sshd”) {
    printf(“%s: size = %u bytes\n”, PROBENAME, size);
  }
}

Placement of a dynamic probe at the start of ESXi kernel function user.Linux32_Write() causes the probe body to be executed every time the function is called. The example shows how Emmett can be used to conditionally print trace entries, helping reduce the output volume. In this case, an if statement is used so that only invocations associated with the sshd world are printed:


VMK:ENTER:user.Linux32_Write: size = 132 bytes
VMK:ENTER:user.Linux32_Write: size = 148 bytes
VMK:ENTER:user.Linux32_Write: size = 100 bytes
VMK:ENTER:user.Linux32_Write: size = 100 bytes
...

Contextual information about the call, such as the size argument, is also printed. This pattern can be applied to any function at any layer of the software stack. The following example shows how a GUEST probe can be used to trace the system calls being invoked inside a VM:

GUEST:ENTER:system_call {
  printf(“System call %x (CR3 %x)”, RAX, CR3);
}

This script is intended for a Linux guest OS, where the system call number is passed on the rax register. When loading this script, the user must provide the Linux kernel symbol map on the command line. This enables the vprobe command to resolve the system_call symbol to its address and then instrument that address. The example generates as output a trace of system calls identified by their corresponding number, along with the present contents of the CR3 register, which can be used to identify the invoking process.

3.2 Counting
For certain use cases and types of events, the level of detail and volume of data generated by tracing can be excessive and cumbersome. Often, a statistical summary of events is more useful in finding answers. Using integer data types, events can be counted and printed periodically or at script unload. Beyond simple counting, histograms are a powerful statistical tool for summarizing data.
VProbes provides built-in support for generating histograms in the form of aggregates. The following script shows a typical use of aggregates:

aggr exits[1];
VMM:HV_Exit {
  exits[VMCS_EXITCODE]++;
}
VMM:VMMUnload {
  printa(exits);
}

This example instruments HV_Exit, a static probe in the VMM domain that fires every time that a VM running in hardwarevirtualization mode exits to the VMM. The exits variable is used to aggregate samples and build a histogram representing the number and frequency of VM exits, using the exit type (VMCS_EXITCODE) as key. VMMUnload is a probe that fires when a script is unloaded from the VMM. In this case, it is used to print the histogram of VM exits:

intKey0  count  pct%
   0x10    222  0.2%
    0x0    222  0.2%
    0x1   3187  3.5&
    0x7   3302  3.7%
   0x1e  13003 14.6%
    0xc  20131 22.6%
   0x30  48623 54.8%

The intKey0 column represents the VM exit code, and the count and pct% columns are self-explanatory. The output above tells us that more than half of the VM exits for the probed VM are related to EPT violations (exit reason 0×30). This information can be useful, for instance, for characterizing a workload running inside a VM.

3.3 Latency Measurements
A common observability pattern in performance analysis is to measure the latency of events, by calculating the time elapsed between the start and the end of the event being measured. This pattern translates into a VProbes script consisting of two probes representing the start and end points of the event, and using a global monotonic timestamp counter (TSC) to measure the time elapsed between those two probe fires.

The following script implements this pattern for measuring the latencies of hardware-virtualization exits:

int exittsc;
aggr lats[1];
VMM:HV_Exit {
  exittsc = TSC;
}
VMM:HV_Resume {
  if (exittsc > 0) {
    lat = TSC – exittsc;
    lats[VMCS_EXIT_REASON]
VMM:VMMUnload {
  printa(lats);
}

Static probes HV_Exit and HV_Resume fire at the start and end of a VM exit, respectively. TSC is a built-in global that returns the current value of the physical CPU’s TSC. The difference between the two measurements is the number of CPU cycles elapsed between the start and end of the VM exit. The latency sample is put in an aggregate, which is later printed as a histogram showing a distribution of VM exits, by the type of exit:

intKey0      avg   count     min       max     pct%
   0x10     4053     183    1494     28314     0.0%
    0x0    26797     215    3576   1765422     0.0%
    0x7     2914    2357    1632     38532     0.0%
    0x1    21297     866    1620    487578     0.0%
   0x1e    12397   10718    2574   2804826     0.6%
   0x30    25954   49394    1656   8600862     5.9%
   0xc    15625   16469   11574  25295526     93.2%

These results tell us that 93.2% of the VMM exit handling time is spent on a HLT instruction executed by the guest operating system (exit reason 0xc).

3.4 Profiling
VProbes makes it easy to build a system-wide profiler through periodic sampling of system state. Periodic probes are especially useful in this case, given their ability to periodically execute a body of code and collect information from the system.

This example illustrates VProbes’ multi-domain probing capabilities by jointly profiling different layers of the ESXi stack:

perhost aggr ticks[0][1];
VMM:TIMER:1msec {
  string backtrace;
  gueststack(backtrace, 3);
  ticks[backtrace]++;
}
VMK:PROFILE:1msec {
  string backtrace;
  vmwstack(backtrace, 3);
  ticks[backtrace]++;
}
VMK:VMKUnload {
  logaggr(ticks, 1);
}

This script combines one periodic probe in the VMM layer with another in the VMK layer, both with the same period. At each probe fire, a new sample containing the current GUEST/VMK stack trace is added to global aggregate ticks.

When the script is loaded against the VMkernel and all VMs running on the system, it produces a histogram that gives a global view of the host’s execution profile, both across different layers and within individual layers.
Below is a sample output generated by profiling an ESXi system with one VM over a period of 10 seconds. It shows that 73.7% of the samples came from a single code location in the guest:

...
vmkernel.Power_HaltPCPU+0x285
vmkernel.CpuSchedIdleLoopInt+0x61b 484 3.1%
vmkernel.CpuSchedTryBusyWait+0x2c6
[0x828547d3]
[0x8281a187]          1090 73.7%
[0x82b5e099]


3.5 Inspection of System State

The examples so far show simple cases of system state introspection. The first example accesses an argument passed to the Linux32_ Write() function. The second and third read global hardware state: the VMCS exit code and the TSC. But sometimes the information required is not as easy to retrieve and requires parsing through data structure hierarchies in memory.

VProbes supports special classes of types—such as pointers, structs, and unions—to inspect target memory. The following script places a dynamic probe on the entry point of the VMkernel function responsible for processing a list of incoming network packets, and traverses that linked list at each invocation. At each node, it aggregates the source and destination IP addresses corresponding with the packet:

aggr rxconn[2][0];
VMK:ENTER:Net_AcceptRxList(
    void *dev,
    $vmkernel.PktList *list) {
  $vmkernel.PktHandle *pkt;
  pkt = list->csList.slist.head;
  while (pkt != 0) {
    $vmkernel.vmk_EthHdr *eh;
    $vmkernel.vmk_IPv4Hdr *ih;
    eh = pkt->frameVA;
    if (eh->type == 8) { // IPv4
      ih = &pkt->frameVA[14];
      rxconn[ih->saddr, ih->daddr]++;
    }
    pkt = pkt->pktLinks.next;
  }
}
VMK:VMKUnload {

The example illustrates the C-like syntax for dereferencing pointers and accessing struct fields, as well as a special $module.typename syntax used to reference types defined in the target domain. These types are automatically imported and do not need to be declared. VProbes guarantees that all operations involving target memory are safe by performing access checks beforehand.

4. Use Cases

In this section we look at a real-world use case in which VProbes was successfully used to debug a customer issue. We also describe several tools developed at VMware using VProbes.

4.1 Debugging Lock Contention Issues

A VMware support team received an initial report of a customer seeing sporadic ESXi hangs while performing file system operations. Initial analysis showed that the problem was caused by lock contention involving global semaphores in the file system layer. Existing tools did not provide enough information about the locks to pinpoint the source of contention.

The support team wrote a VProbes script to gather statistics about these global semaphores, which are acquired file system operations are performed. When writing the VProbes script, they first identified the probe points of interest:

  • Lock request – Fires when semaphore acquire is requested
  • Lock acquire – Fires when semaphore is acquired
  • Lock release – Fires when semaphore release is requested

After the probe points were identified using the TSC built-in and information about the current world (the equivalent of a thread in the ESXi kernel) at each point, they were able to build histograms for the average wait-time and hold-time for a semaphore and present this information categorized for each world. Here is the VProbes script that was used:

perhost uint64 fsLockAddr;
bag   semRequestTime;
bag   semAcquireTime;
aggr  resultWait[1][1];
aggr  resultHold[1][1];
/* Get the address of the global lock */
VMK:VMKLoad {
  fsLockAddr =
    sym2addr(“vmkernel.fsLock”);
}
/* Lock acquire request */
VMK:ENTER:SemaphoreLockInt(int sem) {
  if (sem == fsLockAddr) {
    int wid;
    wid= curworld()->id;
    SemRequestTime[wid]rbx == fsLockAddr) {
      string worldname;
      int waitTime, wid;
      wid= curworld()->id;
      waitTime =
        ((TSC - SemRequestTime[hashKey])
        * 1000000) / TSC_HZ;
      SemAcquireTime[wid]name;
      resultWait[wname, wid]id;
    heldTime =
      ((TSC – SemAcquireTime[hashKey])
      * 1000000) / TSC_HZ;
    wdname = (string) curworld()->name;
    resultHold[wname, wid]

Sample output of the script:

Lock wait statistics:
fsLock   helper50-3   1000014344    1 7 7 7
fsLock   helper50-4   1000014345    1 8 8 8
fsLock   FS3ResMgr    1000014340 2 10 10 10
Lock hold statistics:
fsLock helper50-3    1000014344     1 8 8 8
fsLock helper50-4    1000014345     1 8 8 8
fsLock FS3ResMgr     1000014340  2 21 28 24

This information helped isolate the world that was heavily using the fsLock semaphore. After the world was identified, the support team refined the script to collect call chain information using the vmwstack built-in, to figure out the code that needed to be optimized.

4.2 Dynamic Driver Verifier
The Dynamic Driver Verifier (DDV) [5] is a tool intended to help VMware developers and partners working on ESXi device drivers uncover bugs and accelerate debugging after they are uncovered. It closes coverage gaps by performing comprehensive checks and code path coverage at the level of function calls by intercepting them from a Driver-under-Test (DuT) to the VMkernel and by checking them for erroneous patterns such use-before-initialization, double-free, boundary violations, and resource leaks. DDV also modifies the VMkernel’s responses to the calls, to artificially induce memory allocation failures.

DDV uses VProbes to dynamically intercept function calls related to the DuT internals, core kernel internal calls, and even guest OS calls, providing additional contextual information to analyze a bug. Internally, DDV has been successfully used to track down more than 35 driver bugs. These included incorrect error handling paths, missing error checks, memory leaks, and failures to release resources.

4.3 Packet Tracing Tool

PktTrace [14] is an internal network packet tracing tool aimed at examining the networking behavior in a software-defined data center. PktTrace is primarily meant as a verification tool for NetX, a network extensibility solution that enables VMWare partners to implement services such as intrusion detection or data compression using service VMs. NetX enables the VMs to implement those services by intercepting and manipulating network traffic according to a set of partner-defined filter rules. The goal of PktTrace is to verify that the system implementation obeys the network filter rules.

PktTrace uses VProbes along with pktCapture, a specialized packet tracing tool, to gain detailed visibility into the system and track individual packets through the data path, from the ESXi kernel, through the service VM, and to the destination VM, reporting violations of the network rules. The flexibility of VProbes enables PktTrace to intercept any point along this path. In addition, PktTrace can use the installed probes to compute and report latencies for different segments of the path.

4.4 Dynamic Data Tracker

The Dynamic Data Tracker [9] is an internal debugging tool for tracking dynamically allocated data structures in ESXi. Unlike hardware watchpoints, which are limited in number and in object size, DDT proposes a software watchpoint solution to enable tracking of unbounded numbers of large kernel structures.

The tool consists of two phases. In the first phase, DDT identifies the instructions that access the structure in question by allocating it on a protected page. Accesses to the data structure are intercepted via page faults, and the fault handler records the instruction that accesses the data structure. The second phase of DDT uses VProbes to instrument all the instructions identified in the first phase. DDT uses a default VProbes script that invokes the printf built-in to trace each access to the data structure. The flexibility of VProbes scripting enables users to further augment this script and perform customized actions at each memory access, such as printing backtraces or inspecting other parts of the system state.

5. Experiments

This section provides an evaluation of the VProbes runtime impact. Our experience, illustrated by the numbers in this section, is that there is no performance overhead of VProbes at low probe frequencies, typically up to a few thousand probes per second. At high probe frequencies, the overhead is highly dependent on

  • The probe rate
  • The probe type (static or dynamic)
  • The amount of work done in each probe

Our experiments were conducted on a two–quad-core Intel Xeon server with 12GB RAM and one NUMA node per core, running a recent build of ESXi. We used a four-vCPU, 4GB, 64-bit Windows Server 2008 VM running an IOMeter workload. The IOmeter application had two workers and single outstanding I/Os per disk target. To measure the impact of VProbes, we installed probes on the critical I/O path of the storage stack and observed the throughput in Mbs/sec as reported by IOMeter. The script computes latencies via aggregation, using a similar approach as the example in section 3.3.

5.1 Low-Frequency Probes
In the first experiment we ran IOmeter with sequential writes and different block sizes of 4KB, 16KB, and 32KB. This workload generated about 115 I/Os per second in all these configurations. By installing multiple probes on each I/O, we were able to control the probe fire frequency in these experiments. We used both static and dynamic probes, and varied the probe frequency from 100 probes to 1,000 probes per second. In all cases, no performance degradation was noticed in the throughput reported by IOMeter.

5.2 High-Frequency Probes
In a second experiment, we ran IOmeter with sequential reads and a block size of 4KB. This workload generates about 18,000 I/Os per second.

We installed 1, 2, 3, or 4 probes in the I/O path, which resulted in probe fire frequencies ranging from 18,000 to 72,000 probes per second. We first ran the experiment with static probes. Then we repeated the experiment with dynamic probes. Figure 2 shows the results. For static probes, the performance decrease in IOmeter throughput ranged between 3.08% to 3.8% relative to the case when no probes were installed. In the case of dynamic probes, the performance decrease ranged from 3% to 4.5%. We also enabled all probes in the script, 4 static and 4 dynamic probes per I/O, for a total of 144,000 probes per second. The performance drop in that case was 6.2%.

Figure 2. VProbes Overhead When Running at High Probe Frequencies, with Overhead Measured as I/O Throughput Degradation.

Figure 2. VProbes Overhead When Running at High Probe Frequencies, with Overhead
Measured as I/O Throughput Degradation.

These experiments demonstrate that in this latency measurement use case, VProbes has no runtime impact at low event frequencies, and relatively small impact as the probe fire frequency is increased.

6. Related Work

This section discusses several related performance monitoring and troubleshooting tools, as well as several generic dynamic instrumentation frameworks.

6.1 Scripting-Based Tools
DTrace [4] for Solaris and MacOS and SystemTap [12] for Linux are popular dynamic instrumentation systems for debugging, troubleshooting, and performance analysis. Both of these systems provide scripting languages to control how the system is instrumented and what data needs to be collected. They support dynamic function-boundary probes, as well as static probes tailored to their particular operating systems. VProbes is similar in spirit to DTrace and SystemTap but is specifically designed for the virtualization environment of ESXi. It is designed to observe the main three layers of the ESXi virtualization stack—the guest, the VMM, and the ESXi kernel—enabling it to track events or make correlations across these layers.

6.2 Specialized Tools
Several other tools have been developed for troubleshooting, debugging, and monitoring. Lttng [6] for Linux and ETW [15] for Windows are general-purpose tracing tools designed for fast collection of system traces with low runtime overhead. They both include a number of associated tools for manipulating the collected traces. Profiling tools such as perf [20], oprofile [19], and VTune [22] use hardware event sampling to gather performance data. Other profiling tools have been designed to intercept more-specific events: strace [21] collects and reports system call information; ltrace [2] monitors dynamic library calls; and GCC gprof [17] provides call graph profiling.
All of the above are specialized tools designed to extract specific pieces of information, namely traces and profiles. In contrast, VProbes is an open-ended tool that is not limited to particular collection points, pieces of data, or data formats.

6.3 Dynamic Instrumentation
Enabling dynamic instrumentation via binary translation (BT) is a common technique used in systems such as Dyninst [10], Pin [8], Valgrind [11], and Dynamo [1]. These provide frameworks and APIs for instrumenting user-space applications. They have been typically used for building program analysis tools. More recently, binary instrumentation has been used for kernel space as well [7]. Kprobes [18], KernInst [13], DTrace [4], and GDB [16] provide APIs or implement instrumentation frameworks that dynamically patch code with instructions that redirect the control to instrumentation routines. The dynamic instrumentation in VProbes is similar—we patch the code with instructions that trigger debug exceptions.

7. Future Work and Conclusions

This paper described VProbes, an internal troubleshooting and monitoring tool that uses dynamic instrumentation for the ESXi hypervisor and VMs running on it. VProbes exposes a scripting language interface that makes it easy to query the system for specific pieces of data, including event traces, histograms, latencies, backtraces, and many others. One of the unique features of VProbes is the ability to correlate events among the guest, the VMM, and the ESXi kernel. The tool has been successfully used by the VMware support teams, as well as in the Dynamic Driver Verifier partner development tool.

One possible direction of future work is the integration of various hardware debugging capabilities with VProbes. For example, VProbes could be extended to expose the rich set of Performance Monitoring Counter (PMC) events as new probes in the system. New hardware debugging capabilities such as Intel’s Precise Event Based Sampling (PEBS) or Processor Trace (PT) could also be exposed in VProbes, using the scripting language as a vehicle for easy setup and custom programming of the start and end points of the data collection.
Another possible direction of work is the extension of VProbes with support for user-space probing. Currently, the VMX process is the only user process that can be probed. Extending VProbes to all user processes could provide better observability into other components of the vSphere infrastructure, for example into the host agent (hostd) process.

Currently VProbes targets a single ESXi host. Future work could extend the system to allow the writing of scripts that probe multiple hosts in a cluster. Challenges include aggregating data from different hosts, ordering in time events from different hosts, and probing VMs while they are migrated from one host to another. Finally, VProbes can be used as a building block for other tools, such as specialized command-line debugging tools, or higher-level UI tools. More generally, VProbes can provide a flexible data source to new or existing analytics tools, such as VMware® vCenter™ Operations Manager™.

Acknowledgments

We would like to thank all past members of the VProbes and VMM team at VMware that contributed to this work with code, discussions, or code reviews. In particular, we’d like to thank the past members of the VProbes team, Keith Adams, Eli Collins, Robert Benson, Alex Mirgorodskiy, Matthias Hausner, and Ricardo Gonzalez. We also want to thank the current and past members of the VProbes and monitor verification teams, including Janet Bridgewater, Hemant Joshi, Tung Vo, Jon DuSaint, and Lauren Gao. Finally, we’d like to thank Bo Chen and the DDV team, as well as Chinmay Albal and the CPD organization for their feedback and feature requests that greatly improved VProbes.

References

1. Bala, V., Duesterwald, E., and Banerjia, S. Dynamo: a transparent dynamic optimization system. In Proceedings of the 2000 ACM SIGPLAN Conference on Programming Language Design and Implementation (2000), pp. 1–12.
2. Branco, R. R. Ltrace internals. In Ottawa Linux Symposium (2007).
3. Bridgewater, J., and Leisy, P. Improving software robustness using pseudorandom test generation. In VMware Technical Journal, Winter 2013 Edition (2013).
4. Cantrill, B., Shapiro, M. W., and Leventhal, A. H. Dynamic instrumentation of production systems. In USENIX Annual Technical Conference, General Track (2004), pp. 15–28.
5. Chen, B. A runtime driver verification system using VProbes. In VMware Technical Journal, Summer 2014 Edition (2014).
6. Desnoyers, M. and Dagenais, M. The lttng tracer: A low impact performance and behavior monitor for GNU/Linux. In Ottawa Linux Symposium (2006).
7. Feiner, P., Brown, A. D., and Goel, A. Comprehensive kernel instrumentation via dynamic binary translation. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems (2012), pp. 135–146.
8. Luk, C.-K., Cohn, R. S., Muth, R., Patil, H., Klauser, A., Lowney, P. G., Wallace, S., Reddi, V. J., and Hazelwood, K. M. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation (2005), pp. 190–200.
9. Ma, M.-K., and Ravenna, N. The dynamic data tracker. In VMware Technical Journal, Summer 2014 Edition (2014).
10. Miller, B. P. and Bernat, A. R. Anywhere, any time binary instrumentation. In ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (Szeged, Hungary, 2011).
11. Nethercote, N., and Seward, J. Valgrind: A framework for heavyweight dynamic binary instrumentation. In Proceedings of the ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation (2007), pp. 89–100.
12. Prasad, V., Cohen, W., Eigler, F., Hunt, M., Keniston, J., and Chen, B. Locating system problems using dynamic instrumentation. In Ottawa Linux Symposium (2005), pp. 49–64.
13. Tamches, A. and Miller, B. P. Fine-grained dynamic instrumentation of commodity operating system kernels. In Proceedings of the Third USENIX Symposium on Operating Systems Design and Implementation (1999), pp. 117–130.
14. Zou, H., Mahajan, A. and Pandya, S. PktTrace: A packet lifecycle tracking tool for network services in a software-defined datacenter. In VMware Technical Journal, Summer 2014 Edition (2014).
15. Etw. http://msdn.microsoft.com/en-us/library/ms751538
16. Gdb. http://www.gnu.org/software/gdb
17. Gprof. https://sourceware.org/binutils/docs/gprof
18. Kprobes. https://www.kernel.org/doc/documentation/ kprobes.txt
19. Oprofile. http://oprofile.sourceforge.net/news
20. Perf. https://perf.wiki.kernel.org
21. Strace. http://sourceforge.net/projects/strace
22. Vtune. http://software.intel.com/en-us/intel-vtune-amplifier-xe

VSRT: An SDN-Based Secure and Fast Multicast Routing Technique

$
0
0

Ajay Kumar
VMware Inc.
ajayk@vmware.com

Abstract

Traditionally, concerns about reliability, scalability, and security have resulted in poor adoption of IP multicast in the Internet. However, data center networks, with their structured topologies and tighter control, present an opportunity to address these concerns. In this paper, I present VSRT—a software-defined networking (SDN) based system that enables multicast in commodity switches used in data centers. As part of VSRT, I develop a new multicast routing algorithm called Very Secure Reduced Tree (VSRT) routing algorithm. VSRT attempts to minimize the size of the routing tree it creates for any given multicast group. In typical data center topologies such as Tree and FatTree, VSRT reduces to an optimal routing algorithm that becomes a solution to the Steiner Tree problem. VSRT leverages SDN to take advantage of the rich path diversity commonly available in data center networks and thereby achieves highly efficient bandwidth utilization. I implement VSRT as an OpenFlow controller module. My emulation of VSRT with Mininet Hi-Fi shows that it improves application data rate by up to 12% and lowers packet loss by 51%, on average, compared to IP multicast. I also build a simulator to evaluate VSRT at scale. For the PortLand FatTree topology, VSRT results in at least a 35% reduction, compared to IP multicast, in the number of links that are less than 5% utilized, when the number of multicast groups exceeds 1,000. My results confirm that VSRT results in smaller trees compared to traditional IP multicast routing.

1. Introduction

Group communication is extensively used in modern data centers. Some examples include Apache Hadoop [1], which uses data replication for higher availability; clustered application servers [2], which require state synchronization; and cloud environments, which require OS and application image installation on a group of virtual machines (VMs). Multicast lends itself naturally to these communication patterns. IP multicast, which has been in existence for several years, is the most common multicast implementation for traditional networks. It is prudent to carefully consider the adoption of IP multicast in data centers.

Traditional IP multicast has remained largely undeployed in the Internet owing to concerns about reliability, scalability, and security. Network protocols such as TCP, which is the de facto standard for reliable unicast, incur significant latencies when applied to multicast. Ymir et al. [5] have studied application throughput for data centers that use TCP for reliable multicast. Address aggregatability, a mechanism for reducing unicast forwarding state in switches, is not feasible with IP multicast addresses. This leads to switch state explosion as the number of multicast groups scales up. IP multicast allows any host to subscribe to a multicast group and start receiving group traffic. Security, therefore, becomes very difficult to enforce. Data center networks with their structured topologies and tighter control present an opportunity to address these concerns. As a result, there has been renewed interest in multicast with specific focus on data centers. In [6], the authors propose reliable data center multicast. They leverage the rich path diversity available in data center networks to build backup overlays. In [7], the authors use multiclass bloom filters to compress multicast forwarding state. Security, though, remains a concern. There are also some additional concerns specific to the adoption of IP multicast in data center networks. IP multicast is not designed to take advantage of path diversity, which, unlike in traditional IP networks, is an integral part of data center networks.

This is likely to result in poor bandwidth utilization. Additionally, IP multicast routing algorithms, of which Protocol Independent Multicast – Sparse Mode (PIM-SM) is the most common, are not designed to build optimal routing trees. PIM-SM builds trees rooted either at the source of the multicast group, or at a predetermined rendezvous point (RP) for the group. Optimal tree building is equivalent to solving the Steiner Tree [8] problem. For arbitrary graphs, which is how traditional IP networks are [[word(s) missing]], the Steiner Tree problem is known to be NP-complete. In structured graphs like those found in data center networks, however, it is possible to build Steiner Trees in polynomial time for some topologies. Subsequently, it is possible to build optimal or near-optimal routing trees.

The rapid emergence of SDN, which has strong industry backing provides the perfect opportunity for innovating multicast in data centers to address the aforementioned concerns. The SDN architecture uses a centralized control plane that enables centralized admission control and policy enforcement, thereby alleviating security concerns. It also provides global visibility, as opposed to localized switch-level visibility in traditional IP networks, thereby enabling greater intelligence in network algorithms. Multicast routing algorithms can thus leverage topology information to build optimal routing trees, and can leverage link utilization state to efficiently exploit path diversity typically available in data centers. Lastly, it is important to note that not all commodity switches used in data center networks have IP multicast support. SDN can be leveraged to enable multicast in such commodity switches.

In this context, I present VSRT—an SDN-based system that enables multicast in commodity switches used in data centers. VSRT leverages centralized visibility and control of SDN to realize secure, bandwidth-efficient multicast in switches that do not have any inbuilt IP multicast support. VSRT, like IP multicast, supports dynamic joins of multicast members. However, unlike IP multicast, it is also able to deny admission to a member based on predefined policies. As part of VSRT, I develop VSRT, a new multicast routing algorithm. VSRT attempts to minimize the size of the routing tree created for each group. Whenever a new member joins a group, VSRT attempts to attach it to the existing tree at the nearest attachment point. For typical data center topologies such as Tree and FatTree, VSRT reduces to an optimal multicast routing (Steiner Tree building) algorithm that can be executed in polynomial time. In this paper, I make the following contributions:

  • Detailed design of VSRT
  • Implementation of VSRT as an OpenFlow controller module
  • Emulation of VSRT for application performance benchmarking
  • Design and implementation of a simulator to evaluate VSRT at scale

The rest of this paper is organized as follows. Section 2 discusses the motivation behind developing SDN-based multicast for data centers. In section 3, I present a detailed design and implementation of VSRT. Section 4 describes my experiments and presents both emulation and simulation results. I end with a conclusion in section 5. For the interested reader, an appendix provides my proof showing that the Steiner Tree can be computed in polynomial time for Tree and FatTree topologies.

2. Motivation

Multicast can greatly benefit modern data centers by saving network bandwidth and improving application throughput for group communications. Because IP multicast has been in existence for several years, it is logical to consider the adoption of IP multicast in data centers. However, as outlined in section 1, there are still unresolved concerns about security, path diversity utilization, and routing tree formation that make the adoption of IP multicast prohibitive. In this work, I identify SDN as the architecture that is capable of addressing these concerns, as detailed below.

2.1 Security
SDN uses a centralized control plane. In an SDN network, when a new member sends a request to join a multicast group, the request is forwarded to the control plane. The control plane can either admit this new member and appropriately modify forwarding rules in switches, or deny admission to the member based on predefined policies. In this manner, SDN-based multicast can enable centralized admission control and policy enforcement, thereby alleviating security concerns.

2.2 Path Diversity Utilization
In data center topologies with path diversity, there are multiple, often equal-length, paths between any given hosts. Ideally, for efficient bandwidth utilization, different multicast trees should be spread out across different paths. Traditional IP networks build multicast trees based on localized switch-level views stored in the form of address-based routing table entries, as explained section 2.3 ( Routing Tree Formation). This results in many of the same links being used for different trees, while at the same time leaving many links unused. SDN, on the other hand, can leverage global visibility to take advantage of path diversity and make different multicast groups use different routing trees. This leads to more even distribution of traffic across all links and avoids congestion or oversubscription of links.

2.3 Routing Tree Formation

PIM-SM, the most common IP multicast routing protocol, builds a multicast routing tree by choosing a node in the network as the RP for each group and connecting all group members to this RP. PIM-SM relies on IP unicast routing tables, which are based on localized switch-level views, to find a path from the member to the RP. This results in RP-rooted trees that can be nonoptimal. PIM-SM, for high data rates, provides the option for each member to directly connect to the source. In such a case, instead of RP-rooted trees, there is a combination of source-rooted and RP-rooted trees. This is still likely to be nonoptimal, because each member is routed to a specific node (RP or source) on the existing tree, as opposed to being routed to the nearest intersection on the existing tree. SDN’s global visibility, on the other hand, can be leveraged to build near-optimal routing trees. Whenever a new member joins a group, instead of finding a path from it to the source or the RP, SDN can find its nearest attachment point to the existing tree. This results in trees that use fewer hops, and in the case of topologies such as Tree and FatTree, reduces to optimal trees.

Figure 1. Motivating Example

Figure 1. Motivating Example

This is explained with the help of the example in Figure 1 (a) and (b). These figures show an irregular data center topology—increasingly common in unplanned data centers—that is a combination of Tree and Jellyfish [10] topologies. It shows two multicast groups. The first group comprises Tenant 1, whose VMs reside on hosts {HI, HS, H11,HIS}. The second group comprises Tenant 2, whose VMs reside on hosts {H4, HIO, HI4}. H8 has a suspicious tenant that wants to hack into Tenant 2. Figure 1(a) shows the outcome of using IP multicast. For each of the two multicast groups, I assume that PIM-SM chooses the core switch C as the RP. This is reasonable because C is equidistant from all members in either group. The routing trees built for the two groups are shown by the dashed and solid lines respectively. As can be seen, reliance on unicast routing tables to connect each node to the RP leads to the same links being used for each tree.
Note that in the above discussion, I have taken for granted that the switches in question have support for IP multicast. Many commodity switches used in data centers have no off-the-shelf IP multicast support. In such switches, VSRT, most importantly, enables multicast.

3. Related Work and Background

Today, the majority of Internet applications rely on point-to-point transmission. Utilization of point-to-multipoint transmission has traditionally been limited to LAN applications. Over the past few years the Internet has seen a rise in the number of new applications that rely on multicast transmission.
Mininet is a network emulator. It runs a collection of end-host switches, routers, and links on a single Linux kernel. It uses lightweight virtualization to make a single system look like a complete network, running the same kernel, system, and user code. A Mininet host behaves just like a real machine [8]. Cisco Packet Tracer [3] is a powerful network-simulation program that enables experimentation with network behavior. Packet Tracer provides simulation, visualization, authoring, assessment, and collaboration capabilities to facilitate the teaching and learning of complex technology concepts. I used it to perform tests with routers and create networks.

3.1 Reducing Network Load
Assume that a stock-ticker application is required to transmit packets to 100 stations within an organization’s network [2]. Unicast transmission to the group of stations will require the periodic transmission of 100 packets, and many packets might be required to traverse the same link(s). Multicast transmission is the ideal solution for this type of application, because it requires only a single packet transmission by the source, which is then replicated at forks in the multicast delivery tree. Broadcast transmission is not an effective solution for this type of application, because it affects the CPU performance of every end station that sees the packet, and it wastes bandwidth [6].

3.2 Resource Discovery
Some applications implement multicast group addresses instead of broadcasts to transmit packets to group members residing on the same network. However, there is no reason to limit the extent of a multicast transmission to a single LAN. The time-to-live (TTL) field in the IP header can be used to limit the range (or “scope”) of a multicast transmission [9].

3.3 Multicast Forwarding Algorithms
A multicast routing protocol is responsible for the construction of multicast packet-delivery trees and for performing multicast packet forwarding. There is a Cisco packet-tracer tool that can be used for testing any algorithm [3]. I had tried a few algorithms before creating this new algorithm. This section explores a number of different algorithms that can potentially be employed by multicast routing protocols:

  • Flooding
  • Spanning trees
  • Reverse Path Broadcasting (RPB)
  • Reverse Path Multicasting (RPM)

These are a few algorithms that are implemented in the most prevalent multicast routing protocols in the Internet today:

  • Distance Vector Multicast Routing Protocol (DVMRP)
  • Multicast Open Shortest Path First (MOSPF)
  • Protocol-Independent Multicast (PIM)

3.4 RPM
RPM is an enhancement to RPB and Truncated RBP. RPM creates a delivery tree that spans only:

  • Subnetworks with group members
  • Routers and subnetworks along the shortest path to subnetworks with group members

RPM allows the source-rooted spanning tree to be pruned so that datagrams are only forwarded along branches that lead to members of the destination group [7].

3.5 Network Simulator (Ns-2)
Ns-2 is a discrete event simulator targeted at networking research [11]. Ns-2 provides substantial support for simulation of TCP, routing, and multicast protocols over wired and wireless (local and satellite) networks. Nam is a Tcl/Tk-based animation tool for viewing network simulation traces and real-world packet traces. It is mainly intended as a companion animator to the Ns simulator. REAL is a network simulator originally intended for studying the dynamic behavior of flow and congestion. Anything can be used to test the VSRT algorithm.

4. Design and Implementation

VSRT is designed to achieve the following goals:

  • Efficiently utilize path diversity
  • Enforce admission control
  • Build near-optimal multicast trees
  • Enable multicast support in commodity SDN switches
  • Be easily deployable

4.1. VSRT Algorithm
VSRT is a polynomial time algorithm that builds a routing tree by attempting to attach each new group member to the existing tree at the nearest intersection. Instead of trying to find the shortest path from this member to a specific node, as PIM-SM does, VSRT tries to find the shortest path to the existing tree. This can be trivially accomplished, in theory, by computing the shortest path from the new member to each node on the existing tree. However, this is computationally prohibitive. VSRT performs this attachment using a unique method that completes in polynomial time. Although in theory VSRT might not always be able to find the best attachment point for all topologies, it does so with high probability in practice for most topologies. Specifically, for Tree and FatTree topologies, it does so with probability 1. If VSRT is unable to find the optimal path, it still finds a path that is at least as short as that found by PIM-SM. VSRT first assigns a level to all nodes in the network. This level classifies the node’s distance, in number of hops, from a physical server. Thus, all physical servers are assigned level 0, all top-ofracks (ToRs) are assigned level l, and so on. While creating the routing tree for a group, VSRT iterates through the group members one by one and attaches them to the tree. In this regard, the tree created is a function of the order in which members appear. Regardless of the ordering, though, the tree created is nearoptimal and at least as small as that created by PIM-SM. Optionally, after the group reaches a steady state in terms of number of subscribers, a steady state tree can be reconstructed. The steady state tree can be chosen as the smallest tree obtained from all possible orderings. In my system, I have not implemented steady state tree reconstruction, because the trees created in the first attempt efficiently satisfy all design goals.

Tree building begins when there are at least two members in the group. To connect the first two members, the algorithm chooses the shortest path between them. Subsequently, whenever a new member appears, the algorithm tries to find its nearest intersection to the existing tree. To do so, it first checks if any of its adjacent nodes reside on the existing tree. Thus, when a new member, which would by definition be a level 0 node, appears, all its adjacent (level 1) nodes are checked. If any of these nodes already resides on the existing tree, the new member is simply attached to the tree at this point. If none of these adjacent nodes lies on the tree, the algorithm then looks at all neighbors (level 0, level l, and level 2) of the adjacent nodes. If any of these neighboring nodes lies on the existing tree, the algorithm attaches the new member to the tree at this point. If neither this new member’s adjacent nodes nor their state tree [[word(s) missing]] reconstruction, because the trees created in the first attempt efficiently satisfy all design goals. Tree building begins when there are at least two members in the group. To connect the first two members, the algorithm chooses the shortest path between them. Subsequently, whenever a new member appears, the algorithm tries to find its nearest intersection to the existing tree. To do so, it first checks if any of its adjacent nodes reside on the existing tree. Thus, when a new member, which would by definition be a level 0 node, appears, all its adjacent (level 1) nodes are checked. If any of these nodes already resides on the existing tree, the new member is simply attached to the tree at this point. If none of these adjacent nodes lies on the tree, the algorithm then looks at all neighbors (level 0, level l and level 2) of the adjacent nodes. If any of these neighboring nodes lies on the existing tree, the algorithm attaches the new member to the tree at this point. If neither this new member’s adjacent nodes nor their neighbors lie on the existing tree, then one of the member’s adjacent nodes at the next higher level is randomly chosen.

Note that in this case, the new member has not yet been attached to the tree, so the algorithm continues. Next, this chosen adjacent node is set as the current node. Now, its adjacent nodes (some of which would have already been examined in the previous iteration) and their neighbors are examined to see if any falls on the existing tree. If any of them does, the new member is connected to the tree at this node by tracing the path chosen from the new member onward.

Figure 2. VSRT: Routing Tree Formation

Figure 2. VSRT: Routing Tree Formation

If, on the other hand, neither of them lies on the existing tree, the algorithm continues by randomly choosing one of the current node’s adjacent nodes at the next-higher level. This chosen node is now set as the current node. In this manner, the algorithm continues either until the new member is connected to the existing tree or until the level of the current node reaches the highest level in the topology. If the algorithm has already reached the highest level and has still been unable to attach the new member to the tree, then it resorts to a breadth first search (BFS) with a stop condition that terminates as soon as an attachment point to the tree is found. For typical data center topologies, which are characterized by rich path diversity and large numbers of edges at higher levels, it is unlikely that the algorithm would reach a highest level node in the topology without attaching the new member to the tree. At every iteration for which the algorithm is unable to attach the member, it randomly moves to a higher level, thereby increasing its chances of finding an attachment point (owing to the larger number of edges at the higher level) in the next iteration. If the algorithm randomly selects a node that is headed in a direction away from the tree, in the next iteration a random selection once again is likely to head it back toward the tree. This approach of randomly selecting higher-level nodes, from the perspective of building routing trees for different multicast groups, also contributes to better utilization of the available path diversity, leading to more-balanced link utilizations. If the algorithm does indeed reach the highest level without converging, which as mentioned above is unlikely, the overhead incurred from this unsuccessful search is very small, because typical data center topologies are only three to four levels high. This is computationally far less expensive than using BFS for each new member. Additionally, as demonstrated in section 4, this approach still results in smaller routing trees than PIM-SM. Specifically, for Tree and FatTree topologies, this algorithm always finds the optimal attachment point for each new member without needing to resort to BFS.

I explain VSRT with the help of an example in Figure 2. The example demonstrates how a tree is constructed as new members join a multicast group. Initially, there is one sender S and one receiver R1. The tree is constructed by choosing the shortest path from R1 to S. Subsequently, a receiver R2 also subscribes to the multicast group. None of this receiver’s adjacent nodes are on the tree, nor are the neighbors of these adjacent nodes on the tree. It has only one adjacent node, a level 1 node, which is therefore chosen by default as the node that will lead this member to the tree. Next, setting this level 1 node as the current node, VSRT looks at all its adjacent nodes as well as at their neighbors. Again, this level 1 node has only one neighbor level 2 node, so it is chosen by default. Now this level 2 node becomes the current node, and VSRT looks at its adjacent nodes. None of its adjacent nodes are on the tree. However, at least one of the neighbors of one of these adjacent nodes is on the tree. This adjacent node is the level 2 node located horizontally to the left of the current (level 2) node. Thus, VSRT selects this adjacent node. Finally, VSRT attaches the new member to the existing tree at this adjacent node’s neighbor (the level 3 mode marked by a *). Finally, a last receiver R3 arrives. In the first iteration for R3, VSRT chooses the level 1 node immediately adjacent to it because there is no other choice. In the next iteration, with this level 1 node set as the current node, VSRT first looks at its adjacent nodes. As it turns out, one of its adjacent nodes (the level 2 nodes marked by a *) is on the tree. So, it attaches to the tree at this node.

4.2 VSRT System Implementation

I implement VSRT as an OpenFlow controller module, using the OpenDaylight DN platform, as outlined in Figure 3. The VSRT module listens for subscription requests and topology changes from the network, and dynamically updates the appropriate multicast routing trees. It registers with the IListenDataPacket service of OpenDaylight to indicate that it wants to receive all data packets sent to the IDataPacketService of the controller. In my implementation, I adopt the IP multicast addressing scheme, so hosts still send subscription requests through Internet Group Management Protocol (IGMP) packets. VSRT implements IGMP snooping to learn about hosts that want to subscribe to or unsubscribe from a multicast group. On receiving an IGMP packet, VSRT finds the address of the host as well as the multicast group it wants to join or leave. Subsequently, it examines security policies to ensure that this member can be admitted and, if so, updates the multicast routing tree using VSRT. When a multicast group sender that hasn’t subscribed to the group yet (because senders do not send IGMP packets) starts sending multicast traffic, the controller is notified. Subsequently, VSRT automatically adds the sender to the multicast tree, once again, assuming policies permit this. VSRT also appropriately modifies routing trees whenever a topology change is registered from the ITopologyManager in OpenDaylight. Any time VSRT needs to update the routing tree, it effectively must add, delete, or modify routing rules in appropriate switches. This is done through the IForwardingRulesManager. The OpenDaylight controller’s service abstraction layer uses a southbound plug-in to communicate with network elements. Currently, OpenDaylight has only one southbound plug-in that supports the OpenFlow v1.0 protocol. VSRT can be completely implemented using the features provided by OpenFlow v 1.0. VSRT can work with higher versions of OpenFlow as well.

Figure 3. Architecture of the VSRT OpenDaylight Module

Figure 3. Architecture of the VSRT OpenDaylight Module

5. Results

5.1 Emulator
To validate and evaluate my implementation of VSRT, I used Mininet Hi-Fi [11], an OpenFlow v1.0 enabled network emulation platform. I created a Mininet network topology and connected it to the OpenDaylight controller. For this emulation, I chose a FatTree topology, as shown in Figure 4. FatTree is a common data center topology that has sufficient path diversity to highlight the benefits of VSRT. My topology comprises 24 hosts, 6 ToR switches, 6 aggregation switches, and 2 core switches. The link capacity for each link in this network is set to 10Mbps. For performance benchmarking, I use Iperf [12].

kumar-4

Throughout this section, host hx refers to the host with IP address 10.0.0.x in Figure 4. First, I seek to validate the functionality of VSRT by ensuring that it is able to enable multicast in a Mininet network. OVS switches used by Mininet do not have any out-of-the-box multicast support. This is demonstrated in Figure 5. A multicast group is created with hosts h1, h4, and h7. An Iperf client (sender) is started on h1, while Iperf servers (receivers) are started on hosts h4 and h7. As is evident, the servers do not receive any multicast traffic. Next, I include VSRT in the OpenDaylight controller and restart the Iperf client and servers. As shown in Figure. 5, multicast has now been enabled. Next, I seek to evaluate VSRT by comparing its Iperf performance with that of IP multicast. Once again, I would like to point out that I adopt the addressing scheme of IP multicast. The Iperf application has been written to use the IP addressing scheme, and I merely leverage that for ease of implementation. Using this addressing scheme for VSRT has no bearing on the results. The VSRT system is completely independent and separate from IP multicast. Because IP multicast is not supported out of the box with OVS switches, I also implement an adaptation of IP multicast for my software-defined environment. Although I still leverage the OpenDaylight controller to create IP multicast routing trees by installing appropriate forwarding rules in switches, there are two important differences in the implementation of IP multicast compared to VSRT. These are to ensure that my implementation of IP multicast mirrors traditional IP multicast:

First, IP multicast does not leverage central visibility available to the OpenDaylight controller. It relies on localized switch-level views. Second, IP multicast uses PIM-SM for routing. I tried to incorporate XORP, a routing engine that implements PIM-SM, along with RouteFlow, a service that facilitates communication between the controller and the routing engine. However, the current implementation of RouteFlow is incapable of converting the multicast routing tree generated by XORP into corresponding Open Flow rules. Hence, in my system, I implement PIM-SM. For Iperf performance comparison, I create six random multicast groups of sizes varying from three to six multicast members. The sender in each group uses Iperf in client mode to send multicast traffic at a rate chosen randomly from {2, 4, 6, 8} Mbps. The remaining members of the group use Iperf in the UDP server mode to bind to the multicast group. The packet loss percentages associated with VSRT and IP multicast are shown in Figure 7, and data rates are shown in Figure 8. The results from my emulation show that VSRT results in throughput increase by up to 12% and packet loss reduction by 51% on average, compared to IP multicast.

Figure 4. Mininet Topology Used for Emulating VSRT

Figure 4. Mininet Topology Used for Emulating VSRT

5.2 Simulator
To evaluate the performance of VSRT, I built a multicast simulator that comprises the following modules:

  • Topology Generation
  • VM Placement
  • Multicast Group Generation
  • VSRT
  • IP Multicast
Figure 5. Average Packet Loss Percent

Figure 5. Average Packet Loss Percent

Figure 6. Average Transfer Rate

Figure 6. Average Transfer Rate

The simulator first creates a network topology based on user input specifying the total number of servers, the number of servers per rack, and the type of topology. The topology generation module, currently, is capable of generating two types of topologies: Tree and FatTree. For either type of topology, the topology generator determines the number of switches and arranges them appropriately by creating the necessary host-switch and switch-switch edges. Alternatively, if a topology other than Tree or FatTree needs to be used, it can be supplied as a file to the simulator. This would bypass the topology generation module. Next, the simulator prompts the user to specify the number of VMs running in the data center. Each VM is mapped randomly onto one of the servers. All communication is assumed to be between VMs. Next, the simulator asks the user to specify the number of multicast groups that need to be routed. The simulator also allows the user to supply the member VMs for each multicast group, along with the group’s associated data rate. If member VMs are not supplied, the simulator randomly chooses between 3 and 20 VMs as the members for each group. It also randomly assigns a data rate (in hundreds of kbps) to each group’s traffic. The simulator implements both VSRT and IP multicast. For VSRT, the simulator assumes an SDN environment with centralized visibility into the network. It uses VSRT to build multicast trees. For IP multicast, the simulator assumes localized views derived from routing tables stored in switches. It uses PIM-SM to build multicast trees. In my simulations, I create both Tree and FatTree topologies. For each topology, I specify 11,520 servers assembled into 40 servers per rack, thereby resulting in 288 racks. This distribution of servers is adopted from PortLand [13], which in turn interprets this from. The simulator determines the number of switches required as 288 top-of rack (ToR) switches, 24 aggregation switches, and 1 (for Tree) or 8 (for FatTree) core switches. I specify the number of VMs as 100,000, and the simulator randomly places each VM on the 11,520 servers. Because there is no available real data trace for data center multicast, I let the multicast groups be generated automatically by the simulator. To create these multicast groups, the simulator applies the methodology described in [17]. Additionally, the simulator assigns a random data rate for each multicast group, which is randomly chosen from the range 100kbps to 10Mbps. Here, I present results from our simulation runs on the FatTree topology only. The results from the Tree topology were similar. For the first simulation run, I set the link capacity for each link to 1Gbps and create 1,000 multicast groups. Figure 5 shows the CDF of link utilizations in the network. The following observations are made from this plot: • The percentage of unutilized links is 0% in VSRT, while in IP multicast it is 16%.

  • The percentage of links that have less than 5% utilization in VSRT is 49%, while in IP multicast it is 65%.
  • The maximum link utilization in VSRT is 73%, while that in IP multicast is 301%. IP multicast has 1.5% links with utilization greater than 100%.

The above observations establish that VSRT is able to take better advantage of the available bandwidth in the network. As the number of multicast groups increases, the inability of traditional IP multicast to efficiently utilize path diversity gets magnified even further. For the rest of the simulation runs, I vary the number of multicast groups from 100 to 10,000. Also, I increase link capacity from 1Gbps to 10Gbps to accommodate this large number of multicast groups.

6. Conclusion

Reliability, scalability, and security concerns have resulted in IP multicast’s poor adoption in traditional networks. Data center networks with their structured topologies and tighter control present an opportunity to address these concerns. However, they also introduce new design challenges, such as path diversity utilization and optimal tree formation, that are not critical in traditional networks like the Internet. In this paper, I presented VSRT, an SDN-based system for enabling multicast in data centers. VSRT leverages global visibility and centralized control of SDN to create secure and bandwidth-efficient multicast. VSRT implements its own routing algorithm, VSRT, that creates optimal routing trees for common data center topologies. My implementation of VSRT as an OpenFlow controller module validates its deployability, and my simulation establishes its scalability. I am currently working on incorporating reliability in VSRT and exploring the adoption of reliability protocols such as PGM [18]. I am also working on porting common group communication applications, such as OS image synchronization and Hadoop, to use VSRT.

Appendix: Steiner Tree in Polynomial Time for FatTree

Figure 7. Steiner Tree Building

Figure 7. Steiner Tree Building

Steiner Tree Problem

Given a connected undirected graph G=(V,E) and a set of vertices N, find a sub-tree T of G such that each vertex of N is on T and the total length of T is as small as possible. The Steiner Tree Problem for an arbitrary graph is NP-complete. In this section, I prove that the Steiner Tree problem for FatTree topologies can be solved in polynomial time. Figure 7 shows a FatTree graph with the set of vertices that need to be connected, N, indicated by the black tiles. My proof strategy consists of the following two steps: 1. Build a tree (in polynomial time) that connects all vertices in N. 2. Show that the tree thus constructed is the Steiner tree. Owing to symmetry in FatTree graphs, a given cluster of X nodes at level L connects identically to a given cluster of Y nodes at level (L+1). In other words, there is a mapping X(L) {o} Y(L + 1). Specifically, in the graph shown in Figure 7, clusters of 10 nodes (hosts) at level 0 connect to 1 node (ToR) at level l, clusters of 8 nodes (ToRs) at level 1 connect to clusters of 2 nodes (aggregations) at level 2, and clusters of 4 nodes (aggregations) at level 2 connect to clusters of 2 nodes (cores) at level 3. For each Level L cluster X, one specific node is chosen as the designated node from its corresponding Level (L+ 1) YEY cluster for all group traffic to or from X. The relative orientation of y(L+ 1) with respect to cluster X(L) is kept identical across every mapping X(L) {o} Y(L + 1). This is shown in Figure 7 with the help of red dots. For every vertex in N, which is a level 0 node, the only choice is for a level 1 designated node. The designated level 1 nodes for those level 0 node clusters that have at least one multicast group member are shown in red. Next, for each level 1 cluster (there are 4 clusters of level 1 nodes with 8 nodes in each), the first (left) of the two level 2 nodes is chosen as the designated node. Finally, for each level 2 node thus chosen, the second (right) level 3 node is chosen as the designated node. The choice of designated node doesn’t matter as long as the relative orientation of each level (L+1) designated node with respect to its child Level L cluster is the same throughout the topology. It can be seen that the tree thus created by joining all designated nodes connects all group members to one another, and is thus a multicast routing tree. This tree can be constructed in polynomial time, because each new member can be connected to the existing tree in bounded time (it is just a matter of following designated nodes from it until the tree is reached) and the number of members itself is bounded. Also, it can be seen that the tree thus created connects any given pair of nodes in N by the shortest path between them. Therefore, it is the Steiner Tree. Hence, a Steiner Tree can be created in polynomial time for FatTree topologies. Because Tree is a special case of FatTree, Steiner Trees can be created in polynomial time for Tree topologies by corollary.

References

1. Hadoop. http://hadoop. apache.org/
2. R. Minnich and D. Farbar, “Reducing Host Load, Network Load and Latency in a Distributed Shared Memory
3. CiscoPacketTracer. https://www.netacad.com/web/about-us/cisco-packet-tracer
4. Microsoft Azure. http://www.windowsazure.com/en-us/
5. D. Basin, K. Birman, I. Keidar, and Y. Vigfusson, “Sources of instability in data center multicast,” in Proceedings of the 4th International Workshop on Large Scale Distributed Systems and Middleware. ACM, 2010, pp. 32–37.
6. D. Li, M. Xu, M.-c. Zhao, C. Guo, Y. Zhang, and M.-y. Wu, “RDCM: Reliable data center multicast,” in Proc. IEEE INFO COM. IEEE, 2011, pp. 56–60.
7. D. Li, H. Cui, Y. Hu, Y. Xia, and X. Wang, “Scalable data center multicast using multi-class Bloom Filter,” in Proc. IEEE International Conference on Network Protocols (ICNP). IEEE, 2011, pp. 266–275.
8. Mininet Hi-Fi. https://github.com/mininet/Mininet.
9. ResourceDiscovery. http://www.jisc.ac.uk/whatwedo/topics/resourcediscovery.aspx
10. A. Singla, C.-Y. Hong, L. Popa, and P. B. Godfrey, “Jellyfish: Networking data centers randomly,” in Proc. Of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI). USENIX Association, 2012, pp. 17–17.
11. The Network Simulator – ns-2 http://www.isi.edu/nsnam/ns/
12. Iperf. http://iperf.fr/
13. R. Niranjan Mysore, A. Pamboris, N. Farrington, N. Huang, P. Min, S. Radhakrishnan, V. Subramanya, and A. Vahdat, “PortLand: a scalable fault-tolerant layer 2 data center network fabric,” in SIGCOMM Computer Communication Review, vol. 39, no. 4. ACM, 2009, pp. 39–50.

Scaling of Cloud Applications Using Machine Learning

$
0
0

Pradeep Padala
Distributed Resource
Management Team
VMware Inc.
padalap@vmware.com

Aashish Parikh
Distributed Resource
Management Team
VMware Inc.
aashishp@vmware.com

Anne Holler
Distributed Resource
Management Team
VMware Inc.
anne@vmware.com

Madhuri Yechuri
Distributed Resource
Management Team
VMware Inc.
myechuri@vmware.com

Lei Lu
Distributed Resource
Management Team
VMware Inc.
llei@vmware.com

Xiaoyun Zhu
Distributed Resource
Management Team
VMware Inc.
xzhu@vmware.com

Abstract

Today’s Internet applications are required to be highly scalable and available in the face of rapidly changing, unpredictable workloads. Multi-tier architecture is commonly used to build Internet applications, with different tiers providing load balancing, application logic, and persistence. The advent of cloud computing has given rise to rapid horizontal scaling of applications hosted in virtual machines (VMs) in each of the tiers. Currently, this scaling is done by monitoring system-level metrics (e.g., CPU utilization) and determining whether to scale out or in based on a threshold. These threshold-based algorithms, however, do not capture the complex interaction among multiple tiers, and determining the right set of thresholds for multiple resources to achieve a particular service level objective (SLO) is difficult.

In this paper, we present vScale, a horizontal scaling system that can automatically scale the number of VMs in a tier to meet end-to-end application SLOs. vScale uses reinforcement learning (RL) to learn the behavior of the multi-tier application while automatically adapting to changes. We provide a RL formulation of the autoscaling problem and design a solution based on Q-learning. Our learning algorithm is also augmented with heuristics to improve the responsiveness and guide the learning algorithm. A vScale prototype is implemented in Java and is evaluated on a VMware vSphere® test bed. We tested vScale by replaying traces from the 1998 FIFA World Cup (World Cup ’98) to simulate production workloads. Our experiments indicate that vScale learns quickly, adapts to changing workloads, and outperforms the RightScale auto-scaling algorithm.

1. Introduction

VMware customers have complex applications with multiple tiers, and meeting end-to-end SLOs is of paramount importance. These applications are required to be highly scalable and available in the face of dynamic workloads. To support the requirements of these applications, our customers often use a multi-tier architecture in which individual tiers can be scaled independently. Although vSphere and VMware cloud products allow these tiers to be hosted in VMs, which can be scaled, automation is missing. Currently, cloud providers such as Amazon EC2 offer auto-scaling services for a group of VMs, which are scaled based on specific conditions (e.g., average CPU usage > 70%) set by the user. However, auto-scaling based on individual VM resource usage is insufficient for scaling multi-tier and stateful applications.

Challenges include

  • Choosing the right thresholds – Choosing thresholds for multiple resource metrics used in mechanisms such as EC2 CloudWatch [1] is not easy. Often, cloud administrators set thresholds based on ad-hoc measurements and past experience, wasting resources.
  • Meeting end-to-end application SLOs – Internet applications have strict SLOs that must be met for a good user experience, which might directly affect company revenues. Converting end-to-end application SLOs to the right number of VMs in each tier is not trivial.
  • Interaction among multiple tiers – Applications that involve multiple tiers often have complicated dependencies that cannot be easily captured. Setting thresholds or scaling parameters for a specific tier independently can be counterproductive [12]. It is also possible for the bottlenecks to oscillate among the tiers due to incorrect scaling.
  • Efficient usage of resources – Finding the right number of resources to achieve a particular SLO is essential because of the costs involved in running VMs in a cloud. Another challenge is optimizing resource usage across multiple resources.
  • Stateful applications – Applications with persistent state often exhibit data access spikes due to the increase in the intensity of requests hitting the persistence layer and changes in the object popularity [8]. Scaling persistence layers is challenging because of the latency involved in data copying and distribution.

RightScale provides a mechanism [4] for automated scaling by allowing each node to vote for scaling up or down based on thresholds set on system metrics. The votes are aggregated and another threshold is used to determine the final scaling action. Setting the right thresholds still remains a problem in this approach. Academic solutions [12] have been proposed to address automated scaling, but we found that the assumptions made in previous work (e.g., good predictions about workloads, deep application instrumentation, and full view of the system) do not hold true in real systems.

To address these challenges, we propose a scaling system called vScale, built on top of vSphere, that automatically scales out and in (horizontal scaling) VMs in individual tiers to meet end-to-end application goals. We formulate the auto-scaling problem as an RL problem and solve the problem using the Q-learning technique [13]. A vScale prototype is implemented in Java and is evaluated using a vSphere test bed comparing it to the RightScale algorithm. We replayed World Cup ’98 traces to create different scenarios including workload variation, shift in bottlenecks, and cyclical workloads.

2. System Architecture

Figure 1. Typical Multi-Tier Application Architecture

Figure 1. Typical Multi-Tier Application Architecture

Today’s Internet applications are hosted in a multi-tier architecture, as shown in Figure 1. The architecture includes a DNS server, load balancers, application servers, and database servers. Typically, clients connect to a URL, which is resolved to multiple load balancers in a round-robin fashion. The load balancers (also known as front-end or Web servers) serve static pages and send requests to the application tier. The application tier contains the business logic for the application. Persistence for the application is provided by the database tier. The database tier consists of a database proxy that routes requests to multiple database nodes.

We assume that each of the tiers can be independently scaled. When a new VM is added to or removed from a tier, the previous tier is updated to reflect the state of the system. The vScale system monitors and manages multi-tier applications that are represented by multiple distinct VMs. In Figure 2, we show the architectural block diagram of the overall system. The client program interacts with the vScale system through a well-defined vScale API. This API can be used by products such as AppDirector and Cloud Foundry to register the topology and the applications’ VMs with vScale. The API is also used to define the SLO and, optionally, cost budgets (or resource limits) to cap the amount of auto-scaling. The SLOs are specified as a goal on the expected performance (e.g., 99 percentile latency < 500ms).

Each VM has a VMware vCenter™ Hyperic® [5] agent installed, and these agents are used to monitor the applications running in the VM. Hyperic agents can collect system-level statistics (e.g., CPU, memory utilization) and application-level metrics (e.g., latency, throughput). The collected statistics are aggregated in Hyperic server, which vScale periodically polls to gather the data. Based on the measured values of the performance metrics and the SLO, vScale uses an RL algorithm to compute the recommendations. Each recommendation is a tuple <tier; nvms; up|down>, where tier is the tier to be scaled, nvms is the number of VMs to scale, and up|down specifies starting more VMs or stopping existing VMs. The recommendations are converted into provisioning operations that can be executed by either vCenter instance.

3. Design

We present the design of the vScale algorithm in this section. 3.1. Brief Primer on Reinforcement Learning An RL formulation contains an intelligent agent that automatically learns from an environment. The agent interacts with the environment by applying an action and learning from the reward (positive or negative) awarded by the environment. At each time interval t, the environment provides the state st to the agent. The agent applies an action at and receives a reward rt+1, and the environment moves to state s t+1. The agent chooses the action based on a “policy.” The objective of the agent is to learn the optimal policy to achieve maximum reward in the long run. For more-detailed background on RL, see [11].

3.2 Auto-Scaling Problem
The primary objective in auto-scaling is to achieve the SLO by automatically scaling VMs in and out while utilizing the fewest resources. For the multi-tier application, we define ut as the resource configuration at time t, which is a vector of total number of VMs (nvmt) and resource utilizations (vt). vt is a vector of CPU, memory, storage, and network resource utilization. For each resource, utilization is calculated as a ratio of total consumed and total configured size for all VMs.

We also define a limit nvmlimit, which determines the maximum total number of VMs the multi-tier application is allowed to consume. The application performance is represented by yt, which specifies the end-to-end performance of the multi-tier application. yt is a vector of the individual tier performance (ytiert ) (e.g., MySQL tier latency). yref is the SLO for the application. We represent the state in our RL problem as a combination of the current resource configuration and the application performance st = (ut; yt). We do not include the input workload in our formulation, because it cannot be directly observed. However, note that the workload is indirectly represented by the application performance (yt).The actions in our RL problem are scaling VMs either in or out in a particular tier represented by at = (tier; howmany; up|down), where howmany specifies the number of VMs to be scaled. We define the total expected return Rt as a function of the individual rewards at each future time interval, with a discounting factor of α. Intuitively, the discount factor β allows rewards from previous intervals to be counted toward the overall return, because our goal is to maximize overall return rather than immediate reward.

padala-3

The reward gained by taking an action is a function of the SLO, application performance, and resource usage. If the application meets the SLO (e.g., latency < 200ms, throughput > 1000 reqs/sec), the environment awards a positive reward. However, we don’t want the application to consume too many resources (e.g., latencies far below the SLO). To penalize excessive usage of resources, the environment provides a negative reward (or penalty) for the application exceeding the SLO by a wide margin. If the application does not meet the SLO, a negative reward is provided to discourage the action taken. We use two concave functions to calculate the rewards, as explained in Algorithm 1. We compute the score by combining the application performance and resource configuration ut. Resource configuration contains the number of VMs and resource utilizations. We use the maximum constrained item in ut in computing the score.

Algorithm 1. Reward computation

Algorithm 1. Reward computation

3.3 Solution
We solve the auto-scaling problem by using Q-learning. First, we define the value of taking an action a in state s under a policy as Q(s, a), which denotes the expected return from taking action a from state s.

padala-5

We learn the optimal value of the action-value function Q by using Q-learning. Algorithm 2 describes our core algorithm. We first initialize the Q value for the starting state to zero. For each interval, we iteratively update the Q values as per Q-learning technique [13]. The action is chosen from a policy (described below). Α is the learning factor, which can be changed to favor exploration or exploitation. Larger values of α favor exploitation, by simply using the Q value that has been learned so far. Smaller values update Q values with what has been learned by applying the state. Β is a discount factor for the reward to discount the rewards seen in previous intervals.

Algorithm 2. Q-learning algorithm for the auto-scaling problem

Algorithm 2. Q-learning algorithm for the auto-scaling problem

3.4 Challenges in Applying Q-Learning
Before we talk about the policy for choosing the action in Q-learning, there are practical difficulties to consider in applying the Q-learning algorithm. In typical Q-learning algorithms, “ε-greedy policy is used to determine the next action to be taken. In “ε-greedy policy, action with the best Q value is chosen with 1 – ε probability, and a random action is chosen with ε probability. The ε values can be increased or decreased to give preference to exploration vs. exploitation. However, the auto-scaling problem has a large state space, making it difficult to find an optimal action. Another challenge is the time taken for provisioning a new VM, which can be on the order of minutes. As a result, if we apply Algorithm 2, it will take a long time to converge and we will not be able to adapt to changes quickly.

To avoid these problems, we define a few heuristics to bootstrap Q-learning and speed up learning. We specifically list the heuristics for SLOs specified as latency < SLO, but similar heuristics are applicable to throughput as well:

  • No-change policy – Stay unchanged when latency for all requests below γ * SLO or above (1- γ) SLO. This is a policy to make sure that we do not aggressively scale up or down for slight variations from SLO. Γ can be changed to control the aggressiveness with which to stay closer to SLO.
  •  Scale-up policy – Scale up when average latency for one percentile of requests is above γ * SLO. Choose the tier with
    • ––Max increasing errors within the history window W (e.g., last 32 actions)
    • ––Greatest latency increase during the history window W
    • ––Highest per-tier max latency
  • Scale-down policy – Scale down when average latency for 99 percentile of requests is below (1 – γ) * SLO. Choose the tier with
    • ––Greatest latency decrease during the history window W
    • ––Lowest per-tier max latency

Finally, we define our policy as a combination of learning and heuristics as follows:

ε = (# actions explored so far)/(# of possible actions from s);
With probability ε: use heuristics;
With probability 1 - ε: use Q table to find action a;

When we start the vScale system or when the system behavior changes (due to workload changes), heuristics dominate the vScale operation. As vScale learns from the actions taken due to heuristics, the number of explored actions increases and learning becomes more predominant.

4. Experimental Test Bed

Our test bed consists of multiple enterprise-class servers, each connected to multiple datastores backed by an enterprise-class SAN. Each node is an HP ProLiant BL465c G7 with two 12-core 2.099GHz processors, 128GB RAM, and two Gigabit Ethernet cards. The SAN has 10 non-SSD datastores of 1TB each. The nodes are running VMware ESXi™ 5.1.0 managed by vCenter 5.1.0, and we used Fedora 14 Linux images as guest VMs. The guest VM image contained all the necessary applications for running multi-tier applications. We used named as our DNS server, nginx as the Web server and reverse proxy, JBoss as the application tier, and Percona MySQL Server as the database tier. The guest image also contained Hyperic agents to collect application-level monitoring information. We used RUBiS [6], an online auction site benchmark, as our primary benchmark application. A trace is used to replay the behavior of a production system. RUBiS allows a specified number of client threads to be specified to simulate multiple concurrent connections. We modified this to allow different numbers of client threads to allow replaying of the traces and follow the workload pattern observed in the traces.

5. Evaluation Results

In this section, we present experimental results validating vScale behavior. We designed experiments to specifically test the following aspects of vScale:

  • Automatically detect the bottlenecked tier and scale the VMs to meet the SLO.
  • Quickly achieve the SLO.
  • Use minimum resources to achieve the SLO.

5.1 World Cup Trace
In this experiment, we used RUBiS as the multi-tier application and the World Cup ’98 trace 4. A detailed characterization of the trace is available at [7]. The World Cup trace contains 92 days, and we mapped each day to 5 minutes. To match the workload pattern, we mapped the number of daily access requests to the number of concurrent client threads. The length of the resulting workload is approximately 18000s with workload intensity varying over time. We ran vScale with the “latency of 99 percentile requests < 5 secs” SLO. Figure 3(a) shows the performance of the application over a period of time. The primary y-axis shows the total number of requests and requests below 200ms. The secondary y-axis on the right side shows the number of requests above the SLO. The number of requests below 5 seconds is 99.94%, thereby achieving the SLO. Figure 3(b) shows the scaling of VMs in each tier. The scaling matches the spikes in the workload, and scaling of different tiers is performed according to the intensity of the workload at different tiers.

Figure 3. Performance of vScale Under Single Workload

Figure 3. Performance of vScale Under Single Workload


5.2 Learning with Repetitive Workloads

To evaluate vScale’s learning, we created a repetitive workload pattern that contains four repetitions of the World Cup trace. Figure 4(a) shows the performance of the application. We can see that by learning through reward/punishment of historical autoscaling actions, vScale gradually evolves its policy throughout the four repetitive workloads. The benefit of such evolution is twofold: First, by learning from history, vScale avoids potentially inappropriate decisions so that the right number of resources are allocated to the applications quickly; second, by avoiding unnecessary scaling actions, vScale saves resources. To further demonstrate the benefit of learning, we reran the experiment after disabling the heuristics. Figure 4(b) shows the results. vScale can still generate satisfying scaling policy without heuristics, albeit slowly.

Figure 4. vScale Performance Under Repetitive Workloads

Figure 4. vScale Performance Under Repetitive Workloads

We compared vScale’s performance with the RightScale algorithm [4]. We set the thresholds for each tier based on the observed resource utilization, when latency is above the SLO. Results are shown in Figure 4(c). As can be seen from the figure, the RightScale algorithm has more bad requests than vScale with and without heuristics. The algorithm also does not learn from the workload pattern and produces the same scaling behavior for all repetitions of the workload. Table 1 shows the comparison in more detail. The SLO time specifies the total number of minutes the application meets the SLO for all requests in that minute. vScale performs 77% better than RightScale in this regard. We also measured the resource usage (# of VMs * # of minutes, similar to Amazon EC2’s cost policy) and see that vScale uses 20% more resources, but for a much bigger improvement of 77% in SLO time. Because RUBiS is a closed-loop workload, vScale achieves much higher throughput compared to RightScale, with 34% more requests served.

padala-10

6. Conclusion

Solving the problem of automated scaling of multi-tier applications is important for achieving efficient resource usage and low costs while meeting SLOs. Current state-of-the-art approaches use threshold-based mechanisms, whereby scaling is performed based on a threshold set for the system metrics. These approaches are insufficient for achieving end-to-end SLOs, due to the difficulty involved in setting the thresholds correctly. In this paper, we formulated the auto-scaling problem as an RL problem and designed a solution based on Q-learning. A Java prototype is implemented and is evaluated using a local vSphere test bed. Our experiments with replaying the World Cup ’98 trace to reproduce various dynamic workloads indicate that vScale outperforms the RightScale algorithm.

7. Acknowledgments

We would like to acknowledge Jeff Jiang, Venkatanathan Varadarajan, Dimitris Skourtis, and Abhishek Samanta, who worked on this project during their internships at VMware.

References

1. Amazon CloudWatch.
2. Decision Tree Learning.
3. Galera Cluster for MySQL.
4. Set up autoscaling using voting tags.
5. VMware vCenter Hyperic. http://www.vmware.com/products/vcenter-hyperic
6. C. Amza, A. Ch, A. L. Cox, S. Elnikety, R. Gil, K. Rajamani, E. Cecchet, and J. Marguerite. Specification and implementation of dynamic Web site benchmarks. In Proc. of IEEE Workshop on Workload Characterization, Oct. 2002.
7. M. Arlitt and T. Jin. Workload characterization of the 1998 world cup web site. Technical Report. HPL-1999-35R1, Hewlett Packard Laboratories, Feb. 1. 1999. 8. P. Bodik, A. Fox, M. J. Franklin, M. I. Jordan, and D. A. Patterson. Characterizing, modeling, and generating workload spikes for stateful services. In SoCC, pp. 241–252, 2010.
9. R. Quinlan. Induction of decision trees. Machine Learning, 1:81–106, 1986.
10. R. S. Sutton and A. G. Barto. Reinforcement Learning:
An Introduction. The MIT Press, 1998.
11. B. Urgaonkar, G. Pacici, P. J. Shenoy, M. Spreitzer, and A. N. Tantawi. An analytical model for multi-tier internet services and its applications. In Proceedings of the International Conference on Measurements and Modeling of Computer Systems, pp. 291–302, June 2005.
12. C. Watkins. Learning from delayed rewards. PhD thesis, Cambridge University, Psychology Dept., 1989.

Scaling View Management Operations: Analysis and Optimizations

$
0
0

Oswald Chen
VMware Inc.
ochen@vmware.com

Michael Pate
VMware Inc.
mpate@vmware.com

Banit Agrawal
VMware Inc.
banit@vmware.com

Michael Spradlin
VMware Inc.
mspradlin@vmware.com

Soumya Mishra
VMware Inc.
soumyamishra@vmware.com

Kenny To*
VMware Inc.

Dhiraj Parashar
VMware Inc.
dparashar@vmware.com

Abstract

Scaling VMware View is not an easy task because it involves many parts and dependencies. These include many View Clients connecting over a LAN or WAN to a View Security Server to set up a secure tunnel with the View Connection Server that manages desktop virtual machines with the View Agent installed. All of these components, most importantly, the desktop virtual machines, run on VMware vSphere infrastructure which in turn depends on hardware, storage, and network capabilities. Any of these components could be a bottleneck when scaling to thousands of users.

This paper describes the methodical process we used to learn about the system and make improvements. We designed test cases with metrics relevant to the customer by looking at user stories, added features to existing test tools for more realistic simulations, implemented profiling and reporting infrastructure to aid analysis, ran tests on hardware representative of real world deployments, analyzed the resulting data, and finally made improvements in both code and documentation as part of the VMware Horizon View 5.2 release.

1. Introduction

VMware View is an enterprise solution for replacing traditional physical desktops with virtual Windows desktops. The virtual desktops are deployed in the data center and managed as a service, which enables companies to increase security, ease desktop management, and decrease operational costs while end users are able to access their desktops using tablets, smartphones, thin clients and personal computers from the office, at home, or while traveling.

There are two categories of people dealing with View: the administrators managing View and the end users interacting with their desktops. Administrators use the View Connection Server, a connection and management server, to provision desktops as virtual machines. The View Connection Server itself uses VMware vSphere infrastructure to provision these virtual machines from a master image. End users authenticate to the View Connection server to get access details for their virtual desktop and interact with it over a remote display protocol. Over time, administrators also use the View Connection Server to perform management operations such as recomposing desktop images to apply system updates, provisioning additional desktops to accommodate new users, and migrating desktops between data stores.

As customers adopt View, they are expecting to support higher and higher numbers of seats. To achieve this, the entire solution must perform well and maintain stability, reliability, and serviceability while scaling the deployment size. End users expect to log on and access their desktops quickly, and use applications on them with responsiveness similar to their experience on physical desktops. Administrators need to manage the desktop virtual machines and infrastructure backing them such as desktop images, resource pools, and data stores, all while staying within a certain maintenance window and easily troubleshooting issues when they occur. In this paper, we show how we ensure these needs are met by running actual and simulated View deployments at scale and making product or documentation improvements where necessary.

A deployment of the View solution consists of several View components, vSphere components, storage, networking, and other infrastructure components required for the desktops. Hence making the solution scale and verifying the success is a challenge as any one component can fail or be a bottleneck. For example, when the end user tries to connect to their desktop, the View Client first goes through a LAN, WAN, or mobile network to the View Security Server which resides in a DMZ and establishes secure, authenticated channels with the rest of the solution. Then the View Client creates such a channel to the View Connection Server to request desktop access details. To satisfy a desktop connection request, the desktop virtual machine may need to be started on demand, which is sensitive to current storage I/O load. If the desktop is powered off, the desktop is powered on. As the desktop starts, it acquires an IP address through DHCP, synchronizes with Active Directory, and starts the View Agent, a service running on the desktop for assisting with management. Once the View Agent contacts the View Connection Server, it starts a Windows session. Then the View Client can finally connect, over another secure channel through the View Security Server, to the desktop.
* Kenny was a member of View Large scale group when he was with VMware, Inc.

We approached this scalability challenge methodically by going through repeated iterations of measurement, analysis, and improvement. We needed to find the appropriate measurements in large scale deployments to ensure that they’re applicable to our customers in the real world. Therefore we used user stories to drive test case selection and test on real hardware. For example, we determined that at the start of a work day or shift, users typically all log on at nearly the same time except for a small percentage who are early or late. This is modeled well as a normal distribution over a time window, so we built tools to simulate the varying frequency of log on attempts when driving test clients.

VMware View Planner [1] is a View capacity planning tool including a test harness which we adapted for our tests. To enable automatic collection of measurements, we started a profiling framework for logging the amount of time spent in sensitive code paths. Then we added tools to View Planner for collecting logs and other artifacts generated during a test run. After collecting everything, we fed it into reporting infrastructure we built which consumes profiling data from the logs to compute basic aggregate statistics, generate plots, and present them as reports through a web interface. The reports contain the data we need to identify potential product, deployment, or configuration issues that cause performance degradation or decreased reliability. Once we identified the issues, we started making improvements.
Through our efforts, we have accomplished the following:

  • Improved provisioning speed by a factor of 2.
  • Improved data store rebalancing speed by a factor of 4-10.
  • Added product support for larger desktop pools by allowing them to span multiple networks.
  • Added a product feature to perform mass maintenance operations in a rolling fashion, allowing users to continue using their desktops.
  • Added reusable features to View Planner for automatic log collection and simulating normal distributions.
  • Started an extensible profiling framework for View and infrastructure to generate reports from its output. In this paper we will describe the test cases we chose, tools and test beds we created, our findings from testing, improvements we’ve made to View, and future work.

2. Analysis and Scope Optimization

2.1 Resource Dedication and Team Empowerment
One of the challenges that View faced in scaling up was to secure dedicated resources from both human and hardware perspectives. Traditionally View test beds and engineers were dedicated to a small set of specific test cases during the product release cycle. When the target release was shipped, these resources moved on to other areas. View Scale team was founded to answer that challenge. It consisted of dedicated human and hardware resources that spanned beyond regular release cadences and were responsible for scalability challenges in general.

View Scale team had dedicated storage/compute/network resources that were capable of scaling up to 10K View desktops. And its team members included Product Manager, View Architect, System Architect, as well as software engineers from View R&D, View System Test, and Performance teams. The composition of the team had a wide range of domain expertise and authority which in-effect empowered the team for effective decision making and enabled high level of agility.

2.2 User story Driven Test Cases and Scalability Targets

The team’s first responsibility was to work with product management and developed a set of real world user stories that were highly critical for our customers. These user stories were targeted for both View administrators and end users, including:

  • Management operations for View Administrators: Provisioning, Refresh, Recompose, and Rebalance at scale.
  • End user activities: Log on, log off storms; as well as workload run after logon and power/maintenance policies after log off.
    • As part of user story definition, the team also defined a set of real world criteria against which the scale tests were measured. Including:
    • 8 hour maintenance window for management operations.
    • Maximum acceptable error rate in conjunction with automated error recovery.
    • Log on/off storms that modeled normal distribution in a 60 minute window for various desktop power policies.
    • Maximum acceptable end user log-on time during log-on storms.
    • Guest OS workload definitions for power, knowledge, and task workers.

2.3 Test Tools Development
Once the test cases and their performance and scale criteria were defined, the team needed to establish ways to drive the test execution, quantify the test output, and enable effective troubleshooting and analysis. The test frameworks chosen to fulfill those requirements were View Load Simulator and VMware View Planner.
View Load Simulator (VLS) was used for simulated log on tests. VLS consisted of a set of JMeter clients (which simulated View Clients) that were capable of modeling normally distributed logons, and a set of simulated ESX hosts with the ability to mimic View agent behaviors. For VLS driven logon tests, the View components under test were View Security servers and Connection servers. Use of VLS allowed execution of a scaled logon storm with relatively small hardware cost. To conduct real end-to-end test with real hardware and virtual machines, View Planner was used. View Planner 2.1 [1], already had the ability to work with View deployments to perform uniformed distributed log on storms and guest OS workload generation. The additional enhancements added to View and View Planner to enable View scale tests included:

  • View Profile logging framework to enable extractable profile log entries.
  • Automated and centralized DCT collection process.
  • Normally distributed log-on and log-off storm simulation.
  • Centralized profile entry parsing and uploading to report database
  • Graphical JSON/OpenFlash based reporting framework.

With these tooling frameworks and their additional features for scalability test, the team was able to execute real world test scenarios and quantify the results, as well as performed very granular level of trouble shooting and analysis to zero in on potential performance issues at scale.

2.4 Incremental Scale Up, Exploration, Findings, Improvement, and Validation
We adapted an incremental approach on scaling up to 10K. Initially tests were done at lower scale so that we could iron out any test bed design and tooling issues; as well as establishing a performance base line. The same tests were later re-run at higher scale levels (with 2K desktops increment). For each scale level the following aspects were examined:

  • Exploring and tuning system configurations (for example, increased VirtualCenter and View Composer concurrency limits during VM provisioning) to see whether it can yield better performance.
  • Generating and analyzing profiling reports to identify scalability and performance bottleneck in the system.
  • Root caused scalability issues were fixed and reviewed, and were validated in the next iteration and its performance improvement was quantified and documented.
  • Wider scope product improvements were also evaluated, implemented, and validated on each scalability level. Product improvements that yielded substantial benefit were communicated to the release management team to be included in upcoming release.

The ultimate goal of View Scale effort was to provide our customers with best practices and reference architectures on scaling View based on the team’s findings and recommendations. The means of documentations included Knowledge Based articles, View Architecture Planning Guide, as well as Tech Marketing reference architectures and via technical blogs.

3. Experimental Setup

In this section, we talk about the experimental setup and configurations used for scaling 10k real desktop VMs using one instance of vCenter. We also describe various large scale design aspects which were taken into consideration while designing the test-bed. We followed the best practices to keep the replica on the SSD disk which will absorb most of the read IOPS and avoid high IOPS requirements from the spinning drives.
The experimental setup for virtual desktops and infrastructure VMs is shown in the figure below:

Figure 1. Experimental setup for virtual desktops and Infrastructure VMs.

Figure 1. Experimental setup for virtual desktops and Infrastructure VMs.

As shown in the test bed diagram in Figure 1, we use five clusters with several hosts (6 to 12) in each cluster to scale to 10k desktops. In each cluster, we deploy a pool of 2k desktops and the replica is kept on the SSD drive as shown in Figure 2.

Figure 2. Showing a pictorial representation of the replica kept on SSD used to make the linked clones.

Figure 2. Showing a pictorial representation of the replica kept on SSD used to make
the linked clones.

To provision one such 2k desktop pools, we used 360 15k hard drives which are 300GB each. Since one hard drive provides about 200 IOPS per hard drive, with 360 hard drives, we were getting about 72000 IOPS, so this translates to about 36 IOPS per desktop VM which is sufficient even for power user. We also used the same storage array to host the client VMs (these are used to connect to desktop VMs with PCoIP display protocols), however amount of hardware/storage required for clients was significantly less because we only needed 250 clients to connect to 2k desktops, i.e, one client was used to connect to 8 desktops. On the infrastructure side, we used five View connection servers, 5 View security servers for tunneled connections tests, one instance of vCenter server for hosting all 10k desktops.

3.1 View Infrastructure and Configuration
Now that we have described the hardware test bed, we present the configurations used in the software layers for the 10k desktop VMs tests. We optimized the desktop image as per the Windows 7 optimizations and best practices guide [8] where we disabled some group policy settings, disabled some services, etc. The Configurations used for View connection sever is shown in the table below:

pate-3

3.2 Logon/Logoff storm
To support the Logon/off scenario, new features were added to ViewPlanner like normal distribution simulation which was used to mimic a real world scenario where users would logon to their desktops at varied time following a bell curve pattern. Following were few of these changes:

  • Normal distribution: To perform normal distribution while Logon/off test and not the usual Uniform distribution, we devise a mechanism where the user sleeps for a specific random time which is decided by the normal distribution. Log-on/off period is defined as +/- 3 standard deviation window. For example, a 60 minute log-on period window has a 10 min SD. To translate this to logon/logoff rate:
    • 95% of log-on/off happens within +/- 2 standard deviation window.
    • 68% of log-on/off happens within +/- 1 standard deviation window.
    • Peak log-on/off rate = 0.4 * (X / T) where X is total log-on/off attempts, T is the time for 1 SD window.
  •   View client logoff: We run log off test very similar to what an end-user would in real-life where he/she closes the view client windows. In View Planner, we add the functionality to do logoffs after the workload is completed. First, Clients ignore random sleeps and logon attempt limits. This ensures all desktops are connected. Then, desktops wait until all are connected, then sleep randomly depending upon the ramp-up time. After that, desktops complete their workloads as normal. After a desktop completes, the harness informs the client of the corresponding user. The client closes the view client window to trigger a view logoff.
  • Powered off and suspended desktop interoperability: We mixed in a certain percentage of powered-off and suspended desktops in our log on storm tests to mimic real world scenarios.
  • Limiting View re-connection attempts: Allowed number of view client logon attempts until giving up. A set number of view client connection attempts can be set, after which the client gives up.
  • Logging connection time information: At the end of the test, the results file is appended with information about how many attempts each user had along with how long these took. This occurs for both View Planner remote and passive modes.
  • Adding an upper bound to non-randomness based ramp up time sleeps: We add an upper bound on the ramp-up time to make the total run cycle faster.

3.3 View Profiling framework and Reporting infrastructure
View profiling framework uses well defined prefixes and syntax (operation name, start/end time, and object Id) to create log entries on View Connection server logs for highly performance critical operations. These log entries are included in DCT collection process and can be parsed and uploaded to the MySQL database for performance analysis.

Reporting Infrastructure was an integral part of the entire process, as it provided a great insight into the details of all the operations including Provisioning, Recompose, Refresh, Rebalance and Logon/off. The reporting mechanism consisted of three steps, gathering logs, parsing them and extracting required information into a relational MySQL database and using that information and generating meaningful reports. It was made sure that before starting any test, logs were using View Planner. The logs collection part collected ESXTOP data, View Planner logs and product logs which included View Connection server, VC, View Composer server, Security server Agent desktops and Clients machines.

ESXTOP logs consisted information regarding the hosts used for a particular test, the CPU usage, memory usage, IOPS usage, IO latency, network throughput, and also guest level stats. View Planner logs hold the data regarding client registration to VP harness, connection to the desktop, initiation and termination of workload and logoff or disconnect. The View Planner log and the Agent/Client logs were useful for Logon/off test report generation. For collection of Agent/Client logs, parameters were inserted in View Planner to control the percentage of DCT (Agent/Client logs) required to be collected out of the entire setup and delay between triggering collection command to each Desktop.

Collection of all the above mentioned logs was done in a separate harness made for log storage. From there, the logs were utilized by scripts which parsed them and extracted the useful information into database tables. Report generation could be triggered both using View Planner Harness UI and console.

Figure 3. Screenshot of a sample report used for analysis of the management operation under test.

Figure 3. Screenshot of a sample report used for analysis of the management operation under test.

Figure 3 gives an idea of how a report looks like with all the parameters which it measures like the clone creation, linked clone creation/deletion, refitting operations etc. Around 96 such parameters are displayed in the report for the Standard broker and around 35 for replica brokers. The “chart”, “histogram”, and “concurrency” links open up different plots of the data, and we’ll have some significant examples of them later in this paper.

4. Findings

4.1 Certificate verification and log on time
Initial results from the end to end log on tests of 2000 users showed that each user took about 30 seconds to complete their log on. Most of the remote XML API calls with the connection server completed in less than 1 second, with the exception of the combined + calls and the call. Since performs the primary work in setting up the tunnel and brokering the desktop connection, it was not surprising that it would take some time. However, and are trivial operations, and profiling data from the connection server confirmed that they were executing in under 1 second. Yet, profiling data from the clients indicated a total time of over 16 seconds. Armed with this knowledge, we did some debugging and found that before actually executing the calls, in the process of verifying the connection server’s certificate, the client was attempting to contact the Windows Update server [2]. Since our test bed was in an isolated lab with no external internet access, this attempt was timing out and adding an additional 15 seconds to the operation time. While this deployment scenario is not typical, some customers with special security considerations may still face the same issue, so we published a KB article describing the problem and suggested solutions [3].

4.2 I/O workloads and power operations
While the speeds of the remaining XML API calls were satisfactory, the call was still the largest contributor to log on time, so we investigated it more to determine whether anything could be improved. We found that the bulk of the call was spent waiting for the desktop agent to finish processing a “start session” command. The “start session” command is responsible for allocating a Windows and display protocol session for the user. Since much of this is handled by 3rd party code, there wasn’t anything we could change directly. However, we observed a dramatic difference in the “start session” times depending on whether the user was running any desktop workload after logging on. With no workload, “start session” consistently completed in 2-3 seconds:

Figure 4. Graph showing the “start session” completion in a short duration of time for NO workload.

Figure 4. Graph showing the “start session” completion in a short duration of time for NO workload.

With a medium sized workload, “start session” operations completed in 2-3 seconds at first, but towards the peak of the log on window they started varying dramatically in length anywhere between 3 seconds and the 60 second timeout:

Figure 5. Graph showing variation in the “start session” completion time for middle sized workloads.

Figure 5. Graph showing variation in the “start session” completion time for middle sized workloads.

We analyzed the ESXTOP data from the test run and found a correlation between the longer “start session” times and the times of peak IOPS activity. This suggests, as we have long suspected, that IOPS capacity of data stores is a significant factor in View desktop performance [4].

As we moved on to larger scale tests and different parameters, we uncovered some bugs and interesting behaviors at scale. One that stood out is the impact of the power operations concurrency limit [5] on log on storms where the connection server powers on some VMs on demand. The primary symptom was that at scales above 4000 users, some log on attempts would fail, and the connection server logs would indicate that it had exceeded operation limits. After collecting a few data points, we realized that this was the concurrency limits feature of the product working as intended, but with a default setting that was too low for the given scale. We worked out some exact equations describing the relationships between total scale, log on window, and the amount of time it takes to fully power on a VM. These equations turned out to be relatively complicated due to involving shifts of the cumulative distribution function for the normal distribution. By focusing only on peak power on rates, which is the worst case in the equations, we arrived at a much simpler, though approximate, equation:

concurrentPowerOperations =
desktopPowerOnTime * peakPowerOnRate

We published this equation and its application to a common set of parameters to a KB article [6]. The default power operations concurrency limit of 50 should support a peak desktop power on rate of about 16 desktops per minute, which corresponds to 2000 users logging on over 60 minutes with 20% of their desktops powered off.

4.3 Provisioning operation
For pool management operations such as provisioning desktops, we focused on the end to end time for completing all operations as that would ultimately determine the maintenance window. We were especially interested in experimenting with different concurrency limits because we suspected that the default limit was overly conservative and that we could easily decrease the end to end time by increasing the concurrency limit. The following table shows the results sorted by the time per 512 desktops provisioned.

pate-7

This was quite unexpected. Even with dramatically higher concurrency limit settings, the overall throughput was virtually the same as at the conservative default setting of 8. Increasing the setting to 100 noticeably reduced the performance compared to all previous settings. We drilled down into more detailed profiling information from all the runs and found that the provisioning time for many desktops was dominated not by the clone operation itself, but rather the preparation step of creating a cloning specification. Unfortunately, our instrumented profiling still wasn’t yet detailed enough to figure out the root cause, but through code inspection we found suspicious usage of a lock. In particular, the lock was using a strict LIFO policy which led to a distinct pattern of earlier operations taking longer to finish.

Figure 6. Graph showing Provision operation taking longer to finish due to LIFO policy used for scheduling the cloning operations.

Figure 6. Graph showing Provision operation taking longer to finish due to LIFO policy used for scheduling the cloning operations.

The operations getting starved the most were taking up to 3500 seconds to complete. With a trivial change to use a fairer queue policy on the lock, operation length became much more consistent at 800-1000 seconds each:

Figure 7. Improvements in the Provision time due to usage of a fairer queue policy on the lock for scheduling the cloning operations.

Figure 7. Improvements in the Provision time due to usage of a fairer queue policy on the
lock for scheduling the cloning operations.

This was noticeably better, but 1000 seconds is still a tremendous overhead compared to the primary work of cloning the VM which typically only took 60-120 seconds. That was unacceptable, so we instrumented more profiling to confirm that lock contention was the root cause, then designed and developed an improved preparation step that quickly releases the lock. The improvement has since been shipped in Horizon View 5.2 and is described in more detail further down in this paper.

After making that improvement, we discovered still a couple more bottlenecks.

  • Provisioning speed decreasing with increasing pool size). This was caused by inefficiencies in allocating an appropriate name for the next VM to be created. This resulted from the fact that the VMs in a pool are ordered lexicographically, while the code expected numerical ordering.
Figure 8. Graph showing the decreasing provisioning concurrency with increasing pool size.

Figure 8. Graph showing the decreasing provisioning concurrency with increasing pool size.

  • At very large scale (8K → 10K) the provisioning speed decreases significantly. This was primarily caused by the decryption of all VMs’ private keys to determine if something has changed. This was optimized by just checking the public keys and the charts are shown in Figure 9:
Figure 9. Difference in the concurrency fluctuation of 2K provisioning at 8K scale before and after code change.

Figure 9. Difference in the concurrency fluctuation of 2K provisioning at 8K scale before and after code change.


4.4 Maintenance operations

The end to end time for recompose operations did show some improvement by increasing the concurrency limit from its default setting of 12:

Figure 8. Graph showing the decreasing provisioning concurrency with increasing pool size.

Figure 8. Graph showing the decreasing provisioning concurrency with increasing pool size.

  •  At very large scale (8K → 10K) the provisioning speed decreases significantly. This was primarily caused by the decryption of all VMs’ private keys to determine if something has changed. This was optimized by just checking the public keys and the charts are shown in Figure 9:
Figure 9. Difference in the concurrency fluctuation of 2K provisioning at 8K scale before and after code change.

Figure 9. Difference in the concurrency fluctuation of 2K provisioning at 8K scale before and after code change.


4.4 Maintenance operations

The end to end time for recompose operations did show some improvement by increasing the concurrency limit from its default setting of 12:

pate-14

The throughput improved rapidly, but we started getting diminishing returns beyond a setting of 40. Additionally, we found that even higher concurrency settings put us at risk of overloading and eventually crashing the vCenter server. With a few more test runs on different test beds, we determined that the ideal concurrency setting for maximizing throughput without risking instability is highly dependent on the capacity of the infrastructure. However, deriving an equation to express this relation is an intractable problem since it involves so many variables: capacities of storage, network, CPU, and memory; vCenter and ESX versions; and configurations of vSphere clusters and concurrency limits, to name a few. Instead of documenting any equation, we need to make the system automatically discover capabilities and tune itself accordingly.

5. Improvements

5.1 Provisioning and Rebalance throughput/ Datastore selection improvement
5.1.1 Problem
Provisioning and rebalance operations require selection of an appropriate datastore for placement of each individual VM. This Datastore selection happens after obtaining a lock. However the amount of work that is done for datastore selection takes nontrivial amount of time. This causes the following two problems.

  • Slowness: Each cloning/rebalance operation spends a high percentage of their time waiting to acquire the lock. As a result at higher concurrency levels no throughput improvements are observed.
  • Correctness: In spite of synchronizing the DS selection the VMs do not get distributed correctly across various DS’s. This stems from the fact that after DS selection we are not reserving any space for the allocation that we just made and the clone creation will take some time to create disks.

5.1.2 Key Concepts
Following strategies have been used to solve this problem.

  • Caching: One of the key reasons for slowness of each single selection is the time spent to make the VC call to get the latest DS and VM information. This is easily approximated using the cached values. VCCache is leveraged as much as possible.
  • Reservation: Using the latest value of datastore’s freeSpace impacts the correctness, as the VC is not yet aware of the space allocation we have done. Caching this value and adjusting it with the reserved space improves both the correctness and speed of the solution.

5.1.3 Throwaway Cache Reservation Algorithm
A simple caching and reservation strategy is a little problematic as on a short term our internal cached data has a more accurate view of the datastores as compared to VCCache (or VC itself). However the internal maps diverge from the real disk usage as time progresses. This can be due to increase in disk sizes/PowerOns/PowerOffs/ deleteVMs/external operations.

Throwaway: To avoid this we follow a strategy of throwing away our internal cache (on cache expiry) and recalculating it from VCCache. This does mean that we are probably throwing away the reservation information of at least some in-flight provisioning operations (which haven’t yet reflected in VCCache) and the decision we make after our cache refresh might be suboptimal. However the following points help the overall approach to stay on track.

  • Assuming that we were distributing VMs across the datastores before the refresh, the VCCache data shouldn’t be skewed for or against a particular DS by too much.
  • Inaccuracies that might creep in after DSCache is recalculated from VCCache should cancel out over multiple cycles of refresh.
  • Since linked clones start out really small if we were to begin with datastores with widely varying distribution, we would end up over-packing many more VMs on the datastore with least density. To overcome this, a penalty factor is used to discourage consecutive/frequent selection of the same datastore. The penalty is calculated as (Penalty Factor * Steady State Size of the VM).

5.2 Large Pool Support Problem
We identified two artificial system-imposed constraints as impediments to reaching 10K scale. Each of these dealt with how large an individual pool of desktop machines View could support. Though View allows an arbitrary number of desktop pools to exist, it is not a reasonable expectation of customers reaching 10K scale to create many small pools in order to achieve 10K total machines within those desktop pools. This was additionally important because pools are normally managed by customers with desktops that have similar business case purposes, and not created with characteristics to resolve limitations of the View system itself. As such, we found it desirable to remove any constraints that would prevent a desktop pool of reaching its recommended maximum size of 2000 machines. With this allowance in place, our testing could then be performed with as little as 5 total desktop pools.

5.2.1 Eight Host Limit
View defines a desktop pool on a single cluster within Virtual Center. Therefore, the cluster capacity represents an upper bound on the number of machines that can be provisioned within a desktop pool. In previous versions of Virtual Center, because of concurrency constraints when utilizing shared disks, VMFS enforced a limit of supporting only up to 8 hosts. To protect against this, View internally prevented desktops pools from being created on clusters using VMFS from containing more than 8 hosts (this constraint was not present for NFS disks). However, starting in vSphere 5.1 and VMFS version 5, this constraint was increased to 32 hosts. View lagged behind in modifying this limit. Once we removed the 8 host limit in supported environments, 32 hosts were more than sufficient to support up to 2000 desktops in a single pool.

5.2.2 Automatic Network Label Assignment
Conventional wisdom and industry best practice pegs the recommended size of a VLAN to a /24 subnet, or about 254 hosts, especially in virtual environments. When provisioning desktops within a pool, newly created virtual machines take on the network label characteristics of the pool’s one target parent virtual machine. This network label defines the VLAN tag for the machine, which in turn defines the subset size, and, more indirectly, the DHCP address range available to that machine. Therefore, when an administrator creates desktop pools with a parent virtual machine using recommended standards, newly provisioned machines all shares the same DHCP address range. This means that any more than about 254 machines using this configuration in a desktop pool will oversubscribe the number of IP addresses available to them. This then places an artificial limit on the size of the pool without awkward workarounds to later reassign those child machines to new network labels and VLANs.

In order to address this problem, we implemented a new feature to deal with the fact newly provisioned desktop machines always inherit their pool’s parent machine’s network label. Administrators are now able to specify a set of available network labels (and, indirectly, VLANs and DHCP IP address ranges) that newly provisioned machines can be assigned. An additional step was added in View’s provisioning to automatically assign the next non-exhausted network label to new machines in the pool instead of the parent’s. In this way, as long as the administrator supplied enough network label capacity, desktop pools could be created of large sizes while still following the best practice of individual /24 VLAN subnets. As an additional challenge, we found that re-provisioning operations to refresh or recompose existing machines to some old or new parent state would wipe away these network label assignments. We added logic to prevent this case. With these features, IP range constraints are no longer an impediment to overall pool size in View.

5.3 Rolling maintenance support
While we improve management operations’ scale and throughput, it is also important to be mindful about different variants of scale use cases. For customers with mission critical View Composer pools, maintaining desktop availability during refit operations is as important as the operation throughput itself. This is particularly true for Health Care customers where physicians need to be able to access critical patient data even during maintenance window. The rolling maintenance support was achieved by introducing a new configuration parameter: RollingRefitMinReadyVM, which was basically the minimum number of desktops available for logon during recompose/refresh/rebalance operations for a View Composer pool.

The algorithm being run during a View Composer pool maintenance operation is as followed.

  • Look up the RollingRefitMinReadyVM setting for the pool.
  • Query the number of ready to use desktops in this pool.
  • Check the maximum concurrent operations allowed for refit operations.
  • Effectively, the maximum concurrent refit operations for this pool = Min[(NumReadyVMsInPool - RollingRefitMinReadyVM), MaxConcurrentRefitOperationsAllowed] With rolling maintenance support, maintenance operations and system down time are no longer synonymous. And maintenance operation throughput can be pushed higher without sacrificing critical user accessibility.

5.4 Auto-tuning concurrency limits
Our findings above showed us that choosing optimal concurrency limit settings is a tricky and expensive task even for a dedicated scalability team with internal engineering knowledge, so it would be completely unreasonable for customers to attempt it. Therefore we started pursuing an auto-tuning feature for concurrency limits. Instead of taking manually configured static settings, the product should automatically detect the current capacity of its underlying infrastructure and dynamically adjust the concurrency limit up and down as appropriate to increase management operation throughput without risking instability.

We started prototyping this idea in the form of an external tool hooked into the connection server’s logs to detect when adjustments would be appropriate and then automatically make the adjustments through View’s PowerCLI administration interface. Our strategy was inspired by TCP’s congestion control system and its use of feedback as a signal to back off: we would gradually ramp up the concurrency setting until we observed either the throughput dropping or error conditions, at which point we would reverse course and lower the setting. Since the system capacity may continue to change, we needed to keep this strategy active and essentially implement a hill climbing algorithm targeting the throughput.

We tested the prototype on the same test beds we’d been using, and the results were encouraging. The recompose operation throughput noticeably increased compared to test runs using the default concurrency settings, and the peak that the prototype found was close to what we had found in our manually driven static tests.

6. Related Works

To adaptively tune the concurrency limit for management operations, work done by Bhatt et al [7] is closest to our approach. Their vATM system [7] implements a feedback-based throughput maximizer for vSphere operations, and it can potentially be used for multiple applications built on vSphere instead of just one. Since this solution is more general and more developed, there is a possibility to leverage some of the functionalities from this work in the future. There have been also some studies around understanding the management overhead and building scalable datacenter [9][10], however some of the these techniques are not directly applicable to the specific needs of View scale requirements.
Regarding the visualization and profiling tools for logs, there are several tools available such as VMware vCenter Log Insight [11] that we could use. However, to get custom and meaningful insights, we added some simple enhancements to ViewPlanner tool to get the precise data/charts which helped us to find the bottlenecks.

7. Conclusion

Our View Scale testing and engineering effort helped in solving the challenging View scalability problem. It uncovered several aspects of the product which became very vivid at scale testing and also provided concrete information about reproducing such a scenario. The following table gives a very high level comparison of the time taken for the management operations before and after our View Scale effort. These new comparison tests were performed on a different configuration of underlying hardware, but they still showed an enormous reduction in the end to end time required for each operation.
pate-15

We published many useful findings from this initiative in the View Connection server’s help guide.

8. Future Work

Potential future projects for the View Scale team include: (1) As previously mentioned, we will look for integration opportunities with vATM to maximize concurrent operation limits. (2) Design and implement a job framework that is capable of supporting scheduling and workflow for provisioning and maintenance operations (3) Integrate View Planner, VLS, and other test frameworks with VMODL based View API to enable complete end to end automation of scale tests, including even higher scale use cases like Multi-Datacenter View.

References

1. Banit Agrawal et al. VMware View® Planner: Measuring True Virtual Desktop Experience at Scale. VMware Technical Journal, December 2012.
2. Window Update Server.
3. View users inside the firewall might experience a 15-second delay when connecting to View Connection Server, while Windows attempts to reach Windows Update Server. (2020988) http://kb.vmware.com/kb/2020988
4. B. Agrawal et al. “VMware® Horizon View™ 5.2 Performance and Best Practices”, VMware Whitepaper, Performance March 2013.
5. VMware Horizon View 5.2 Administration Guide
6. Setting a concurrent power operations rate to support View desktop logon storms http://kb.vmware.com/kb/2015557
7. Chirag Bhatt et al. vATM: VMware vSphere Adaptive Task Management, VMware Technical Journal, summer 2013.
8. VMware® Horizon View™ Optimization Guide for Windows 7 and Windows 8, White Paper,
9. Vijayaraghavan Soundararajan , Kinshuk Govil, Challenges in building scalable virtualized datacenter management, ACM SIGOPS Operating Systems Review, v.44 n.4, December 2010
10. Vijayaraghavan Soundararajan, Jennifer M. Anderson: The impact of management operations on the virtualized datacenter. ISCA 2010: 326-337
11. www.vmware.com/products/vcenter-log-insight

Introduction

$
0
0

Welcome to Volume 3, Number 1 of the VMware Technical Journal (VMTJ). The main theme of this issue is the work of the VMware Ecosystems Engineering organization. We have five excellent papers from this team, which are introduced by our guest co-editor, T. Sridhar.

In addition, we are pleased to include a number of papers on a variety of topics related to VMware technology. These papers are indicative of the breadth and depth of the work being done at VMware. “Scaling VMware View Management Operations: Analysis and Optimizations” by Agrawal and Mishra describes how the VMware View® R&D team developed support for a very large number (thousands) of View clients. “Scaling of Cloud Applications Using Machine Learning” by Padala et al. presents vScale, a technology that enables horizontal scaling of virtual machines for multitier applications to meet application-level service level objectives. “VSRT: An SDN-Based Secure and Fast Multicast Routing Technique” by Ajay Kumar describes a new approach to multicast routing that the author refers to as a Very Secure Reduced Tree (VSRT) routing algorithm, which addresses a number of problems associated with standard IP multicast routing. “VProbes: Deep Observability into the VMware ESXi Hypervisor” by Carbone et al. presents an instrumentation system for observing the main layers of the VMware software stack: the guest OS, the Virtual Machine Monitor, and the VMware® ESXi™ kernel. “Automatic Discovery of Configuration Policies” by Jain and Frascadore describes a technique for automatically discovering configuration rules (“policies”) and the associated resource groups that the rules apply to in very large-scale data center environments. “Statistical Normalcy Determination Based on Data Categorization” by Marvasti et al. introduces a technique for data categorization with analysis tools that identify normalcy bounds, which is useful for automating anomaly detection, prediction, capacity planning, and root-cause analysis. These techniques are employed in our VMware® vCenter™ Operations Manager™ product. Finally, “VMware ESX Memory Resource Management: Swap” by Banerjee et al. describes the design of the VMware® ESX® swap subsystem, an integral part of ESX memory resource management. I hope that you enjoy this issue of the VMTJ. As always, we welcome your comments and feedback.

Sincerely

curt

Curt Kolovson
Sr. Staff Research Scientist
VMware Academic Program


VMware OS Optimization Tool

$
0
0

The VMware OS Optimization Tool helps optimize Windows 7/8/2008/2012 systems for use with VMware Horizon View. The optimization tool includes customizable templates to enable or disable Windows system services and features, per VMware recommendations and best practices, across multiple systems. Since most Windows system services are enabled by default, the optimization tool can be used to easily disable unnecessary services and features to improve performance.

You can perform the following actions using the VMware OS Optimization Tool:

  • Local Analyze/Optimize
  • Rempte Analyze/Optimize
  • Optimization History
  • Managing Templates

New for version 2014!

  • Updated templates for Windows 7/8 – based on VMware’s OS Optimization Guide
  • New templates for Windows 2008/2012 RDSH servers for use as a desktop
  • Single portal EXE design for ease of deployment and distribution
  • Combination of Remote and Local tools into one tool
  • Better template management, with built in and user-definable templates
  • Results report export feature.

Various bug fixes, usability enhancements, and GUI layout updates.
 
 

screenshot-1Mediu,

Tap Tap vCloud Client

$
0
0

Tap Tap vCloud Client is an Android application to manage and monitor cloud organizations in vCloud Director application and in a VMware vCloud Hybrid Service.

Tap Tap supports the following operations and features:

  • Deploy vApps from vApp Templates
  • Power Operation of vApps and VMs
  • Clone vApps
  • Suspend vApps
  • List the VMs, vApps and vApp templates
  • Search VMs and vApps
  • Snapshot of vApps
  • Task and Event to monitor vCloud Director

 

Home Menu

ViewDbChk

$
0
0

The ViewDbChk tool allows administrators to scan for, and fix provisioning errors that can not be addressed using View Administrator. Provisioning errors occur when there are inconsistencies between the LDAP, vCenter and View Composer databases. These can be caused by: direct editing of the vCenter inventory, restoring a backup, or a long term network problem.

This tool allows VMware View administrators to scan for machines which can not be provisioned and remove all database entries for them. The View Connection Server will then be able to re-provision the machine without any errors.

 

scanMachines

I/O Analyzer

$
0
0

VMware I/O Analyzer is an integrated framework designed to measure storage performance in a virtual environment and to help diagnose storage performance concerns. I/O Analyzer, supplied as an easy-to-deploy virtual appliance, automates storage performance analysis through a unified interface that can be used to configure and deploy storage tests and view graphical results for those tests.

I/O Analyzer can use Iometer to generate synthetic I/O loads or a trace replay tool to deploy real application workloads. It uses the VMware VI SDK to remotely collect storage performance statistics from VMware ESX/ESXi hosts. Standardizing load generation and statistics collection allows users and VMware engineers to have a high level of confidence in the data collected.

Please post comments and questions regarding this fling to the I/O Analyzer Community.

Features

  • Integrated framework for storage performance testing
  • Readily deployable virtual appliance
  • Easy configuration and launch of storage I/O tests on one or more hosts
  • Integrated performance results at both guest and host levels
  • Storage I/O trace replay as an additional workload generator
  • Ability to upload storage I/O traces for automatic extraction of vital metrics
  • Graphical visualization of workload metrics and performance results

New in version 1.6.1

  • Changed guest I/O scheduler to NOOP and disable I/O coalescing at I/O scheduler level.
  • Downgraded VM version to 7 to be compatible with ESX/ESXi 4.0.
  • Back-end improvements to workload generator synchronization to support 240+ workers.
  • Bug fixes.

Latency Sensitivity Troubleshooting Tool

$
0
0

The Latency Sensitivity Troubleshooting Tool provides scripts and examples to troubleshoot configuration and performance problems with the Latency Sensitivity feature in VMware vSphere 5.5.

Features

  • Python script that runs on ESXi to check virtual machine and physical NIC (PNIC) configuration to monitor host, virtual machine, and PNIC performance.
  • Python program to process traces from pktcap-uw for a ping workload and print time spent in ESXi on the receive path, time spent in the virtual machine, and time spent in ESXi on the transmit path.
  • A simple C program demonstrating the trace format generated by pktcap-uw. The C program was tested on an x86_64 Linux virtual machine.
  • Example SystemTap scripts to break down ping and netperf TCP_RR latencies inside a Red Hat Linux guest. These scripts were tested on a Red Hat Enterprise Linux 6.2 virtual machine.

latsens-screen

Services Virtual File System

$
0
0

Building on top of the FUSE API (Linux specific) and the llfuse library, Services Virtual File System (SVFS) aims to provide an easy way for creating a custom file system that integrates seamlessly with the native Linux one. Users are able to execute almost all operations on it, including but not limited to creation and removal of files and directories, as well as searches and read from/write to files. This allows the user to map just about any structure to a Linux file system, using the abstractions provided in SVFS.

By providing a coherent mapping of REST API operations to file system commands, one can map a RESTful web service to the file system. For instance, HTTP GET can populate a directory when a read (ls) is triggered, or HTTP POST can be executed on file open/execution. A tree command can traverse the whole REST tree and map everything from the API locally. This enables automation through scripts that execute on the file system instead of crawlers that follow URLs.

RestFsSample

BatchV2V

$
0
0

BatchV2V is a command line utility that helps to simplify batch V2V operations utilizing VMware Converter. On first run, BatchV2V outputs a configuration file that can be used multiple times to batch copy one virtual machine to many other targets. For many administrators, this is done manually on a regularly scheduled basis with VMware Converter.

BatchV2V helps batch the copy job so you don’t have to manually specify the parameters on an ongoing basis.

BatchV2Vscreenshot


ESXi Mac Learning dvFilter

$
0
0

MAC learning functionality solves performance problems for use cases like nested ESX.  This ESX extension adds functionality to ESX to support MAC-learning on vswitch ports. For most ESX use cases, MAC learning is not required as ESX knows exactly which MAC address will be used by a VM. However, for applications like running nested ESX, i.e. ESX as a guest-VM on ESX, the situation is different. As an ESX VM may emit packets for a multitude of different MAC addresses, it currently requires the vswitch port to be put in “promiscuous mode”. That however will lead to too many packets delivered into the ESX VM, as it leads to all packets on the vswitch being seen by all ESX VMs. When running several ESX VMs, this can lead to very significant CPU overhead and noticeable degradation in network throughput. Combining MAC learning with “promiscuous mode” solves this problem.

The MAC learning functionality is delivered as a high speed VMkernel extension that can be enabled on a per-port basis. It works on legacy standard switches as well as Virtual Distributed Switches.

The MAC learning module has a few noteworthy limitations:

  • Once learned, a MAC address is never aged out. For very long running ESX VMs with high churn in used MAC addresses (e.g. via nested guest VMs) this may be a problem. If the MAC table of a particular port is full, the MAC learning functionality can no longer improve performance.
  • MAC learning is not applied to multicast traffic and multicast traffic will see no performance improvement.

For more information, read this blog.

The download is on Instructions tab.

WebCommander

$
0
0

****** WebCommander is now open source! To get the download or source code, go to the WebCommader project page. ******

New features of 4.0 (Available here:  WebCommader project page.)

  •  Filter commands via categories
  • Allow static commands without parameters
  • Allow using the output of former commands as the input of latter commands in workflow
  • View source code directly
  • Allow using Windows Authentication to authenticate users
  • Allow using HTTPS
  • More commands (include one to install webcommander)
  • UI enhancements
  • Provide setup.ps1 to install, config and upgrade webcommander (could not install PowerCLI for external users)

For more detailed information, also see WebCommander goes Open Source!

Have you ever wanted to give your users access to certain virtual infrastructure tasks instead of the entire vCenter Client?

WebCommander is a way to do this! WebCommander was designed as a framework to wrap your PowerShell and PowerCLI scripts into an easy-to-access web service. Now you can hand off the tasks your users need by simply adding a new script and giving them access to WebCommander.

Click image to enlarge.
webCommandefrontpage

View Auditing Portal

$
0
0

View Auditing Portal is a new web portal that serves  as an extension to Horizon View Administrator.  It has three functions:

  • Show parent virtual machines of linked clone desktop pools and descendant snapshots in a tree view. The snapshots not in use by linked clone pools are marked in grey,  so that View administrator can remove the snapshots not in use.
  • Show statistics for operation systems and versions of View clients in different types of view styles.
  • Show virtual desktop connections by desktop pools,  and show virtual application connections by RDS (Remote Desktop Service) Farms.

Client1

XenApp2Horizon

$
0
0

The XenApp2Horizon Fling helps you migrate published applications and desktops from XenApp to Horizon View. One XenApp farm is migrated to one or more Horizon View farm(s).

The GUI wizard-based tool helps you:

  • Validate the View agent status on RDS hosts (from View connection server, and XenApp server)
  • Create farms
  • Validate application availability on RDS hosts
  • Migrate application/desktop to one or multiple farms (new or existing)
  • Migrate entitlements to new or existing applications/desktops. Combination of application entitlements are supported
  • Check environment
  • Identify incompatible features and configuration

gui_new600
Click to Enlarge

PowerActions for vSphere Web Client

$
0
0

PowerActions integrates the vSphere Web Client and PowerCLI to provide complex automation solutions from within the standard vSphere management client.

PowerActions is deployed as a plugin for the vSphere Web Client and will allow you to execute PowerCLI commands and scripts in a vSphere Web Client integrated Powershell console.

Furthermore, administrators will be able to enhance the native WebClient capabilities with actions and reports backed by PowerCLI scripts persisted on the vSphere Web Client. Have you ever wanted to “Right Click” an object in the web client and run a PowerCLI script? Now you can!

For example I as an Administrator will be able to define a new action for the VM objects presented in the Web client, describe/back this action with a PowerCLI script, save it in a script repository within the Web client and later re-use the newly defined action straight from the VM object context (right click) menu.

Or, I as an Administrator can create a PowerCLI script that reports all VMs within a Data Center that have snapshots over 30 days old, save it in a script repository within the Web client and later execute this report straight from the Datacenter object context menu.

Or better yet, why not share your pre-written scripts with the rest of the vSphere admins in your environment by simply adjusting them to the correct format and adding them to the shared script folder.

For additional information see the video in the Video tab, or read this article.


Script650
Click to enlarge

Viewing all 226 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>