Analyzing PAPI Performance on Virtual Machines

December 17, 2013, 11:47 am

Latest and popular articles on VMWare Virtualization

≫ Next: Virtualizing Latency-Sensitive Applications: Where Does the Overhead Come From?

≪ Previous: Hierarchical Memory Resource Groups in the eSX Server

John Nelson
University of Tennessee
john.i.nelson@gmail.com

Abstract

Performance Application Programming Interface (PAPI) aims to provide a consistent interface for measuring performance events using the performance counter hardware available on the CPU as well as available software performance events and off-chip hardware. Without PAPI, a user may be forced to search through specific processor documentation to discover the name of processor performance events. These names can change from model to model and vendor to vendor. PAPI simplifies this process by providing a consistent interface and a set of processor-agnostic preset events. Software engineers can use data collected through source-code instrumentation using the PAPI interface to examine the relation between software performance and performance events. PAPI can also be used within many high-level performance-monitoring utilities such as TAU, Vampir, and Score-P.

VMware® ESXiTM and KVM have both added support within the last year for virtualizing performance counters. This article compares results measuring the performance of five real-world applications included in the Mantevo Benchmarking Suite in a VMware virtual machine, a KVM virtual machine, and on bare metal. By examining these results, it will be shown that PAPI provides accurate performance counts in a virtual machine environment.

1. Introduction

Over the last 10 years, virtualization techniques have become much more widely popular as a result of fast and cheap processors. Virtualization provides many benefits, which makes it an appealing test environment for high-performance computing. Encapsulating configurations is a huge motivating factor for wanting to do performance testing on virtual machines. Virtual machines enable portability among heterogeneous systems while providing an identical configuration within the guest operating system.

One of the consequences of virtualization is that there is an additional hardware abstraction layer. This prevents PAPI from reading the hardware Performance Monitoring Unit (PMU) directly. However, both VMware and KVM have recently, within the last year, added support for a virtual PMU. Within the guest operating system that provides a virtual PMU, PAPI can be built without any special build procedure and will access the virtual PMU with the same system calls as a hardware PMU.

To verify that performance event counts measured using PAPI are accurate, a series of tests were run to compare counts on bare metal to counts on ESXi and KVM. A suite of real-world applications was

chosen to attempt to more accurately represent use cases for PAPI. In general, most users will use PAPI preset events to measure particular performance events of interest. PAPI preset events are a collection of events that can be mapped to native events or can be derived from a few native events. Preset events are meant to provide a set of consistent events across all types of processors.

2. Testing Setup

2.1 Benchmark Suite

Tests performed are taken from the Mantevo Suite of miniapplications [4]. The purpose of the Mantevo suite is to provide miniapplications that mimic performance characteristics of real-world large-scale applications. The applications listed below were used for testing:

CloverLeaf – Hydrodynamics algorithm using a two-dimensional Eulerian formulation
CoMD – An extensible molecular dynamics proxy applications suite featuring the Lennard-Jones potential and the Embedded Atom Method potential
HPCCG – Approximation for implicit finite element method
MiniGHOST – Executes halo exchange pattern typical of explicit structured partial differential equations
MiniXYCE- Simple linear circuit simulator

2.2 Testing Environment

Bare Metal

16-core Intel Xeon CPU @ 2.9GHz
64GB memory
Ubuntu Server 12.04 • Linux kernel 3.6

KVM

Qemu version 1.2.0
Guest VM – Ubuntu Server 12.04 • Guest VM – Linux kernel 3.6
Guest VM – 16GB RAM

VMware • ESXi 5.1

Guest VM – Ubuntu Server 12.04 • Guest VM – Linux kernel 3.6
Guest VM – 16GB ram

2.3 Testing Procedure

For each platform (bare metal, KVM, VMware), each application was run while measuring one PAPI preset event. Source code for each application was modified to place measurements surrounding the main computational work of each application. Ten runs were performed for each event. Tests were run on a “quiet” system; no other users were logged on, and only minimal OS services were running at the same time. Tests were performed in succession with no reboots between runs.

3. Results

Results are presented as bar charts in Figures 1–5. Along the x-axis is a PAPI preset event. Each preset event maps to one or more native events. Along the y-axis is the difference in event counts from bare metal event counts—that is, the ratio of the mean of 10 runs on the virtual machine to the mean of 10 runs on a bare metal system. Therefore, a value of 0 corresponds to an identical number of event counts between the virtual machine and bare metal. A value of 1 corresponds to a 100% difference in the counts on the virtual machine from on bare metal.

Figure 1. Event Counts on ESXi and KVM Compared to Bare Metal for the sCloverLeaf Benchmark

Figure 2. Event Counts on VMware and KVM compared to Bare Metal for the CoMD benchmark.

Figure 3. Event Counts on VMware and KVM compared to Bare Metal for the HPCCG benchmark.

Figure 4. Event Counts on VMware and KVM compared to Bare Metal for the MiniGhost benchmark.

Figure 5. Event Counts on VMware and KVM compared to Bare Metal for the MiniXyce benchmark.

4. Discussion

On inspection of the results, two main classes of events exhibit significant differences between performance on virtual machines and performance on bare metal. These two classes include instruction cache events such as PAPI_L1_ICM, and translation lookaside buffer (TLB) events such as PAPI_TLB_IM. These two classes will be examined more closely below. Another anomaly that has yet to be explained is KVM reporting nonzero counts of PAPI_VEC_DP and PAPI_VEC_SP (both related to vector operations), whereas both the bare metal tests and the VMware tests report 0 for each application tested.

4.1 Instruction Cache

From the results, we can see that instruction cache event counts are much more frequent on the virtual machines than on bare metal. These events include: PAPI_L1_ICM, PAPI_L2_ICM, PAPI_L2_ICA, PAPI_L3_ICA, PAPI_L2_ICR, and PAPI_L3_ICR. By simple deduction, we can see that L2 events are directly related to PAPI_L1_ICM, and likewise L3 events to PAPI_L2_ICM. That is, the level 2 cache will only be accessed in the event of a level 1 cache miss. Miss rate will compound total accesses at each level and, as a result, the L3 instruction cache events appear the most different from on bare metal with a ratio of more than 2. Therefore, it is most pertinent to examine the PAPI_L1_ICM results, because all other instruction cache events are directly related. In Figure 6, we can see the results of the HPCCG tests on bare metal, KVM, and ESXi side by side. As can be seen on the graph, both KVM and ESXi underperform bare metal results. However, ESXi is quite a bit better off with only 20% more misses than bare metal, whereas KVM exhibits nearly 40% more misses. Both have a larger standard deviation than bare metal, but not by a huge margin.

Figure 6. L1 Instruction Cache Miss Counts on ESXi and KVM Compared to Bare Metal for the HPCCG Benchmark

4.2 TLB

Data TLB misses are a huge issue for both KVM and ESXi. Both exhibit around 33 times more misses than runs on bare metal. There is little difference between the two virtualization platforms for this event.

Figure 7. L1 Instruction TLB Miss counts on VMware and KVM compared to Bare Metal for the HPCCG benchmark.

Instruction TLB misses, shown in Figure 7, are also significantly more frequent on both virtualization platforms than on bare metal. However, ESXi seems to perform much better in this regard. Not only does ESXi incur 50% of the misses seen on KVM, but ESXi also has a much smaller standard deviation (even smaller than that of bare metal) compared to KVM’s unpredictable results.

4.3 MiniXyce

The results for MiniXyce warrant special consideration. The behavior of the application itself appears much less deterministic than other tests in the test suite. MiniXyce is a simple linear circuit simulator that attempts to emulate the vital computation and communication of XYCE, a circuit simulator designed to solve extremely large circuit problems on high-performance computers for weapons design [4].

Even so, the behavior of the application may have exposed shortcomings in the virtualization platforms. Not only are instruction cache miss and TLB miss events much more frequent on ESXi and KVM, but data cache misses and branch mispredictions are also much more frequent. MiniXyce is the only application tested that displayed significantly different data cache behavior on the virtual machines from on bare metal. This may suggest that there exists a certain class of applications, with similar behavior characteristics, that may cause ESXi and KVM to perform worse than bare metal with regard to the data cache. This may be explained by the fact that virtualizing the Memory Management Unit (MMU) has a well-known overhead [2].

4.4 Possible Confounding Variable

VMware and KVM do not measure guest-level performance events in the same way. One of the challenges of counting events in a guest virtual machine is determining which events to attribute to the guest. This is particularly problematic when the hypervisor emulates privileged instructions. The two main counting techniques are referred to as domain switch and CPU switch. Domain switch is the more inclusive of the two, including all events that the hypervisor contributes when emulating guest I/O. VMware uses the term host mode for domain switch and guest mode for CPU switch.

Figure 8. Example Timeline of Virtual Machine Scheduling [1]

Figure 8 shows an example timeline for virtual-machine scheduling events taken from [1]. CPU switch does not include events contributed by the hypervisor. On VM exit, when the hypervisor must process an interrupt (such as a trapped instruction), the performance counter state is saved and then restored on VM entry. Descheduling also causes the save and subsequent restore in CPU switch. Alternatively, the performance counter states in domain switch are saved only when the guest is descheduled and restored on rescheduling. Events are counted between VM exit and VM entry in domain switching. Domain switching, therefore, gives a more accurate view of the effects the hypervisor has on execution, whereas CPU switching hides this effect. However, domain switching may also count events that occur due to another virtual machine.

KVM chooses either domain switching or CPU switching, both of which have downsides as discussed above. ESXi, by default, uses a hybrid approach. Events are separated into two groups: speculative and nonspeculative. Speculative events include events that are affected by run-to-run variation. For example, cache and branch miss-prediction are both nondeterministic and may be related to previous executing code that affects the branch predictor or data cache. Both are affected by the virtual machine monitor as well as all virtual machines that are executed. Therefore, any specific virtual machine cache performance, or other nonspeculative event performance, will be affected by any other virtual machine that is scheduled. That is, the data cache will be filled with data by a virtual machine while it is executing. Further loads to this data will be cached. However, when execution returns to the virtual machine monitor or to another virtual machine, data in the cache may be overwritten by data that is loaded by the virtual machine monitor or another running virtual machine. On the other hand, nonspeculative events are events that are consistent, such as total instructions retired. ESXi will count speculative events between VM exit and VM entry but will not count nonspeculative events. This gives a more accurate representation of the effects of the hypervisor on speculative events while providing accurate results for nonspeculative events that should not be affected by the hypervisor [1].

The difference in how KVM and VMware choose to measure events could potentially skew the results for certain events. It is safe to assume that for the majority of the events tested, the event counts can be trusted, as there is a less than 1% difference between either virtualization platform and bare metal. It has also been shown in [2] that instruction cache and TLB performance overhead is a known problem, so it is likely that PAPI is in fact reporting accurate counts for related events.

5. Related Work

PAPI relies on PMU virtualization at the guest level to provide performance measurements. [3] discusses guest-wide and system- wide profiling implementations for KVM as well as domain versus CPU switch. [1] expands on the domain-versus-CPU-switch distinction by providing an alternative hybrid approach that counts certain events similar to domain switch and other events similar to CPU switch.

A performance study that examined VMware vSphere® performance using virtualized performance counters as one of the data-collection tools is presented in [1]. [1] also provides a discussion of TLB overhead, a result also observed by the study presented in this paper. [5] provides a framework for virtualizing performance counters in Xen, another popular virtualization platform.

6. Conclusions

PAPI can be used within a KVM or VMware virtual machine to reliably measure guest-wide performance events. However, there are a few events that, when measured, either reveal poor performance of the virtualization platform or are possibly overestimated when attributed to virtual machines. Virtual machine performance for the few significantly different events was expected [2]. For the large majority of events, one should expect to see almost identical results on a virtual machine as on a bare metal machine on a lightly loaded system. Future work examining event counts when a system is overloaded by many concurrently running virtual machines is necessary to make conclusions about whether results provided by PAPI are accurate in such a situation.

Acknowledgments

This material is based on work supported by the National Science Foundation under Grant No. CCF-1117058 and by FutureGrid under Grant No. OCI-0910812. Additional support for this work was provided through a sponsored Academic Research Award from VMware, Inc.

References

Serebrin, B. and Hecht, D. 2011. Virtualizing performance counters. In Proceedings of the 2011 International Conference on Parallel Processing (Euro-Par’11), Michael Alexander et al. (Eds.). Springer-Verlag, Berlin, Heidelberg, 223–233. http://dx.doi.org/10.1007/978-3-642-29737-3_26
Buell, J., Hecht, D., Heo, J., Saladi, K., and Taheri, H.R.. Methodology for Performance Analysis of VMware vSphere under Tier-1 Applications. VMware Technical Journal, Summer 2013. http://labs.vmware.com/vmtj/methodology-for-performance-analysis-of-vmware-vsphere-under-tier-1- applications
Du, J., Sehrawat, N., and Zwaenepoel, W. Performance Profiling in a Virtualized Environment. 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 2010), Boston, MA. https://www.usenix.org/legacy/events/hotcloud10/tech/full_papers/Du.pdf
Mantevo Project. http://www.mantevo.org
Nikolaev, R. and Back, G. Perfctr-Xen: A Framework for Performance Counter Virtualization. Virtual Execution Environments 2011, Newport Beach, CA. http://www.cse.iitb.ac.in/~puru/courses/spring12/cs695/downloads/perfctr.pdf
PAPI. http://icl.cs.utk.edu/papi/

↧

Virtualizing Latency-Sensitive Applications: Where Does the Overhead Come From?

December 17, 2013, 11:48 am

Latest and popular articles on VMWare Virtualization

≫ Next: A Benchmark for User-Perceived Display Quality on Remote Desktops

≪ Previous: Analyzing PAPI Performance on Virtual Machines

Jin Heo
VMware, Inc.
heoj@vmware.com

Reza Taheri
VMware, Inc.
rtaheri@vmware.com

Abstract

This paper investigates the overhead of virtualizing latency-sensitive applications on the VMware vSphere® platform. Specifically, request- response workloads are used to study the response-time overhead compared to native configurations. When those workloads are virtualized, it is reasonable to anticipate that (1) there will be some overhead due to additional layers of processing required by the virtualization software and (2) a similar amount of overhead will be seen in response times across a variety of request-response workloads as long as the same number of transactions are executed and the packet size is similar. Findings in this paper confirm some overhead in virtual systems. However, response-time overhead increases with workload complexity instead of staying the same. The analysis of five different request-response workloads in vSphere shows that the virtualization overhead in the response time of such workloads consists of the combination of two different parts:

(1) a constant overhead and (2) a variable part that monotonically increases with the complexity of the workload. This paper further demonstrates that the added layers for virtualizing network I/Os and CPUs bear the major responsibility for the constant overhead, whereas hardware-assisted memory management unit (MMU) virtualization plays a major role in the variable part of the overhead.

1. Introduction

Virtualization has been widely adopted to run the majority of business-critical applications, such as Web applications, database systems, and messaging systems [2]. As virtualizing those applications has been proven successful in recent years, growing interest has also arisen in virtualizing performance-critical applications such as those that are latency-sensitive—including distributed in-memory data management, stock trading, and streaming media. These applications typically have a strict requirement for the response time of a transaction, because violating the requirement can lead to loss of revenue. The question arises, then, of how well virtualization performs when running virtual machines (VMs) of such applications.

Virtualization requires some level of overhead. Extra processing layers inherent in virtualization enable the benefit of sharing resources while giving the illusion that VMs own the hardware. Successful virtualization of latency-sensitive applications therefore starts from fully understanding where the overhead lies. This paper studies the overhead in virtualizing latency-sensitive workloads on the most recent release (as of this writing) of the VMware virtualization platform, vSphere 5.1.

Specifically, this paper investigates response-time overhead of running request-response workloads in vSphere. In this type of workload, the client sends a request to the server, which replies with a response after processing the request. This kind of workload is very commonly seen: Ping, HTTP, and most remote procedure call (RPC) workloads all fall into the category. This paper primarily focuses on configurations in a lightly loaded system in which only one transaction of request-response is allowed to run. Any overhead induced due to additional efforts performed by the virtualization layer then will be directly shown as an extra latency. This is one of the main reasons that request-response workloads can be categorized as latency-sensitive.

When a workload is virtualized, a certain amount of overhead is expected due to the extra layers of processing. Because a request- response workload processes exactly one request-response pair per transaction, it is reasonable to anticipate a similar amount of overhead in response time across a different variation of the workload (as long as the packet size is similar). Very interestingly, however, the virtualization overhead in response time—that is, the difference between response times of native and virtualized setups—is not constant at all across different workloads. Instead, it increases with regard to the response time of the workload itself; the gap in response times becomes wider as the response time of a workload (i.e., transaction execution time) gets longer. On the contrary, the relative overhead (i.e., the percentage of overhead) decreases with regard to the response time, which implies that there also exists a nonnegligible constant component in the virtualization overhead that is more dominant in a simpler workload.

This paper demonstrates with five different request-response workloads that the virtualization overhead of such a workload consists of the combination of two different parts: (1) a constant overhead and (2) a variable part that monotonically increases with regard to the response time of the workload. This paper further shows that the added layers for virtualizing network I/Os and CPUs take the major responsibility for the constant overhead, while hardware-assisted memory management unit (MMU) virtualization plays a major role in the variable part of the overhead.

The rest of the paper is organized as follows. Section 2 describes the characteristics of the latency-sensitive workloads used in this paper. Sections 3 and 4 explain the breakdown of the response-time overhead of running latency-sensitive applications on the vSphere platform. Section 5 presents evaluation results and analysis with five different latency-sensitive workloads that leads to the breakdown of the response-time overhead explained in Sections 3 and 4. Section 6 presents more discussion. The related work is presented in Section 7. The paper concludes with Section 8.

2. Request-Response Workload and Disclaimers

Latency-sensitive workloads such as request-response workloads (hereafter, this paper refers to these types of workloads as RR workloads) would lose their characteristic of latency-sensitivity as the intensity of workload increases and contention for resources starts. For example, with a large number of latency-sensitive workload sessions running at the same time, the average response time of individual sessions would increase as they contend with one another for resources such as CPU and network bandwidth. The same thing would happen as the number of VMs increases with limited resources available to them. Queuing theory tells us that an increase in the response time with regard to workload intensity (i.e., the number of sessions, packet rate, or the number of VMs) is generally superlinear. The implication is that the system (or VMs) would easily move off the desired region where latency-sensitivity is kept as the system gets overloaded. With the response time substantially increased, one would hardly call such an application latency-sensitive. Using a request-response workload, this paper primarily focuses on the region where latency-sensitivity is kept and there is not much contention for resources.

In this paper, only one request-response session is generated, to avoid the effect of processing accumulated (batched) requests in the server. The purpose is to have configurations that are as simple as possible to keep the system in the region of latency-sensitivity. For a similar reason, this paper mainly studies 1-(v)CPU configurations to avoid the impact of parallel processing and any multi-CPU-related overhead, although virtual symmetric multiprocessing (VSMP) configurations are briefly studied later in the paper. Because only one request-response pair is processed at a time, everything is serialized and any overhead introduced due to virtualization will show up directly in latency. Only the server machine is virtualized for results presented in the paper so that virtualization overhead implies the overhead of virtualizing the server application of the given workload. Details about the setup are described later in the paper.

3. Constant Response-Time Overhead

This paper shows that, even with the same number of transactions executed (i.e., one transaction), the response-time overhead of RR workloads in a virtualization setup is not constant. Instead, it consists of the combination of both a constant overhead and a variable part that increases with the workload transaction response time. (i.e., the longer the workload runs, the larger the overhead becomes.) This paper takes a top-down approach to first explain the response-time overhead breakdown for both the constant and the variable components, followed by evaluation and analysis results with five RR workloads from which the breakdown is drawn. This section explains the constant response-time overhead, and section 4 presents the variable component.

This paper focuses on RR workloads that involve one request-and- response pair per transaction. Sending and receiving packets requires access to physical resources such as CPUs and network interface cards (NICs). In a virtualized environment, however, the guest cannot directly touch the physical resources in the host; there are extra virtual layers to traverse, and the guest sees only virtualized resources (unless the VM is specifically configured to be given direct access to a physical resource). This implies that handling one packet- transmission-and-reception pair per transaction requires a certain amount of effort by the virtualization layer, thereby incurring an overhead in response time.

This section describes the constant overhead component in response time that is due to virtualization when an RR workload is run.

3.1 Network I/O Virtualization

Because processing network I/Os involves access to a physical network interface card (PNIC) that is shared by multiple VMs, a VM should not directly control the device (see Figure 1). This might confuse other VMs accessing the same PNIC and badly affect their operations. At the same time, however, the VM needs to have the illusion that it completely owns its device.

Figure 1. Network I/O Virtualization and Virtual Networking

Instead of directly accessing physical PNICs, a VM can be configured with one or more virtual NICs (VNICs) that are exported by the VM hardware [3]. Accessing VNICs incurs overhead because it needs an intervention of the virtualization layer. Further, to properly connect VMs to the physical networks (i.e., rout packets between VNICs and PNICs), or connect them to one another (i.e., rout packets between VNICs and VNICs), a switching layer is necessary, which essentially forms virtual networks. Network I/O virtualization therefore requires extra layers of network packet processing that perform two important operations: NIC virtualization and virtual switching. Packet sending and receiving both involve these two operations of accessing VNICs and going through the virtual switching layer, the cost of which is directly added to the response time—a cost that does not show up in a native setup.

In vSphere, the virtual machine monitor (VMM) provides the VM hardware for the guest, and VMkernel manages physical hardware resources. When the guest accesses VNICs, for instance, to send a packet, the VMM must intervene and communicate with VMkernel, which finds the right PNIC to put the packet out on the wire. When packet transmission is completed or a packet is received, on the other hand, VMkernel first processes such an event and notifies the right VM (with the right VNIC) through the VMM. All the processing done in VMM and VMkernel combined is extra work, and hence counts toward the virtualization overhead. Throughout this paper, the virtualization layer combining both VMM and VMkernel is sometimes collectively referred to as hypervisor.

As noted, the communication between VMM and VMkernel happens in two ways: (1) VMM calling into VMkernel to send a packet and (2) VMkernel notifying VMM of an event for the guest. vSphere reduces CPU cost by batching (i.e., coalescing) the communication between VMM and VMkernel involving network I/Os in both directions. This also adds an extra delay and variance to the response time. Throughout this paper, this feature is disabled to simplify the analysis and reduce variance in the response-time measurement.

3.2 CPU Virtualization: Virtual CPU Halt/Wake-Up

Like other resources such as NICs, CPUs also need to be virtualized. VMs do not exclusively own physical CPUs, but they are instead configured with virtual CPUs (VCPUs) provided by the VM hardware. VMkernel’s proportional-share-based scheduler enforces a necessary CPU time to VCPUs based on their entitlement [4]. This is an extra step in addition to the guest OS scheduling work. Any scheduling latency in the VMkernel’s CPU scheduler therefore is added to the response-time overhead.

One common characteristic of RR workloads is that the VCPU becomes idle after a packet is sent out, because it must wait for a subsequent request from the client. Whenever the VCPU becomes idle and enters a halted state, it is removed from the scheduler because it does not need CPU time anymore. When a new packet comes in, the descheduled VCPU must be rescheduled when the CPU scheduler runs to perform necessary operations to make the VCPU ready again. This process of descheduling and rescheduling of a VCPU performed at each transaction incurs a nonnegligible latency that is added to the constant overhead in RR workload’s response time. VCPU halting (and descheduling) cost might have less impact on response-time overhead, because it needs to wait for the response from the client anyway.

Note that this paper mainly studies configurations in which CPU is not overcommitted such that each VCPU is able to get its own physical core. Therefore, VCPU ready time, which is usually one of the biggest contributors to response-time overhead, is minimal and thus does not add much to response-time overhead.

4. Variable Response-Time Overhead

The virtualization overhead in response time with RR workloads is not completely constant. Instead, the gap in response time between VM and native setups increases with regard to the workload response time itself. Such an increase mostly comes from running the guest code. Even though the guest runs most of the time without the intervention of the hypervisor (i.e., the CPU executes guest instructions directly), the overhead is still observed. This paper demonstrates that the main causes of the variable overhead component are attributable to (1) the use of hardware-assisted MMU virtualization and (2) an increase in cache misses that is seemingly a byproduct of hardware-assisted MMU virtualization. Detailed evaluation and analysis results are presented later in this paper.

One of the major challenges in virtualizing MMU is that the physical memory perceived by the guest is different from the real machine memory in the host. This arises from the very same reason that the memory in the host also needs to be shared between VMs, like any other resources. Figure 2 describes how guest virtual pages (logical pages), guest physical pages, and machine pages are related to one another on the vSphere platform. Before hardware-assisted MMU virtualization was introduced, the mappings of guest virtual page numbers (VPNs) to machine page numbers (MPNs) in the host were stored in shadow page tables that are implemented in software and managed by VMM [1]. However, to maintain fresh information, the shadow page tables must be updated whenever the guest changes its page tables.

Figure 2. MMU Virtualization

Both Intel and AMD introduced a hardware-assisted MMU virtualization mechanism to overcome the drawback of the software-based MMU virtualization approach. For example, Intel introduces a separate page table structure, Extended Page Tables (EPTs), to maintain the mappings of VMs’ physical page numbers (PPNs) to MPNs in hardware [6]. EPT improves performance considerably by avoiding calling into VMM too frequently to update shadow page tables. Because the approaches used by both Intel and AMD are similar, this paper takes Intel’s mechanism, EPT, as an example of hardware-assisted MMU virtualization.

The overhead of hardware-assisted MMU virtualization, however, becomes more visible when translation lookaside buffer (TLB) misses happen and the hardware page walker walks both EPTs and guest page tables. With hardware-assisted MMU virtualization, TLB misses become more costly and the cycles spent walking EPT become pure overhead. Even worse, page walks for two different page table structures (guest page tables and EPTs) might increase the data cache footprint, taking away CPU cycles to handle more cache misses.

Assuming (just for the sake of analysis) that the TLB miss rate remains the same when guest instructions are executed, the overhead caused by using EPT and associated cache misses increases in proportion to the duration of the guest code execution. In other words, the longer the guest code runs (i.e., the longer the response time of the workload), the more overhead will be shown in response time. This explains why there is a variable component in the response-time overhead of virtualization. The constant overhead will be dominant in a simpler workload, whereas the variable overhead becomes dominant in a more complicated workload (i.e., a workload with a longer response time).

5. Evaluating Response-Time Overhead with Five Different RR Workloads

This section evaluates the response-time overhead of five different RR workloads on vSphere (i.e., VM configuration) versus a native machine (native configuration).

5.1 Test Bed

The test bed (shown in Figure 3) consists of one server machine for serving RR workload requests and one client machine that generates RR requests. The server machine is configured with dual-socket, quad-core 3.47GHz Intel Xeon X5690 processors and 64GB of RAM, and the client machine is configured with dual-socket, quad-core 3.40GHz Intel Xeon X5690 processors and 48GB of RAM. Both machines are equipped with a 10GbE Broadcom NIC. Hyperthreading is not used.

Figure 3. Test-Bed Setup

Two different configurations are used to compare against each other. A native configuration runs native Red Hat Enterprise Linux (RHEL) 6 on both client and server machines, and a VM configuration runs vSphere on the server machine and native RHEL6 on the client machine. The vSphere instance hosts a VM that runs the same version of RHEL6 Linux. For both the native and VM configurations, only one CPU (VCPU for the VM configuration) is configured to remove the impact of parallel processing and discard any multi- CPU-related overhead. In both configurations, the client machine is used to generate RR workload requests, and the server machine is used to serve the requests and send replies to the client. VMXNET3 is used for virtual NICs in the VM configuration. A 10 Gigabit Ethernet switch is used to interconnect the two machines. The detailed hardware setup is described in Table 1.

	Server	Client
Make	HP ProLiant DL380 G7	Dell PowerEdge R710
CPU	2X Intel Xeon X5690 @ 3.47GHz, 4 cores per socket	2X Intel Xeon X5690 @ 3.40GHz, 4 cores per socket
RAM	64GB	48GB
NICs	1 Broadcom NetXtreme II 10Gbps adapter	1 Broadcom NetXtreme II 10Gbps adapter

Table 1. Test-Bed Hardware Specification

5.2 Workload

Five different request-response workloads are used to evaluate and analyze the response-time overhead: (1) Ping, (2) Netperf_RR, (3) Gemfire_Get, (4) Gemfire_Put, and (5) Apache.

Ping – Default ping parameters are used except that the interval is set to .001, meaning that 1,000 ping requests are sent out every second.
Netperf_RR – The Netperf micro benchmark [8] is used to generate an RR workload in TCP.
Gemfire_Get – VMware vFabricTM GemFireTM [9] is a Java-based distributed data management platform. Gemfire_Get is a benchmark workload that is built with GemFire wherein the client node sends a request with a key to extract an object stored in server nodes. Two server nodes are replicated, and the client node already knows which server node to contact to extract the object witha given key. All nodes are Java processes. The client node runs in the client machine, and the server nodes run in the server machine (or in a VM for the VM configuration).
Gemfire_Put – Gemfire_Put is a similar to Gemfire_Get except that the client node sends an object to a server node that stores it in its in-memory database. Because two server nodes are running, the object sent to one server node is replicated to the other one. Although two hops of communication between nodes (from the client node to a server node to the other server node) occur per transaction, there happens to be only one pair of request-response network I/O, because the two server nodes reside in the same machine (or in the same VM for the VM configuration).

Apache – The client generates HTTP traffic that is sent to the Apache Web server [10]. The Apache Web server is configured to run only one server process so that there is always one transaction handled at a time.

In all five workloads, the request and response sizes are configured to be less than the MTU so that they can be put into one Ethernet packet.

5.3 Method

The five different workloads are used to study the overhead in the response times of RR workloads. Each workload runs for 120 seconds, and the average response time is measured in both the VM and native configurations. When the response times of the two configurations are measured, the overhead can be easily calculated by taking the difference of the two. Because only the server is virtualized in the VM configuration, “virtualization overhead” implies the overhead of virtualizing the server application of the workload.

	PING	NETPERF	GEMFIRE_GET	APACHE	GEMFIRE_PUT
Native	26 us	38 us	72 us	88 us	134 us
VM	39 us	52 us	88 us	106 us	157 us

Table 2. Response Times of Five RR Workloads

5.4 Results

Table 2 compares the response times of the five workloads between the native and VM configurations. Ping exhibits the lowest response time because it is the simplest workload. Gemfire_Put shows the highest response time. The overhead of Ping is 13us, and the overhead of Gemfire_Put is 23us. Figure 4 takes a closer look at the response-time overhead of the VM configuration by plotting both the absolute overhead (i.e., the absolute difference between the two configurations) and the relative one. It is interesting to observe that the absolute overhead is obviously not constant but increases for a workload with a longer transaction-response time. On the contrary, the relative overhead moves in the opposite direction. This implies that a nonnegligible amount of constant overhead also exists.

Figure 4. Response-time overhead in relative and absolute terms

Figure 5. Linear regression on response time overhead of five workloads

5.5 Trend in Response-Time Overhead

To get a better picture of the trend in the response-time overhead across multiple RR workloads, linear regression is applied to curve-fit the overhead of the five different workloads. Linear regression is a simple but very useful method for identifying a trend in the observed data set.

The regression result is shown in Figure 5. It shows an almost perfect line with R² = 0.999. An R² value indicates how well the regression line fits the data points. An R² of 1.0 means a perfect fit of the regression line; 0 means that there is no linear relationship. It is somewhat interesting to see a strong linear relationship with five different workloads. According to the linear regression, the relation between VM and native response times for these five workloads can be expressed as follows:

VRT = 1.09 x NRT + 10.5 . (1)

VRT denotes the average response time of the VM configuration, and NRT denotes the average response time of the native configuration. Rearranging the equation to calculate the actual overhead yields the following equation:

Overhead=VRT-NRT=1.09x(NRT-1)+10.5. (2)

Simply interpreting Equation 2 tells us that the VM’s response-time overhead increases with the native machine’s response time and has a nonnegligible constant term of 10.5us. Therefore, with the previously described five workloads, the response-time overhead can be divided into two different components: (1) a constant overhead and (2) a variable part that monotonically increases with regard to the complexity of workload. (i.e., the longer the workload runs, the larger the overhead becomes.)

Further, rewriting the first term of Equation 1 by replacing 1.09 with 1/0.92 (1.09 = 1/0.92) raises an interesting insight into better interpreting the variable part of the overhead.

VRT=1/0.92xNRT+10.5. (3)

Equation 3 says that the VM configuration runs more slowly at 92% of the native configuration’s speed, which causes it to take 9% more response time (the 1.09 coefficient in Equation 1 indicates this), with an additional 10.5us of overhead. The existence of the variable overhead component (i.e., the first term in Equation 1 and Equation 3), therefore, indicates that the VM configuration effectively runs more slowly than the native configuration. The following subsections take a closer look at each component.

5.6 Constant Cost

It is previously explained that network I/O virtualization and VCPU halt/wake-up latency are the major elements of the constant overhead in response time for RR workloads. This section studies the detailed breakdown of the constant overhead. Three RR workloads (NetPerf, Gemfire_Get, and Gemfire_Put) are picked for the study, because all five RR workloads show exactly the same overhead trend as shown in Equation 3. Among the two main sources for the constant overhead, VCPU scheduling is first studied, followed by network I/O virtualization.

The amount of the overhead in response time due to VCPU halt/ wake-ups per transaction can be indirectly calculated by removing such VCPU halt/wake-up operations. vSphere provides an option of preventing the VCPU from getting descheduled. The VCPU will be spinning instead, even when it is in a halted state, hence bypassing the VMkernel CPU scheduler. Table 3 shows that the response time gets reduced by around 5us across the three workloads that were evaluated, which is roughly a half of the total 10.5us of the constant overhead. Linear regression gives the following equation with R² = 0.999:

VRT=1/0.92xNRT+5.3. (4)

Equation 4 says that removing VCPU halt/wake-up reduces the constant cost by 5us. The relative speed of the VM configuration (i.e., the variable part) does not change, staying at 92% of the native configuration. This indicates that the VCPU halt/wake-up cost is indeed a significant part of the constant cost in response-time overhead, and it does not contribute to the variable overhead component in a meaningful way.

	NETPERF	GEMFIRE_GET	GEMFIRE_PUT
VM	53 us	43 us	116 us
VM with monitor_control. desched=false	48 us	46 us	121 us

Table 3. Response of three RR Workloads on VM with and without monitor_control.desched=false

As explained, network I/O virtualization can be divided into two different components: NIC virtualization and virtual switching. Among the two overhead sources, the virtual switching cost is first studied. A packet-tracing tool (developed internally) is used that reports the time spent in doing the virtual switching. The measured processing time of packet transmission and reception together is constant for the three workloads, taking 3us. Combining the 5us of VCPU halt/wake-up overhead with 3us of the virtual switching takes 8us out of total 10.5us of the constant overhead obtained from Equation 3. This leaves us 2.5us to explain further.

This remaining 2.5us is likely to come from NIC virtualization. When the guest accesses the virtual NIC to send/receive a packet or handle virtual interrupts for packet reception and completion, the VMM must intervene. This adds overhead that must have a constant component associated with it.

To confrm the 2.5us of NIC virtualization overhead, the time spent in the guest was measured using tcpdump. The measured time includes the time spent in the VMM handling VNIC accesses on behalf of the VCPU and the direct execution of guest instructions, but it does not include the time spent in VMkernel performing virtual switching and accessing the physical device. Applying linear regression to the guest time yields the following linear equation with R² = 1.000, confirming that there is indeed a constant overhead of 2.5us.

VRT=1/0.93xNRT+2.5. (5)

In summary, the constant overhead in response time incurred is completely broken down into (1) VCPU halting/wake-up overhead and (2) network I/O virtualization overhead that further consists of the virtual switching and NIC virtualization overhead.

Because VCPU halting/wake-up and the virtual-switching overhead are the main contributors to the time spent in VMkernel and they are constant, the variable cost should be added from layers other than VMkernel. In other words, the variable part of the overhead arises when the guest is running (the guest code directly executes or the VMM runs on behalf of the guest). This is also corroborated by Equation 5, which shows that the time spent in the guest (excluding VMkernel time) has indeed the (almost) identical variable component as the ones shown in Equation 3 and Equation 4. Further, because the time spent in the VMM for accessing VNICs is already (likely to be) captured in the constant overhead, the variable overhead in response time should mostly happen when guest instructions directly execute. The next section studies the variable component of the response-time overhead in detail.

5.7 Variable Cost

To analyze (1) why the overhead is added when guest instructions directly execute (without much intervention of the hypervisor) and (2) how executing guest instructions contribute to the variable overhead, hardware performance counters are used. Hardware performance counters provide an effective way to understand where CPU cycles are spent.

Note that there is a noticeable difference in the code path executed between the native and VM configurations, mainly because of the difference in the device drivers. The VM configuration uses a paravirtualized driver, VMXNET3, and the native configuration uses a native Linux device driver. This makes it hard to compare performance counters of the two configurations fairly because they execute different guest code. For these reasons, a simpler workload is used to find any difference in hardware performance counters between the two configurations. By running both the client and server applications on the same machine (or on the same VM for the VM configuration) with one CPU configured, the two configurations get to execute the same (guest) code. Any network I/Os are removed while the (guest) kernel TCP/IP code is still exercised. The workload becomes completely CPU- intensive, avoiding halting and waking up the VCPU and removing the difference in the halt/wake-up path. Another aspect of this kind of workload is that guest instructions directly execute without the intervention of the hypervisor 99% of the time. It is not 100% because the hypervisor sometimes must run, mainly to handle timer interrupts. Without the hypervisor overhead, the constant overhead is almost entirely removed.

	SIMPLIFIED NETPERF
Native	8.7 us
VM	9.2 us

Table 4. ResponseTime Comparison Using a Simplified Netperf RR Workload with No Network I/Os

The Netperf RR workload is chosen and simplified to compare hardware performance counters between the VM and native configurations. Before presenting the hardware performance counter comparison, Table 4 shows the response-time comparison of the simplified workload between the native and VM configurations. An overhead in response time (of 6%) still shows up even when the workload becomes much simpler without the hypervisor costs.

	NATIVE	VPMC-GUEST (VM)	VPMC-HYBRID (VM)
IPC	0.70	0.64	0.64
Unhalted Cycles	376 * 10⁹	406 * 10⁹	406 * 10⁹
# Instructions	261 * 10⁹	258 * 10⁹	259 * 10⁹

Table 5. Performance Counters Comparison Using a Simplified Netperf RR Workload with No Network I/Os

	NATIVE	VPMC-HYBRID (VM)
IPC	0.70	0.64
Unhalted Cycles	376 * 10⁹	406 * 10⁹
# Instructions	261 * 10⁹	259 * 10⁹
TLB-misses-walk-cycles	22.7 * 10⁹	25.5 * 10⁹
EPT-walk-cycles	0	16.4 * 10⁹
L1-dcache-misses-all	4.53 * 10⁹	6.14 * 10⁹

Table 6. More Performance Counters Comparison Using a Simplified Netperf RR Workload with No Network I/Os

Table 5 and Table 6 compare the major performance counters that are collected using a Linux perf profiling tool [11]. Two different modes of Virtual Performance Monitoring Counters (VPMC) [7] are used to collect performance counters for the VM: VPMC-Guest and VPMC-Hybrid. In the VPMC-Guest configuration, performance counters are updated only when guest instructions execute directly on the PCPU. They do not increment when the PCPU runs the hypervisor code (on behalf of the guest). On the other hand, in the VPMC-Hybrid configuration, all counters increment regardless of whether the PCPU is running guest or hypervisor instructions, except for the instructions retired and branches retired counters. The instructions retired and branches retired events count guest instructions only. The VPMC-Hybrid configuration is useful for measuring the overhead of the virtualization layer using a metric such as Instructions Per Cycles (IPC), because unhalted cycles includes the cycles spent in the hypervisor, whereas instructions retired counts only guest instructions.

The test runs for a fixed number of iterations (i.e., transactions) and hardware counters are collected during the entire period of a run. Because both the native and VM configurations execute the same (guest) instructions, the instructions retired counters are also the same. Extra cycles spent in the VM (compared to the native machine), therefore, become the direct cause of the increase in response time. Understanding where and why the extra cycles are spent in the VM configurations is the key to analyzing why the VM runs more slowly.

Table 5 compares IPC, unhalted cycles, and instructions retired between the native and the two VM configurations (two different VPMC modes). Note that, whereas VPMC-Guest counts only the cycles spent in the guest, the VPMC-Hybrid configuration counts the cycles spent in the hypervisor also. The fact that there is not much difference in the unhalted cycles between the two VM configurations confirms that the hypervisor overhead (i.e., the time spent in executing the hypervisor code) is minimal in this workload. This further indicates that the 8% increase in the cycles compared to the native configuration must have come when the guest instructions executed directly. With the same number of instructions executed but more cycles spent, the VM configurations yield an IPC that is lower by 9%, indicating that the VM indeed runs more slowly than the native machine.

From Table 6, the biggest contributor of the increased cycles when guest instructions directly execute are the Extended Page Table (EPT) walk cycles. This takes 4% out of the 8% cycles increase. The use of an additional page table mechanism (i.e., EPT) to keep track of the mappings from PPN and MPN requires extra computing cycles when TLB misses occur. The EPT walk cycles count these extra cycles. Interestingly, L1/L2/L3 data cache misses also increased. (Only L1 data cache misses are shown in Table 6). 3% out of 8% comes from L1/L2/L3 cache misses, with L1 data cache misses constituting the majority. It is suspected that additional page table structures (EPT page tables) seemingly incur more memory accesses, increasing L1 data cache misses. An increase in TLB cache misses (excluding EPT walk cycles) takes the remaining 1% of the 8% increase. This is likely to be due to more memory accesses.

The results show that in this simplified workload, hardware- assisted MMU virtualization directly causes most of the cycle increase, compared to the native configuration. It is also suspected that the extra cycles spent due to more data cache misses and TLB misses are seemingly a side-effect of using hardware MMU virtualization. Therefore, it can be concluded that hardware- assisted MMU virtualization is the main reason for the variable response-time cost.

A software MMU virtualization method such as those with shadow page tables might also incur an overhead contributing to the variable overhead component. Its overhead might in fact be higher than that of hardware MMU virtualization. Comparison between software and hardware-based MMU virtualization approaches is out of this paper’s scope.

Figure 5 shows that there is a strong linear relationship in response- time overhead with five different RR workloads. Based on the analysis in this section, this is probably because all five workloads exercise a similar code path (i.e., Linux kernel network stack and the VNIC device driver routine) and the TLB miss rate is similar, therefore leading to a linearly increasing variable cost.

6. More Discussion

6.1 Impact of Using Direct Path I/O

With VMDirectPath I/O [22] configured, a VM is given complete and exclusive access to PNICs installed on the vSphere host. Because the VM directly owns the device, neither a VNIC nor the virtual switching layer is needed. Figure 6 shows the regression results of three different workloads with VMDirectPath I/O. monitor_control.desched=false is specified in a VMX file to remove the CPU scheduling overhead. Removing the halt/wake-up and network I/O virtualization costs, the constant overhead from the regression result is virtually gone—the negative constant term might exist because there are not enough data points. Having exclusive and direct access to the hardware (i.e., CPU and NIC) removes the associated virtualization overhead and further prevents the response time from arbitrarily being increased due to resource contention. The drawback is that resources are not shared and thus can be wasted.

Figure 6. Linear Regression on Response Time: Native and VM with VMDirectPath I/O

6.2 Response-Time Overhead in Using VSMP

With multiple VCPUs, more sources of response-time overhead come into play, even when resources, especially CPUs, are not overcommitted. Many applications employ multiple processes/ threads, resulting in multiple VCPU halt/wake-up operations per transaction to communicate and wake one another up. Such extra halt/wake-up operations will add more latency to the response time accordingly. For example, the Gemfire_Put workload used in this paper has two server nodes (two separate Java processes). They are likely to be distributed, running on a different VCPU. Consequently, three VCPU halt/wake-up operations instead of one pair occur per transaction, because there is one extra hop of communication between the two server nodes.

	NETPERF	GEMFIRE_GET	GEMFIRE_PUT
4 CPU Native	39 us	72 us	120 us
4 VCPU VM	54 us	89 us	159 us
4 VCPU VM with monitor_ control. desched=false	49 us	83 us	143 us

Table 7. Response Times of Three RR Workloads on 4-VCPU VM with and Without monitor_control.desched=false

Table 7 presents response times of three RR workloads on a 4-VCPU VM with and without monitor_control.desched=false. The native configuration is also shown for the purpose of comparison. The difference in response time between the two VM configurations is around 5us for Netperf and Gemfire_Get, whereas Gemfire_Put shows 16us of difference. Indeed, the estimated cost of VCPU halt/ wake-up operations with Gemfire_Put (16us) is three times that of Netperf and Gemfire_Get (5–6us) with three VCPU halt/wake-up operations instead of just one.

Waking up another CPU generates an inter-processor interrupt (IPI) that is sent to the target CPU. This process requires access to hardware, the local Advanced Programmable Interrupt Controller (APIC) of communicating CPUs. Sending and receiving IPIs in a VM (between VCPUs), therefore, are properly virtualized and managed by the hypervisor, and this adds a certain overhead. To better understand the impact of the overhead of sending and handling IPIs in a virtualized setup, the workload used in Table 4 and Table 5 (netperf TCP_RR traffic without network I/Os on a 1-VCPU VM) are modified a little bit such that the client and server run separately on a different VCPU of a 2-VCPU VM. The modified workload maintains the same characteristic of not having any network I/Os, but the client and the server must wake each other up to complete a transaction that incurs IPI traffic between the two VCPUs. Three different configurations are used: (1) Native, (2) VM, and (3) VM with monitor_control.desched=false configurations. Results are shown in Table 8.

	NETPERF
Native	8.3 us
VM	26 us
VM with monitor_control.desched=false	15 us

Table 8. Response-Time Comparison with Simplified Workloads with No Network I/Os on 2-VCPU VM

In this microbenchmark, two VCPU halt/wake-up operations happen per transaction because the client and server sit on a different VCPU. Because no network I/Os are incurred, the transaction rate is relatively higher, causing a very high number of IPIs. Using monitor_control. desched=false saves the cost of two VCPU halt/wake-up operations, improving by 11us (compared to the default VM configuration). However, even after using the option, the response time (15us) is still behind the native configuration (8.3us). One of the main reasons for the still remaining overhead in response time seems to be the IPI-related cost. Figure 7 and Figure 8 show performance-profiling results using Linux perf in the native and VM configurations respectively. For the VM configuration, the VPMC-Hybrid mode is used such that any hypervisor overhead is shown in the guest profiling data. Indeed, in the VM configuration shown in Figure 8, the cost of IPI-related kernel functions (flat_send_IPI_mask, native_apic_mem_write, reschedule_interrupt) takes 15% of CPU time used, indicating that it is one of the major reasons for the response-time overhead together with VCPU halt/wake-up cost in this microbenchmark. Those samples do not show in the native profiling data in Figure 7.

Figure 7. Profiling Data with Simplified Workloads with No Network I/Os on Native (with 2 CPUs Configured)

Figure 8. Profiling Data with Simplified Workloads with No Network I/Os on 2-VCPU VM

7. Related Work

The performance overhead of virtualization has been thoroughly evaluated. While doing so, researchers investigated ways to reduce the network I/O virtualization overhead [13] [17] [18] [14][21].

In [13], Menon et al. worked on the optimization of network I/O virtualization in Xen to reduce the performance gap between native and virtual configurations using various techniques. Cherkasova et al. [17] measured CPU overhead for processing HTTP workloads in Xen. Gamage et al. studied a low TCP throughput issue in a heavily consolidated virtualized environment [18]. They argued that the reason for the lowered throughput is that the TCP window does not open up as quickly as in a native configuration due to an increased RTT caused by CPU contention among multiple VMs. They proposed a way to get around the issue by allowing packets to be flooded from the VM to the driver domain.

All of the above papers studied network I/O virtualization overhead in terms of either CPU cost or throughput. In contrast to the above papers, our paper focuses on the response-time overhead of request- response workloads, which are the most common type of latency- sensitive applications.

In [14], Sugerman et al. described I/O virtualization architecture and evaluated its performance on VMware WorkstationTM. It investigated the breakdown of the latency overhead imposed by the virtualization layer. In contrast to our paper, however, it only focused on a hosted architecture, whereas our paper studies vSphere, which employs a bare-metal approach.

In [21], Ongaro et al. studied the effect of VCPU scheduling on I/O performance in Xen. They used various workloads including latency- sensitive ones. Their work is orthogonal to our paper, because they focused on the CPU scheduler’s impact on performance (i.e., latency) when multiple VMs are running, whereas our paper studies the response-time overhead with one VM when the system is lightly loaded. Furthermore, they didn’t compare the response time results to those of a native configuration.

Serebrin et al. introduced virtual performance monitoring counters (VPMCs), which enable performance counters for VMs on the vSphere platform [7]. Profiling tools that use hardware performance counters do not work on VMs (i.e., guests) with most hypervisors. With VPMC, existing profiling tools can be used inside a VM to profile applications without any modifications. Similarly, in [20] Menon et al. presented a system-wide statistical profiling toolkit for Xen. This work is similar to [7] in that it enables the accurate profiling of guest applications and the hypervisor overhead. Our paper uses VPMC to profile latency-sensitive workloads.

In [7], the authors studied and quantified the increased cost of handling TLB misses in virtualization due to the hardware-assisted MMU using tier 1 applications and a network microbenchmark. Focusing on latency-sensitive workloads, this paper takes a deeper look at the gap between virtual native performance and gives a complete breakdown of the response-time overhead of running latency-sensitive workloads in VMs.

8. Conclusion

This paper investigated the response-time overhead of latency- sensitive workloads, especially request-response workloads, in vSphere. With this type of workload, any overhead that the virtualization layer imposes is exposed directly in the response time with every step of extra operation serialized. This paper analyzed the breakdown of the sources of the response-time overhead observed in such workloads.

Specifically, this paper presented that the virtualization overhead in the response-time of a latency-sensitive workload is not entirely constant as expected. Instead, it also has a variable component that increases with regard to the response time of the workload itself. Using five different request-response workloads, this paper further demonstrated that the constant overhead incurs due to extra layers of virtualizing network I/Os and CPUs, while the cost of using another page table structure with hardware-assisted MMU virtualization is the main reason for the variable part of the overhead.

Acknowledgments

The authors would like to thank Dan Hecht, Garrett Smith, Haoqiang Zheng, Julie Brodeur, Kalyan Saladi, Lenin Singaravelu, Ole Agesen, Ravi Soundararajan, Seongbeom Kim, Shilpi Agarwal, and Sreekanth Setty for reviews and contributions to the paper.

References

K. Adams and O. Agesen. “A comparison of software and hardware techniques for x86 virtualization.” In Proceedings of the Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Oct. 2006.
Thomas J. Bittman, George J. Weiss, Mark A. Margevicius, and Philip Dawson. “Magic Quadrant for x86 Server Virtualization Infrastructure,” Gartner Group, June 11, 2012.
What’s New in VMware vSphere 4: Virtual Networking. VMware, Inc., 2009. http://www.vmware.com/files/pdf/ techpaper/Performance-Networking-vSphere4-1-WP.pdf .
VMware vSphere: The CPU Scheduler in VMware ESX 4.1. VMware, Inc., 2011. http://www.vmware.com/files/pdf/ techpaper/VMW_vSphere41_cpu_schedule_ESX.pdf.
Understanding Memory Resource Management in VMware vSphere 5.0. VMware, Inc., 2011. http://www.vmware.com/ files/pdf/mem_mgmt_perf_vsphere5.pdf
Performance Evaluation of Intel EPT Hardware Assist. VMware, Inc., 2008. http://www.vmware.com/pdf/Perf_ ESX_Intel-EPT-eval.pdf.
Benjamin Serebrin, Daniel Hecht, “Virtualizing Performance Counters.” 5th Workshop on System-Level Virtualization for High Performance Computing (HPCVirt 2011), August 2011.
Netperf. www.netperf.org, 2011. http://www.netperf.org/ netperf/.
VMware vFabric GemFire. VMware, Inc., 2012. http://www.vmware.com/products/vfabric-gemfire/ overview.html.
Apache HTTP Server Project. The Apache Software Foundation. http://httpd.apache.org.
perf: Linux profiling with performance. https://perf.wiki.kernel.org/index.php/Main_Page.
Leonard Kleinrock. Queueing Systems. Volume 1: Theory. Wiley-Interscience, 1975.
A Menon, AL Cox, and W Zwaenepoel. “Optimizing network virtualization in Xen.” In Proceedings USENIX ‘06 Annual Technical Conference.
Jeremy Sugerman, Ganesh Venkitachalam, and BengHong Lim. “Virtualizing I/O Devices on VMware Workstation’s Hosted Virtual Machine Monitor.” In Proceedings of the 2001 USENIX Annual Technical Conference, June 2001.
J.Walters, V. Chaudhary, M. Cha, S. Guercio, and. S. Gallo. “A Comparison of Virtualization Technologies for HPC.” In Proceedings of the 22nd International Conference on Advanced Information Networking and Applications (AINA). Pp. 861–868. Okinawa 25–28, Mar., 2008.
J. R. Santos, Y. Turner, G. Janakiraman, and I. Pratt, “Bridging the gap between software and hardware techniques for IO virtualization,” in ATC’08: In Proceedings of the USENIX 2008 Annual Technical Conference, pp. 29–42, 2008.
L. Cherkasova and R. Gardner, “Measuring CPU overhead for I/O processing in the Xen virtual machine monitor.” In Proceedings of the 2005 USENIX Annual Technical Conference, 2005, pp. 387–390.
Gamage, S., Kangarlou, A., Kompella, R. R., Xu, D. 2011. “Opportunistic flooding to improve TCP transmit performance in virtualized clouds.” In Proceedings of the Second ACM Symposium on Cloud Computing(October).
Aravind Menon and Willy Zwaenepoel. “Optimizing TCP receive performance.” In Proceedngs of the 2008 USENIX Annual Technical Conference, June 2008.
A. Menon, J. R. Santos, Y. Turner, G. J. Janakiraman, and W. Zwaenepoel. “Diagnosing performance overheads in the Xen virtual machine environment.” VEE ’05, June 2005.
D. Ongaro, A. L. Cox, and S. Rixner. “Scheduling I/O in virtual machine monitors” VEE ‘08, 2008.
VMware VMDirectPath I/O http://communities.vmware.com/docs/DOC-11089.
J. Buell, D. Hecht, J. Heo, K. Saladi, and H.R. Taheri. “Methodology for Performance Analysis of VMware vSphere under Tier-1 Applications.” VMTJ Summer 2013, 2013.

↧

A Benchmark for User-Perceived Display Quality on Remote Desktops

December 17, 2013, 11:49 am

Latest and popular articles on VMWare Virtualization

≫ Next: Toward Guest OS Writable Virtual Machine Introspection

≪ Previous: Virtualizing Latency-Sensitive Applications: Where Does the Overhead Come From?

Yue Gong
MTS
ygong@vmware.com

Winnie Wu
Sr. MTS
wuy@vmware.com

Alex Chen
MTS
chenm@vmware.com

Yang Liu
MTS
yangliu@vmware.com

Ning Ge
MTS
nge@vmware.com

Abstract

User experience is a key consideration when organizations prepare for and deploy virtual desktop infrastructure (VDI). Several existing tools on the market claim that they can measure the user experience by calculating the CPU utilization or latency that is required to operate applications on remote desktops. These methods no doubt make sense, but they are indirect and limited for assessing what the end users perceive. In the world of VDI, humans perceive display quality through their visual sense. At this time, the industry lacks a mechanism for estimating display quality.

This paper presents a creative benchmarking approach that uses fair and precise data for measuring the display quality that end users perceive on remote desktops. To calculate the display quality, this benchmark compares the screen on the host side (the VDI client side), which is what end users really see, with the baseline—the screen on the remote side, which is what end users expect to see. Two indicators are delivered by this benchmark: (a) the relative frame rates and (b) the image quality score. Several protocols (PCoIP, RDP, Blast) with seven typical VDI user scenarios across different network environments (LAN, WAN) were tested to ensure that the metrics are correlated with what end users perceive.

Within VMware, development efficiency of the remote protocol could be improved by using this benchmarking method to automatically collect display-quality data. Beyond that, VMware could benefit more on the VDI market, once we could drive the industry-standard establishment of VDI performance based on this benchmark.

Keywords: display quality, remote desktop, user experience 1. Introduction

1. Introduction

For a virtual desktop infrastructure (VDI) solution, one of the biggest challenges is to provide a good-enough desktop experience to the end user. The remote display protocol is responsible for sending the display data of the virtual desktop to the client and even for optimizing the data transmission and processing. As such, it becomes very important to understand and measure the user experience implemented by the display protocols in a variety of environments [1]. Several organizations have delivered test tools or methodologies based on resources occupied or the response time over protocols. Although these methods are useful, they are indirect and limited in their ability to assess what the end users perceive. Humans perceive display quality in the world of VDI through their visual sense. Unfortunately, at this time the industry lacks a metric to estimate the user experience based on this point.

This paper presents a creative benchmarking approach to measuring user-perceived display quality on remote desktops using fair and precise data. To calculate the display quality, it compares the screen on the host side (the VDI client side), which is what end users really see, with the baseline—the screen on the remote side, which is what end users expect to see; and it delivers two indicators: (a) the relative frame rates (b) the image quality score.

A timestamp is used to record the frame sequence. To ensure the timestamp’s accuracy, this benchmark polarizes the brightness of the timestamp area, which might be blurred by some remote protocols. When identifying the baseline for a specified screen image, this benchmark carefully assesses the side effect of inevitable asynchronies between the remote-screen frame rate and the timestamp rate, and corrects it. When comparing the captured screen images on the client side with the baseline, this benchmark adopts both Peak Signal-to-Noise Ratio (PSNR) [2] and Structural Similarity (SSIM) [3] algorithms, which are most commonly used in the industry to measure image similarity. Based these considerations and adjustments, this benchmark strives to deliver fair and precise results.

To be more consistent with humans’ visual perception, this benchmark combines the PSNR and SSIM algorithms to measure image similarity. It increases the weight of the intensive bad blocks and bad screen frames when calculating the image quality score and, furthermore, elaborately tunes the formulas’ factors based on the test results of seven typical VDI user scenarios across different protocols (PCoIP, RDP, Blast) with different network environments (LAN, WAN).

The experiment that is demonstrated in section 3 proves that this benchmark achieves about 90% approximation to the true end-user experience.

2. Display Quality Benchmark

2.1 Baseline: The Remote Screen

The benchmark uses the remote screen as the baseline when measuring display quality. In most cases, display protocols are responsible for rendering the screen of the remote desktop on

the client screen synchronously. When the remote side stutters,
it is generally foreseeable that the client screen will stutter to the same degree. To minimize the unexpected impact on display-quality testing from various unrelated factors, such as limited CPU resources, this benchmark requires sufficient performance of the remote virtual machine.

Several technologies, such as multimedia redirection, accelerate display performance. In scenarios that use such technologies, the remote screen and the client screen can be expected to be different. This paper leaves this topic as future work.

2.2 Relative Frame Rate

Frame rate commonly refers to the speed at which the screen image is refreshed. It is typically expressed in frames per second (FPS). In this paper, frame rate refers in particular to the screen-refreshing rate after consolidation of contiguous screen frames that are identical. Typical LCD monitors on the market are locked at 60Hz, but the actual frame rate doesn’t necessarily equal 60fps. When the screen image changes every 50ms, the frame rate = 1s/(50ms/frame)=20fps.

Ideally, the frame rate on the host side is the same as it on the remote side. However, in the real world, host FPS is always slower than its remote peer. This benchmark uses relative frame rate, which has a data range from 0 to 100%, as one important indicator for evaluating the efficiency of VDI protocols:

RelativeFrameRate = frame rate at host side / frame rate at remote side

2.2.1 Timestamp

Use of the timestamp is designed to:

Minimize the impact to the original screen content
Maximize the frame rate of the timestamp to ensure that all frames are captured
Provide a sufficient domain of time span to ensure that the display quality can be assessed
Maximize the signal-to-noise ratio to avoid the impact of possible quality decline through protocols

The timestamp that is used in this benchmark is represented by a prepared video. The video is in a small area (e.g., 4 x 4 pixels) and plays on the top of the screen at a very fast speed (e.g., 100fps). Pixels of each frame of the timestamp are black or white. If white is considered to be 1 and black to be 0, each frame becomes a two-dimensional code. For example, the timestamp in Figure 1 is: 0000110001000011B = 3139. It provides a time span of (2^16 frames)/(100fps) = 655.36 seconds ≈ 11 mins.

2.2.2 Frame Rate on the Remote Side

The steps to get the frame rate on the remote side are:

Figure 1. Timestamp Example

Capture the remote screen faster than its frame rate.
Select only one frame from the frames with the same timestamp.
Count the frames.
Calculate the frame rate as:

FrameRate = SUM of nonduplicate frames / time span of the testing

2.2.3 Frame Rate on the Client Side

The steps to get the frame rate on the host side are basically identical to the steps in section 2.2.2.

However, there is an issue for timestamps captured on the host side. Remote protocols can impart noise to the timestamp due to optimizations, such as bandwidth saving, that are widely used by most remote display protocols. The timestamp in Figure 2 shows an example. White pixels could be changed from 0xFFFFFF to 0xDDFFFF, and black pixels could be changed from 0×000000 to 0×001740.

Figure 2. A Blurred Timestamp

To identify timestamps, the benchmark:

Calculates the perceived brightness of each pixel by its RGB values. The formula recommended by the W3C working draft on accessibility [4] is used:
- Brightness = 0.299R + 0.587G + 0.144B
Polarizes the pixels to white or black: If the brightness of the pixel exceeds 128, treat it as white; otherwise treat it as black.
Calculates the number of the timestamp.

2.3 Image Quality Score

Image quality is also an important indicator of display quality. When the screens on the host side are not identical to the baseline, end users might perceive the difference. Assuming that the baseline is perfect, the degree of negative feeling depends on the difference between baseline images and host images, and the time span of relatively consecutive difference occurs. It has a data range from 0 to 100.

Generally there are four steps to getting a straightforward number as the image quality score:

Record screens on both the remote side and the host side, and discard the duplicate frames.
For each recorded frame on the host side, identify its baseline according to timestamps.
Calculate the image quality score of each frame on the host side.
Measure the aggregation effect of the “bad block” in the screen over time, and give the final score.

2.3.1 Screen Recording

Various conditions must be addressed to ensure the accuracy of the result:

Reduce the latency of the starting point to synchronize the host-side and remote-side captures to maximize the match rate of frames. Virtual Channels can be used.
Boost the recording speeds on both sides as much as possible. Approximately twice real FPS is adequate.

Minimize the performance side-effect of frequent screen captures caused by slow I/O operations. Sufficient memory might be mandatory on certain machines.

2.3.2 Baseline Frames Identification

In most cases, frames are paired by matching the timestamps of captures from the host and remote sides.

Figure 3. Frame Mismatch by Out-of-Sync Time Lines

However, because of network latency, timestamps are not exactly synchronized. As Figure 3 shows, at the period of timestamp M, the frame changes from F1 to F2. And the captured image can be either F1 (captured at T1) or F2 (captured at T2). So it is a potential risk that remote-side captures F1 and host-side captures F2 and both captured frames are stamped with the same timestamp M. Because of inaccurate time sync between remote and host machines, it is also possible for the remote side to capture F2 and the host side to capture F1. To avoid this condition, the following mechanisms are introduced:

For each frame on the host side, find the first frame (here called target) with the same timestamp from the recorded screen of the remote side, and:

If the target frame is equal to its previous frame and its next frame, it is the baseline frame.
If the target frame is not equal to either its previous frame or its next frame, respectively calculate the frame image quality scores (refer to section 2.3.3) with this target frame, its previous frame, and its next frame. The frame that gets highest score is used as the baseline frame.
Otherwise, respectively calculate the frame image quality scores with the target’s previous frame and the target’s next frame. Then the frame that gets highest score is used as the baseline frame.

2.3.3 Image Quality for Each Frame

2.3.3.1 Block Similarity

In common circumstances, if the image quality for the entire frame is used directly as the final result, visually sensitive screen differences such as a straight line on the screen that impacts only a small portion of the image would be concealed unexpectedly. Dividing the frame into small pieces (e.g., 16 x 16 pixels) causes a single pixel shift to be weighted more than it is to the whole frame.

Based on the characteristic difference between PSNR and SSIM, the benchmark calculates the final score by using the following formula, which has a data range from 0 to 100%:

2.3.3.2 Bad-Block Density

For each frame, mark every block with BlockSimilarity lower than 50% as bad. Then:

Select a radius as the Concerned Density Area for all bad blocks. This benchmark recommends using 1%~2% of frame width.
For each bad block, count the other bad blocks in its Concerned Density Area.
Use the maximum result in the step 2 as Density Max Num.
Calculate the Max Bad Block Density Ratio:

2.3.3.3 Frame Image Quality Score

Overall, weighted average is used to calculate the image quality score for each frame. For the blocks that have a BlockSimilarity lower than or equal to 50%, more weight is added to increase the sensitiveness of the bad blocks. Furthermore, the Max Bad Block Density Ratio also impacts the score. The data range of Frame Image Quality Score is from 0 to 100.

BS: Block Similarity

A: number of blocks whose Block Similarity is equal to BS

2.3.4 Image Quality Score

The VDI remote desktop is a series of related frames, not a pile of irrelevant images. For example, a noisy block displayed on the host screen at the same place for a period of time is quite noticeable to human eyes. So the aggregation effect of the bad block in the screen over time must be measured, as the weighted average of Frame Image Quality Scores do:

Set the initial value of ContinuousBadness to 0, which represents that no continuous bad frames are found.
For each frame, after step 3 of section 2.3.3.2, select the most noisy block.
For the two adjacent frames, calculate the PSNR value of their noisiest block. If the PSNR value is higher than 18, consider this as a ContinuousBadness, and increase ContinuousBadness by 1. SSIM is very sensitive to image position shift. It is not suitable for this scenario.
Calculate the Image Quality Score by the following formula:

The data range of Image Quality Score is from 0 to 100.

3. Benchmark Evaluation

In all, 42 cases are used to test this benchmark:

7 typical VDI scenarios
x
3 protocols (RDP, PCoIP, Blast)
x
2 network environments (LAN, 2M bandwidth WAN).

3.1 Evaluation of Relative Frame Rate

To evaluate the Relative Frame Rate, a reliable mechanism is required to acquire the frame rates on both the remote and client sides. PCoIP protocol logs the frame rates it delivers, which could be reliable. Figure 4 demonstrates two groups of Relative Frame Rates. In summary, the mean error of the Relative Frame Rate delivered by this benchmark in all test cases under PCoIP protocol is less than 7% compared to the results from PCoIP logs.

Figure 4. Comparisons of Relative Frame Rate

Mean error of RDP and Blast protocols are left as future work, because no comparable frame rate data is available from the protocols themselves.

3.2 Evaluation of Image Quality Score

To evaluate the Image Quality Score, a comparison between the benchmark and the end-user scores takes place. Four people execute all the test cases manually and record their scores. For each test case, the error rate is populated by comparing the score from the benchmark with the average scores perceived by end users. Figure 5 demonstrates four groups of Image Quality Scores. Over all the test cases, the mean error rate of the Image Quality Scores delivered by this benchmark when compared to the average score perceived by end users is less than 10%. Considering system errors, the benchmark could be used as a measurement of true VDI end user experience.

Figure 5. Comparisons of Image Quality Score

4. Related Work

To understand and measure the user experience implemented by the display protocols in variable environments, several companies/ organizations/individuals have delivered test tools or methodologies based on the response time over various protocols.

Microsoft Tech Ed 2010 included a session named “RDP, RemoteFX, ICA/HDX, EOP and PCoIP – VDI Remoting Protocols Turned Inside Out” by Bernhard Tritsch (and recommended by Citrix) [5]. It used a methodology named Application Performance Index to test remote protocols. This method measured the times between user actions and the responses on the remote desktop, and transferred the time numbers into the levels perceived by the users. This method provides a valid way to measure the response time of the perceived performance of the remote desktops. However, it only measures response time and doesn’t mention/measure display quality.

Login Consultants announced the new Client Side Performance Testing module for Login VSI in March 2012. At this time the product is a beta version. The release notes say: “The Client Side Performance Testing module can now be used to perform these tests: … Image quality and loading times…This image has been specifically designed to measure quality and speed of different protocols in an independent way.” However, via its results [6], it measures only image response time—how long it takes to show a complex image onscreen—but does not measure display quality.

Scapa TPP says it “measures the end to end performance of the system, from the end users’ perspective”. But it in fact records “the overall latency down the ICA/RDP protocols per user, per location and per server/HVD” [7]. It doesn’t measure display quality.

Simon Bramfitt, a former Gartner analyst, published a blog entitled “Developing a Remote Display Technology User Experience Benchmark” [8] in April 2012. As this blog showed, his method captured the output directly from the server and on the client, then compared any variations in timing and image quality to measure the impact that the remote display infrastructure had on the output quality. This is an idea quite similar with ours. However:

The blog doesn’t mention which image analysis algorithm or software was used, but only says, “The image analysis software that I have been using in my preliminary investigations is rather costly, however I am committed to working with the developers to see what can be done to repackaging it in such a way to make it more affordable for this type of activity”.
The blog doesn’t mention if Bramfitt paid attention to how to synchronize the output from the server and on the client. This is an important technical problem impacting the accuracy of the display-quality verification.
The blog mentions “a frame by frame comparison.” In fact, the display on a remote desktop is a stream. Bramfitt’s method lacks any consideration of this fact.
The implementation of Bramfitt’s method was still in progress. Although the blog includes a plan to release an implementation, we can’t locate one.

5. Conclusion and Future Work

This paper presents a creative benchmarking approach. Two indicators are delivered by this benchmark: (a) the relative frame rates; (b) the image quality score. The comparison experiments prove that this benchmark achieves about 90% approximation to the true end-user experience and is fair and precise for measuring user-perceived display quality on remote desktops.

In the future, the benchmark will be extended to cover scenarios in which significant differences are expected between the remote screen and the client screen. For example, in some multimedia redirection solutions, screen areas displaying video are quite different between remote and client screens.

Acknowledgments

The authors would like to thank Clayton Wishoff and Jonathan Clark for their valuable suggestions for the idea, the workflow details, and the algorithm selection. The authors also appreciate the helpful reviews by John Green, Shixi Qiu, and Hayden Fowler.

References

Comprehensive Performance Analysis of Remote Display Protocols in VDI Environments. http://www.vmworld.com/ docs/DOC-3574
Peak signal-to-noise ratio. http://en.wikipedia.org/wiki/ Peak_signal-to-noise_ratio
Structural similarity. http://en.wikipedia.org/wiki/Structural_similarity
Techniques For Accessibility Evaluation And Repair Tools. http://www.w3.org/TR/2000/WD-AERT-20000426
RDP, RemoteFX, ICA/HDX, EOP and PCoIP – VDI Remoting Protocols Turned Inside Out, Microsoft MSDN. http://channel9. msdn.com/Events/TechEd/Europe/2010/VIR401
Client Side Performance Testing (Beta), CSPT Result, LoginVSI. http://www.loginvsi.com/documentation/v3/cspt-results
Taking Measurements from the End User Perspective, Scapa Test and Performance Platform. http://www.scapatech.com/ wp-content/uploads/2011/04/ScapaSync_final.pdf
Developing a Remote Display Technology User Experience Benchmark. http://www.linkedin.com/groups/Developing-Remote-Display-Technology-User-117246.S.112304728

↧

Toward Guest OS Writable Virtual Machine Introspection

December 17, 2013, 11:50 am

Latest and popular articles on VMWare Virtualization

≫ Next: Analysis of the Linux Pseudo-Random Number Generators

≪ Previous: A Benchmark for User-Perceived Display Quality on Remote Desktops

Zhiqiang Lin
The University of Texas at Dallas
zhiqiang.lin@utdallas.edu

Abstract

Over the past decade, a great deal of research on virtual machine introspection (VMI) has been carried out. This approach pulls the guest OS state into the low-level hypervisor and performs external monitoring of the guest OS, thereby enabling many new security applications at the hypervisor layer. However, past research mostly focused on the read-only capability of VMI; because of inherent difficulties, little effort went into attempting to interfere with the guest OS. However, since hypervisor controls the entire guest OS state, VMI can go far beyond read-only operations. In this paper, we discuss writable VMI, a new capability offered at the level of the hypervisor. Specifically, we examine reasons of why to embrace writable VMI, and what the challenges are. As a case study, we describe how the challenges could be solved by using our prior EXTERIOR system as an example. After sharing our experience, we conclude the paper with discussions on the open problems and future directions.

1. Introduction

By virtualizing hardware resources and allocating them based on need, virtualization [18] [19] [28] has significantly increased the utilization of many computing capacities, such as available computing power, storage space, and network bandwidth. It has pushed our modern computing paradigm from multi-tasking computing to multi-operating-system computing. Located one layer below the operating system (OS), virtualization enables system developers to achieve unprecedented levels of automation and manageability— especially for large scale computing systems—through resource multiplexing, server consolidation [32], machine migration [4], and better security [16] [15] [14] [3] [34], reliability, and portability [2]. Virtualization has become ubiquitous in the realm of enterprise computing today, underpinning cloud computing and data centers. It is expected to become ubiquitous on the desktop and mobile devices in the near future.

In terms of security, one of the best applications enabled by virtualization is virtual machine introspection (VMI) [16]. VMI pulls the guest OS state into the outside virtual machine monitor (VMM), or hypervisor (the terms VMM and hypervisor are used interchangeably in this paper), and performs external monitoring of the runtime state of a guest OS. The introspection can be placed in a VMM, in another virtual machine (VM), or within any other part of the hypervisor, as long as it can inspect the runtime state of the guest OS—including CPU registers, memory, disk, and network. Because of such strong isolation, VMI has been widely adopted in many security applications such as intrusion detection (e.g., [16] [24] [25]), malware analysis (e.g., [22] [5] [6] [9]), process monitoring (e.g., [30] [31]), and memory forensics (e.g., [20] [7] [9]).

However, past research in VMI has primarily focused on read-only inspection capability for the guest OS. This is reasonable, because intuitively any writable operation to the guest OS might disrupt the kernel state and even crash the kernel. In other words, in order to perform writable operations, the VMM must know precisely which guest virtual address it can safely write to, and when it can perform the write (i.e., what the execution context is). Unfortunately, this is challenging because of the well-known semantic gap problem [2]. That is, unlike the scenario with the in-guest view—where we have rich semantics such as the type, name, and data structure of kernel objects—at the VMM layer, we can view only the low-level bits and bytes. Therefore, we must bridge the semantic gap.

Earlier approaches to bridging the semantic gap have leveraged kernel-debugging information, as shown in the pioneer work Livewire [16]. Other approaches include analyzing and customizing kernel source code (e.g., [26] [21]), or simply manually writing down the routines to traverse kernel objects based on the kernel data structure knowledge (e.g., [22] [24]). Recently, highly automated binary- code-reuse-based approaches have been proposed that either retain the executed binary code in a re-executable manner or operate through an online kernel data redirection approach utilizing dual-VM support.

Given the substantial progress in the all possible approaches to bridging the semantic gap at the VMM layer, today we are almost certain of the semantics of the guest OS virtual addresses that we may or may not write to. Then can we go beyond read-only VMI? Since the VMM controls the entire guest computing stack, VMM certainly can do far more than that, such as perform guest OS memory-write operations. Then what are the benefits of writable VMI? What is the state of the art? What are the remaining challenges that must be addressed to make writable VMI deterministic? How can we address them and realize this vision?

This paper tries to answer these questions. Based on our prior experiences with EXTERIOR [10], we argue that writable VMI is worthwhile and can be realized. In particular, we show that there will be many exciting applications once we can enable writable VMI, such as guest OS reconfiguration and repair, and even applications for guest OS kernel updates. However, there are still many challenges to solve before we can reach that point.

The rest of the paper is organized as follows: Section 2 addresses further the need of writable VMI. Section 3 discusses the challenges we will be facing. In Section 4, we present the example of a writable- VMI prototype that we built to support guest OS reconfiguration and repair. Section 5 discusses future directions, and finally Section 6 concludes the paper.

2. Further Motivation

Past research on VMI primarily focused on retrieving the guest
OS state, such as the list of running processes, active networking connections, and opening files. None of these operations requires modification of the guest OS state, which has consequently limited the capabilities of the VMI. By enabling VMI to write to the guest OS, we can support many other operations on the guest OS, such as configuring kernel parameters, manipulating the IP routing table, or even killing a malicious process.

For security, writable VMI would certainly share all of the benefits of readable VMI, such as strong isolation, higher privilege, and stealthiness. In addition, it can have another unique advantage— high automation. In the following, we discuss these benefits in greater detail. More general discussion of the benefits of hypervisor-based solutions can be found in other papers (c.f., [2] [17]).

Strong isolation – The primary advantage of using the VMM is the ability to shift the guest OS state out of the VM, thereby isolating in-VM from out-of-VM programs. It is generally believed to be much harder for adversaries to tamper with programs running at the hypervisor layer, because there is a world switch from in-VM to out-of-VM (unless the VMM has vulnerabilities). Consequently, we can gain higher trustworthiness of out-of-VM programs. For instance, if we have a VMM-layer guest OS process kill utility, we can guarantee that this utility is not tampered before using it to kill the malicious processes inside the guest OS.
Higher privileges and stealthiness – Traditional security software (e.g., antivirus) runs inside the guest OS, and in-VM malware can often disable the execution of this software. By moving the execution of security software to the VMM layer, we can achieve a higher privilege (same as the hypervisor’s) for it and make it invisible to attackers (higher stealthiness). For instance, malicious code (e.g., a kernel rootkit) often disables the rmmod command needed to remove a kernel module. By enabling the execution of these commands at the VMM layer, we can achieve a higher privilege. Also, the VMM-layer rmmod command would certainly be invisible (stealthy) to the in-VM malware because of the strong isolation.
High Automation – A unique advantage of writable VMI is the enabling of automated responses to guest OS events. For instance, when a guest OS intrusion is detected, it often requires an automated response. Current practice is to execute an automated response inside the guest OS and/or notify the administrators. Again, unfortunately, any in-VM responses can be disabled by attackers because they run at the same privilege level. However, with writable VMI, we can quickly take actions to stop and prevent the attack without the assistance from any in-VM programs and their root privileges. Considering that there are a great deal of read-only VMI-based intrusion-detection systems (e.g., [6] [8] [9] [16] [22] [23] [24]), writable VMI can be seamlessly integrated with them and provide a timely response to attacks—such as kill-ing a rootkit-created hidden process and running rmmod against a hidden malicious kernel module.

3. Challenges

However, it is non-trivial to realize writable VMI at the hypervisor layer. As in all the read-only VMI solutions, we must bridge the semantic gap and reconstruct the guest OS abstractions. In addition, we will also face a concurrency problem while performing guest OS writable operations.

3.1 Reconstructing the Guest OS Abstractions

Essentially, a hypervisor can be considered to be programmable hardware. Therefore, the view at the hypervisor layer is at a very low level. Specifically, we can observe all the CPU registers and all of the physical memory cells of the guest OS. Also, we can observe all the instruction executions if the hypervisor is an instruction- translation-based VMM; otherwise we can only observe some special VMM-level instructions (e.g., Intel VT-x instructions) and special kernel events such as page faults if the hypervisor is a hardware-virtualization-based VMM.

However, what we want is the semantic information of the guest OS abstractions. For instance, for a memory cell, we want to know the meaning of that cell—for example, what is the virtual address of this memory cell? Is it a kernel global variable? If so, what does this global variable stand for? For a running instruction inside the guest OS, we also would like to know if it is a user-level instruction or a kernel-level instruction? Which process does the instruction belong to? If the instruction belongs to kernel space, is it a system- call-related instruction, a kernel-module instruction, kernel-interrupt handler, or something else? For a running system call, we also want to know the semantics of this system call, such as the system call number and the arguments.

Therefore, we must bridge the semantic gap for this low-level data and these events. In general, we must be armed with detailed knowledge of the algorithms and data structures of each OS component in order to rebuild high-level information. However, due to the high complexity of modern OSs, acquiring such knowledge is a tedious and time-consuming operation, even for open source OSs. When the source code is not available, sustained effort is needed to reverse engineer the undocumented kernel algorithms and data structures.

Because of the importance of this problem, significant research in the past has focused on how to bridge the semantic gap more efficiently and with less constraint. Currently, the state of the art includes the kernel-data-structure-based approach (e.g., [16] [26] [21] [22] [24] [1]), and the binary-code-reuse-based approach (e.g., [6] [9] [10] [11] [29]). Each has its own pros and cons. The data-structure-assisted approach is flexible, fast, and precise, but it requires access to kernel-debugging information or kernel source code; a binary-code-reuse-based approach is highly automated, but it is slow and can only support limited functionality (e.g., with fewer than 20 native utilities supported so far in VMST [9] and EXTERIOR [10]).

Figure 1. (a) System call trace of the hostname command (b) Disassembled instructions for sys_sethostname system call

3.2 Addressing the Concurrency Issue

Unlike with read-only VMI, if we aim to perform writable operations on the guest OS, we must ensure that the memory write is safe. By safe, we mean that the newly written value should reflect the original OS semantics. In particular, for a memory write, even though we can bridge its semantic gap, we still need to know when it is the safe moment to launch the write operation. For instance, as shown in the Figure 1(b), when writable VMI executes the rep movsl instruction at the hypervisor layer to set the host name of the guest OS, we need to ensure there is no concurrent execution of this instruction inside the guest OS.

In addition, the OS is designed to manage hardware resources such as CPU and memory, which are often shared by multiple processes or threads (for multiplexing). Therefore, the OS kernel is full of synchronization or lock primitives against the concurrent access of the shared resources. These synchronization mechanisms (e.g., spinlock and semaphores) would set yet another obstacle when implementing writable VMI.

Note that the concurrency issue happens at very fine granularity— that is, the memory-cell level for a particular variable. Based on the semantics, if we are sure that there is no such concurrency, we can safely perform the memory write. In other words, the outside writable operation should be like a transaction (e.g., [27]), and self-contained. For instance, the execution of the ps command would not affect the kernel state, and this “transaction” is self-contained and can happen multiple times even inside the guest OS.

There are also some other related issues, such as the performance trade-off. One intuitive approach for avoiding concurrency would be to stop guest OS execution and then perform the writable operation solely from VMI. This is doable if the performance impact on the guest OS is not so critical and if inside the guest OS there is no similar behavior to the outside writable operation.

4. Examples

In this section, we share our experiences in realizing a writable VMI system named EXTERIOR [10]. We first give an overview of our system in Section 4.1, and in Section 4.2 explain how we addressed the semantic-gap and concurrency challenges. Finally, we discuss the limitations of EXTERIOR in Section 4.3.

4.1 EXTERIOR Overview

Recently, we presented EXTERIOR, a dual-VM, binary-code-reuse- based framework for guest OS introspection, configuration, and repair. As illustrated in Figure 2, EXTERIOR enables native OS utilities such as ps, rmmod, and kill to execute in a secure VM (SVM)

but transparently inspect and update the OS state in the guest VM (GVM). There are two requirements for EXTERIOR to work: (1) The OSs running in the two VMs must be the exact same version and (2) the SVM’s hypervisor is an instruction-translation-based VMM.

The SVM is used to create the necessary running environment for the utility processes. The binary-translation-based VM is used to monitor all the instruction execution in the SVM, resolve the instruction execution context, and dynamically and transparently redirect and update the memory state at the hypervisor layer from SVM to GVM when the execution context of interest is executed, thus achieving the same effect—in terms of kernel state updates—as running the same utility inside the GVM.

We demonstrated that EXTERIOR can be used for automated management of a guest OS, including introspection (e.g., ps, lsmod, netstat) and reconfiguration (e.g., sysctl, hostname, renice) of the guest OS state without any user account in the guest OS. It also supports end users developing customized programs to repair the kernel damage inflicted by kernel malware, such as contaminated system-call tables.

4.2 Solutions to the Challenges

To bridge the semantic gap, EXTERIOR uses a binary-code-reuse- based approach. The key insight is that for compiled software, including an OS kernel, variables are usually updated by the compiled instructions within a certain execution context. More specifically, by tracing how the traditional native program executes and updates the kernel state, we observe that OS kernel state is often updated within a certain kernel system call execution context. For instance, as shown in Figure 1, the sethostname utility, when executed with the test parameter, invokes the sys_sethostname system call to update the kernel host name with the new name. The instructions from line 6 to line 11 are responsible for this.

Therefore, if at the VMM layer we can precisely identify the instruction execution from line 6 to line 11 when system call sys_sethostname is executed, and if we can maintain a secure duplicate of the running GVM as a SVM, through redirecting both their memory read and write operations from the SVM to the running GVM, we can transparently update the in-VM kernel state of the GVM from the outside SVM. In other words, the semantic gap (e.g., the memory location of in-VM kernel state) is automatically bridged by the intrinsic instruction constraints encoded in the binary code in the duplicated VM. That is why it eventually leads to a dual-VM based architecture.

Regarding the concurrency issues, it is rare to execute these native utilities simultaneously in both SVM and GVM. For instance, the probability would be extremely low of executing a utility such as sethostname in the SVM at the same time that it is executed in the GVM. Meanwhile, the utilities that EXTERIOR supports are self- contained, and they can be executed multiple times in one VM. For instance, we can execute kill multiple times to kill a process, and we can also execute ps multiple times to show the running-processes list. Also, we can execute kill at an arbitrary time to kill a running process, because this operation is self-contained. That is why these operations can be considered transactions. A rule of thumb is that if we can execute a command multiple times in a VM and can get the same result in terms of kernel-state inspection or update, then that command can be executed in a SVM.

4.3 Limitations of EXTERIOR

The way in which EXTERIOR bridges the semantic gap and how it addresses the concurrency issue naturally lead to a number of limitations. First, it requires the two kernels to have identical kernel versions because of the nature of binary code reuse. Any new patch to the GVM kernel must be applied to the SVM. Second, it also requires the address space of kernel global variables not be randomized; otherwise it must derandomize it. Third, the execution of the monitored system call (e.g., sys_sethostname) will not be blocked, and the monitored system call should only operate on memory data.

Because of the above constraints, EXTERIOR cannot support the running of arbitrary administration utilities with arbitrary kernels. Also, EXTERIOR must precisely identify the instruction execution context. Currently, it can precisely identify the system-call execution context. However, a given system call can contain certain nonredirectable data, such as the variables accessed by spin_lock and spin_unlock, or semaphores accessed by __up or __down. If we cannot precisely identify the execution context of these functions, EXTERIOR is highly likely to make these relevant kernel lock primitives inconsistent when redirecting kernel data access. Currently, EXTERIOR uses a manual approach to derive the signatures for all those observed kernel locks. Such a manual approach is tedious and error-prone, and it must be repeated for different kernels.

5. Future Research

There are many directions to go in order to realize writable VMI. The two most urgent steps are to (1) push the technology further based on different constraints and (2) demonstrate the technology with more compelling applications. This section discusses both steps in more detail.

5.1 Improving the Techniques

Whether the hypervisor can access the guest OS kernel source code determines which of the two following strategies are possible: we can either retrofit the kernel source code to make it more suitable for writable VMI, or we can improve the binary code analysis of the OS kernel to automatically recognize a more fine-grained execution context such as spin_locks.

5.1.1 Retrofitting Kernel Source Code

As with writable VMI, we want to perform transaction-like operations. Also, at the binary-code level it is challenging to recognize the kernel- synchronization primitives. Then why not to retrofit the kernel source code to add hooks or wrappers such that, at the hypervisor layer, we can easily detect these events? This is certainly doable. For instance, much as in paravirtualization [32], we can modify kernel source code (with a compiler pass) to automatically recognize certain functions based on certain rules, and add hooks (e.g., [13]), or even rewrite some part of kernel code if the transaction-like behavior is missing (c.f., TxOS [27]).

On the other hand, to perform writable VMI at the hypervisor layer, essentially we are executing a program at the hypervisor layer to update kernel variables. Another route would be to change the binary code output (by compilers) of the given kernel in order to assist our VMI. For instance, if we can relocate the kernel variables to certain pages (instead of mixing them with all other unrelated kernel variables, which is the current practice), it would be much easier for the hypervisor to recognize and update the kernel introspection related information. For instance, through program analysis such as program slicing, if we can precisely identify the variables involved in the memory write and relocate them into special pages, we could map the pages between the SVM and the GVM such that the operation happening in the SVM is directly reflected in the GVM’s state. It might be trivial to relocate the global variables, but for heap we might have to dynamically track them through pointer references. Part of our current research is working in this direction.

5.1.2 Recognizing Fine-Grained Execution Context

When we cannot retrofit the kernel source code, the only way we can move forward is to improve the binary code analysis to recognize the more fine-grained execution context. Currently, we can recognize the beginning and ending point of a system call, interrupt, and exception, through instrumenting kernel binary code and hardware events generation [9] [10] [11] [29]. We cannot recognize many other kernel functions such as context switch, the bottom half of the interrupt handler, and many of the kernel synchronization primitives. Although we have mitigated the identification of these functions by instrumenting the timer to further disable the context switch, this is certainly not general enough and has limited functionality.

Therefore, determining how to identify fine-grained kernel function execution contexts and their semantics is a challenging problem. Manually inspecting each kernel function will not scale, and automatic techniques are needed. Having source code access is much easier, because we can instrument the code and inform the hypervisor of the execution context. When given only binary, we must automatically infer them from the information we gather.

5.2 Exploring More Applications

There will be many new exciting applications once we are able to perform fine-grained (i.e., memory-address-level) writable VMI. We have demonstrated that we can do guest OS reconfiguration such as resetting certain kernel parameters. We can also perform guest OS repair to clean the attack footprints.

Other possible applications include kernel updates. If we can quantify that a new kernel patch changes the kernel state in a transactional way, then we can certainly perform writable VMI to update the kernel defined in the patch. Other applications could be forensic applications that bypass authentication [12]. Again, the biggest advantage for writable VMI is that no explicit root privilege is required to perform a task (because it has the highest, hypervisor-level privilege).

In a broader scope, we can view writable VMI as a new program execution model that has certain components executed in-VM and certain components executed out-of-VM. These two types of components work together but have different trustworthiness and privileges. Some typical problems, such as the consumer and producer model, might be good examples of using writable VMI. For instance, we can use writable VMI to produce the kernel data for consumers inside the guest OS to consume. It might also be useful for high-performance computing (HPC), because this model splits program execution into two parts. With the support of special APIs, we might be able to improve the performance of specific HPC applications.

6. Conclusion

VMI has been an appealing application for security, but it only focuses on the guest OS read-only capability provided by the hypervisor. This paper discusses the possibility of exploring approaches for writable-capability that is necessary for changing certain kernel state. In particular, we discussed the demand for highly automated writable-VMI approaches that can be used as transactions. We also walked through the challenges that we will be facing, including the well-known semantic-gap problem, and the unique concurrency issues that occur when both in-VM and out-of-VM programs write the same kernel variables at the same time. We discussed how we can solve these challenges by using our prior EXTERIOR system as an example. Finally, we believe writable VMI is useful. It will open many new opportunities for system administration and security. There are still many open problems to work on in order to realize full-fledged writable-VMI.

References

Carbone, M., Cui, W., Lu, L., Lee, W., Peinado, M., and Jiang, X. Mapping kernel objects to enable systematic integrity checking. In The 16th ACM Conference on Computer and Communications Security (CCS’09) (Chicago, IL, USA, 2009), pp. 555–565.
Chen, P. M., and Noble, B. D. “When virtual is better than real”. In Proceedings of the Eighth Workshop on Hot Topics i n Operating Systems (HOTOS’01) (Washington, DC, USA, 2001), IEEE Computer Society, p. 133.
Chen, X., Garfinkel, T., Lewis, E. C., Subrahmanyam, P., Waldspurger, C. A., Boneh, D., Dwoskin, J., and Ports, D. R. Overshadow: a virtualization-based approach to retrofitting protection in commodity operating systems. In Proceedings of the 13th International Confer-ence on Architectural support for programming languages and oper-ating systems (Seattle, WA, USA, 2008), ASPLOS XIII, ACM, pp. 2–13.
Clark, C., Fraser, K., Hand, S., Hansen, J. G., Jul, E., Limpach, C., Pratt, I., and Warfield, A. Live migration of virtual machines. In Proceedings of the 2nd Conference on Symposium on Networked Systems Design & Implementation -Volume 2 (2005), NSDI’05, USENIX Association, pp. 273–286
Dinaburg, A., Royal, P., Sharif, M., and Lee, W. Ether: malware analysis via hardware virtualization extensions. In Proceedings of the 15th ACM Conference on Computer and Communications Security (CCS’08) (Alexandria, Virginia, USA, 2008), pp. 51–62.
Dolan-Gavitt, B., Leek, T., Zhivich, M., Giffin, J., and Lee, W. Virtuoso: Narrowing the semantic gap in virtual machine introspection. In Proceedings of the 32nd IEEE Symposium on Security and Privacy (Oakland, CA, USA, 2011), pp. 297–312.
Dolan-Gavitt, B., Payne, B., and Lee, W. Leveraging forensic tools for virtual machine introspection. Technical Report; GT-CS-11-05 (2011).
Dunlap, G. W., King, S. T., Cinar, S., Basrai, M. A., and Chen, P. M. Revirt: enabling intrusion analysis through virtual-machine logging and replay. In Proceedings of the 5th Symposium on Operating Sys-tems Design and Implementation (OSDI) (2002).
Fu, Y., and Lin, Z. Space traveling across vm: Automatically bridging the semantic gap in virtual machine introspection via online kernel data redirection. In Proceedings of 33rd IEEE Symposium on Security and Privacy (May 2012).
Fu, Y., and Lin, Z. Exterior: Using a dual-vm based external shell for guest OS introspection, configuration, and recovery. In Proceedings of the Ninth Annual International Conference on Virtual Execution Environments (Houston, TX, March 2013).
Fu, Y., and Lin, Z. Bridging the Semantic Gap in Virtual Machine Introspection via Online Kernel Data Redirection. ACM Transaction on Information System Security 16(2) (September 2013), 7:1–7:29.
Fu, Y., Lin, Z., and Hamlen, K. Subverting systems authentication with context-aware, reactive virtual machine introspection. In Proceedings of the 29th Annual Computer Security Applications Conference (ACSAC’13) (New Orleans, Louisiana, December 2013).
Ganapathy, V., Jaeger, T., and Jha, S. Automatic placement of authorization hooks in the Linux security modules framework. In Proceedings of the 12th ACM Conference on Computer and communications security (Alexandria, VA, USA, 2005), CCS ’05, ACM, pp. 330–339.
Garfinkel, T., Adams, K., Warfield, A., and Franklin, J. Compatibility is Not Transparency: VMM Detection Myths and Realities. In Proceedings of the 11th Workshop on Hot Topics in Operating Systems (HotOS-XI) (May 2007).
Garfinkel, T., Pfaff, B., Chow, J., Rosenblum, M., and BONEH, D. Terra: a virtual machine-based platform for trusted computing. In Proceedings of the Nineteenth ACM symposium on Operating Systems Principles (Bolton Landing, NY, USA, 2003), SOSP’03, ACM, pp. 193–206.
Garfinkel, T., and Rosenblum, M. A virtual machine introspection based architecture for intrusion detection. In Proceedings Network and Distributed Systems Security Symposium (NDSS’03) (February 2003).
Garfinkel, T., and Rosenblum, M. When virtual is harder than real: Security challenges in virtual machine based computing environ-ments. In Proceedings of the 10th Workshop on Hot Topics in Operating Systems (HotOS-X) (May 2005).
Goldberg, R. P. Architectural principles of virtual machines. PhD thesis, Harvard University. 1972.
Goldberg, R. P. Survey of Virtual Machine Research. IEEE Computer Magazine (June 1974), 34–45.
Hay, B., and Nance, K. Forensics examination of volatile system data using virtual introspection. SIGOPS Operating System Review 42 (April 2008), 74–82.
Hofmann, O. S., Dunn, A. M., Kim, S., Roy, I., and Witchel, E. Ensuring operating system kernel integrity with osck. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems (Newport Beach, California, USA, 2011), ASPLOS ’11, pp. 279–290.
Jiang, X., Wang, X., and Xu, D. Stealthy malware detection through vmm-based out-of-the-box semantic view reconstruction. In Proceedings of the 14th ACM Conference on Computer and Communica-tions Security (CCS’07) (Alexandria, Virginia, USA, 2007), ACM, pp. 128–138.
Jones, S. T., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. Ant-farm: tracking processes in a virtual machine environment. In Proceedings USENIX ’06 Annual Technical Conference (Boston, MA, 2006), USENIX Association.
Payne, B. D., Carbone, M., and Lee, W. Secure and flexible monitor-ing of virtual machines. In Proceedings of the 23rd Annual Computer Security Applications Conference (ACSAC 2007) (December 2007).
Payne, B. D., Carbone, M., Sharif, M. I., and Lee, W. Lares: An architecture for secure active monitoring using virtualization. In Proceedings of 2008 IEEE Symposium on Security and Privacy (Oakland, CA, May 2008), pp. 233–247.
Petroni, JR., N. L., and Hicks, M. Automated detection of persistent kernel control-flow attacks. In Proceedings of the 14th ACM Confer-ence on Computer and Communications Security (Alexandria, Virginia, USA, 2007), CCS ’07, ACM, pp. 103–115.
Porter, D. E., Hofmann, O. S., Rossbach, C. J., Benn, A., and Witchel, E. Operating system transactions. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (Big Sky, Montana, USA, 2009), SOSP ’09, ACM, pp. 161–176.
Rosenblum, M., and Garfinkel, T. Virtual machine monitors: Current technology and future trends. IEEE Computer (May 2005).
Saberi, A., Fu, Y., and Lin, Z. Hybrid-Bridge: Efficiently Bridging the Semantic-Gap in Virtual Machine Introspection via Decoupled Execution and Training Memoization. In Proceedings Network and Distributed Systems Security Symposium (NDSS’14) (February 2014).
Sharif, M. I., Lee, W., Cui, W., and Lanzi, A. Secure in-vm monitoring using hardware virtualization. In Proceedings of the 16th ACM Conference on Computer and Communications Security (Chicago, Illinois, USA, 2009), CCS ’09, ACM, pp. 477–487.
Srinivasan, D., Wang, Z., Jiang, X., and Xu, D. Process out- grafting: an efficient “out-of-vm” approach for fine-grained process execution monitoring. In Proceedings of the 18th ACM Conference on Computer and Communications Security (CCS’11) (Chicago, Illinois, USA, 2011), pp. 363–374.
Whitaker, A., Shaw, M., and Gribble, S. D. Denali: Lightweight virtual machines for distributed and networked applications. In Proceedings of the USENIX Annual Technical Conference (2002).
Whitaker, A., Shaw, M., and Gribble, S. D. Scale and performance in the denali isolation kernel. In Proceedings of the 5th Symposium on Operating Systems Design And Implementation (Boston, Massachu-setts, 2002), OSDI ’02, ACM, pp. 195–209.
Zhang, F., Chen, J., Chen, H., and Zang, B. Cloudvisor: retrofitting protection of virtual machines in multi-tenant cloud with nested virtualization. In Proceedings of the Twenty- Third ACM Symposium on Operating Systems Principles (Cascais, Portugal, 2011), SOSP ’11, ACM, pp. 203–216.

↧

Analysis of the Linux Pseudo-Random Number Generators

December 17, 2013, 11:51 am

Latest and popular articles on VMWare Virtualization

≫ Next: Introduction

≪ Previous: Toward Guest OS Writable Virtual Machine Introspection

Yevgeniy Dodis
New York University
dodis@cs.nyu.edu

David Pointcheval
DI/ENS, ENS-CNRS-INRIA
david.pointcheval@ens.fr

Sylvain Ruhault
Oppida, France
ruhault@di.ens.fr

Damien Vergnaud
DI/ENS, ENS-CNRS-INRIA
vergnaud@diens.fr

Daniel Wichs Northeastern University
wichs@ccs.neu.edu

Abstract

A pseudo-random number generator (PRNG) is a deterministic algorithm that produces numbers whose distribution is indistinguishable from uniform. A PRNG usually involves an internal state from which a cryptographic function outputs random-looking numbers. In 2005, Barak and Halevi proposed a formal security model for PRNGs with input, which involve an additional (potentially biased) external random input source that is used to refresh the internal state. In this work we extend the Barak- Halevi model with a stronger security property capturing how the PRNG should accumulate the entropy of the input source into the internal state after state compromise, even with a low-entropy input source—contrary to the Barak-Halevi model, which requires a high-entropy input source. More precisely, our new robustness property states that a good PRNG should be able to eventually recover from compromise even if the entropy is injected into the system at a very slow pace. This expresses the real-life expected behavior of existing PRNG designs.

We show that neither the model nor the specific PRNG construction proposed by Barak and Halevi meets our robustness property, despite meeting their weaker robustness notion. On the practical side, we discuss the Linux /dev/random and /dev/urandom PRNGs and show attacks proving that they are not robust according to our definition, due to vulnerabilities in their entropy estimator and their internal mixing function.

Finally, we propose a simple PRNG construction that is provably robust in our new and stronger adversarial model. We therefore recommend the use of this construction whenever a PRNG with input is used for cryptography.

Keywords: randomness, entropy, security models, /dev/random

1. Introduction

Pseudo-Random Number Generators. Generating random numbers is an essential task in cryptography. Random numbers are necessary not only for generating cryptographic keys, but also in several steps of cryptographic algorithms or protocols (e.g., initialization vectors for symmetric encryption, password generation, nonce generation, etc.). Cryptography practitioners usually assume that parties have access to perfect randomness. However, quite often this assumption is not realizable in practice, and random bits in protocols are generated by a pseudo-random number generator (PRNG). When this is done, the security of the scheme depends on the quality of the (pseudo-) randomness generated.

The lack of assurance about the generated random numbers can cause serious damage, and vulnerabilities can be exploited by attackers. One striking example is a failure in the Debian Linux distribution [4] that occurred when commented code in the OpenSSL PRNG with input led to insufficient entropy gathering and then to concrete attacks on the TLS and SSH protocols. More recently, Lenstra, Hughes, Augier, Bos, Kleinjung, and Wachter [16] showed that a nonnegligible percentage of RSA keys share prime factors. Heninger, Durumeric, Wustrow, and Halderman [10] presented an analysis of the behavior of Linux PRNGs that explains the generation of low-entropy keys when these keys are generated at boot time. Besides key generation cases, several works demonstrated that if nonces for the DSS signature algorithm are generated with a weak PRNG, then the secret key can be quickly recovered after a few key signatures are seen (see [17] and references therein). This illustrates the need for precise evaluation of PRNGs based on clear security requirements.

A user who has access to a truly random, possibly short, bit-string can use a deterministic (or cryptographic) PRNG to expand this short seed into a longer sequence whose distribution is indistinguishable from the uniform distribution to a computationally bounded adversary (which does not know the seed). However, in many situations, it is unrealistic to assume that users have access to secret and perfect randomness. In a PRNG with input, one only assumes that users can store a secret internal state and have access to a (potentially biased) random source to refresh the internal state.

In spite of being widely deployed in practice, PRNGs with input were not formalized until 2005, by Barak and Halevi [1]. They proposed a security notion, called robustness, to capture the fact that the bits generated should look random to an observer with (partial) knowledge of the internal state and (partial) control of the entropy source. Combining theoretical and practical analysis of PRNGs with input, this paper presents an extension of the Barak-Halevi security model and analyzes the Linux /dev/random and /dev/urandom PRNGs.

Security Models. Descriptions of PRNGs with input are given in various standards [13, 11, 8]. They identify the following core components: the entropy source, which is the source of randomness used by the generator to update an internal state, which consists of all the parameters, variables, and other stored values that the PRNG uses for its operations.

Several desirable security properties for PRNGs with input have been identified in [11, 13, 8, 2]. These standards consider adversaries with various means (and combinations of them): those who have access to the output of the generator; those who can (partially or totally) control the source of the generator; and those who can (partially or totally) control the internal state of the generator. Several requirements have been defined:

• Resilience – An adversary must not be able to predict future PRNG outputs even if the adversary can influence the entropy source used to initialize or refresh the internal state of the PRNG.

• Forward security – An adversary must not be able to identify past outputs even if the adversary can compromise the internal state of the PRNG.

• Backward security – An adversary must not be able to predict future outputs even if the adversary can compromise the internal state of the PRNG.

Desai, Hevia, and Yin [5] modeled a PRNG as an iterative algorithm and formalized the above requirements in this context. Barak and Halevi [1] model a PRNG with input as a pair of algorithms (refresh, next) and define a new security property called robustness that implies resilience, forward security, and backward security. This property assesses the behavior of a PRNG after compromise of its internal state and responds to the guidelines for developing PRNGs given by Kelsey, Schneier, Wagner, and Hall [12].

Linux PRNGs. In UNIX-like operating systems, a PRNG with input was implemented for the first time for Linux 1.3.30 in 1994. The entropy source comes from device drivers and other sources such as latencies between user-triggered events (keystroke, disk I/O, mouse clicks). It is gathered into an internal state called the entropy pool. The internal state keeps an estimate of the number of bits of entropy in the internal state, and (pseudo-)random bits are created from the special files /dev/random and /dev/urandom. Barak and Halevi [1] discussed briefly the /dev/random PRNG, but its conformity with their robustness security definition is not formally analyzed.

The first security analysis of these PRNGs was given in 2006 by Gutterman, Pinkas, and Reinman [9]. It was completed in 2012 by Lacharme, Röck, Strubel, and Videau [15]. Gutterman et al. [9] presented an attack based on kernel version 2.6.10, for which a fix was published in the following versions. Lacharme et al. [15] give a detailed description of the operations of the PRNG and provide a result on the entropy preservation property of the mixing function used to refresh the internal state.

Our Contributions. On the theoretical side, we propose a new formal security model for PRNGs with input that encompasses all previous security notions. This new property captures how a PRNG with input should accumulate the entropy of the input data into the internal state, even if the former has low entropy only. This property was not initially formalized in [1], but it expresses the real-life expected behavior of a PRNG after a state compromise, where it is expected that the PRNG quickly recovers enough entropy, whatever the quality of the input.

On the practical side, we give a precise assessment of the two Linux PRNGs, /dev/random and /dev/urandom. We prove that these PRNGs are not robust and do not accumulate entropy properly, due to the behavior of their entropy estimator and their internal mixing function. We also analyze the PRNG with input proposed by Barak and Halevi [1]. This scheme was proven robust in [1], but we prove that it does not generically satisfy our expected property of entropy accumulation. On the positive side, we propose a PRNG construction that is robust in the standard model and in our new stronger adversarial model.

In this survey we give a high-level overview of our findings, leaving many lower-level details (including most proofs) to the conference version of this paper [7].

Read the full article by downloading the PDF below:

Download PDF

↧

Introduction

December 17, 2013, 11:52 am

Latest and popular articles on VMWare Virtualization

≫ Next: Spring 2014 RFP

≪ Previous: Analysis of the Linux Pseudo-Random Number Generators

Welcome to the third volume of the VMware Technical Journal, which includes papers by VMware employees, interns and collaborators in academia. The breadth of papers presented reflects VMware’s focus on simplifying the delivery and consumption of applications, and the many problems and challenges associated with achieving this.

At the highest level, the mission of VMware is to develop the software that hides the complexity of physical and virtual infrastructure, i.e. servers, disks, networks, desktops, laptops, tablets and phones, and allows our customers to simply, automatically manage their applications, their data and their users.

VMware’s strategy is simple.

The software defined data center (SDDC) empowers our customers to deliver their applications, with the right service levels, at the right price; flexibly, safely, securely and compliant. We use virtualization to separate the applications from the underlying physical infrastructure, and once separated we can automate their management. VMware Hybrid Cloud environments provide application owners with choices about where they run those applications—within their own data centers, on their own SDDCs, within VMware’s data centers or within our partners’. And finally End User Computing enables the safe and secure consumption of those applications, anytime, anyplace and on any device.

Realizing this vision is hard!

Success requires that VMware continues to innovate, and driving innovation is the primary purpose of the Office of the CTO here at VMware. Our mission is to drive thought leadership and to accelerate technology leadership. We do this through collaboration: with our product teams, with our customers and partners, and with academia. The goal is to sow the seeds for ideas and to cultivate them, no matter where they sprout. To put the right ideas, in the right place, at the right time, and with the right people, on the right trajectory to realize their potential and deliver the maximum value to our customers. Whether those ideas result in simple feature addition to existing products, or in fundamentally new businesses, the goal is to unleash ideas, and to enable all of VMware, and our collaborators outside, to participate.

The contributions to this third volume of VMware’s technical journal reflect this broad collaboration, and the scope of the challenge in making the SDDC real. From core hypervisor performance, i.e. performance at the point of supply, to the perceived performance of virtual desktops, i.e. performance at the point of consumption. From product quality to service discovery. Topics also include latency sensitive applications and Hadoop, reflecting that the SDDC must be a universal platform for all applications. These topics are being addressed by a broad range of contributors – VMware employees, interns and our academic research partners, reflecting the need for collaboration amongst the best and the brightest, no matter where they work, to make the vision a reality.

So with that, I’ll leave you with their papers. As always feedback is welcome.

Paul Strong, Vice President and Interim CTO

↧

Spring 2014 RFP

December 27, 2013, 10:20 am

Latest and popular articles on VMWare Virtualization

≫ Next: Onyx

≪ Previous: Introduction

RFP Title: Datacenter Automation in Virtualized Environments

VMware Technical Liaison: Rean Griffith, Curt Kolovson

VMware invites university-affiliated researchers worldwide to submit proposals for funding in the general area of Datacenter Automation. Our research interests in this field are broad, including but not limited to following topics.

Scalable Machine Learning / Analytics for datacenter management
Management of multi-tenant infrastructure for data-intensive applications
Algorithms, protocols, design patterns and systems for the management of large-scale, geo-distributed, data-intensive services
Automation of storage/network QoS in the presence of I/O resource contention for data-intensive applications

Initial submissions should be extended abstracts, up to two pages of length, describing the research project for which funding is requested and must include articulation of its relevance to VMware. After a first review phase, we will create a short list of proposals that been selected for further consideration, and invite authors to submit detailed proposals, including details of milestones, budget, personnel, etc. Initial funding of up to $150,000 will be for a period of one year, with options to review and recommit to funding subsequently.

2014 Key Dates

March 17 – Two page extended abstracts due

April 7 – Notification of interest from VMware

May 5 – Full Proposals due

May 27 – Announcement of final acceptance of proposals

All proposals and inquiries should be sent to rtavilla@vmware.com

↧

Onyx

February 10, 2014, 12:00 am

Latest and popular articles on VMWare Virtualization

≫ Next: VMware 2014-2015 Graduate Fellowships

≪ Previous: Spring 2014 RFP

Note: The Onyx download, information, and community is currently being maintained at VMware Project Onyx. .

Onyx is a standalone application that serves as a proxy between the vSphere Client and the vCenter Server. It monitors the network communication between them and translates it into an executable PowerShell code. Later, this code could be modified and saved into a reusable function or script.

↧

VMware 2014-2015 Graduate Fellowships

February 13, 2014, 9:46 am

Latest and popular articles on VMWare Virtualization

≫ Next: Migrate to View (MTV)

≪ Previous: Onyx

VMware has awarded four Graduate Fellowships for the 2014-2015 academic year.

Baris Kasikci
École Polytechnique Fédérale de Lausanne (EPFL)

2014 VMware Summer Intern with the Management Platform Group

Baris is a PhD student at the Dependable Systems Laboratory at EPFL, led by Prof. George Candea. Before starting his PhD, he worked as a software engineer for four years, mainly developing real time embeddedsystems software. Baris received his B.S and M.S degrees in Electrical and Electronics Engineering from Middle East Technical University,Ankara, Turkey in 2006 and 2009 respectively. His research is centered around building techniques, tools and environments that will ultimately help developers build more reliable software. He is interested in finding solutions that will allow programmers to debug their code in an easier way. In this regard, he strives to find efficient ways to deal with concurrency bugs in general, and data races in particular.

Lanyue Lu
University of Wisconsin

2014 VMware Summer Intern with the Core Storage Group

Lanyue Lu is a Ph.D. student of Computer Sciences at University of Wisconsin-Madison. His research focuses on file and storage systems. He conducted the first comprehensive study of Linux file system evolution by analyzing eight years of file-system changes for six popular file systems. He is currently building an isolation file system to provide fine-grained fault isolation, quick recovery and high scalability for a shared file system. He works under the guidance of Prof. Andrea C. Arpaci-Dusseau and Prof. Remzi H. Arpaci-Dusseau, as a member of ADvanced Systems Lab (ADSL) and Wisconsin Institute on Software-defined Datacenters in Madison (WISDoM). He received his M.S. from Rice University and B.E. from University of Science and Technology of China (USTC).

Muntasir Raihan Rahman
University of Illinois at Urbana-Champaign

2014 VMware Summer Intern with the SAS & Availability Group

Muntasir Raihan Rahman is a PhD candidate in Computer Science at the University of Illinois at Urbana-Champaign. He is a member of the Distributed Protocols Research Group (DPRG) led by his advisor Indranil Gupta. His current research is focused on mechanisms and algorithms for measuring and enforcing consistency based service level agreements in distributed storage systems. In the past he has worked on priority scheduling for map-reduce clusters, distributed run-time for actor programming models, and resource allocation algorithms for network virtualization. Muntasir has completed research internships at Microsoft Research, HP Labs, and Xerox Labs. He received a B.Sc. degree in computer science and engineering from Bangladesh University of Engineering and Technology in 2006, and the M. Math degree in computer science from the University of Waterloo in 2010.

Yiqi Xu
Florida International University

2014 VMware Summer Intern with DRM

Yiqi Xu’s research focus is in the resource management of big-data systems, including parallel and distributed storage systems. His projects include vPFS, a framework that supports diverse plugins to achieve different types of Quality of Service (QoS) for parallel storage systems running in high-performance systems. This framework employs proxy-based file system virtualization technique to enforce global fair share of storage bandwidth with minimal overhead. His most current project IBIS is particularly concentrated on the cooperation between CPU scheduler and I/O scheduler in a distributed and scalable manner. IBIS also recognizes the complex I/O phases and access patterns of a typical cloud application (such as Map/Reduce applications), thus can guarantee QoS for data-intensive cloud applications. Yiqi also conducted intensive research on the topic of automatic application resource demand estimation in the cloud system while he was an intern in VMware in summer 2013.

Yiqi obtained his Bachelor’s degree in Computer Science from Fudan University in 2004 and is now pursuing Ph.D. in the School of Computing and Information Sciences in FIU. Since 2009, he has been working with Dr. Ming Zhao in the Laboratory of Virtualized Infrastructure, Systems and Applications (VISA) in the School of Computing and Information Sciences.

↧

Migrate to View (MTV)

February 14, 2014, 9:50 am

Latest and popular articles on VMWare Virtualization

≫ Next: Horizon View Configuration Tool

≪ Previous: VMware 2014-2015 Graduate Fellowships

Migrate to View (MTV) enables seamless migration of vSphere virtual machines (non-VDI) into View Connection Broker, maintaining the user persistence onto the virtual machines. By moving to View, more features can be integrated and leveraged. Additionally, administrative tasks can better maintain virtual machines and control user policy.

Open Source License

m2v-screenshot

↧

Horizon View Configuration Tool

February 25, 2014, 10:49 am

Latest and popular articles on VMWare Virtualization

≫ Next: Vmss2core

≪ Previous: Migrate to View (MTV)

The Horizon View Configuration Tool automates Horizon View 5.3 installation and deployment. It removes the complexities and manual steps required for setting up a basic Horizon View deployment.

Features

The vCT ships as a virtual appliance with all the required VMware components to set up your Horizon View environment. After providing a Windows Server 2008 R2 SP1 ISO, an ESX host (not managed by vCenter), a few parameters, and licenses, the tool will provision your environment dynamically and automatically. The vCT deploys the following components:

Virtual machine with Active Directory Domain Controller configured (or you may integrate with the existing DC in your environment)
Virtual machine with Horizon View Connection Server installed
Virtual machine with Horizon View Composer installed
vCenter Server Appliance virtual machine deployed and configured

For more information, see VMware Horizon View Configuration Tool Quick Start Guide.

vctblog-figure2

↧

Vmss2core

March 6, 2014, 8:20 am

Latest and popular articles on VMWare Virtualization

≫ Next: I/O Analyzer

≪ Previous: Horizon View Configuration Tool

Vmss2core is a tool to convert VMware checkpoint state files into formats that third party debugger tools understand. It can handle both suspend (.vmss) and snapshot (.vmsn) checkpoint state files (hereafter referred to as a ‘vmss file’) as well as both monolithic and non-monolithic (separate .vmem file) encapsulation of checkpoint state data.

Features

The vmss2core tool can produce core dump files for the Windows debugger (WinDbg), Red Hat crash compatible core files, a physical memory view suitable for the Gnu debugger gdb, Solaris MDB (XXX), and Mac OS X formats. Debugging Virtual Machines with the Checkpoint to Core Tool provides the usage information for the vmss2core tool.

Note: This last update has improved support for Win 8.1/Win2012 R2 vmss files.

↧

I/O Analyzer

March 26, 2014, 2:22 pm

Latest and popular articles on VMWare Virtualization

≫ Next: VMware Tools for Nested ESXi

≪ Previous: Vmss2core

VMware I/O Analyzer is an integrated framework designed to measure storage performance in a virtual environment and to help diagnose storage performance concerns. I/O Analyzer, supplied as an easy-to-deploy virtual appliance, automates storage performance analysis through a unified interface that can be used to configure and deploy storage tests and view graphical results for those tests.

I/O Analyzer can use Iometer to generate synthetic I/O loads or a trace replay tool to deploy real application workloads. It uses the VMware VI SDK to remotely collect storage performance statistics from VMware ESX/ESXi hosts. Standardizing load generation and statistics collection allows users and VMware engineers to have a high level of confidence in the data collected.

Please post comments and questions regarding this fling to the I/O Analyzer Community.

Features

Integrated framework for storage performance testing
Readily deployable virtual appliance
Easy configuration and launch of storage I/O tests on one or more hosts
Integrated performance results at both guest and host levels
Storage I/O trace replay as an additional workload generator
Ability to upload storage I/O traces for automatic extraction of vital metrics
Graphical visualization of workload metrics and performance results

New in version 1.6

Upgraded to SLES 64-bit VM
Upgraded to 64-bit JVM with 1.6GB default heap size
Upgraded to Tomcat6
Disabled esxtop CSV output by default to avoid running out of JVM heap
Added experimental NFS-client stats collection
Added experimental non-persistent configuration option
Various bug fixes
Updated documentation

↧

VMware Tools for Nested ESXi

April 1, 2014, 9:55 am

Latest and popular articles on VMWare Virtualization

≫ Next: VMware CPU Microcode Update Driver

≪ Previous: I/O Analyzer

This VIB package provides a VMware Tools service (vmtoolsd) for running inside a nested ESXi virtual machine. The following capabilities are exposed through VMware Tools:

Provides guest OS information of the nested ESXi Hypervisor (eg. IP address, configured hostname, etc.).
Allows the nested ESXi VM to be cleanly shut down or restarted when performing power operations with the vSphere Web/C# Client or vSphere APIs.
Executes scripts that help automate ESXi guest OS operations when the guest’s power state changes.
Supports the Guest Operations API (formally known as the VIX API).

New in version 1.1:

- Now reports IPv6 addresses as well as IPv4 addresses for nested ESXi guests.

For more information:

vmware-tools-nested-esxi-ipv6

↧

VMware CPU Microcode Update Driver

April 4, 2014, 9:01 am

Latest and popular articles on VMWare Virtualization

≫ Next: Horizon View Event Notifier

≪ Previous: VMware Tools for Nested ESXi

This Fling is a Windows driver that can be used to update the microcode on a computer system’s central processor(s) (“CPU”). This type of update is most commonly performed by a system’s firmware (“BIOS”). However, if a newer BIOS cannot be obtained from a system vendor then this driver can be a potential substitute.

Features

The driver can update the firmware on AMD CPUs. The driver attempts to update the processor when the driver is started, when the system is booted, or when the system resumed from a low power state (e.g., hibernation).

The driver will report its actions in the OS’s event log that can be examined using “Event Viewer”. The driver reports whether it found supported processors and if an update was attempted or successfully performed on a
processor. This information lets the user know whether the driver is providing a benefit (otherwise they can uninstall it). For example, the processors might already have the latest firmware version.

eventvwr

↧

Horizon View Event Notifier

April 14, 2014, 9:55 am

Latest and popular articles on VMWare Virtualization

≫ Next: WebCommander

≪ Previous: VMware CPU Microcode Update Driver

This tool connects to one or more existing Horizon View Event Database(s) and allows the user to customize which types of alerts to be notified on. It can be run from any Windows based system and it collects and sends the alerts via email (SMTP) to users that are specified during the configuration process. It allows aggregation of alerts across multiple Horizon View Pods and for near real-time alerting of Horizon View alerts that are otherwise very difficult to be notified on.

Features

Pro-active alerting on Horizon View events via SMTP directly from the one or more Horizon View Event database(s).

View_Events_Final_Fixed

↧

WebCommander

May 5, 2014, 11:04 am

Latest and popular articles on VMWare Virtualization

≫ Next: Wavemaker Integration for vCenter Orchestrator

≪ Previous: Horizon View Event Notifier

Have you ever wanted to give your users access to certain virtual infrastructure tasks instead of the entire vCenter Client?

WebCommander is a way to do this! WebCommander was designed as a framework to wrap your PowerShell and PowerCLI scripts into an easy-to-access web service. Now you can hand off the tasks your users need by simply adding a new script and giving them access to WebCommander.

Click image to enlarge.

↧

Wavemaker Integration for vCenter Orchestrator

May 27, 2014, 9:19 am

Latest and popular articles on VMWare Virtualization

≫ Next: PktTrace: A Packet Life-Cycle Tracking Tool for Network Services in a Software-Defined Data Center

≪ Previous: WebCommander

This Fling enables you to easily run vCenter Orchestrator (vCO) workflows from within Wavemaker web applications. It covers much of the Workflow Presentation dialogs available in vCO, with Java services exposing the main vCO functionality. Additionally, you get a demo application.

Features

Widget library for Wavemaker
Java services, exposing the main vCO API functions
WaveOperator demo project, mimicking the popular WebOperator vCO application, and extending it to new and exciting capabilities.

Github Project

Download sources from the GitHub project
Refer to the wiki documentation
Tarball source code

screenshot-waveoperator600

↧

PktTrace: A Packet Life-Cycle Tracking Tool for Network Services in a Software-Defined Data Center

June 27, 2014, 10:40 am

Latest and popular articles on VMWare Virtualization

≫ Next: VMware Ecosystem and Solutions Engineering (EASE)

≪ Previous: Wavemaker Integration for vCenter Orchestrator

Hongbo Zou
VMware Inc., and Georgia
Institute of Technology
zouh@vmware.com

Arun Mahajan
VMware Inc.
mahajana@vmware.com

Sameer Pandya
VMware Inc.
pandya@vmware.com

Abstract

In a software-defined data center, network services are defined and developed as virtual and physical appliances that are integrated into the data center to customize its functionality. Network services preprocess network packet traffic before it arrives at the destination virtual machine (VM) or post-process the traffic after it leaves the source VM. Such pre- and postprocessing enables the specified network policies—such as intrusion detection, data compression, and load balancing—to be executed within the data center, and they maintain a consolidated and secure network environment for resource sharing. Because network services need to coexist with the data center internal infrastructure for data processing, network services debugging and service-rule verification are difficult problems if the network packets are invisible in the virtualization infrastructure. Network-Service-Extensibility (NetX) is a network extensibility solution that enables VMware partners to integrate their service VMs into the VMware ecosystem. NetX requires the presence of partner-inserted traffic filter rules that redirect network IP packets to and from the partner’s service VMs. We have developed an IP-packet tracing tool (PktTrace) to track IP packets and to verify the behavior of these rules in the data center internal infrastructure. PktTrace tracks IP packets at various points in the data path. It also detects possible packet loss and measures latency. Such tracking exposes the details of networking traffic for service debugging and verification of these packet-redirect rules. PktTrace help us better understand networking behavior in the software-defined data center. It also enables a quick measurement of the latency added by a network service. A security service example measurement is provided in the paper to demonstrate the capabilities of PktTrace. Our work with PktTrace verifies the rules for packet redirect to a partner service in the virtualization infrastructure and provides additional information about performance.

Keywords: software-defined data center, software-defined networking, virtualization, debugging, packet tracking

1. Introduction

Cloud computing has emerged as a new paradigm for consolidating applications and sharing resources in a common infrastructure. The data center is used as the physical place to host cloud services and incarnate such consolidation. The software-defined data center (SDDC) [1] is better used for hosting cloud services, because of their high configuration and maintenance cost and low consolidation and scalability performance. To change this, virtualization is growing as an important building block for transforming the traditional data center into a SDDC. The SDDC fully virtualizes all of its resource sharing to meet the performance and reliability demands of cloud computing. This change has also been extended to network infrastructure and service sharing.

Figure 1. Virtual Data Center Architecture Overview

Network virtualization is widely used in the SDDC. Such virtualization aims to provide separate logical networks on a shared physical infrastructure [5]. In conventional networking, data, control, and management planes are implemented in the firmware, which limits the network resource sharing in the data center [8]. Software-defined networking (SDN) [6] is an emerging network architecture in which the data plane, control plane, and management plane are decoupled and implemented in software instead. Such software-defined planes enable the underlying network infrastructure to be abstracted and virtualized as independent entities. L4–L7 networking services are also moving away from being hosted on special hardware to VMs and are being treated as virtual entities. This migration enables the data center to gain much more flexibility in its network control. In the SDDC, network services are hosted in the service VM (SVM).

With the deployment of a SVM, network services enable the network data to be preprocessed before it arrives at the destination VM and postprocessed after it leaves the source VM. Using L4–L7 network services, the data center can easily configure and enforce any network policy, such as intrusion detection, data compression, and load balancing. Such network policies maintain a consolidated and secure network environment for resource sharing in the cloud. Because network services are mainly developed and provided by VMware partners rather than the virtualization infrastructure providers, VMware has built an integrated infrastructure to enable the partners to integrate their services into the SDDC. Figure 1 shows the overview of a SDDC architecture. In Figure 1, the SDDC hosts two kinds of VMs: protected VM (PVM) and SVM. PVMs host the user applications, and SVMs provide network isolation and protection services for the PVMs that request the services. The data plane is hosted in the virtualization infrastructure to deliver and forward the network IP packets based on a set of rules. A data plane agent is correspondingly needed in the SVM to handle these IP packets from the data plane in the virtualization infrastructure (kernel). After the services are deployed in the SDDC, some packet routing rules with actions—such as forwarding, dropping, and punt to the SVM—need to be inserted into the data plane. Any incoming or outgoing IP packet is checked by the data plane in the virtualization infrastructure. If the rule is hit for an IP packet, then the defined actions take effect on that IP packet, and the IP packet is forwarded to the corresponding network service to process.

After the network service finishes the IP packet processing, the IP packet is sent back onto the data path to continue its journey. During this process, network services work with the data plane to divide the packet data path processing into two phases—one in the virtualization infrastructure and the second in the SVM. These two phases should be visible when the network services are being debugged and some unexpected network behavior happens. Therefore, network extensibility should not only work on services deployment and management, but also track the IP packets in the data plane. This helps the partner in its service debugging, especially when there is packet loss.

In this paper, we explore VMware NetX and develop an IP packet tracing tool (PktTrace) to complement its functionality. PktTrace tracks IP packets at various points in the data path to detect possible packet loss and processing latency. With PktTrace, we can better understand network behavior and help partners debug their services in the SDDC.

PktTrace makes the following contributions:

It tracks IP packets at various points in the packet flow based on a given set of rules in the kernel of the virtualization infrastructure. Such tracking can detect packet loss and the exact location of the loss.

It records and analyzes the IP packet processing latency that is due to every network component. Such recording and analysis quantifies the performance of every packet delivery network component in the SDDC.
PktTrace is evaluated and verified with the debugging of a real network security service provided by a VMware development partner company specializing in security products. Experimental results show that PktTrace can significantly help partners detect packet loss during debugging and provide the performance results to help remove performance bottlenecks and improve the quality of their services.

The remainder of the paper is organized as follows. Section 2 presents the motivation behind our work and the technical background. Section 3 presents the design and implementation of the PktTrace tool. Section 4 evaluates PktTrace with an actual commercially available network security service provided by a development partner-vendor. Section 5 concludes the paper and discusses future work.

2.Background

In this section, we present the basic architecture of NetX. This framework motivated us to develop an IP packet tracking tool: PktTrace, which complements the functionality of network service management. We also describe two underlying tools—VProbes and pktCapture for PktTrace development—in this section.

2.1 NetX
NetX is a network service extensibility framework developed by VMware for its SDDC development and management. NetX enables customers and partners to develop their own device, service, and operational models to match their specific network service delivery needs. As part of the NetX framework, customers and partners receive development tools for integrating their own network service implementations into the VMware SDDC ecosystem.

NetX infrastructure can be used by the partner service in multiple ways: The service can be a VM that is in the same host as the VMs that it protects, or the service can be on L2 or L3 networks such that it protects multiple VMs on multiple hosts over a network connection. The service can also be hosted on separate hardware and not be a VM at all [2]. In all cases, as long as the service involves the insertion of rules into the kernel of hosts that are running the protected VMs, our PktTrace tool can be used to track rule behavior. This paper focuses on the case of the service VM being hosted on the same host as the protected VMs.

Figure 2. Solution Architecture of NetX

By leveraging NetX, customers and partners are able to develop and integrate network services into the SDDC service model. Figure 2 shows the workflow of using NetX to integrate and manage SVMs in SDDC. Figure 2 is partially cited from [2]. In Figure 2, there are PVM1 and PVM2, two PVMs and a SVM deployed to support traffic filtering. PVM1 does not have a NetX data plane agent, as illustrated in Figure 2. In this case, any PVM1 IP packet need not be forwarded and processed by any network service. As illustrated by PVM2, when a NetX data plane agent is present, any PVM2 IP packet needs to be checked and forwarded to the specified network service residing in the SVM. Customers and partners deploy and configure data plane and plane agents with a management controller through VMware vCloud® infrastructure.

2.2 VProbes and pktCapture
VProbes is a tool developed by VMware for its virtualization infrastructure and application performance probing [3]. A set of probes has been previously embedded within both the virtualization infrastructure and its supporting libraries. Through selective activation of these probes, various data points within the software hierarchy at the application, library, or kernel level can be recorded and correlated with one another to form a highly detailed profile of any targeted application. VProbes enables system-level users to understand what is occurring in the virtual infrastructure and a VM on which the VM runs.

To dynamically enable probes and define instrumentation routines via VProbes, a C-like scripting language is utilized that provides support for features such as variables and control structures. VProbes provides static and dynamic probe modes. By combining these mechanisms with other facilities provided by the scripting language, developers can easily form complex instrumentation routines that collect data from probes in a logical and organized manner. For tracing and detecting IP packets, this paper includes a set of scripts have been developed to capture and analyze IP packets. Users can turn on the instrumentation with the scripts and execute them on demand. The IP packets are intercepted at the specified probe points and analyzed. The analysis results are saved into an output file for further analysis.

pktCapture is a packet-capture tool developed by VMware that can capture IP packets in the networking appliances in a VMware SDDC. The current pktCapture provides the ability to capture packets as seen by any packet processing functions associated with the networking appliances, such as the uplink module, virtual network interface card module, and virtual filters. pktCapture is tailored into our project only for packet trace in the virtual networking appliance. In the virtual kernel, packet tracing still must be implemented by the VProbes scripts.

Our implementation of PktTrace tailors the use of VProbes and pktCapture, builds a framework around them, and makes them suitable for the NetX use case.

3. Design and Implementation

PktTrace is an IP packet tracing tool for the VMware SDDC to help users detect possible network service failures or performance degradation. PktTrace can be used to debug network services after they have been integrated into the data center. The core idea of our design is to place multiple detection points along the IP packets path and check the packets in the live network stream.

3.1 Detection Points of PktTrace
Figure 3 illustrates the packet detection points specified by PktTrace in single-host mode. Four core detection points are developed in PktTrace in this paper:

The point where the packet arrives
The point of departure of the packet from the kernel to the SVM (which is the point of rule insertion
The point where the packet returns from the SVM into the kernel (which is the post point of rule insertion)
The point where the packet is ready to send to the PVM, in which we have tracked and verified the “hitting” of the rule meant for this tailored packet and therefore conclude that the rule functions correctly

Figure 3. IP Packet Detection Points

In Figure 3, there are two paths working in the SDDC. The red arrows show the management path, which enables the configuration and integration of network services in the SDDC. The management path includes three parts:

Virtual appliance <=> partner service manager
Partner service manager <=> VMware® vCenter™/VMware® vShield™
vCenter/vShield manager <=> packet filter

In this path, users configure, start, and stop service appliances in the VMware hypervisor (VMware® ESXi™). In addition, users need to register the rule in the packet filter. The incoming and outgoing packets are detoured to the service virtual appliance if the rule is hit in the filter. This filter is opaque to users, so one of most important functions of PktTrace is verifying the “hitting” of rules in the filter. In Figure 3, the green arrows show the IP packet processing path in the SDDC. The path includes five parts:

Physical NIC => virtual switch
Virtual- switch port => filter
Filter => virtual appliance
Virtual appliance => filter
Filter => virtual NIC

Figure 3 doesn’t show the egress data path, because the egress packet processing follows a similar path, albeit in a different direction than ingress processing. The detection points check the IP packet on the data path.

3.2 Interface for Using PktTrace
This section introduces PktTrace in the ESXi hypervisor. A command-line interface (CLI) is the primary user interface for using PktTrace. The user executes the pktTrace CLI command and specifies the source, destination, and switch protocol addresses and ports, as in the following command instruction:

pktTrace –h –c –m –proto –s –sport –d –dport –uplink –switchport
--vmnicbacken
  –h, --help
  –c COUNT the number of packets to be checked, the maximum is 64
  –m MODE the mode, i.e. latency measurement or rule verification
  --proto=PROTO the protocol type: 1 for ICMP, 6 for TCP, 17 for UDP
  –s SRCIP, --source=SRCIP
  --sport=SPORT specify the port number on the client node
  –d DSTIP, --destination=DSTIP
  --dport=DPORT specify the destination port number on the
  server node
  --uplink=UPLINK specify the name of the physical nic, e.g. vmnic1
  --switchport=SWITCHPORT specify the switch port needs to be traced
--vnicbackend=VNICBACKEND, the vnic backend port to be traced

In the CLI there are two modes for network service verification. The -m option enables users to switch the verification between two modes. Mode 0 measures the packet processing latency with SVM. When a PVM employs the SVM to preprocess its packets, the packets are detoured to the SVM, which adds extra latency to packet delivery. Such latency is an issue for some time-critical applications, such as real-time data processing. If the deadline cannot be reached with extra-service enable, the corresponding SVM should be removed or replaced in the SDDC. Mode 0 detects the latency of SVMs. Mode 1 verifies the functionality of rules in the filter. The SVMs need to co-work with rules for packet processing. When the packets arrive at the filter, it checks its rules to make sure the packet matches the description of the rule and forwards the packet to the corresponding SVM. In such actions, users need PktTrace to help them detect the packet forwarding if packet loss happens.

4. Performance Evaluation

4.1 Experimental Setup
Experiments were run on two machines—one ESXi server and one client system. Figure 4 shows the architecture of the experimental platform. Machine A is an x86 server equipped with 24GB of memory and two 1Gbps NIC ports. Machine B is a common desktop with a 100Mbps network connection. ESXi (Release 5.5) is configured and installed on Machine A. Three protected VMs and one service VM are running on top of ESXi. The service VM is a simple security service provided by a development partner. The service doesn’t perform any concrete task and just forwards the packets. Three protected VMs are configured as general network servers to provide TCP/UDP access. Network access to Machine A is created on Machine B [4].

To enable the SVM for IP packet processing, one firewall rule is added into the filter, as in Figure 5. This rule includes the specification of the action and protocol type. In this rule, all TCP packets need to be forwarded to the service for further processing.

Figure 4. Architecture of the Experimental Platform

Figure 5. Rule Creation

4.2 Latency Measurement
Mode 0 is selected for packet processing latency measurement. In this mode, the timestamps have been collected and analyzed on all the packet processing points along the data path. The rule diverts the packets to the SVM if the packets meet the criteria specified in the rule. Detection is inserted at the point of departure of the packet after the rule-hit into the SVM and at the point of arrival back into the kernel, after the packed is processed by the SVM.

The tailored packets are inserted, and it is ensured that they are sent to the SVM and return from the SVM. The difference in the timestamps at both the points determines the latency, which is the time taken up by the service. This is not an accurate measurement of latency but it does provide a very quick measure. PktTrace itself collects the timestamps with VProbes, which is very lightweight [3]. Because the relative timing is measured by PktTrace, the impact of PktTrace itself adding to the latency (the so-called Heisenberg effect [7]) is minimal.

Figure 6 shows a snapshot of the experimental results. Two-way timestamps (ingress and egress) were collected and analyzed. The results show that the latency is a significant cost in stage 2 for the ingress stream and in stage 0 for egress. The stages record the latency of service in two directions. The average latency of stage 2 for ingress packets is ~195.02us, which accounts for 84.7% of the whole latency. The average latency of stage 0 for egress packets is ~31.91us, which also accounts for about 81.5% of the whole latency. Such measured results show the performance impact of a service.

Figure 6. Test Results for Packet-Processing Latency

4.3. Rule Verification
Mode 1 is selected for rule verification in the filter of the ESXi kernel. The packet filter is the only module to coexist with the services in ESXi. Therefore, the functionality verification of rules in this filter is a very important step in service debugging. In this mode, rules are inserted by the partner as needed by the design and applied to incoming and outgoing packets. In this paper, we limit the discussion to incoming packets because the outgoing packet behavior is a mirror image. When an arriving packet meets a rule that the rule is designed for, then the packet is dealt with by the action associated with that rule. In our experiments, the associated action is ACTION_PUNT, which forwards the matched packets to the service. The tailored packets are inserted to help PktTrace verify the functionality of rules. In such experiments, the protocol type is changed to 1 – ICMP packets forwarding.

Figure 7 shows a snapshot of the experimental results. Two-way timestamps (ingress and egress) were collected and analyzed. The results show that the tailored packets are detected on the specified points, that the rule was hit in the filter, and that the packets were forwarded to the SVM.

Figure 7. Service Rule Verification

The experimental results show that PktTrace is capable of verifying the functionality of rules and detecting the performance cost of services.

5. Conclusions and Future Work

PktTrace is available as an ESXi command-line option. It has options for rule verification and latency measurement. Rule verification has a “detailed” option whereby the packet dumps are observed, and a default option for nondetailed behaviors. Use of PktTrace will significantly accelerate service development by VMware partners.

In the future, the tool will be integrated with the virtualized networkbased releases. It will be available for use by other technologies such as EPSec, an endpoint security service integration program. The NetX and EPSec certification suites will be enhanced to include this tool for data path–based certification.

References

Delivering on the Promist of the Software-Defined Data Center. VMware Accelerate Advisory Services. VMware Inc., 2013.
VMware Network Extensibility Developer’s Guide, Technical Document, VMware Inc., 2013.
Martim Carbone, Alok Kataria, Radu Rugina, and Vivek Thampi, “VProbes: Deep Observability into the ESXi Hypervisor,” VMware Technical Journal, Summer 2014.
C. Du and H. Zou. “V-MCS: A Configuration System for Virtual Machine,” Proc. of Workshop in conjunction with Cluster’09.
K. Hutton, M. Schofield, and D. Teare. Designing Cisco Network Service Architectures, 2d ed. Cisco Press, 2009.
“SDN: The New Norm for Networks.” Open Networking Foundation, 2012.
M. Raymer, “Uncertainty Principle for Joint Measurement of Noncommuting Variables,” American Journal of Physics, 1994.
H. Zou, W. Wu, et al., “An evaluation of Parallel Optimization for OpenSolaris Network Stack,” Proc. of the 37th Annual IEEE Conference on Local Computer Networks (LCN), 2010.

↧

VMware Ecosystem and Solutions Engineering (EASE)

June 27, 2014, 10:50 am

Latest and popular articles on VMWare Virtualization

≫ Next: VMware ESX Memory Resource Management: Swap

≪ Previous: PktTrace: A Packet Life-Cycle Tracking Tool for Network Services in a Software-Defined Data Center

VMware vSphere® is both a product and a platform that offers value to IT administrators and developers. The platform capability enables the VMware partner ecosystem to build solutions. vSphere exposes several APIs that range from tight hypervisor kernel-level integration to loosely coupled management interfaces. Partners whose solutions integrate with the vSphere platform include network interface card and host bus adapter vendors, as well as network and storage infrastructure providers.
Vendors must consider several factors when integrating with the vSphere platform. Partners should use published interfaces; ensure that correct diagnostics and management information are provided; and be able to troubleshoot issues with their code integration. In addition, joint innovations with partners can be provided as value addition to the platform for specific use cases.

In this section of the VMware Technical Journal, we present some interesting areas in which the VMware Ecosystem and Solutions Engineering (EASE) team has been innovating for and with the partner ecosystem. Arun Mahajan et al., outline how, using vProbes technology (described earlier in this issue), they built a packettracing tool that can be used for troubleshooting network service insertion offerings by partners. Bo Chen provides details of a dynamic driver-verification system built with vProbes technology. Kwan Ma et al., outline an approach for data tracing that can potentially be used to track the code path for data. Paul Willmann digs deep into the VMware® ESXi™ kernel to highlight a new API definition and callback mechanism for the bidirectional user/kernel interface that third-party (partner) code can use. Finally, Samdeep Nayak et al., detail an application-aware storage platform that uses I/O hinting and tagging in the host-to-storage context across a fabric.

These papers provide a brief overview of some innovations and related work accomplished during integration with the partner ecosystem. VMware and its partners continue to work on several other areas with the vSphere platform. I hope you will enjoy reading these papers.

T. Sridhar is Chief Technologist and Architect in the VMware Ecosystem and Solutions Engineering group in Palo Alto, CA.

↧