John Nelson
University of Tennessee
john.i.nelson@gmail.com
Abstract
Performance Application Programming Interface (PAPI) aims to provide a consistent interface for measuring performance events using the performance counter hardware available on the CPU as well as available software performance events and off-chip hardware. Without PAPI, a user may be forced to search through specific processor documentation to discover the name of processor performance events. These names can change from model to model and vendor to vendor. PAPI simplifies this process by providing a consistent interface and a set of processor-agnostic preset events. Software engineers can use data collected through source-code instrumentation using the PAPI interface to examine the relation between software performance and performance events. PAPI can also be used within many high-level performance-monitoring utilities such as TAU, Vampir, and Score-P.
VMware® ESXiTM and KVM have both added support within the last year for virtualizing performance counters. This article compares results measuring the performance of five real-world applications included in the Mantevo Benchmarking Suite in a VMware virtual machine, a KVM virtual machine, and on bare metal. By examining these results, it will be shown that PAPI provides accurate performance counts in a virtual machine environment.
1. Introduction
Over the last 10 years, virtualization techniques have become much more widely popular as a result of fast and cheap processors. Virtualization provides many benefits, which makes it an appealing test environment for high-performance computing. Encapsulating configurations is a huge motivating factor for wanting to do performance testing on virtual machines. Virtual machines enable portability among heterogeneous systems while providing an identical configuration within the guest operating system.
One of the consequences of virtualization is that there is an additional hardware abstraction layer. This prevents PAPI from reading the hardware Performance Monitoring Unit (PMU) directly. However, both VMware and KVM have recently, within the last year, added support for a virtual PMU. Within the guest operating system that provides a virtual PMU, PAPI can be built without any special build procedure and will access the virtual PMU with the same system calls as a hardware PMU.
To verify that performance event counts measured using PAPI are accurate, a series of tests were run to compare counts on bare metal to counts on ESXi and KVM. A suite of real-world applications was
chosen to attempt to more accurately represent use cases for PAPI. In general, most users will use PAPI preset events to measure particular performance events of interest. PAPI preset events are a collection of events that can be mapped to native events or can be derived from a few native events. Preset events are meant to provide a set of consistent events across all types of processors.
2. Testing Setup
2.1 Benchmark Suite
Tests performed are taken from the Mantevo Suite of miniapplications [4]. The purpose of the Mantevo suite is to provide miniapplications that mimic performance characteristics of real-world large-scale applications. The applications listed below were used for testing:
- CloverLeaf – Hydrodynamics algorithm using a two-dimensional Eulerian formulation
- CoMD – An extensible molecular dynamics proxy applications suite featuring the Lennard-Jones potential and the Embedded Atom Method potential
- HPCCG – Approximation for implicit finite element method
- MiniGHOST – Executes halo exchange pattern typical of explicit structured partial differential equations
- MiniXYCE- Simple linear circuit simulator
2.2 Testing Environment
Bare Metal
- 16-core Intel Xeon CPU @ 2.9GHz
- 64GB memory
- Ubuntu Server 12.04 • Linux kernel 3.6
KVM
- Qemu version 1.2.0
- Guest VM – Ubuntu Server 12.04 • Guest VM – Linux kernel 3.6
- Guest VM – 16GB RAM
VMware • ESXi 5.1
- Guest VM – Ubuntu Server 12.04 • Guest VM – Linux kernel 3.6
- Guest VM – 16GB ram
2.3 Testing Procedure
For each platform (bare metal, KVM, VMware), each application was run while measuring one PAPI preset event. Source code for each application was modified to place measurements surrounding the main computational work of each application. Ten runs were performed for each event. Tests were run on a “quiet” system; no other users were logged on, and only minimal OS services were running at the same time. Tests were performed in succession with no reboots between runs.
3. Results
Results are presented as bar charts in Figures 1–5. Along the x-axis is a PAPI preset event. Each preset event maps to one or more native events. Along the y-axis is the difference in event counts from bare metal event counts—that is, the ratio of the mean of 10 runs on the virtual machine to the mean of 10 runs on a bare metal system. Therefore, a value of 0 corresponds to an identical number of event counts between the virtual machine and bare metal. A value of 1 corresponds to a 100% difference in the counts on the virtual machine from on bare metal.
4. Discussion
On inspection of the results, two main classes of events exhibit significant differences between performance on virtual machines and performance on bare metal. These two classes include instruction cache events such as PAPI_L1_ICM, and translation lookaside buffer (TLB) events such as PAPI_TLB_IM. These two classes will be examined more closely below. Another anomaly that has yet to be explained is KVM reporting nonzero counts of PAPI_VEC_DP and PAPI_VEC_SP (both related to vector operations), whereas both the bare metal tests and the VMware tests report 0 for each application tested.
4.1 Instruction Cache
From the results, we can see that instruction cache event counts are much more frequent on the virtual machines than on bare metal. These events include: PAPI_L1_ICM, PAPI_L2_ICM, PAPI_L2_ICA, PAPI_L3_ICA, PAPI_L2_ICR, and PAPI_L3_ICR. By simple deduction, we can see that L2 events are directly related to PAPI_L1_ICM, and likewise L3 events to PAPI_L2_ICM. That is, the level 2 cache will only be accessed in the event of a level 1 cache miss. Miss rate will compound total accesses at each level and, as a result, the L3 instruction cache events appear the most different from on bare metal with a ratio of more than 2. Therefore, it is most pertinent to examine the PAPI_L1_ICM results, because all other instruction cache events are directly related. In Figure 6, we can see the results of the HPCCG tests on bare metal, KVM, and ESXi side by side. As can be seen on the graph, both KVM and ESXi underperform bare metal results. However, ESXi is quite a bit better off with only 20% more misses than bare metal, whereas KVM exhibits nearly 40% more misses. Both have a larger standard deviation than bare metal, but not by a huge margin.

Figure 6. L1 Instruction Cache Miss Counts on ESXi and KVM Compared to Bare Metal for the HPCCG Benchmark
4.2 TLB
Data TLB misses are a huge issue for both KVM and ESXi. Both exhibit around 33 times more misses than runs on bare metal. There is little difference between the two virtualization platforms for this event.
Instruction TLB misses, shown in Figure 7, are also significantly more frequent on both virtualization platforms than on bare metal. However, ESXi seems to perform much better in this regard. Not only does ESXi incur 50% of the misses seen on KVM, but ESXi also has a much smaller standard deviation (even smaller than that of bare metal) compared to KVM’s unpredictable results.
4.3 MiniXyce
The results for MiniXyce warrant special consideration. The behavior of the application itself appears much less deterministic than other tests in the test suite. MiniXyce is a simple linear circuit simulator that attempts to emulate the vital computation and communication of XYCE, a circuit simulator designed to solve extremely large circuit problems on high-performance computers for weapons design [4].
Even so, the behavior of the application may have exposed shortcomings in the virtualization platforms. Not only are instruction cache miss and TLB miss events much more frequent on ESXi and KVM, but data cache misses and branch mispredictions are also much more frequent. MiniXyce is the only application tested that displayed significantly different data cache behavior on the virtual machines from on bare metal. This may suggest that there exists a certain class of applications, with similar behavior characteristics, that may cause ESXi and KVM to perform worse than bare metal with regard to the data cache. This may be explained by the fact that virtualizing the Memory Management Unit (MMU) has a well-known overhead [2].
4.4 Possible Confounding Variable
VMware and KVM do not measure guest-level performance events in the same way. One of the challenges of counting events in a guest virtual machine is determining which events to attribute to the guest. This is particularly problematic when the hypervisor emulates privileged instructions. The two main counting techniques are referred to as domain switch and CPU switch. Domain switch is the more inclusive of the two, including all events that the hypervisor contributes when emulating guest I/O. VMware uses the term host mode for domain switch and guest mode for CPU switch.
Figure 8 shows an example timeline for virtual-machine scheduling events taken from [1]. CPU switch does not include events contributed by the hypervisor. On VM exit, when the hypervisor must process an interrupt (such as a trapped instruction), the performance counter state is saved and then restored on VM entry. Descheduling also causes the save and subsequent restore in CPU switch. Alternatively, the performance counter states in domain switch are saved only when the guest is descheduled and restored on rescheduling. Events are counted between VM exit and VM entry in domain switching. Domain switching, therefore, gives a more accurate view of the effects the hypervisor has on execution, whereas CPU switching hides this effect. However, domain switching may also count events that occur due to another virtual machine.
KVM chooses either domain switching or CPU switching, both of which have downsides as discussed above. ESXi, by default, uses a hybrid approach. Events are separated into two groups: speculative and nonspeculative. Speculative events include events that are affected by run-to-run variation. For example, cache and branch miss-prediction are both nondeterministic and may be related to previous executing code that affects the branch predictor or data cache. Both are affected by the virtual machine monitor as well as all virtual machines that are executed. Therefore, any specific virtual machine cache performance, or other nonspeculative event performance, will be affected by any other virtual machine that is scheduled. That is, the data cache will be filled with data by a virtual machine while it is executing. Further loads to this data will be cached. However, when execution returns to the virtual machine monitor or to another virtual machine, data in the cache may be overwritten by data that is loaded by the virtual machine monitor or another running virtual machine. On the other hand, nonspeculative events are events that are consistent, such as total instructions retired. ESXi will count speculative events between VM exit and VM entry but will not count nonspeculative events. This gives a more accurate representation of the effects of the hypervisor on speculative events while providing accurate results for nonspeculative events that should not be affected by the hypervisor [1].
The difference in how KVM and VMware choose to measure events could potentially skew the results for certain events. It is safe to assume that for the majority of the events tested, the event counts can be trusted, as there is a less than 1% difference between either virtualization platform and bare metal. It has also been shown in [2] that instruction cache and TLB performance overhead is a known problem, so it is likely that PAPI is in fact reporting accurate counts for related events.
5. Related Work
PAPI relies on PMU virtualization at the guest level to provide performance measurements. [3] discusses guest-wide and system- wide profiling implementations for KVM as well as domain versus CPU switch. [1] expands on the domain-versus-CPU-switch distinction by providing an alternative hybrid approach that counts certain events similar to domain switch and other events similar to CPU switch.
A performance study that examined VMware vSphere® performance using virtualized performance counters as one of the data-collection tools is presented in [1]. [1] also provides a discussion of TLB overhead, a result also observed by the study presented in this paper. [5] provides a framework for virtualizing performance counters in Xen, another popular virtualization platform.
6. Conclusions
PAPI can be used within a KVM or VMware virtual machine to reliably measure guest-wide performance events. However, there are a few events that, when measured, either reveal poor performance of the virtualization platform or are possibly overestimated when attributed to virtual machines. Virtual machine performance for the few significantly different events was expected [2]. For the large majority of events, one should expect to see almost identical results on a virtual machine as on a bare metal machine on a lightly loaded system. Future work examining event counts when a system is overloaded by many concurrently running virtual machines is necessary to make conclusions about whether results provided by PAPI are accurate in such a situation.
Acknowledgments
This material is based on work supported by the National Science Foundation under Grant No. CCF-1117058 and by FutureGrid under Grant No. OCI-0910812. Additional support for this work was provided through a sponsored Academic Research Award from VMware, Inc.
References
- Serebrin, B. and Hecht, D. 2011. Virtualizing performance counters. In Proceedings of the 2011 International Conference on Parallel Processing (Euro-Par’11), Michael Alexander et al. (Eds.). Springer-Verlag, Berlin, Heidelberg, 223–233. http://dx.doi.org/10.1007/978-3-642-29737-3_26
- Buell, J., Hecht, D., Heo, J., Saladi, K., and Taheri, H.R.. Methodology for Performance Analysis of VMware vSphere under Tier-1 Applications. VMware Technical Journal, Summer 2013. http://labs.vmware.com/vmtj/methodology-for-performance-analysis-of-vmware-vsphere-under-tier-1- applications
- Du, J., Sehrawat, N., and Zwaenepoel, W. Performance Profiling in a Virtualized Environment. 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 2010), Boston, MA. https://www.usenix.org/legacy/events/hotcloud10/tech/full_papers/Du.pdf
- Mantevo Project. http://www.mantevo.org
- Nikolaev, R. and Back, G. Perfctr-Xen: A Framework for Performance Counter Virtualization. Virtual Execution Environments 2011, Newport Beach, CA. http://www.cse.iitb.ac.in/~puru/courses/spring12/cs695/downloads/perfctr.pdf
- PAPI. http://icl.cs.utk.edu/papi/