Publications

External links:

2024

Accelerating Shared Library Execution in a DBT

In Proceedings of the 25th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems

User-mode Dynamic Binary Translation (DBT) has recently received renewed interest, not least due to Apple’s transition towards the Arm ISA, supported by a DBT compatibility layer for x86 legacy applications. While receiving praise for its performance, execution of legacy applications through Apple’s Rosetta 2 technology still incurs a performance penalty when compared to direct host execution. A particular limitation of Rosetta 2 is that code is either executed exclusively as native Arm code, or as translated Arm code. In particular, mixed mode execution of native Arm code and translated code is not possible. This is a missed opportunity, especially in the case of shared libraries where both optimized x86 and Arm versions of the same library are available. In this paper, we develop mixed mode execution capabilities for shared libraries in a DBT system, eliminating the need to translate code where a highly optimised native version already exists. Our novel execution model intercepts calls to shared library functions in the DBT system and automatically redirects them to their faster host counterparts, making better use of the underlying host ISA. To ease the burden for the developer, we make use of an Interface Description Language (IDL) to capture library function signatures, from which relevant stubs and data marshalling code are generated automatically. We have implemented our novel mixed mode execution approach in the open-source QEMU DBT system, and demonstrate both ease of use and performance benefits for three popular libraries (standard C Math library, SQLite, and OpenSSL). Our evaluation confirms that with minimal developer effort, accelerated host execution of shared library functionality results in speedups between 2.7x and 6.3x on average, and up to 28x for x86 legacy applications on an Arm host system. ...

2023

Risotto: A Dynamic Binary Translator for Weak Memory Model Architectures

In proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

Dynamic Binary Translation (DBT) is a powerful approach to support cross-architecture emulation of unmodified binaries. However, DBT systems face correctness and performance challenges, when emulating concurrent binaries from strong to weak memory consistency architectures. As a matter of fact, we report several translation errors in Qemu, when emulating x86 binaries on Arm hosts. To address these challenges, we propose an end-to-end approach that provides correct and efficient emulation for weak memory model architectures. Our contributions are twofold: First, we formalize Qemu’s intermediate representation’s memory model, and use it to propose formally verified mapping schemes to bridge the strong-on-weak memory consistency mismatch. Second, we implement these verified mappings in Risotto, a Qemu-based DBT system that optimizes memory fence placement while ensuring correctness. Risotto further improves performance via cross-architecture dynamic linking of native shared libraries and faster yet correct translation of compare-and-swap operations. ...

2022

Leaps and bounds: Analysing WebAssembly’s performance with a focus on bounds checking

In Proceedings of the 2022 IEEE International Symposium on Workload Characterization

WebAssembly is gaining more and more popularity, finding applications beyond the Web browser – for which it was initially designed for. However, its performance, which developers aimed at being comparable to native, requires further tuning and hasn’t being studied extensively to pinpoint the cause of overheads. This paper identifies that WebAssembly unique safety mechanisms, one of the major ones being bounds-checked memory accesses, may introduce up to 650% overhead. Therefore, we evaluate four popular WebAssembly runtimes against native compiled code. These runtimes have been enriched with modern bounds checking mechanisms and run on three different ISA CPUs, including x86-64, Armv8 and RISC-V RV64GC. We show that performance-oriented runtimes are able to achieve performance within 20% of the native performance on x86_64 platforms, 35% for Armv8. On RISC-V the V8 runtime can achieve a 17% overhead over native code for simple numeric kernels. For simple numerical kernels, we have shown that there is no significant difference in the WebAssembly performance compared to native code across different ISAs. ...

Lasagne: A Static Binary Translator for Weak Memory Model Architectures

In proceedings of the 43rd ACM SIGPLAN Conference on Programming Language Design and Implementation

The emergence of new architectures create a recurring challenge to ensure that existing programs still work on them. Manually porting legacy code is often impractical. Static binary translation (SBT) is a process where a program’s binary is automatically translated from one architecture to another, while preserving their original semantics. However, these SBT tools have limited support to various advanced architectural features. Importantly, they are currently unable to translate concurrent binaries. The main challenge arises from the mismatches of the memory consistency model specified by the different architectures, especially when porting existing binaries to a weak memory model architecture. ...

2020

Fast and correct load-link/store-conditional instruction handling in DBT systems

In proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems

Dynamic Binary Translation (DBT) requires the implementation of load-link/store-conditional (LL/SC) primitives for guest systems that rely on this form of synchronization. When targeting e.g. x86 host systems, LL/SC guest instructions are typically emulated using atomic Compare-and-Swap (CAS) instructions on the host. Whilst this direct mapping is efficient, this approach is problematic due to subtle differences between LL/SC and CAS semantics. In this paper, we demonstrate that this is a real problem, and we provide code examples that fail to execute correctly on QEMU and a commercial DBT system, which both use the CAS approach to LL/SC emulation. We then develop two novel and provably correct LL/SC emulation schemes: (1) A purely software based scheme, which uses the DBT system’s page translation cache for correctly selecting between fast, but unsynchronized, and slow, but fully synchronized memory accesses, and (2) a hardware accelerated scheme that leverages hardware transactional memory (HTM) provided by the host. We have implemented these two schemes in the Synopsys DesignWare® ARC® nSIM DBT system, and we evaluate our implementations against full applications, and targeted micro-benchmarks. We demonstrate that our novel schemes are not only correct, but also deliver competitive performance on-par or better than the widely used, but broken CAS scheme. ...

A retargetable system-level DBT hypervisor

In ACM Transactions on Computer Systems

System-level Dynamic Binary Translation (DBT) provides the capability to boot an Operating System (OS) and execute programs compiled for an Instruction Set Architecture (ISA) different to that of the host machine. Due to their performance-critical nature, system-level DBT frameworks are typically hand-coded and heavily optimized, both for their guest and host architectures. While this results in good performance of the DBT system, engineering costs for supporting a new, or extending an existing architecture are high. In this paper we develop a novel, retargetable DBT hypervisor, which includes guest specific modules generated from high-level guest machine specifications. Our system simplifies retargeting of the DBT, but it also delivers performance levels in excess of existing manually created DBT solutions. We achieve this by combining offline and online optimizations, and exploiting the freedom of a Just-in-time ( JIT) compiler operating in a bare-metal environment provided by a Virtual Machine (VM) hypervisor. We evaluate our DBT using both targeted micro-benchmarks as well as standard application benchmarks, and we demonstrate its ability to outperform the de-facto standard Qemu DBT system. Our system delivers an average speedup of 2.21× over Qemu across SPEC CPU2006 integer benchmarks running in a full-system Linux OS environment, compiled for the 64-bit ARMv8-A ISA and hosted on an x86-64 platform. For floating-point applications the speedup is even higher, reaching 6.49× on average. We demonstrate that our system-level DBT system significantly reduces the effort required to support a new ISA, while delivering outstanding performance. ...

2019

A retargetable system-level DBT hypervisor Best Paper

In Proceedings of the 2019 USENIX Annual Technical Conference

System-level Dynamic Binary Translation (DBT) provides the capability to boot an Operating System (OS) and execute programs compiled for an Instruction Set Architecture (ISA) different to that of the host machine. Due to their performance critical nature, system-level DBT frameworks are typically hand-coded and heavily optimized, both for their guest and host architectures. While this results in good performance of the DBT system, engineering costs for supporting a new, or extending an existing architecture are high. In this paper we develop a novel, retargetable DBT hypervisor, which includes guest specific modules generated from high-level guest machine specifications. Our system simplifies retargeting of the DBT, but it also delivers performance levels in excess of existing manually created DBT solutions. We achieve this by combining offline and online optimizations, and exploiting the freedom of a Just-in-time (JIT) compiler operating in a bare-metal environment provided by a Virtual Machine (VM) hypervisor. We evaluate our DBT using both targeted micro-benchmarks as well as standard application benchmarks, and we demonstrate its ability to outperform the de-facto standard QEMU DBT system. Our system delivers an average speedup of 2.21× over QEMU across SPEC CPU2006 integer benchmarks running in a full-system Linux OS environment, compiled for the 64-bit ARMv8-A ISA and hosted on an x86-64 platform. For floating-point applications the speedup is even higher, reaching 6.49× on average. We demonstrate that our system-level DBT system significantly reduces the effort required to support a new ISA, while delivering outstanding performance. ...

Uncovering security vulnerabilities in the Belkin WeMo home automation ecosystem

In proceedings of the IEEE International Conference on Pervasive Computing and Communications Workshops

The popularity of smart home devices is growing as consumers begin to recognize their potential to improve the quality of domestic life. At the same time, serious vulnerabilities have been revealed over recent years, which threaten user privacy and can cause financial losses. The lack of appropriate security protections in these devices is thus of increasing concern for the Internet of Things (IoT) industry, yet manufacturers’ ongoing efforts remain superficial. Hence, users continue to be exposed to serious weaknesses. In this paper, we demonstrate that this is also the case of home automation applications, as we uncover a set of previously undocumented security issues in the Belkin WeMo ecosystems. In particular, we first reverse engineer both the mobile app that enables users to control smart appliances and the communication logic implemented by WeMo devices. This enables us to compromise the passphrase guarding the communication over the local wireless network, opening the possibility of eavesdropping on user traffic. We further reveal how an attacker can present a fake device to a WeMo user, through which cross-site scripting can be exploited in order to mislead the user into disclosing private information. Lastly, we provide a set of security guidelines that can be followed to remedy the vulnerabilities identified. ...

Full-system simulation of mobile CPU/GPU platforms

In proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software

Graphics Processing Units (GPUs) critically rely on a complex system software stack comprising kernel- and userspace drivers and Just-in-time (JIT) compilers. Yet, existing GPU simulators typically abstract away details of the software stack and GPU instruction set. Partly, this is because GPU vendors rarely release sufficient information about their latest GPU products. However, this is also due to the lack of an integrated CPU/GPU simulation framework, which is complete and powerful enough to drive the complex GPU software environment. This has led to a situation where research on GPU architectures and compilers is largely based on outdated or greatly simplified architectures and software stacks, undermining the validity of the generated results. In this paper we develop a full-system system simulation environment for a mobile platform, which enables users to run a complete and unmodified software stack for a state-of-the-art mobile Arm CPU and Mali-G71 GPU powered device. We validate our simulator against a hardware implementation and Arm’s stand-alone GPU simulator, achieving 100% architectural accuracy across all available toolchains. We demonstrate the capability of our GPU simulation framework by optimizing an advanced Computer Vision application using simulated statistics unavailable with other simulation approaches or physical GPU implementations. We demonstrate that performance optimizations for desktop GPUs trigger bottlenecks on mobile GPUs, and show the importance of efficient memory use. ...

Mitigating JIT compilation latency in Virtual Execution Environments

In proceedings of the 15th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments

Many Virtual Execution Environments (VEEs) rely on Justin-time (JIT) compilation technology for code generation at runtime, e.g. in Dynamic Binary Translation (DBT) systems or language Virtual Machines (VMs). While JIT compilation improves native execution performance as opposed to e.g. interpretive execution, the JIT compilation process itself introduces latency. In fact, for highly optimizing JIT compilers or compilers not specifically designed for JIT compilation, e.g. LLVM, this latency can cause a substantial overhead. While existing work has introduced asynchronously decoupled JIT compilation task farms to hide this JIT compilation latency, we show that this on its own is not sufficient to mitigate the impact of JIT compilation latency on overall performance. In this paper, we introduce a novel JIT compilation scheduling policy, which performs continuous low-cost profiling of code regions already dispatched for JIT compilation, right up to the point where compilation commences. We have integrated our novel JIT compilation scheduling approach into a commercial LLVM-based DBT system and demonstrate speedups of 1.32× on average, and up to 2.31×, over its state-of-the-art concurrent task-farm based JIT compilation scheme across the SPEC CPU2006 and BioPerf benchmark suites. ...

Low-cost deterministic C++ exceptions for embedded systems

In proceedings of the 28th International Conference on Compiler Construction (CC ’19)

The C++ programming language offers a strong exception mechanism for error handling at the language level, improving code readability, safety, and maintainability. However, current C++ implementations are targeted at general-purpose systems, often sacrificing code size, memory usage, and resource determinism for the sake of performance. This makes C++ exceptions a particularly undesirable choice for embedded applications where code size and resource determinism are often paramount. Consequently, embedded coding guidelines either forbid the use of C++ exceptions, or embedded C++ tool chains omit exception handling altogether. In this paper, we develop a novel implementation of C++ exceptions that eliminates these issues, and enables their use for embedded systems. We combine existing stack unwinding techniques with a new approach to memory management and run-time type information (RTTI). In doing so we create a compliant C++ exception handling implementation, providing bounded runtime and memory usage, while reducing code size requirements by up to 82%, and incurring only a minimal runtime overhead for the common case of no exceptions. ...

2018

Navigating the landscape for real-time localization and mapping for robotics and virtual and augmented reality

In proceedings of the IEEE

Visual understanding of 3D environments in real-time, at low power, is a huge computational challenge. Often referred to as SLAM (Simultaneous Localisation and Mapping), it is central to applications spanning domestic and industrial robotics, autonomous vehicles, virtual and augmented reality. This paper describes the results of a major research effort to assemble the algorithms, architectures, tools, and systems software needed to enable delivery of SLAM, by supporting applications specialists in selecting and configuring the appropriate algorithm and the appropriate hardware, and compilation pathway, to meet their performance, accuracy, and energy consumption goals. The major contributions we present are (1) tools and methodology for systematic quantitative evaluation of SLAM algorithms, (2) automated, machine-learning-guided exploration of the algorithmic and implementation design space with respect to multiple objectives, (3) end-to-end simulation tools to enable optimisation of heterogeneous, accelerated architectures for the specific algorithmic requirements of the various SLAM algorithmic approaches, and (4) tools for delivering, where appropriate, accelerated, adaptive SLAM solutions in a managed, JIT-compiled, adaptive runtime context. ...

Anatomy of a vulnerable fitness tracking system: dissecting the fitbit cloud, app, and firmware

In Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies

Fitbit fitness trackers record sensitive personal information, including daily step counts, heart rate profiles, and locations visited. By design, these devices gather and upload activity data to a cloud service, which provides aggregate statistics to mobile app users. The same principles govern numerous other Internet-of-Things (IoT) services that target different applications. As a market leader, Fitbit has developed perhaps the most secure wearables architecture that guards communication with end-to-end encryption. In this paper, we analyze the complete Fitbit ecosystem and, despite the brand’s continuous efforts to harden its products, we demonstrate a series of vulnerabilities with potentially severe implications to user privacy and device security. We employ a repertoire of techniques encompassing protocol analysis, software decompiling, and both static and dynamic embedded code analysis, to reverse engineer previously undocumented communication semantics, the official smartphone app, and the tracker firmware. Through this interplay and in-depth analysis, we reveal how attackers can exploit the Fitbit protocol to extract private information from victims without leaving a trace, and wirelessly flash malware without user consent. We demonstrate that users can tamper with both the app and firmware to selfishly manipulate records or circumvent Fitbit’s walled garden business model, making the case for an independent, user-controlled, and more secure ecosystem. Finally, based on the insights gained, we make specific design recommendations that not only can mitigate the identified vulnerabilities, but are also broadly applicable to securing future wearable system architectures. ...

2017

Breaking fitness records without moving: reverse engineering and spoofing Fitbit

In proceedings of the 20th International Symposium on Research in Attacks, Intrusions and Defenses

Tens of millions of wearable fitness trackers are shipped yearly to consumers who routinely collect information about their exercising patterns. Smartphones push this health-related data to vendors’ cloud platforms, enabling users to analyze summary statistics on-line and adjust their habits. Third-parties including health insurance providers now offer discounts and financial rewards in exchange for such private information and evidence of healthy lifestyles. Given the associated monetary value, the authenticity and correctness of the activity data collected becomes imperative. In this paper, we provide an in-depth security analysis of the operation of fitness trackers commercialized by Fitbit, the wearables market leader. We reveal an intricate security through obscurity approach implemented by the user activity synchronization protocol running on these devices. Although non-trivial to interpret, we reverse engineer the message semantics, demonstrate how falsified user activity reports can be injected, and argue that based on our discoveries, such attacks can be performed at scale to obtain financial gains. We further document a hardware attack vector that enables circumvention of the end-to-end protocol encryption present in the latest Fitbit firmware, leading to the spoofing of valid encrypted fitness data. Finally, we give guidelines for avoiding similar vulnerabilities in future system designs. ...

SimBench: a portable benchmarking methodology for full-system simulators

In proceedings of the 2017 IEEE International Symposium on Performance Analysis of Systems and Software

Full-system simulators are increasingly finding their way into the consumer space for the purposes of backwards compatibility and hardware emulation (e.g. for games consoles). For such compute-intensive applications simulation performance is paramount. In this paper we argue that existing benchmarksuites such as SPEC CPU2006, originally designed for architecture and compiler performance evaluation, are not well suited for the identification of performance bottlenecks in full-system simulators. While their large, complex workloads provide an indication as to the performance of the simulator on ‘real-world’ workloads, this does not give any indication of why a particular simulator might run an application faster or slower than another. In this paper we present SimBench, an extensive suite of targeted micro-benchmarks designed to run bare-metal on a fullsystem simulator. SimBench exercises dynamic binary translation (DBT) performance, interrupt and exception handling, memoryaccess performance, I/O and other performance-sensitive areas. SimBench is cross-platform benchmarking framework and can be retargeted to new architectures with minimal effort. For several simulators, including QEMU, Gem5 and SimIt-ARM, and targeting ARM and Intel x86 architectures, we demonstrate that SimBench is capable of accurately pinpointing and explaining real-world performance anomalies, which are largely obfuscated by existing application-oriented benchmarks. ...

2016

Hardware accelerated cross-architecture full-system virtualization

In ACM Transactions on Architecture and Code Optimization (TACO)

Hardware virtualization solutions provide users with benefits ranging from application isolation through server consolidation to improved disaster recovery and faster server provisioning. While hardware assistance for virtualization is supported by all major processor architectures, including Intel, ARM, PowerPC & MIPS, these extensions are targeted at virtualization of the same architecture, e.g. an x86 guest on an x86 host system. Existing techniques for cross-architecture virtualization, e.g. an ARM guest on an x86 host, still incur a substantial overhead for CPU, memory and I/O virtualization due to the necessity for software emulation of these mismatched system components. In this article we present a new hardware accelerated hypervisor called CAPTIVE, employing a range of novel techniques, which exploit existing hardware virtualization extensions for improving the performance of full-system cross-platform virtualization. We illustrate how (1) guest MMU events and operations can be mapped onto host memory virtualization extensions, eliminating the need for costly software MMU emulation, (2) a block-based DBT engine inside the virtual machine can improve CPU virtualization performance, (3) memory mapped guest I/O can be efficiently translated to fast I/O specific calls to emulated devices, and (4) the cost for asynchronous guest interrupts can be reduced. For an ARM-based Linux guest system running on an x86 host with Intel VT support we demonstrate application performance levels, based on SPEC CPU2006 benchmarks, of up to 5.88x over state-of-the-art QEMU and 2.5x on average, achieving a guest dynamic instruction throughput of up to 1280 MIPS and 915.52 MIPS, on average. ...

Efficient Cross-architecture Hardware Virtualisation

PhD - University of Edinburgh

Hardware virtualisation is the provision of an isolated virtual environment that represents real physical hardware. It enables operating systems, or other system-level software (the guest), to run unmodified in a “container” (the virtual machine) that is isolated from the real machine (the host). There are many use-cases for hardware virtualisation that span a wide-range of end-users. For example, home-users wanting to run multiple operating systems side-by-side (such as running a Windows® operating system inside an OS X environment) will use virtualisation to accomplish this. In research and development environments, developers building experimental software and hardware want to prototype their designs quickly, and so will virtualise the platform they are targeting to isolate it from their development workstation. Large-scale computing environments employ virtualisation to consolidate hardware, enforce application isolation, migrate existing servers or provision new servers. ...

Efficient asynchronous interrupt handling in a full-system instruction set simulator

In proceedings of the 17th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, Tools, and Theory for Embedded Systems

Instruction set simulators (ISS) have many uses in embedded software and hardware development and are typically based on dynamic binary translation (DBT), where frequently executed regions of guest instructions are compiled into host instructions using a just-in-time (JIT) compiler. Full-system simulation, which necessitates handling of asynchronous interrupts from e.g. timers and I/O devices, complicates matters as control flow is interrupted unpredictably and diverted from the current region of code. In this paper we present a novel scheme for handling of asynchronous interrupts, which integrates seamlessly into a region-based dynamic binary translator. We first show that our scheme is correct, i.e. interrupt handling is not deferred indefinitely, even in the presence of code regions comprising control flow loops. We demonstrate that our new interrupt handling scheme is efficient as we minimise the number of inserted checks. Interrupt handlers are also presented to the JIT compiler and compiled to native code, further enhancing the performance of our system. We have evaluated our scheme in an ARM simulator using a region-based JIT compilation strategy. We demonstrate that our solution reduces the number of dynamic interrupt checks by 73%, reduces interrupt service latency by 26% and improves throughput of an I/O bound workload by 7%, over traditional per-block schemes. ...

2015

Efficient Dual-ISA Support in a Retargetable, Asynchronous Dynamic Binary Translator

In proceedings of the 2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)

Dynamic Binary Translation (DBT) allows software compiled for one Instruction Set Architecture (ISA) to be executed on a processor supporting a different ISA. Some modern DBT systems decouple their main execution loop from the built-inJust-In-Time (JIT) compiler, i.e. the JIT compiler can operate asynchronously in a different thread without blocking program execution. However, this creates a problem for target architectures with dual-ISA support such as ARM/THUMB, where the ISA of the currently executed instruction stream may be different to the one processed by the JIT compiler due to their decoupled operation and dynamic mode changes. In this paper we present a new approach for dual-ISA support in such an asynchronous DBT system, which integrates ISA mode tracking and hot-swapping of software instruction decoders. We demonstrate how this can be achieved in a retargetable DBT system, where the target ISA is not hard-coded, but a processor-specific module is generated from a high-level architecture description. We have implemented ARM V5T support in our DBT and demonstrate execution rates of up to 1148 MIPS for the SPEC CPU 2006 benchmarks compiled for ARM/THUMB, achieving on average 192%, and up to 323%, of the speed of QEMU, which has been subject to intensive manual performance tuning and requires significant low-level effort for retargeting. ...

2014

Automated ISA branch coverage analysis and test case generation for retargetable instruction set simulators

In proceedings of the 2014 International Conference on Compilers, Architecture and Synthesis for Embedded Systems

Processor design tools integrate in their workflows generators for instruction set simulators (ISS) from architecture descriptions. However, it is difficult to validate the correctness of these simu-lators. ISA coverage analysis is insufficient to isolate modelling faults, which might only be exposed in corner cases. We present a novel ISA branch coverage analysis, which considers every possible execution path within an instruction and, on demand, generates new test cases to cover the missing paths. We have applied this analysis to industry standard EEMBC and SPEC CPU2006 benchmarks and show that for an ARM V5T model neither of these benchmark suites provides a sufficient ISA coverage to exercise every path through each instruction of the whole instruction set. ...

Efficient Code Generation in a Region-based Dynamic Binary Translator

In proceedings of the 2014 SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems

Region-based JIT compilation operates on translation units comprising multiple basic blocks and, possibly cyclic or conditional, control flow between these. It promises to reconcile aggressive code optimisation and low compilation latency in performance-critical dynamic binary translators. Whilst various region selection schemes and isolated code optimisation techniques have been investigated it remains unclear how to best exploit such regions for efficient code generation. Complex interactions with indirect branch tables and translation caches can have adverse effects on performance if not considered carefully. In this paper we present a complete code generation strategy for a region-based dynamic binary translator, which exploits branch type and control flow profiling information to improve code quality for the common case. We demonstrate that using our code generation strategy a competitive region-based dynamic compiler can be built on top of the LLVM JIT compilation framework. For the ARM-V5T target ISA and SPEC CPU 2006 benchmarks we achieve execution rates of, on average, 867 MIPS and up to 1323 MIPS on a standard X86 host machine, outperforming state-of-the-art QEMU-ARM by delivering a speedup of 264%. ...