Files
Abstract
The wide adoption of smart devices and Internet-of-Things (IoT) sensors has led to massive growth in data generation at the edge of the Internet over the past decade. Intelligent real-time analysis of such a high volume of data, particularly leveraging highly accurate Deep Learning (DL) models, often requires the data to be processed as close to the data sources (or at the edge of the Internet) to minimize the network and processing latency. The advent of specialized, low-cost, and power-efficient edge devices has greatly facilitated DL inference tasks at the edge. However, limited research has been done to improve the inference throughput (i.e., number of inferences per second) by exploiting various system techniques. This study investigates system techniques that enhance the overall inference throughput on edge devices with DL models for image classification tasks. We present various approaches, such as batched inferencing and multi-tenancy to utilize edge devices' system resources (CPU and GPUs) and AI accelerators (e.g., Tensor Processing Units; TPUs). The evaluation results show that batched inferencing results in up to 4× more inferences per second in devices equipped with high-performance GPUs like Jeston Xavier NX. Moreover, with multi-tenancy approaches, e.g., concurrent model executions and dynamic model placements, a throughput of nearly 340 inferences per second can be achieved, which is 6× higher than the maximum throughput when running the models on a single tenant. Furthermore, a detailed analysis of the factors (hardware and software) that affect the throughput of the systems is presented, thereby shedding light on areas that could be further improved to achieve high-performance DL inference at the edge.