Real-time systems, such as autonomous driving systems, require satisfying deadline and resolving dependency at the same time. Such scheduling requires creating a directed acyclic graph (DAG) of task dependency and solving the graph with priority in consideration. However, when executing a large and complex system with data- ow task sets, priority-based scheduling makes it difficult to fulfill deadlines of important tasks. The author proposes a job-grouped DAG task model in a real-time system, and its real-time scheduling framework. The proposed framework focuses on the deadlines of each group rather than those of each task, enabling it to meet the deadline. By modifying ROSCH, real-time scheduling framework for Robot Operating System, the author proves that the proposed group-optimized scheduling framework accomplishes to reduce the number of deadline miss.
Single-ISA, heterogeneous multi-core architectures, which combine high-performance large cores and power-efficient small ones are attractive on systems with diverse workloads. Previous work has enhanced whole-system throughput on the architectures by improving thread scheduling algorithms. In this approach, however, overhead caused by scheduling and task migrations becomes problem. This paper presents an operating system (OS) design incorporating the multikernel model. A large core and other resources are isolated and then allocated exclusively to a specified process. Thus overhead caused by scheduling and task migrations, and mutual executions when accessing shared resources are avoided, leading to improvement in throughput. The author has implemented and evaluated the proposed OS design on real ARM big.LITTLE platforms to show the design is promising. The evaluation indicates improvement in throughput of processes which access shared resources frequently.
We design and implement operating systems for many-core processors. In particular, we investigate the scalability and performance improvement in operating systems and device drivers on Intel Xeon Phi Knights Landing many-core processors.
3D scan registration is an important method for localization on mobile devices. The 3D normal distribution transform (3D-NDT) is an efficient algorithm for 3D registration compared to the iterative closest point (ICP). Input point data are captured by 3D laser range finders, the resolutions of which are continuously getting higher. Localization for fast-moving objects such as automobile requires short turnaround time in the order of milliseconds. At the same time, embedded systems are sensitive to their power consumption. There exist CPU and GPU implementations of the algorithm, however, lack of flexibility makes them difficult to coordinate between calculation capability and usability in mobile system. To satisfy these requirements, a hardware implementation of the algorithm using FPGA with appropriate datatype is presented in this dissertation. The result shows the new option for 3D registration, which is also expected to be implemented to ASIC in future work.
Parallelization of Convolutional Neural Networks (CNNs) has been considerably studied in recent years. A case study of parallelized CNNs using general-purpose computing on GPUs (GPGPU) and Message Passing Interface (MPI) has been published. On the other hand, little effort is being expended on studying scalability of parallelized CNNs on multi-core CPUs. We explores performance of the training process of CNNs achieved by increasing the number of computing cores and threads. Detailed experiments were conducted on state-of-the-art multi-core processors using OpenMP and MPI frameworks to demonstrate that Caffe-based CNNs are successfully accelerated due to well-designed multi-threaded programs. We also discussed better way to exhibit performance of multi-threaded CNNs by comparing three different implementations.