TeraCache: Efficient Caching over Fast Storage Devices

Abstract

This talk will introduce TeraCache, a new scalable cache for Spark that avoids both garbage collection (GC) and serialization overheads. Existing Spark caching options incur either significant GC overheads for large managed heaps over persistent memory or significant serialization overheads to place objects off-heap on large storage devices. Our analysis shows that: (1) serialization increases execution time by up to 30% and (2) caching on the managed heap increases GC time by 20%. In addition, these overheads become worse as datasets grow.

TeraCache eliminates serialization and GC overhead for cached objects. To achieve this, TeraCache extends HotSpot JVM’s heap with a managed heap that resides on a memory-mapped fast storage device and is exclusively used for cached data. To avoid GC over TeraCache, we extend the Java runtime to use semantic hints from Spark allocating and freeing cached data objects. We modify the collector to not include cached objects, while maintaining safety. Preliminary results show that TeraCache can speed up ML workloads by up to 37% compared to the supported RDD storage levels.

Date
Nov 19, 2020 —
Event
Data and AI Summit Europe'20
Location
Virtual
Iacovos Kolokasis
Iacovos Kolokasis
Graduate Research Assistant

My research interests include distributed robotics, mobile computing and programmable matter.