Proteus is a database engine designed for today's heterogeneous environments. Proteus adapts to variable data, hardware and workloads through a combination of GPU acceleration, data virtualization, and adaptive scheduling.

Sampling-Based AQP in Modern Analytical Engines

DaMoN 2022. V. Sanca, A. Ailamaki

Abstract

As the data volume grows, reducing the query execution times remains an elusive goal. While approximate query processing (AQP) techniques present a principled method to trade off accuracy for faster queries in analytics, the sample creation is often considered a second-class citizen. Modern analytical engines optimized for high-bandwidth media and multi-core architectures only exacerbate existing inefficiencies, resulting in prohibitive query-time online sampling and longer preprocessing times in offline AQP systems.

We demonstrate that the sampling operators can be practical in modern scale-up analytical systems. First, we evaluate three common sampling methods, identify algorithmic bottlenecks, and propose hardware-conscious optimizations. Second, we reduce the performance penalties of the added processing and sample materialization through system-aware operator design and compare the sample creation time to the matching relational operators of an in-memory JIT-compiled engine. The cost of data reduction with materialization is up to 2.5x of the equivalent group-by in the case of stratified sampling and virtually free (∼1x) for reasonable sample sizes of other strategies. As query processing starts to dominate the execution time, the gap between online and offline AQP methods diminishes.

@inproceedings{DBLP:conf/damon/SancaA22,
  author    = {Viktor Sanca and
               Anastasia Ailamaki},
  editor    = {Spyros Blanas and
               Norman May},
  title     = {Sampling-Based {AQP} in Modern Analytical Engines},
  booktitle = {International Conference on Management of Data, DaMoN 2022, Philadelphia,
               PA, USA, 13 June 2022},
  pages     = {4:1--4:8},
  publisher = {{ACM}},
  year      = {2022},
  url       = {https://doi.org/10.1145/3533737.3535095},
  doi       = {10.1145/3533737.3535095},
  timestamp = {Wed, 15 Jun 2022 13:56:38 +0200},
  biburl    = {https://dblp.org/rec/conf/damon/SancaA22.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}