Proteus is a database engine designed for today's heterogeneous environments. Proteus adapts to variable data, hardware and workloads through a combination of GPU acceleration, data virtualization, and adaptive scheduling.
Sampling-Based AQP in Modern Analytical Engines
DaMoN 2022.Abstract
As the data volume grows, reducing the query execution times remains an elusive goal. While approximate query processing (AQP) techniques present a principled method to trade off accuracy for faster queries in analytics, the sample creation is often considered a second-class citizen. Modern analytical engines optimized for high-bandwidth media and multi-core architectures only exacerbate existing inefficiencies, resulting in prohibitive query-time online sampling and longer preprocessing times in offline AQP systems.
We demonstrate that the sampling operators can be practical in modern scale-up analytical systems. First, we evaluate three common sampling methods, identify algorithmic bottlenecks, and propose hardware-conscious optimizations. Second, we reduce the performance penalties of the added processing and sample materialization through system-aware operator design and compare the sample creation time to the matching relational operators of an in-memory JIT-compiled engine. The cost of data reduction with materialization is up to 2.5x of the equivalent group-by in the case of stratified sampling and virtually free (∼1x) for reasonable sample sizes of other strategies. As query processing starts to dominate the execution time, the gap between online and offline AQP methods diminishes.
Links
@inproceedings{DBLP:conf/damon/SancaA22, author = {Viktor Sanca and Anastasia Ailamaki}, editor = {Spyros Blanas and Norman May}, title = {Sampling-Based {AQP} in Modern Analytical Engines}, booktitle = {International Conference on Management of Data, DaMoN 2022, Philadelphia, PA, USA, 13 June 2022}, pages = {4:1--4:8}, publisher = {{ACM}}, year = {2022}, url = {https://doi.org/10.1145/3533737.3535095}, doi = {10.1145/3533737.3535095}, timestamp = {Wed, 15 Jun 2022 13:56:38 +0200}, biburl = {https://dblp.org/rec/conf/damon/SancaA22.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} }