As Uber continues to grow, its big data systems must also grow in scalability, reliability, and performance to help Uber make business decisions, give user recommendations, and analyze experiments across all data sources. Zhenxiao Luo shares his experience running columnar storage in production at Uber and discusses query optimization techniques in SQL engines.
Uber’s Hadoop warehouse uses columnar storage with Parquet as the default file format, Presto as its interactive query engine, and Hive and Spark as the batch engines. Zhenxiao explains how Uber developed a number of performance optimizations for columnar storage in all of these query engines to achieve much better performance for customers, including nested column pruning, predicate pushdown, dictionary pushdown, columnar reads, and lazy reads, achieving a more than 5x performance improvement in all query engines.
Zhenxiao Luo is a software engineer at Uber working on Presto and Parquet. Previously, he led the development and operations of Presto at Netflix and worked on big data and Hadoop-related projects at Facebook, Cloudera, and Vertica. He holds a master’s degree from the University of Wisconsin-Madison and a bachelor’s degree from Fudan University.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com