The recent advancement in distributed processing engines, from Spark to Impala to Spark Streaming and Storm, has proved exciting. However, if your design only focuses on the processing layer to get speed and power then you may be missing half the story, leaving a significant amount of optimization untapped.
Ted Malaska looks down the stack and describes a set of storage design patterns and schemas implemented on Cassandra, HBase, Kudu, Kafka, SolR, Elasticsearch, HDFS, and S3. By carefully tailoring how data is stored for each use case, processing and access times can be reduced by two to three orders of magnitude.
While the strategies and principles you’ll learn in this class can be applied in many software environments, examples will be shown using HDFS, HBase, Cassandra, Kudu, Kafka, Elasticsearch, and S3
Ted Malaska is a director of enterprise architecture at Capital One. Previously, he was the director of engineering in the Global Insight Department at Blizzard; principal solutions architect at Cloudera, helping clients find success with the Hadoop ecosystem; and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is a coauthor of Hadoop Application Architectures, a frequent speaker at many conferences, and a frequent blogger on data architectures.
©2017, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com