Matei Zaharia

Matei Zaharia is a computer scientist known for starting the Apache Spark project during his PhD at the University of California, Berkeley. According to his own homepage, Spark has become one of the most widely used frameworks for distributed data processing, and he also co-initiated related systems including Apache Mesos and Spark Streaming.

Zaharia was the lead author of the 2012 USENIX NSDI paper “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing,” which introduced the RDD abstraction at the heart of Spark. The RDD let a distributed dataset be rebuilt from its lineage after a failure, making in-memory cluster computing both fast and fault tolerant.

He co-founded Databricks, the company built around Spark, where he served as Chief Technology Officer. His homepage notes that at Databricks he helped develop MLflow, a tool for managing the machine learning lifecycle, and Delta Lake, a system for ACID table storage. His research centers on computer systems for large-scale workloads such as AI, data analytics, and cloud computing.

His homepage lists his role as an Associate Professor of Electrical Engineering and Computer Sciences at UC Berkeley, along with recognitions including the ACM Prize in Computing and the SIGMOD Systems Award. (His earlier Stanford homepage now redirects to this Berkeley page.)

Sources

Related