Dean and Ghemawat (2004). MapReduce: Simplified Data Processing on Large Clusters

Fri Feb 05 2021

tags: draft programming computer science self study notes public 6.824 MIT distributed systems

Introduction

This paper introduces MapReduce.

I read this paper as part of my self-study of 6.824 with NUS SoC folks. It may or may not be part of my goal to finish a Stanford CS degree in a year.

Overview of the paper

Interesting things about the paper

MapReduce gets map jobs

For some reason, the paper gets Map workers to send the list of the generated intermediate file locations back to the server, but doesn't do it for Reduce workers. I get it--there's not really a need for Reduce workers to send back since Reduce output files are saved on a shared file system while Map intermediate files are saved on the individual workers' storage. The drawback is that it results in additional work for the server to have to constantly check for the presence of output files. I compared my implementation to

I think this could be trying to save unnecessary network I/O.

How MapReduce avoids inconsistency

The problem is

With

Suppose you have a m

The paper prevents master from observing files that have been partially written by using temporary output files plus a atomic rename operation.