Saturday, November 29, 2014

Map, Filter, Reduce & Lambda with Python 2.7

You may have heard the term MapReduce come up in conversations around big data, specifically Apache Hadoop, or Google who originally released it as proprietary technology. Now, there's a lot of buzz these days about Google's Dataflow as an alternative for handling streams of big data, but map, filter and reduce are still functions you need to be familiar with. Let's start by going over each one in detail.

Prerequisites: Stop by my article about lists and list comprehensions.

The Essential Functions


The map function accepts a function as its first parameter, and a list as its second, applying that function to each element of the list and returning a new list with those new values.

# map
l = [4, 5, 1, 8]

def double(x):
   return x * 2

map(double, l) # [8, 10, 2, 16]


filter also accepts a function and a list. When you pass a function into filter, that function is expected to return True or False. So as filter iterates through each element of the list, it's passing the value into the function, and if it returns true, it will append that item into a new list. The end result is a list with all of those passing values.

# filter
l = [2, 6, 5, 8, 7]
def is_even(x):
   return (x % 2) == 0

filter(is_even, l) # [2, 6, 8]


reduce follows the same paradigm as the preceding functions. It accepts a function and a list, but returns a total representative of those values based on the logic provided through the passed in function.

# reduce
l = [1, 2, 4, 9, 5]
def add(a, b):
   return a + b

def sub(a, b):
   return a - b

reduce(add, l) # 21
reduce(sub, l) # -19

# You can also provide an initial value
reduce(add, l, 4) # Initial value of 4 results in 25

Note: reduce has been removed from Python 3.x. It is recommended that you use iterators in 3.x as they are more readable or use functools.reduce()


We define lambda in Python as the construct that supports the creation of anonymous functions at runtime.

# defining a function normally
def sqr(x): 
  return x*x

print(sqr(3)) # 9

# defining a lambda function
l = lambda x: x*x
print(l(3)) # 9

# lambda in list comprehensions
print([(lambda x: x*x)(i) for i in range(10)])
# [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

# a cleaner solution
print([i*i for i in range(10)])
# [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

MapReduce and Big Data

So you can see that all of these functions are driving towards the same basic idea. Major efficiencies result in distributing tasks over several clusters of data, rather than all at once. The MapReduce model also follows this principle, making giant strides in the way of processing and generating large data sets. There's the map phase and the reduce phase. The map phase is responsible for accepting a set of inputs and generating a new variable number of outputs determined by a provided function and the programmer's intent. The reducer's job is to process the data from the mapper into something useable. Somewhere along this process, there's also a shuffle and sort phase that makes it easier to manage the data, but we won't dive into that given the scope of this article.

So now it's easy to see how MapReduce was influenced by the map and reduce functions. MapReduce's main contributions lie in the way of distribution, parallelism, redundancy and fault tolerance, and, most importantly, scalability.

No comments:

Post a Comment