Some ways to deal with big(-ish) data: 'why do we get an OutOfMemory error in FPGrowth?'

Wed Mar 31 2021

tags: debugging programming imda

(Written for work)

Introduction

Xin Zheng mentioned that she was running out of memory when running the a priori and FP Growth algorithms for the recommender systems project.

After some research, I found the reason for the error, and here I present a few solutions which might help other data scientists too.

Thanks also to Siyang for being a sounding board for my ideas.

The problem

Xin Zheng mentioned that she was running out of memory when running the a priori and FP Growth algorithms These algorithms are used to generate frequent itemsets (items that are frequently bought together by customers) so that we can recommend similar items to purchase.

This is the section of her code that was throwing the error:

"""
FP growth algo for product association rules calculation
* input item order-product name matrix
* output rules
"""

import pandas as pd
from mlxtend.frequent_patterns import fpgrowth
from mlxtend.frequent_patterns import association_rules
import time

class FPGrowthRules:
def run(self, df_matrix=None, min_support=0.5, min_confidence=0.25):
frequent_itemsets = fpgrowth(df_matrix, min_support=min_support, use_colnames=True)
# ... stuff elided ...
return rules

In particular, it was this invocation of fpgrowth from the mlxtend package:

frequent_itemsets = fpgrowth(df_matrix, min_support=min_support, use_colnames=True)

Here fpgrowth takes in a DataFrame and returns a DataFrame. In particular, it gets frequent itemsets for a one-hot DataFrame. A one-hot DataFrame looks like this:

        Apple  Bananas   Beer  Chicken   Milk   Rice
    0   True    False   True     True  False   True
    1   True    False   True    False  False   True
    2   True    False   True    False  False  False
    3   True     True  False    False  False  False
    4  False    False   True     True   True   True
    5  False    False   True    False   True   True
    6  False    False   True    False   True  False
    7   True     True  False    False  False  False

fpgrowth returns a pandas DataFrame with columns ['support', 'itemsets'] of all itemsets that are >= min_support and < max_len. You can find usage examples here.

After looking at her code and the source code of the mlxtend library, I found the error and came up with three plausible solutions, each having their own set of advantages and disadvantages.

Solution 1: use a bigger machine (vertical scaling)

The simplest way to solve the out-of-memory issue is to use a larger machine. XZ's machine (27GB of RAM and 10GB of disk) runs out of memory, but 27GB is not that much memory in the grand scheme of things; we can get instances with much more memory.

The pros of this solution is that it requires no developer effort. The cons are that this will incur additional cost, and the TP we pass it to may not be able to spin up larger machines. And of course, this solution is not very scalable because there's a hard cap on the amount of memory we can provision. So if the out-of-memory issue is caused by some O(n2)O(n^2) scaling or worse, this would not be a good solution.

Solution 2: use Apache Spark (horizontal scaling)

One more complicated way is to use Apache Spark to scale out. This involves rewriting Xin Zheng's code as a Spark job and using a parallel, Spark-friendly version of fpgrowth. We would then find some way to spin up a Spark cluster and run the Spark job.

The pros of this solution is that it can scale to much larger datasets. The cons are that it introduces a lot of technical complexity both in the code, and in the deployment. It will also make testing more difficult. Additionally, TPs may not have the technical chops to provision their own spark cluster.

Solution 3: replace mlxtend's source code (software solution)

The last solution requires us to actually understand the problem. Let's figure out exactly what is causing the OOM error.

Recall that fpgrowth takes in a one-hot DataFrame that looks like this:

        Apple  Bananas   Beer  Chicken   Milk   Rice
    0   True    False   True     True  False   True
    1   True    False   True    False  False   True
    2   True    False   True    False  False  False
    3   True     True  False    False  False  False
    4  False    False   True     True   True   True
    5  False    False   True    False   True   True
    6  False    False   True    False   True  False
    7   True     True  False    False  False  False

and returns a pandas DataFrame with columns ['support', 'itemsets'] of all itemsets that are >= min_support and < max_len. It's clear that the smaller min_support is, the more itemsets will be returned. If min_support is small enough we will get an OutOfMemory error.

But how are these itemsets generated? Can we pin down the exact line that is causing the error? How do we fix the OutOfMemory error?

Let's dive into the fpgrowth source code here:

We can see that the fpgrowth function calls fpc.setup_fptree on line 84 to create a class called an FPTree, then uses this tree class to run fgp_step recursively.

I did some additional digging and found the most likely cause: line 113 inside fpg_step:

# fpg_step
if max_len:
size_remain = max_len - len(tree.cond_items) + 1
for i in range(1, size_remain):
for itemset in itertools.combinations(items, i): # <-- this line!
count += 1
support = min([tree.nodes[i][0].count for i in itemset])
yield support, tree.cond_items + list(itemset)

Here itertools.combinations(items,i) is almost certainly causing the OOM error.
(itemsi)items \choose i where items and i are large will give a very big number, on the order of Θ(itemsi).\Theta(items^i). (see the Appendix for an explanation). So itemsets grows non-polynomially with the size of the DataFrame. But this itself is not the problem, because itertools returns a generator, which will not OOM. Rather, it's the line immediately after it, fpc.generate_itemsets, that is the issue:

# fpgrowth.py
# ...
def fpgrowth(...):
# ....
generator = fpg_step(tree, minsup, colname_map, max_len, verbose)
return fpc.generate_itemsets(generator, len(df.index), colname_map) # <- this line
# fpcommon.py

def generate_itemsets(generator, num_itemsets, colname_map):
itemsets = []
supports = []
for sup, iset in generator:
itemsets.append(frozenset(iset))
supports.append(sup / num_itemsets)

res_df = pd.DataFrame({'support': supports, 'itemsets': itemsets})

if colname_map is not None:
res_df['itemsets'] = res_df['itemsets'] \
.apply(lambda x: frozenset([colname_map[i] for i in x]))

return res_df

Looking at the generate_itemsets function we can see that itemsets and supports are appended to for each itemset. And we know that there are a lot of itemsets. This causes the OOM error.

So what can we do? There are several possible solutions. First is simply to use a higher min_support value so that we don't run out of memory. But this is of course not a solution so much as it is giving up. I think one real solution is to generate the DataFrame in chunks, and persist it in disk. This way we don't have to append all itemsets to itemsets before generating the DataFrame.

What are the pros and cons of this solution? This solution does not need any additional hardware or cost: this is a pure software solution. This would also be the easiest to hand over to the TP, as it introduces no additional technical complexity.

The con is that this would require us to make a pull request to the mlxtend repo, which may or may not be accepted. We must also find a way to make the returned chunked DataFrame compatible with just a regular, non-chunked DataFrame. But if accepted this would be a public service to others using the repo too. I know Siyang has made PRs to an open-source library in the past that have been well-received by the community.

Conclusions

I think one key takeaway here is to understand the problem domain well, and not to be afraid to dig into the source code. Knowing what we now know about the problem domain, it's clear that solution 1 would not be sufficient as the output size explodes with an increase in input size. Solution 2 does work but involves significant technical complexity. In my opinion solution 3 is probably best.

We could also sidestep the problem entirely: reduce size_remain or items by setting a larger min_support threshold. I know this is what XZ has been working on. This does come at the expense of accuracy/not having enough recommendations, however.

Acknowledgements

Thanks to Siyang and XZ for telling me about this problem.

Appendix

items and i grow as min_support shinks.

We can see this by looking at the source code for setup_fptree (here). The line is items = np.nonzero(item_support >= min_support)[0].