Some ways to deal with big(-ish) data: 'why do we get an OutOfMemory error in FPGrowth?'
Wed Mar 31 2021tags: debugging programming imda
(Written for work)
Introduction
Xin Zheng mentioned that she was running out of memory when running the a priori and FP Growth algorithms for the recommender systems project.
After some research, I found the reason for the error, and here I present a few solutions which might help other data scientists too.
Thanks also to Siyang for being a sounding board for my ideas.
The problem
Xin Zheng mentioned that she was running out of memory when running the a priori and FP Growth algorithms These algorithms are used to generate frequent itemsets (items that are frequently bought together by customers) so that we can recommend similar items to purchase.
This is the section of her code that was throwing the error:
"""
FP growth algo for product association rules calculation
* input item order-product name matrix
* output rules
"""
import pandas as pd
from mlxtend.frequent_patterns import fpgrowth
from mlxtend.frequent_patterns import association_rules
import time
class FPGrowthRules:
def run(self, df_matrix=None, min_support=0.5, min_confidence=0.25):
frequent_itemsets = fpgrowth(df_matrix, min_support=min_support, use_colnames=True)
# ... stuff elided ...
return rules
In particular, it was this invocation of fpgrowth
from the mlxtend
package:
frequent_itemsets = fpgrowth(df_matrix, min_support=min_support, use_colnames=True)
Here fpgrowth
takes in a DataFrame and returns a DataFrame.
In particular, it gets frequent itemsets for a one-hot DataFrame.
A one-hot DataFrame looks like this:
Apple Bananas Beer Chicken Milk Rice
0 True False True True False True
1 True False True False False True
2 True False True False False False
3 True True False False False False
4 False False True True True True
5 False False True False True True
6 False False True False True False
7 True True False False False False
fpgrowth
returns a pandas DataFrame with columns ['support', 'itemsets']
of all itemsets that are >= min_support
and < max_len
.
You can find usage examples
here.
After looking at her code and the source code of the mlxtend
library,
I found the error and came up with three plausible solutions,
each having their own set of advantages and disadvantages.
Solution 1: use a bigger machine (vertical scaling)
The simplest way to solve the out-of-memory issue is to use a larger machine. XZ's machine (27GB of RAM and 10GB of disk) runs out of memory, but 27GB is not that much memory in the grand scheme of things; we can get instances with much more memory.
The pros of this solution is that it requires no developer effort. The cons are that this will incur additional cost, and the TP we pass it to may not be able to spin up larger machines. And of course, this solution is not very scalable because there's a hard cap on the amount of memory we can provision. So if the out-of-memory issue is caused by some scaling or worse, this would not be a good solution.
Solution 2: use Apache Spark (horizontal scaling)
One more complicated way is to use Apache Spark to scale out.
This involves rewriting Xin Zheng's code as a Spark job
and using a parallel, Spark-friendly version of fpgrowth
.
We would then find some way to spin up a Spark cluster and
run the Spark job.
The pros of this solution is that it can scale to much larger datasets. The cons are that it introduces a lot of technical complexity both in the code, and in the deployment. It will also make testing more difficult. Additionally, TPs may not have the technical chops to provision their own spark cluster.
Solution 3: replace mlxtend's source code (software solution)
The last solution requires us to actually understand the problem. Let's figure out exactly what is causing the OOM error.
Recall that fpgrowth
takes in a one-hot DataFrame that looks like this:
Apple Bananas Beer Chicken Milk Rice
0 True False True True False True
1 True False True False False True
2 True False True False False False
3 True True False False False False
4 False False True True True True
5 False False True False True True
6 False False True False True False
7 True True False False False False
and returns a pandas DataFrame with columns ['support', 'itemsets']
of all itemsets that are >= min_support
and < max_len
.
It's clear that the smaller min_support
is, the more itemsets will be returned.
If min_support
is small enough we will get an OutOfMemory
error.
But how are these itemsets generated?
Can we pin down the exact line that is causing the error?
How do we fix the OutOfMemory
error?
Let's dive into the fpgrowth
source code
here:
We can see that the fpgrowth
function calls fpc.setup_fptree
on line 84
to create a class called an FPTree
,
then uses this tree class to run fgp_step
recursively.
I did some additional digging and found the most likely cause:
line 113
inside fpg_step
:
# fpg_step
if max_len:
size_remain = max_len - len(tree.cond_items) + 1
for i in range(1, size_remain):
for itemset in itertools.combinations(items, i): # <-- this line!
count += 1
support = min([tree.nodes[i][0].count for i in itemset])
yield support, tree.cond_items + list(itemset)
Here itertools.combinations(items,i)
is almost certainly causing the OOM error.
where items
and i
are large
will give a very big number,
on the order of
(see the Appendix for an explanation).
So itemsets
grows non-polynomially with the size of the DataFrame.
But this itself is not the problem, because itertools
returns a generator, which will not OOM.
Rather, it's the line immediately after it,
fpc.generate_itemsets
,
that is the issue:
# fpgrowth.py
# ...
def fpgrowth(...):
# ....
generator = fpg_step(tree, minsup, colname_map, max_len, verbose)
return fpc.generate_itemsets(generator, len(df.index), colname_map) # <- this line
# fpcommon.py
def generate_itemsets(generator, num_itemsets, colname_map):
itemsets = []
supports = []
for sup, iset in generator:
itemsets.append(frozenset(iset))
supports.append(sup / num_itemsets)
res_df = pd.DataFrame({'support': supports, 'itemsets': itemsets})
if colname_map is not None:
res_df['itemsets'] = res_df['itemsets'] \
.apply(lambda x: frozenset([colname_map[i] for i in x]))
return res_df
Looking at the generate_itemsets
function
we can see that itemsets
and supports
are appended to for each itemset.
And we know that there are a lot of itemsets.
This causes the OOM error.
So what can we do? There are several possible solutions.
First is simply to use a higher min_support
value so that we don't run out of memory.
But this is of course not a solution so much as it is giving up.
I think one real solution is to generate the DataFrame in chunks,
and persist it in disk.
This way we don't have to append all itemsets to itemsets
before generating the DataFrame.
What are the pros and cons of this solution? This solution does not need any additional hardware or cost: this is a pure software solution. This would also be the easiest to hand over to the TP, as it introduces no additional technical complexity.
The con is that this would require us to make a pull request
to the mlxtend
repo, which may or may not be accepted.
We must also find a way to make the returned chunked DataFrame compatible
with just a regular, non-chunked DataFrame.
But if accepted this would be a public service to others using the repo too.
I know Siyang has made PRs to an open-source library in the past
that have been well-received by the community.
Conclusions
I think one key takeaway here is to understand the problem domain well, and not to be afraid to dig into the source code. Knowing what we now know about the problem domain, it's clear that solution 1 would not be sufficient as the output size explodes with an increase in input size. Solution 2 does work but involves significant technical complexity. In my opinion solution 3 is probably best.
We could also sidestep the problem entirely:
reduce size_remain
or items
by setting a larger min_support
threshold.
I know this is what XZ has been working on.
This does come at the expense of accuracy/not having enough recommendations, however.
Acknowledgements
Thanks to Siyang and XZ for telling me about this problem.
Appendix
items
and i
grow as min_support
shinks.
We can see this by looking at the source code for setup_fptree
(here).
The line is items = np.nonzero(item_support >= min_support)[0]
.