I have a large collection of items, and a list of independent filters (boolean f
ID: 649864 • Letter: I
Question
I have a large collection of items, and a list of independent filters (boolean functions). I want to find the collection of items that pass all of my filters as quickly as possible. This must involve looping over every item in the collection and applying each filter to each item. An item will be rejected early if any one of the filters fails during this process.
Each filter has a runtime that varies significantly. And, we know a-priori what percent of the collection each filter will filter out. Given this, how do I find the ordering I should apply my filters to each item with the fasted expected run-time overall?
For example, filter A runs in 5 ms and filters out 50% of the collection on average. Filter B runs in 1 ms and filters out 25% of the collection on average. From this, we know that ordering A,B gives 5 + 0.5 * 1 = 5.5 ms average runtime, and B,A gives 1 + 0.75 * 5 = 4.75 ms average runtime. So B,A is our fastest ordering.
I think this admits a dynamic programming solution (since fastest ordering is runtime of first filter + (1 - filter fraction) * (optimal ordering of the rest), but I was wondering if this problem has a name and has been studied before? Could somebody point me to any papers?
EDIT: also, I forgot that the probability of filtering and the run-times will be correlated between filters, so my analysis above is wrong and the problem is harder. I think I can solve this with enough time but I want to catch up on prior-work before I start, if there is any (it seems like it could be a common problem). I haven't had any luck finding anything so far though.
Explanation / Answer
Independent filters
If the filters are independent, a greedy algorithm can be used to find the optimal order. Sort the filters by increasing value of t/p, where t is the time to run the filter and p is the fraction of items that are filtered out by the filter. This gives the optimal order.
To give some intuition for why this might be a plausible answer: ideally, we'd prefer filters that take less time; and ideally, we'd prefer filters that filter out a higher fraction of items. Intuitively, you can see how t/p somehow incorporates both of those.
A formal proof of optimality involves an exchange argument. I won't detail the proof here, as it can be derived easily using standard techniques.
Correlated filters
If there can be correlations between the filters (i.e., they are not independent), the optimal order cannot be determined from the information provided without knowledge of the correlations.
Here's an example, based on making a small modification to your example. Suppose we have two filters, filter A and filter B. Suppose filter A runs in 4.5ms and filters out 50% on average; filter B runs in 1ms and filters out 25% on average. Also suppose that anything filtered out by filter B is also filtered out by filter A. Then the optimal order is to just do filter A. Your formulas give A,B: 4.5+0.5
Related Questions
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.