Fix ScalableBloomFilter and tests on top of #5#6
Conversation
Empirical measurement of the false positive rate of BloomFilter is substantially lower that the requested error rates. For example, a BloomFilter with capacity 20,000 and an error rate of .01, gives actual error rate .0003. This is because the bit size allocated to the filter is unnecessarily large, and does not match the theoretically calculated required size.
|
The bad news is that ScalableBloomFilter is definitely implemented incorrectly, if you do the math it converges to a much higher error rate than it's supposed to, because the first filter starts off with the target error rate so it can only get worse from there since checking each filter is an independent event. It appears that this bug has also propagated to other implementations inspired by this one [1]. Should probably tell them once it's fixed. Per the paper and other implementations [2] the error rate for each bloom filter in the SBF has to be calculated much differently. Specifically the sum([err * (1-ratio) * pow(ratio, n) for n in [0..inf]]) should converge to err. The implementation in [2] calculates the error rate of the next filter by just doing a multiply of the previous error rate, which is probably what we should do too. (err is 0.001 in this case, with ratio of 0.9) [1] http://code.google.com/p/scalable-bloom-filter/source/browse/trunk/trunk/scalable-bloom-filter/src/main/java/com/elaunira/sbf/ScalableBloomFilter.java |
Fix ScalableBloomFilter and tests on top of #5
This takes the branch from pr #5 that fixes BF implementation, fixes SBF, and makes the tests pass again.