<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Bloom Filters in Scala (and all the fun that they bring)</title>
	<atom:link href="http://www.codecommit.com/blog/scala/bloom-filters-in-scala/feed" rel="self" type="application/rss+xml" />
	<link>http://www.codecommit.com/blog/scala/bloom-filters-in-scala</link>
	<description>(permanently in beta)</description>
	<lastBuildDate>Sun, 29 Aug 2010 20:01:44 -0700</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Eric Bowman</title>
		<link>http://www.codecommit.com/blog/scala/bloom-filters-in-scala/comment-page-1#comment-4913</link>
		<dc:creator>Eric Bowman</dc:creator>
		<pubDate>Sun, 06 Dec 2009 09:07:24 +0000</pubDate>
		<guid isPermaLink="false">http://www.codecommit.com/blog/scala/bloom-filters-in-scala#comment-4913</guid>
		<description>We solved the &quot;what language is this phrase&quot; using a naive Bayesian classifier.  Works great; takes not too long to train if you pick the training set well, and uses much, much less memory.</description>
		<content:encoded><![CDATA[<p>We solved the &#8220;what language is this phrase&#8221; using a naive Bayesian classifier.  Works great; takes not too long to train if you pick the training set well, and uses much, much less memory.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jesse Farmer</title>
		<link>http://www.codecommit.com/blog/scala/bloom-filters-in-scala/comment-page-1#comment-4140</link>
		<dc:creator>Jesse Farmer</dc:creator>
		<pubDate>Tue, 14 Oct 2008 04:50:36 +0000</pubDate>
		<guid isPermaLink="false">http://www.codecommit.com/blog/scala/bloom-filters-in-scala#comment-4140</guid>
		<description>I made this same point on the original WhatLanguage blog post, but Bloom Filters are not the way to do this.  Anything that operates on the word level will have trouble with morphological artifacts, e.g., pluralization, verb tenses, etc.

It gets much more complicates in agglutinative languages like Hungarian and Finnish, and even languages like German which can basically synthesize arbitrary words and still remain intelligible German.

A better way to determine a language is to use an n-gram model.  The distribution of n-grams across a language is a great finger print and you can identify most languages within a word or two.

Cheers,
Jesse</description>
		<content:encoded><![CDATA[<p>I made this same point on the original WhatLanguage blog post, but Bloom Filters are not the way to do this.  Anything that operates on the word level will have trouble with morphological artifacts, e.g., pluralization, verb tenses, etc.</p>
<p>It gets much more complicates in agglutinative languages like Hungarian and Finnish, and even languages like German which can basically synthesize arbitrary words and still remain intelligible German.</p>
<p>A better way to determine a language is to use an n-gram model.  The distribution of n-grams across a language is a great finger print and you can identify most languages within a word or two.</p>
<p>Cheers,<br />
Jesse</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Peter Cooper</title>
		<link>http://www.codecommit.com/blog/scala/bloom-filters-in-scala/comment-page-1#comment-4139</link>
		<dc:creator>Peter Cooper</dc:creator>
		<pubDate>Tue, 14 Oct 2008 02:01:18 +0000</pubDate>
		<guid isPermaLink="false">http://www.codecommit.com/blog/scala/bloom-filters-in-scala#comment-4139</guid>
		<description>Awesome writeup! Very interesting. Glad you liked the WhatLanguage work. It turns out that there are other statistical ways to analyze and determine that language that might be significantly more efficient (not just in speed but memory too) - many of these are linked in the WhatLanguage post on Ruby Inside.

That aside, this is a great introduction to the topic and it was interesting to see the Scala code. I own scalainside.com, but unfortunately never got into the language enough to go full steam into it.</description>
		<content:encoded><![CDATA[<p>Awesome writeup! Very interesting. Glad you liked the WhatLanguage work. It turns out that there are other statistical ways to analyze and determine that language that might be significantly more efficient (not just in speed but memory too) &#8211; many of these are linked in the WhatLanguage post on Ruby Inside.</p>
<p>That aside, this is a great introduction to the topic and it was interesting to see the Scala code. I own scalainside.com, but unfortunately never got into the language enough to go full steam into it.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Daniel Spiewak</title>
		<link>http://www.codecommit.com/blog/scala/bloom-filters-in-scala/comment-page-1#comment-4134</link>
		<dc:creator>Daniel Spiewak</dc:creator>
		<pubDate>Mon, 13 Oct 2008 20:21:27 +0000</pubDate>
		<guid isPermaLink="false">http://www.codecommit.com/blog/scala/bloom-filters-in-scala#comment-4134</guid>
		<description>@Ian

Well, not *quite* blind loyalty.  :-)  I really, really like the benefits that immutable data structures carry with them.  I mean, think about it, how many times have you done something like this by accident?

public class Bean {
private String[] data;

private String[] getData() {
return data;
}
}

There&#039;s no setter for data, but it doesn&#039;t really matter; anyone can call getData() and then change the internals.  It&#039;s even worse with more complex data structures like List and Map.  And don&#039;t even get me started on thread synchronization issues...  :-)

In terms of performance, immutable data structures can be pretty darn fast.  My benchmarking of Rich Hickey&#039;s persistent vector shows it to be within 5x-10x of the speed of an ArrayList on &quot;writes&quot; (remember, a new Vector each time), and between 1x and 2x the speed on reads.  It&#039;s hard to argue that these factors constitute a significant performance hit, especially when we remember that lookups in an ArrayList are on the order of 0.5-1ms.

It should be possible to implement a bitset immutably using much the same technique as is used in the persistent vector.  However, rather than the lowest level of the trie being composed of a 32-element array, this level would be a single 32-bit integer.  Because of the speed of integer copying on modern processors (constant), this sort of data structure would actually be phenomenally faster than a general-purpose Vector.  Granted, this is to be expected since it is limited to bit values, but it&#039;s still a cool conclusion to trot out.  :-)

With this sort of implementation, I don&#039;t think you would really have to discard any of the benefits of a bloom filter.  Granted, it would certainly have a higher &quot;write&quot; overhead than a mutable bloom filter, but its reads would be very nearly (if not exactly) as fast.  The &quot;write&quot; overhead wouldn&#039;t be too much of a concern because bloom filters are usually used immutably anyway.  In short, I think the benefits of a fully immutable bloom filter would far outweigh the minimal performance penalty imposed by such a structure.

P.S. Oh, I&#039;m not really claiming that immutable data structures are computationally as efficient as their mutable counterparts.  Even Clojure&#039;s persistent vector is really only log32(n) for reads, which is close to constant time but still not quite.  However, in most scenarios that I work on, immutable data structures lend considerably to the correctness and logical &quot;reasonability&quot; of the code.  Add in their inherently stellar throughput when dealing with concurrency, and I think the overall benefits are more than sufficiently compelling (at least, they are to me).</description>
		<content:encoded><![CDATA[<p>@Ian</p>
<p>Well, not *quite* blind loyalty.  <img src='http://www.codecommit.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />   I really, really like the benefits that immutable data structures carry with them.  I mean, think about it, how many times have you done something like this by accident?</p>
<p>public class Bean {<br />
private String[] data;</p>
<p>private String[] getData() {<br />
return data;<br />
}<br />
}</p>
<p>There&#8217;s no setter for data, but it doesn&#8217;t really matter; anyone can call getData() and then change the internals.  It&#8217;s even worse with more complex data structures like List and Map.  And don&#8217;t even get me started on thread synchronization issues&#8230;  <img src='http://www.codecommit.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>In terms of performance, immutable data structures can be pretty darn fast.  My benchmarking of Rich Hickey&#8217;s persistent vector shows it to be within 5x-10x of the speed of an ArrayList on &#8220;writes&#8221; (remember, a new Vector each time), and between 1x and 2x the speed on reads.  It&#8217;s hard to argue that these factors constitute a significant performance hit, especially when we remember that lookups in an ArrayList are on the order of 0.5-1ms.</p>
<p>It should be possible to implement a bitset immutably using much the same technique as is used in the persistent vector.  However, rather than the lowest level of the trie being composed of a 32-element array, this level would be a single 32-bit integer.  Because of the speed of integer copying on modern processors (constant), this sort of data structure would actually be phenomenally faster than a general-purpose Vector.  Granted, this is to be expected since it is limited to bit values, but it&#8217;s still a cool conclusion to trot out.  <img src='http://www.codecommit.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>With this sort of implementation, I don&#8217;t think you would really have to discard any of the benefits of a bloom filter.  Granted, it would certainly have a higher &#8220;write&#8221; overhead than a mutable bloom filter, but its reads would be very nearly (if not exactly) as fast.  The &#8220;write&#8221; overhead wouldn&#8217;t be too much of a concern because bloom filters are usually used immutably anyway.  In short, I think the benefits of a fully immutable bloom filter would far outweigh the minimal performance penalty imposed by such a structure.</p>
<p>P.S. Oh, I&#8217;m not really claiming that immutable data structures are computationally as efficient as their mutable counterparts.  Even Clojure&#8217;s persistent vector is really only log32(n) for reads, which is close to constant time but still not quite.  However, in most scenarios that I work on, immutable data structures lend considerably to the correctness and logical &#8220;reasonability&#8221; of the code.  Add in their inherently stellar throughput when dealing with concurrency, and I think the overall benefits are more than sufficiently compelling (at least, they are to me).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ian Clarke</title>
		<link>http://www.codecommit.com/blog/scala/bloom-filters-in-scala/comment-page-1#comment-4131</link>
		<dc:creator>Ian Clarke</dc:creator>
		<pubDate>Mon, 13 Oct 2008 19:53:21 +0000</pubDate>
		<guid isPermaLink="false">http://www.codecommit.com/blog/scala/bloom-filters-in-scala#comment-4131</guid>
		<description>@Daniel, using an immutable datastructure for a BloomFilter is a good example of where blind loyalty to immutability is a bad idea :-)  

I&#039;m not terribly familiar with how a bitset would be implemented in an immutable manner, but my guess is that you would throw away much of the performance benefit of using a BloomFilter.

Immutability is beneficial in many, but not every circumstance.</description>
		<content:encoded><![CDATA[<p>@Daniel, using an immutable datastructure for a BloomFilter is a good example of where blind loyalty to immutability is a bad idea <img src='http://www.codecommit.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />   </p>
<p>I&#8217;m not terribly familiar with how a bitset would be implemented in an immutable manner, but my guess is that you would throw away much of the performance benefit of using a BloomFilter.</p>
<p>Immutability is beneficial in many, but not every circumstance.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Daniel Spiewak</title>
		<link>http://www.codecommit.com/blog/scala/bloom-filters-in-scala/comment-page-1#comment-4130</link>
		<dc:creator>Daniel Spiewak</dc:creator>
		<pubDate>Mon, 13 Oct 2008 17:56:40 +0000</pubDate>
		<guid isPermaLink="false">http://www.codecommit.com/blog/scala/bloom-filters-in-scala#comment-4130</guid>
		<description>As a followup, I experimented a bit with the floating width.  These are the results from before its implementation:

wl.processText(&quot;This is a very long string which contains a lot of different words.  The goal is to try to confuse the language processor and draw out some more contrasting results in terms of language scoring&quot;)

=&gt; Map(Portuguese -&gt; 10, English -&gt; 31, Swedish -&gt; 5, Farsi -&gt; 3, German -&gt; 1, Spanish -&gt; 7, French -&gt; 13, Dutch -&gt; 15, Pinyin -&gt; 1)

And afterward:

=&gt; Map(Portuguese -&gt; 9, English -&gt; 30, Swedish -&gt; 7, Farsi -&gt; 2, German -&gt; 1, Spanish -&gt; 3, French -&gt; 13, Dutch -&gt; 17, Pinyin -&gt; 1)

So oddly enough, it seems that the floating width has a slightly deleterious effect on accuracy, at least with this particular string.  Overall accuracy should have been improved, particularly with strings containing Russian words.  The only unfortunate consequence is the memory requirements have increased due to the average width being higher than 2,000,000.  Still, an interesting experiment to be sure.</description>
		<content:encoded><![CDATA[<p>As a followup, I experimented a bit with the floating width.  These are the results from before its implementation:</p>
<p>wl.processText(&#8220;This is a very long string which contains a lot of different words.  The goal is to try to confuse the language processor and draw out some more contrasting results in terms of language scoring&#8221;)</p>
<p>=> Map(Portuguese -> 10, English -> 31, Swedish -> 5, Farsi -> 3, German -> 1, Spanish -> 7, French -> 13, Dutch -> 15, Pinyin -> 1)</p>
<p>And afterward:</p>
<p>=> Map(Portuguese -> 9, English -> 30, Swedish -> 7, Farsi -> 2, German -> 1, Spanish -> 3, French -> 13, Dutch -> 17, Pinyin -> 1)</p>
<p>So oddly enough, it seems that the floating width has a slightly deleterious effect on accuracy, at least with this particular string.  Overall accuracy should have been improved, particularly with strings containing Russian words.  The only unfortunate consequence is the memory requirements have increased due to the average width being higher than 2,000,000.  Still, an interesting experiment to be sure.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Daniel Spiewak</title>
		<link>http://www.codecommit.com/blog/scala/bloom-filters-in-scala/comment-page-1#comment-4129</link>
		<dc:creator>Daniel Spiewak</dc:creator>
		<pubDate>Mon, 13 Oct 2008 17:06:33 +0000</pubDate>
		<guid isPermaLink="false">http://www.codecommit.com/blog/scala/bloom-filters-in-scala#comment-4129</guid>
		<description>@Charles

Your note on performance actually makes a lot of sense.  I was running this with JRuby 1.1.2.  For the record, I wasn&#039;t really expecting (or demanding) performance approaching Scala&#039;s in JRuby, I was just somewhat surprised to see how divergent it really was.

The String conversion was a beast as well.  I actually had to force the conversion by explicitly creating a new instance of java.lang.String, which makes things all the more worse.

@Asd and @Ian

I&#039;m a little surprised that the JVM doesn&#039;t optimize boolean[].  Considering all the other complicated optimizations performed by the JIT, this would seem to me to be low hanging fruit.

In any case though, you&#039;re right that the JVM isn&#039;t optimizing BloomSet down to one bit per element.  Even if it could optimize boolean[], Vector isn&#039;t using a typed array under the surface.  It&#039;s actually even worse than a straight array of primitive booleans due to the fact that every boolean has to be stored in its boxed form.  Watching the memory usage of an example application using WhatLanguage shows that life isn&#039;t *too* terrible, but it&#039;s still a lot more than I would have liked.

Thinking about it, the best way to solve this would be (as you say) to create a new persistent data structure based on the Vector concept which stores bits in Integer masks.  Conveniently, this would still preserve all of the exact properties of Clojure&#039;s persistent vector since integers contain 32 bits.  This sort of data structure would not be possible just as a &quot;specialization&quot; of Vector, but the same ideas could be cross-applied through (mostly) copy-paste coding.

Java&#039;s BitSet isn&#039;t really an option because it is mutable.  I really strongly believe in using immutable data structures except where they carry a severe performance hit.  Immutable data structures can be used mutably if you really need to, but they also carry all of the benefits of air-tight thread safety and dirt cheap versioning (not to mention correctness assurance in cases where a return value may expose implementation details).

@Asd (again)

It would seem that you&#039;re right about hash.  I guess that&#039;s what happens when I try to re-tune algorithms at the last minute...  :-S  Originally, I had a very different version of hash based on Random (@Tim).  As I was actually writing the article, I attempted to pair down the implementation just a bit so as to improve the readability.  Seems I went just a little too far...

Considering that there is exactly one integer value (out of 2^32 - 1) that reproduces this issue, I think I&#039;m pretty safe.  :-)  Do you know of another function which does better?  I suppose I could just right-shift everything by one place, which yields the more natural top-end result: Integer.MIN_VALUE &gt;&gt; 1 == Integer.MAX_VALUE.

@Tim

Character repetition frequency is an interesting way of detecting language.  This would be a lot more memory-efficient than assembling bloom filters.  However, I&#039;m not entirely sure that it would be as accurate.  The nice thing about the dictionary approach is fairly easy to analyze its properties and discover why it is wrong in such cases that it is.  This contrasts pretty heavily with a more statistical approach, which would be very difficult to debug.  I&#039;m sure there are pathological English sentences that look a lot like French, German or even Dutch (considering the similarities in the languages).

@Thomas

I thought about letting the width float, but apparently you&#039;re more ambitious than I am in terms of calculating optimizations.  :-)  The BloomSet API is capable of accepting a configurable width, so this would only require changes in the GenLangs.scala script.  Thanks for the pointer!

@Paolo

When I initially looked at the algorithm, it seemed closer to O(n^2), but somehow my second glance told me it was O(n!).  I&#039;ll correct the article.  Thanks for catching my error!</description>
		<content:encoded><![CDATA[<p>@Charles</p>
<p>Your note on performance actually makes a lot of sense.  I was running this with JRuby 1.1.2.  For the record, I wasn&#8217;t really expecting (or demanding) performance approaching Scala&#8217;s in JRuby, I was just somewhat surprised to see how divergent it really was.</p>
<p>The String conversion was a beast as well.  I actually had to force the conversion by explicitly creating a new instance of java.lang.String, which makes things all the more worse.</p>
<p>@Asd and @Ian</p>
<p>I&#8217;m a little surprised that the JVM doesn&#8217;t optimize boolean[].  Considering all the other complicated optimizations performed by the JIT, this would seem to me to be low hanging fruit.</p>
<p>In any case though, you&#8217;re right that the JVM isn&#8217;t optimizing BloomSet down to one bit per element.  Even if it could optimize boolean[], Vector isn&#8217;t using a typed array under the surface.  It&#8217;s actually even worse than a straight array of primitive booleans due to the fact that every boolean has to be stored in its boxed form.  Watching the memory usage of an example application using WhatLanguage shows that life isn&#8217;t *too* terrible, but it&#8217;s still a lot more than I would have liked.</p>
<p>Thinking about it, the best way to solve this would be (as you say) to create a new persistent data structure based on the Vector concept which stores bits in Integer masks.  Conveniently, this would still preserve all of the exact properties of Clojure&#8217;s persistent vector since integers contain 32 bits.  This sort of data structure would not be possible just as a &#8220;specialization&#8221; of Vector, but the same ideas could be cross-applied through (mostly) copy-paste coding.</p>
<p>Java&#8217;s BitSet isn&#8217;t really an option because it is mutable.  I really strongly believe in using immutable data structures except where they carry a severe performance hit.  Immutable data structures can be used mutably if you really need to, but they also carry all of the benefits of air-tight thread safety and dirt cheap versioning (not to mention correctness assurance in cases where a return value may expose implementation details).</p>
<p>@Asd (again)</p>
<p>It would seem that you&#8217;re right about hash.  I guess that&#8217;s what happens when I try to re-tune algorithms at the last minute&#8230;  :-S  Originally, I had a very different version of hash based on Random (@Tim).  As I was actually writing the article, I attempted to pair down the implementation just a bit so as to improve the readability.  Seems I went just a little too far&#8230;</p>
<p>Considering that there is exactly one integer value (out of 2^32 &#8211; 1) that reproduces this issue, I think I&#8217;m pretty safe.  <img src='http://www.codecommit.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />   Do you know of another function which does better?  I suppose I could just right-shift everything by one place, which yields the more natural top-end result: Integer.MIN_VALUE >> 1 == Integer.MAX_VALUE.</p>
<p>@Tim</p>
<p>Character repetition frequency is an interesting way of detecting language.  This would be a lot more memory-efficient than assembling bloom filters.  However, I&#8217;m not entirely sure that it would be as accurate.  The nice thing about the dictionary approach is fairly easy to analyze its properties and discover why it is wrong in such cases that it is.  This contrasts pretty heavily with a more statistical approach, which would be very difficult to debug.  I&#8217;m sure there are pathological English sentences that look a lot like French, German or even Dutch (considering the similarities in the languages).</p>
<p>@Thomas</p>
<p>I thought about letting the width float, but apparently you&#8217;re more ambitious than I am in terms of calculating optimizations.  <img src='http://www.codecommit.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />   The BloomSet API is capable of accepting a configurable width, so this would only require changes in the GenLangs.scala script.  Thanks for the pointer!</p>
<p>@Paolo</p>
<p>When I initially looked at the algorithm, it seemed closer to O(n^2), but somehow my second glance told me it was O(n!).  I&#8217;ll correct the article.  Thanks for catching my error!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Paolo Bonzini</title>
		<link>http://www.codecommit.com/blog/scala/bloom-filters-in-scala/comment-page-1#comment-4126</link>
		<dc:creator>Paolo Bonzini</dc:creator>
		<pubDate>Mon, 13 Oct 2008 14:51:59 +0000</pubDate>
		<guid isPermaLink="false">http://www.codecommit.com/blog/scala/bloom-filters-in-scala#comment-4126</guid>
		<description>It&#039;s not the factorial, it&#039;s n+(n-1)+(n-2)+... = n(n-1)/2.

If it were the factorial, even 30 hash functions would be waaaaay too much.</description>
		<content:encoded><![CDATA[<p>It&#8217;s not the factorial, it&#8217;s n+(n-1)+(n-2)+&#8230; = n(n-1)/2.</p>
<p>If it were the factorial, even 30 hash functions would be waaaaay too much.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ian Clarke</title>
		<link>http://www.codecommit.com/blog/scala/bloom-filters-in-scala/comment-page-1#comment-4125</link>
		<dc:creator>Ian Clarke</dc:creator>
		<pubDate>Mon, 13 Oct 2008 13:50:26 +0000</pubDate>
		<guid isPermaLink="false">http://www.codecommit.com/blog/scala/bloom-filters-in-scala#comment-4125</guid>
		<description>Just want to second Asd&#039;s first point - you are wasting 7 bits for every bit you store by using a Vector, especially unfortunate given that Bloom Filters are supposed to save space.  Use a BitSet.</description>
		<content:encoded><![CDATA[<p>Just want to second Asd&#8217;s first point &#8211; you are wasting 7 bits for every bit you store by using a Vector, especially unfortunate given that Bloom Filters are supposed to save space.  Use a BitSet.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Thomas Hurst</title>
		<link>http://www.codecommit.com/blog/scala/bloom-filters-in-scala/comment-page-1#comment-4124</link>
		<dc:creator>Thomas Hurst</dc:creator>
		<pubDate>Mon, 13 Oct 2008 12:17:03 +0000</pubDate>
		<guid isPermaLink="false">http://www.codecommit.com/blog/scala/bloom-filters-in-scala#comment-4124</guid>
		<description>Why fix the size of the bloomfilter?  You get much more efficient structures if you scale k and m, and a much neater API; you just tell it how many items you have and what false positive rate you can put up with, and get back m and k.  You also don&#039;t get insanity like &quot;Optimal K: 3470&quot;.  This was in C, but I converted it to Ruby for you:

# --
def size_filter(capacity, error = 1 / 1000000.0)
  m = ((capacity * Math.log(error)) / Math.log(1.0 / (2.0 ** Math.log(2.0)))).ceil
  k = (Math.log(2.0) * m / capacity).ceil
  {:k =&gt; k, :m =&gt; m}
end

p size_filter(234936, 0.03) # =&gt; {:m=&gt;1714667, :k=&gt;6}
p size_filter(399, 0.03) # =&gt; {:m=&gt;2913, :k=&gt;6}
# --

Also, after finding how poorly &quot;fast&quot; hashes perform on bloomfilters, I ended up just using SHA512; you get excellent distribution, and can make a lot of k&#039;s out of all those bits.</description>
		<content:encoded><![CDATA[<p>Why fix the size of the bloomfilter?  You get much more efficient structures if you scale k and m, and a much neater API; you just tell it how many items you have and what false positive rate you can put up with, and get back m and k.  You also don&#8217;t get insanity like &#8220;Optimal K: 3470&#8243;.  This was in C, but I converted it to Ruby for you:</p>
<p># &#8211;<br />
def size_filter(capacity, error = 1 / 1000000.0)<br />
  m = ((capacity * Math.log(error)) / Math.log(1.0 / (2.0 ** Math.log(2.0)))).ceil<br />
  k = (Math.log(2.0) * m / capacity).ceil<br />
  {:k =&gt; k, :m =&gt; m}<br />
end</p>
<p>p size_filter(234936, 0.03) # =&gt; {:m=&gt;1714667, :k=&gt;6}<br />
p size_filter(399, 0.03) # =&gt; {:m=&gt;2913, :k=&gt;6}<br />
# &#8211;</p>
<p>Also, after finding how poorly &#8220;fast&#8221; hashes perform on bloomfilters, I ended up just using SHA512; you get excellent distribution, and can make a lot of k&#8217;s out of all those bits.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
