Searching with Bloom Filters

RomaJS • June 21, 2023

Luís Rodrigues

Staff Engineer, NewStore

Jamstack

Everything Is Local

Client-Side Search

😊 Self-contained

😊 Works offline

😊 Privacy

😇 Requires JS

😬 Constrained index size

Use Case:

Hundreds of small documents

My Options

Inverted Indices

Lunr, Elasticlunr

Context Indices

FlexSearch 😢

Radix Tree

MiniSearch

Bloom Filters

Bloom Search (this!)

Bloom Filters

Is this thing in that set?

‘Probably yes,’ or

‘Definitely not.’

An array of m bits plus k independent hash functions.

Signature:

Bloom filter containing all the terms in a document.

Improving

Error Rate

How big a filter?

m=nlnε(ln2)2{\displaystyle m=-{\frac {n\ln \varepsilon }{(\ln 2)^{2}}}}

How many hash functions?

k=mnln2{\displaystyle k={\frac {m}{n}}\ln 2}

How to find kk independent hash functions?

Gi ⁣:xH1(x)+iH2(x)+i2modn{G_i \colon x \rightarrow H_1(x) + iH_2(x) + i^2\bmod{n}}

Improving

Relevance

1 document → 1 counting Bloom filter

Required and Excluded Terms

tf—idf

Term Frequency — Inverse Document Frequency

Field Boosting

n-gram search

Improving

Performance

Memoized Hash Functions

Improving

Index Size

Classic Bloom Filters

nn filters by term frequency

Group by approximate term frequency

Stopwords

MessagePack

Comparison

Lunr + Elasticlunr

  • Inverted indices
  • Indexing is O(n)O(n)
  • Searching is O(1)O(1)
  • Powerful fuzzy matching
  • Larger footprint

MiniSearch

  • Radix tree
  • Small footprint
  • Indexing is O(n)O(n)
  • Searching is O(1)O(1)
  • Limited partial (prefix) and fuzzy matching

Bloom Search

  • Bloom filters
  • Smallest footprint
  • Indexing and searching are O(k×n)O(k \times n)
  • Very limited fuzzy matching
  • False positive rate can be lowered but never eliminated
Time to Index
Elasticlunr 🥇 1,453 ms
Lunr 🥈 3,349 ms
MiniSearch 🥉 6,046 ms
Bloom Search* 11,543 ms
Index Size
Bloom Search* 🥇 356.63 kb
MiniSearch 🥈 901.04 kb
Elasticlunr 🥉 3667.58 kb
Lunr 4216.16 kb

Thank you.

That is all.