Faster significant prefix comparison in binary representation

Question

An "integer prefix" here is the first non-zero digits of a representation... Suppose first two digits:

Same decimal prefix 45: $(4512)_{10}$ and $(0045)_{10}$
Same binary prefix 101: $(000101)_{2}$ and $(010101)_{2}$

So, for the second example, suppose a function p that checks same prefix of size 3: $p(000101,010101,3)$.

... A function p(int binary,int bynary,size) that returns true when numbers have the same prefix. Suppose $p(x,y,L)$ for $x,y$ as positive 64 bits integers and 0<L<64.

I have all usual integer arithmetics (*,+,-,etc.) and bitwise (AND, OR, shift left, etc.) basic operations... How to express this p function?
Supposing that there are many solutions, I need the "faster" (less computational time).

NOTES

There are some clues here but I not see how to use it.

There are also a good example of application, for Geohash 64-bit representation. See this pg_geohash issue. Analog thing can be used with Metaphone, SOUNDEX, pHash and other "similarity hashes" (mathematical Locality-sensitive hashing).

I think the clues you are referring to are quite straightforward, you just need to apply them. You can find another idea here and this related page although it seems to apply only to the case $L = 1$. In my opinion your question is more of a stackoverflow question. — Fabius Wiesner, Aug 23 '18 at 11:51
@mbjoe thanks, more one clue. Seems that is not a trivial problem (!)... I have the option to cache a second representation as a "rotated to the first non-zero" (any solution using x << 1 loop to drop zeros) and them comparing by a AND-mask of 1110000 with L ones. — Peter Krauss, Aug 23 '18 at 12:27
Yes, shifting to the left till the most significant 1 is at bit 63, AND with a mask with $L$ ones on the left, then compare, is the simpler solution, but not the fastest one. — Fabius Wiesner, Aug 23 '18 at 12:41
Thanks @mbjoe, you understand the problem, show good clues and show that it is not trivial (!). Seems there are no "analitic solution"... Can you check my proposed "niche solution"? — Peter Krauss, Aug 23 '18 at 13:24

Peter Krauss · Answer 1 · 2018-08-25T14:44:16.643

Summary

Perhaps there are an answer, and it is

$$p(y,x,L) := [[(f(y) \& f(x)) >> (64-L)] = 0]$$

but is difficult to proof that it is the "faster way". The only algorithm that actually looks "the faster", with performance gain when compared to "common" ones, is one that uses the cache strategy (pre-processing y values). So, enhancing the above formula with the replacing of $f(y)$ by a constant $c$.

Discussion and details

This is not a generic solution, is valid only for use in situations where only one parameter need "real time" speed.

The problem explaned in the question seems not offers a trivial solution, so using a brute force solution: to cache a pre-processed number... Is an economic cache, and valid as solution of my problem in a database.

Cache-phase: some examples, supposing 8 bits.

original=$00010100$ and cached=$10100000$.
original=$01010100$ and cached=$10101000$.
original=$11010101$ and cached=$11010101$.

Real-time checking: comparing the cached number c with x.

Suppose comparision by $L=3$, with $x=01011000$. Need to transform x into $10110000$ them preserve only first 3 (L) non-zero bits, resulting $y=10100000$. Examples:

for $c:=10100000$, compare $c\&y$ (is $10100000$) with y, that is true.
for $c:=10101000$, compare $c\&y$ (is $10100000$) with y, that is true.
for $c:=11010101$, compare $c\&y$ (is $11000000$) with y, that is false.

Conclusion

Cache-phase is a "while non-zero" function that does only x << 1 to obtain the cached 64 bits representation. Let label it $f(x)$, a function sometimes named "BitScanForward".
The function $p(c,x,L)$ that checks prefixes of size L, need also a function $d(L)$ that do "padding left ones" ($d(2)=11000000$, $d(3)=11100000$, etc.). Is an integer comparison with a bitwise AND after after f transformation: $$p(c,x,L) := [x = [c \& f(x) \& d(L)]]$$

or, using the faster Ben Voigt's solution:

$$p(c,x,L) := [[(c \& f(x)) >> (64-L)] = 0]$$

There are many performance gains:

the replace of the &d(L) part by >>(64-L), it removes the function-call and ensure a good implementation for this part of the function;
the replace of the x= part, a "comparison with variable" operation, by a comparison with a constant. To use a constant instead a variable is the main performance gain.
As this constant is zero, modern compilers will use something as test eax,eax instead of cmp eax,0, that is some marginally better performing instruction.

PS: the function f(x) can be implemented as "left align" using a function g(x,L) that tells you how far to shift, so $f(x)=x<<g(x,L)$. As Ben Voigt explained, is the faster way to do it, and the g usual name in the C-lang's libraries, is BitScanReverse64().

Comparing the first N bits: ((x ^ y) >> (64-N)) == 0 Left align: BitScanReverse64(&shift, x); return x << shift;` — Ben Voigt, Aug 23 '18 at 13:48
Some more clues for alternative implementations, by Morton Numbers: https://www.factual.com/blog/how-geohashes-work/ — Peter Krauss, Sep 16 '18 at 15:24

Faster significant prefix comparison in binary representation

1 Answers1

Summary

Discussion and details