Page 1 of 1

From the idea to practical usability

Posted: Mon Jan 27, 2025 4:38 am
by Mitu9900
If only a single hash function is used, the result will of course be extremely influenced by chance, so the LogLog algorithm suggests using several (e.g. 64) hash functions and an estimated value averaged from them. In order to avoid having to keep and calculate 64 different hash functions in practice, there is a very elegant solution: the first 6 bits of each binary hash value are cut off and interpreted as a number between 0 and 63. This results in 64 "artificially" different hash functions, for albania telegram screening each of which the number of the longest chain of zeros from the remaining bits of the hash value is stored in a so-called bucket (simply a number storage device).

For example, the hash value above for Mannheim begins with the bold bits 001010 (= ten in decimal notation), so that for the two following zeros, a two would be stored in the bucket with the number 10, assuming that a larger number had not already been stored there before.

To estimate the number of values ​​seen so far, the geometric mean over all 64 buckets must be calculated using the following formula.

The HyperLogLog algorithm, which was later derived from it, works according to exactly the same basic principle, only it uses the harmonic mean shown in the following formula to estimate its counter reading.

Fig. 2: Formula for approximation in the HyperLogLog algorithm (harmonic mean).
Fig. 2: Formula for approximation in the HyperLogLog algorithm (harmonic mean).
When using 1024 buckets, HyperLogLog achieves an average error rate of around 3.25 percent - and this when using a 256-bit hash function with a constant memory consumption of just 1024 bytes. It is also easy to combine approximate values ​​from several server nodes running in parallel; the average value of the values ​​from different nodes can easily be calculated. For improved accuracy, the maximum values ​​from the respective buckets of all nodes could also easily be collected and used directly on the coordinating server to calculate the estimated value.

It almost goes without saying that the open source community has already developed numerous implementations of HyperLogLog in a variety of languages. To get a feel for its use, I would like to briefly illustrate the use of the well-known Java HLL implementation of Aggregate Knowledge [3] .