tanh implementations for FPGA neural nets

Question

In trying to put a neural network on my FPGA, I am running into the problem of describing my activation function. On my computer, I typically use the well-known tanh, so I am inclined to stick to my choice on the board too. However, I can't seem to find the best way to calculate tanh on an FPGA, so my question is:

Given a signed 16-bit word (with 8 bits of fraction length), what's the easiest way to implement the tanh function with reasonable accuracy?

Keep in mind that the target is an FPGA, so things like multiplications are OK (as long as I can prevent the word-length from growing too fast), but divisions are tricky and I would like to avoid them as much as possible. Also, the output word length can be optimized (I can devote all but two bits from the word to the fractional part, since the range is (-1, 1)). And by reasonable accuracy, I mean at least 5 decimals worth.

The options I have researched already are:

1) Look-up tables: These need no explanation, I am sure.

2) CORDIC: I was able to write a tanh CORDIC implementation using details from this paper from Walther, though I do not fully understand the 'convergence' of this algorithm, or how I can test it (meaning my implementation returns the right answer for radian values > 1.13, so where's the problem?)

3) Minimax approximation: This requires a division, and more importantly, the input argument needs to be raised to up to the nth power (n being the degree of the polynomial), which I think will cause problems with a fixed-point format like mine.

Are there other computationally cheap ways of calculating this function? There's plenty of literature out there on this very subject, using everything from stochastic machines to DCT interpolation filters, but none have clear details that I could understand.

Thanks in advance!

"At least 5 decimals worth" in the result would require $\lceil 5 \log_{2}(10)\rceil = 17$ fraction bits. I have a hard time reconciling that much output precision with the fact that the input is only in s7.8 fixed-point. How much table storage can you afford? Have you looked into bipartite tables, or more generally multipartite tables? A 15-bits in, 16-bits out bipartite table would probably consume around 3 KB; I haven't looked at the details, though. — njuffa, Apr 09 '19 at 23:43
@njuffa: You are right, I realized that only after I posted the question; maybe I should reconsider the precision I need. Table storage isn't an issue at the moment, since there is little going on the board other than the net. The board itself has 36kbits (not bytes) of BRAM which I can use as storage. As for your second question, I am not very knowledgeable about multipartite tables, so I haven't checked that out yet - I will research that a bit more and get back to you. — Swat Kat, Apr 09 '19 at 23:49
You may also want to have a second look at the number of fraction bits in the input. The input domain seems kind of large at $x \in (-128, 128)$, with $\mathrm{tanh}(x) = 1$ for much of that domain. Increasing the number of fractional bits in the input, and restricting input domain to $(-32, 32)$, so s5.10 fixed point, might be more suitable? — njuffa, Apr 10 '19 at 00:31
@njuffa: Thanks for the suggestion; I am definitely trying to trim down my input data type (it's used in other places in the network, so changing it requires a bit of effort). — Swat Kat, Apr 10 '19 at 02:37

Mark Fischler · Answer 1 · 2019-04-09T23:10:04.320

The best way turns out to break into a small number of ranges and use Pade approximants in each range. This method requires one division.

For example, when $|z| < 2$ use $$ \tanh(z) \approx z\frac{945+105 z^2 + z^4}{15(63+28z^2+z^4)} $$ while when $2 \leq z < 4.5$ use $$ \tanh(z) \approx \frac{26.648 + 22.955(z-3) + 7.5759 (z-3)^2 + 1.0202(z-3)^3-0.002523(z-3)^4}{26.781+22.804(z-3)+7.6516(z-3)^2 + (z-3)^3}\\= \frac{0.995055 + 0.857143(z-3) +0.282885(z-3)^2+0.038095(z-3)^3-0.000094(z-3)^4}{1+0.85149(z-3)+0.285714(z-3)^2+0.03734(z-3)^3} $$ and when $4.5 \leq z$ use $$ \tanh(z) \approx -\frac{15 + 15.0002(z-6) + 6 (z-6)^2 + 1.0001(z-6)^3}{15.0002+15(z-6)+6.00007(z-60)^2 + (z-6)^3} $$ For $z > 7$ I would just say $\tanh(z) = 1$, although the expression shown works well all the way out to about $z = 200$.

Of course, to get good accuracy in that last case, you should figure out the tiny deviations from integers separately.

You would have to adjust where your fixed point lies depending on which range you are in, to get good enough accuracy in your constants; or you can normalize to put your constants in your preferred range.

This is a great solution; I had never heard of Pade approximants until today! I can see that the greatest power I have to worry about is 4, which isn't that bad (especially for smaller values). For z > 4.5 the (z - 60) square is troublesome, since I will have to increase the word length for the integer part, but I think that's about the hardest part. I am not quite sure I understand the last part of your answer, about adjusting the fixed point - would you be able to elaborate on that a little more? — Swat Kat, Apr 10 '19 at 00:07
yes, since the numbers involved are not all in the same size range, it may not be convenient to mix, for example, 15 with 1.0001 -- and that last decimal point actually is needed to get $10^{-5}$ accuracy. In fact, short of doing a table based on cubic splines, it may not be possible to rapidly get that accuracy using 16-bit fixed point. — Mark Fischler, Apr 11 '19 at 05:12

njuffa · Answer 2 · 2019-04-13T17:47:12.493

Asker mentioned in comments that the FPGA has 36 kbits of RAM available for table storage. If the output format is S1.16 fixed-point, it is only necessary to store 1598 16-bit table entries to cover those results that are not $1$ to within the precision of the output format. With this simple straight lookup a maximum error of $7.63 \cdot 10^{-6}$ is achieved, roughly in line with the asker's expectations. The required table storage is 25568 bits, well within the hardware limitations.

I coded a quick prototype of this approach in a simple ISO-C program:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>

typedef int32_t int18_t; /* C doesn't provide an 18-bit type, fake it here */

uint16_t tab[0x63e]; // 1598 * 16 bits = 25568 bits

/* receives data in s7.8 fixed point, returns result in s1.16 fixed-point */
int18_t fxtanh (int16_t x)
{
    int18_t r;
    uint16_t absx = abs (x);

    r = (absx > 0x63d) ? (0x10000) : ((int18_t) tab [abs (x)]);
    r = (x < 0) ? (-r) : (r);
    return r;
}

int main (void)
{
    int i;
    int18_t res;
    double resf, ref, err, maxerr = 0;
    for (i = 0; i < 0x63e; i++) {
        tab[i] = (int)((tanh (i / 256.0) * 65536.0 + 0.5));
    }

    for (i = -0x8000; i <= 0x7fff; i++) {
        res = fxtanh ((int16_t)i);
        resf = (double)res / 65536.0;
        ref = tanh (i / 256.0);
        err = fabs (resf - ref);
        if (err > maxerr) maxerr = err;
    }
    printf ("maxerr = %15.8e\n", maxerr);
    return EXIT_SUCCESS;
}

Two other alternatives I examined, which compute $\mathrm{tanh}(x) = 1 - \frac{2}{\exp(2x) +1}$, do not achieve the accuracy desired by the asker. The first uses quadratic interpolation to compute $2^{x}$ from a table of 130 32-bit entries, returning a 16.16 fixed-point result. The maximum absolute error in $\mathrm{tanh}$ with this method is $3.2 \cdot 10^{-5}$. The second computes $\exp$ as a 16.16 fixed-point number using Meggitt's pseudo-multiplication process, requiring 20 table entries. The maximum absolute error in $\mathrm{tanh}$ is $3.51 \cdot 10^{-5}$ with this method.

I was able to implement this with great success, many thanks! The only negative here is that when I have to accommodate more stuff on the board, resizing this table will be a pain. It might then be necessary to add some sort of interpolation to this, which isn't my area at all, so I would be stuck again. — Swat Kat, Apr 10 '19 at 16:28

tanh implementations for FPGA neural nets

2 Answers2