This core is a low latency divider that works by caching reciprocal values, then using a multiply to perform the divide rather than the usual divide operation. On first encountering a divide operation the reciprocal of the divisor is calculated, this takes the same amount of time as a normal divide. The next time the same divide is encountered the pre-calculated reciprocal is used. Reciprocals are stored in a small cache similar to a processor data cache.
a/b is the same as a * 1/b
In many cases the divisor 'b' remains the same within a loop. 1/b can be calculated to be essentially a constant; then all that's required is a multiply operation. As in the example, divides are performed using only three clock cycles when the reciprocal can be found in the cache.