GetFixedPointMultiplierShift#
源码:tvm/src/relay/qnn/utils.cc
/*
* \brief Convert FP32 representation into fixed point representation.
* \param double_multplier The input FP32 number.
* \return The pair of multiplier and shift for fixed point representation.
* \note Converts a floating point number so that it can be represented by
* integers. The representation is
* float_number = (significand) * 2^(exponent)
*
* The significand is a number between 0.5 and 1. This is represented by
* an integer number. For example, if it is int32, then the decimal point
* exists between bit 31 and 30 from LSB (or between first and second bit
* from the left).
*
* Some examples are
* 0.25 = (0.5) * 2^(-1)
* 0.125 = (0.5) * 2^(-2)
*
* Credit to TFLite reference implementation.
*/
std::pair<int32_t, int32_t> GetFixedPointMultiplierShift(double double_multiplier) {
int32_t significand, exponent;
if (double_multiplier == 0.) {
significand = 0;
exponent = 0;
return std::make_pair(significand, exponent);
}
// Get the significand and exponent.
double significand_d = std::frexp(double_multiplier, &exponent);
// Convert the double significand to int significand, i.e., convert into a
// integer where the decimal point is between bit 31 and 30. This is done by
// multiplying the double value with 2^31 and then casting to int.
significand_d = std::round(significand_d * (1ll << 31));
auto significand_int64 = static_cast<int64_t>(significand_d);
ICHECK_LE(significand_int64, (1ll << 31));
if (significand_int64 == (1ll << 31)) {
significand_int64 /= 2;
++exponent;
}
ICHECK_LE(significand_int64, std::numeric_limits<int32_t>::max());
significand = static_cast<int32_t>(significand_int64);
return std::make_pair(significand, exponent);
}
函数 GetFixedPointMultiplierShift
,它接受双精度浮点数 double_multiplier
作为参数,返回包含两个整数的 pair 对象。
该函数的作用是将输入的双精度浮点数转换为定点数表示形式,即将其转换为具有固定小数位数的数值。具体来说,它将输入的浮点数分解为尾数和指数两部分,并将尾数转换为整数,使得小数点位于第 31 位和第 30 位之间。然后,将这个整数和指数一起返回。
在函数内部,首先判断输入的浮点数是否为零,如果是,则直接返回零值对。否则,使用 std::frexp
函数获取浮点数的尾数和指数。接着,将尾数乘以 \(2^{31}\),并四舍五入得到整数。如果这个整数等于 \(2^{31}\),则将其除以2,并将指数加1。最后,将整数转换为 int32_t
类型,并检查其是否小于等于 int32_t
的最大值。如果满足条件,则将其和指数一起返回。
std::frexp
是 C++ 标准库中的一个函数,用于将一个浮点数分解为尾数和指数。它的原型如下:
double frexp(double x, int* exp);
参数:
x
:要分解的浮点数。exp
:指向一个整数的指针,用于存储分解后的指数(xponent)部分。
返回值:
返回分解后的尾数(Mantissa)部分。
#include <limits>
#include <string>
#include <utility>
#include <vector>
#include <math.h>
#include <iostream>
double x, y;
int n;
x = 16.4;
y = frexp(x, &n);
std::cout << "std::frexp(x, &n) => "
<< x << " = " << y << " * "
<< "(1 << " << n << ")";
std::frexp(x, &n) => 16.4 = 0.5125 * (1 << 5)
0.5125 * (1 << 5)
16.400000