GetFixedPointMultiplierShift

GetFixedPointMultiplierShift#

源码：tvm/src/relay/qnn/utils.cc

/*
 * \brief Convert FP32 representation into fixed point representation.
 * \param double_multplier The input FP32 number.
 * \return The pair of multiplier and shift for fixed point representation.
 * \note Converts a floating point number so that it can be represented by
 *       integers. The representation is
 *             float_number = (significand) * 2^(exponent)
 *
 *       The significand is a number between 0.5 and 1. This is represented by
 *       an integer number. For example, if it is int32, then the decimal point
 *       exists between bit 31 and 30 from LSB (or between first and second bit
 *       from the left).
 *
 *       Some examples are
 *           0.25 = (0.5) * 2^(-1)
 *           0.125 = (0.5) * 2^(-2)
 *
 *       Credit to TFLite reference implementation.
 */
std::pair<int32_t, int32_t> GetFixedPointMultiplierShift(double double_multiplier) {
  int32_t significand, exponent;
  if (double_multiplier == 0.) {
    significand = 0;
    exponent = 0;
    return std::make_pair(significand, exponent);
  }

  // Get the significand and exponent.
  double significand_d = std::frexp(double_multiplier, &exponent);

  // Convert the double significand to int significand, i.e., convert into a
  // integer where the decimal point is between bit 31 and 30. This is done by
  // multiplying the double value with 2^31 and then casting to int.
  significand_d = std::round(significand_d * (1ll << 31));
  auto significand_int64 = static_cast<int64_t>(significand_d);
  ICHECK_LE(significand_int64, (1ll << 31));
  if (significand_int64 == (1ll << 31)) {
    significand_int64 /= 2;
    ++exponent;
  }
  ICHECK_LE(significand_int64, std::numeric_limits<int32_t>::max());
  significand = static_cast<int32_t>(significand_int64);
  return std::make_pair(significand, exponent);
}

函数 GetFixedPointMultiplierShift，它接受双精度浮点数 double_multiplier 作为参数，返回包含两个整数的 pair 对象。

该函数的作用是将输入的双精度浮点数转换为定点数表示形式，即将其转换为具有固定小数位数的数值。具体来说，它将输入的浮点数分解为尾数和指数两部分，并将尾数转换为整数，使得小数点位于第 31 位和第 30 位之间。然后，将这个整数和指数一起返回。

在函数内部，首先判断输入的浮点数是否为零，如果是，则直接返回零值对。否则，使用 std::frexp 函数获取浮点数的尾数和指数。接着，将尾数乘以 \(2^{31}\)，并四舍五入得到整数。如果这个整数等于 \(2^{31}\)，则将其除以2，并将指数加1。最后，将整数转换为 int32_t 类型，并检查其是否小于等于 int32_t 的最大值。如果满足条件，则将其和指数一起返回。

std::frexp 是 C++ 标准库中的一个函数，用于将一个浮点数分解为尾数和指数。它的原型如下：

double frexp(double x, int* exp);

参数：

x：要分解的浮点数。
exp：指向一个整数的指针，用于存储分解后的指数(xponent)部分。

返回值：

返回分解后的尾数(Mantissa)部分。

#include <limits>
#include <string>
#include <utility>
#include <vector>

#include <math.h>
#include <iostream>

double x, y;
int n;
x = 16.4;
y = frexp(x, &n);
std::cout << "std::frexp(x, &n) => " 
         << x << " = " << y << " * "
         << "(1 << " << n << ")";

std::frexp(x, &n) => 16.4 = 0.5125 * (1 << 5)

0.5125 * (1 << 5)

16.400000