GetFixedPointMultiplierShift

GetFixedPointMultiplierShift#

源码:tvm/src/relay/qnn/utils.cc

/*
 * \brief Convert FP32 representation into fixed point representation.
 * \param double_multplier The input FP32 number.
 * \return The pair of multiplier and shift for fixed point representation.
 * \note Converts a floating point number so that it can be represented by
 *       integers. The representation is
 *             float_number = (significand) * 2^(exponent)
 *
 *       The significand is a number between 0.5 and 1. This is represented by
 *       an integer number. For example, if it is int32, then the decimal point
 *       exists between bit 31 and 30 from LSB (or between first and second bit
 *       from the left).
 *
 *       Some examples are
 *           0.25 = (0.5) * 2^(-1)
 *           0.125 = (0.5) * 2^(-2)
 *
 *       Credit to TFLite reference implementation.
 */
std::pair<int32_t, int32_t> GetFixedPointMultiplierShift(double double_multiplier) {
  int32_t significand, exponent;
  if (double_multiplier == 0.) {
    significand = 0;
    exponent = 0;
    return std::make_pair(significand, exponent);
  }

  // Get the significand and exponent.
  double significand_d = std::frexp(double_multiplier, &exponent);

  // Convert the double significand to int significand, i.e., convert into a
  // integer where the decimal point is between bit 31 and 30. This is done by
  // multiplying the double value with 2^31 and then casting to int.
  significand_d = std::round(significand_d * (1ll << 31));
  auto significand_int64 = static_cast<int64_t>(significand_d);
  ICHECK_LE(significand_int64, (1ll << 31));
  if (significand_int64 == (1ll << 31)) {
    significand_int64 /= 2;
    ++exponent;
  }
  ICHECK_LE(significand_int64, std::numeric_limits<int32_t>::max());
  significand = static_cast<int32_t>(significand_int64);
  return std::make_pair(significand, exponent);
}

函数 GetFixedPointMultiplierShift,它接受双精度浮点数 double_multiplier 作为参数,返回包含两个整数的 pair 对象。

该函数的作用是将输入的双精度浮点数转换为定点数表示形式,即将其转换为具有固定小数位数的数值。具体来说,它将输入的浮点数分解为尾数和指数两部分,并将尾数转换为整数,使得小数点位于第 31 位和第 30 位之间。然后,将这个整数和指数一起返回。

在函数内部,首先判断输入的浮点数是否为零,如果是,则直接返回零值对。否则,使用 std::frexp 函数获取浮点数的尾数和指数。接着,将尾数乘以 \(2^{31}\),并四舍五入得到整数。如果这个整数等于 \(2^{31}\),则将其除以2,并将指数加1。最后,将整数转换为 int32_t 类型,并检查其是否小于等于 int32_t 的最大值。如果满足条件,则将其和指数一起返回。

std::frexp 是 C++ 标准库中的一个函数,用于将一个浮点数分解为尾数和指数。它的原型如下:

double frexp(double x, int* exp);

参数:

  • x:要分解的浮点数。

  • exp:指向一个整数的指针,用于存储分解后的指数(xponent)部分。

返回值:

  • 返回分解后的尾数(Mantissa)部分。

#include <limits>
#include <string>
#include <utility>
#include <vector>
#include <math.h>
#include <iostream>

double x, y;
int n;
x = 16.4;
y = frexp(x, &n);
std::cout << "std::frexp(x, &n) => " 
         << x << " = " << y << " * "
         << "(1 << " << n << ")";
std::frexp(x, &n) => 16.4 = 0.5125 * (1 << 5)
0.5125 * (1 << 5)
16.400000