{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Apache Parquet 格式的读取与写入\n", "\n", "[Apache Parquet](http://parquet.apache.org/) 项目提供了标准化的开源列式存储格式,用于数据分析系统。它最初是为 [Apache Hadoop](http://hadoop.apache.org/) 创建的,后来被 [Apache Drill](http://drill.apache.org/)、[Apache Hive](http://hive.apache.org/)、[Apache Impala](http://impala.apache.org/) 和 [Apache Spark](http://spark.apache.org/) 等系统采用,作为高性能数据 IO 的共同标准。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apache Arrow 是用于读取或写入 Parquet 文件的数据的理想内存传输层。一直在并行开发 Apache Parquet 的 [C++ 实现](https://github.com/apache/arrow/tree/main/cpp/tools/parquet),其中包括原生的、多线程的 C++ 适配器,用于与内存中的 Arrow 数据进行交互。PyArrow 包含了这段代码的 Python 绑定,因此也能使用 `pandas` 来读取和写入 Parquet 文件。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 获取支持 Parquet 的 pyarrow\n", "如果你使用 `pip` 或 `conda` 安装了 `pyarrow`,它应该已经内置了对 Parquet 的支持:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pyarrow.parquet as pq" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 读取和写入单个文件\n", "函数 {func}`~pyarrow.parquet.read_table` 和 {func}`~pyarrow.parquet.write_table` 分别用于读取和写入 {class}`pyarrow.Table`对象。\n", "\n", "让我们来看一个简单的表格:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "import pandas as pd\n", "\n", "import pyarrow as pa\n", "\n", "df = pd.DataFrame({'one': [-1, np.nan, 2.5],\n", " 'two': ['foo', 'bar', 'baz'],\n", " 'three': [True, False, True]},\n", " index=list('abc'))\n", "\n", "table = pa.Table.from_pandas(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "将此使用 `write_table` 函数写入 Parquet 格式:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import pyarrow.parquet as pq\n", "\n", "pq.write_table(table, 'example.parquet')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "这将创建一个单一的 Parquet 文件。实际上,一个 Parquet 数据集可能包含许多目录中的多个文件。我们可以使用 `read_table` 函数将单个文件读取回来:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwothree
a-1.0fooTrue
bNaNbarFalse
c2.5bazTrue
\n", "
" ], "text/plain": [ " one two three\n", "a -1.0 foo True\n", "b NaN bar False\n", "c 2.5 baz True" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table2 = pq.read_table('example.parquet')\n", "\n", "table2.to_pandas()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "你可以传递一部分列来进行读取,这比读取整个文件要快得多(由于列式布局):" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onethree
0-1.0True
1NaNFalse
22.5True
\n", "
" ], "text/plain": [ " one three\n", "0 -1.0 True\n", "1 NaN False\n", "2 2.5 True" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pq.read_table('example.parquet', columns=['one', 'three']).to_pandas()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "当从使用 Pandas dataframe 作为源的文件中读取部分列时,我们使用 `read_pandas` 来保持任何额外的索引列数据:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
two
afoo
bbar
cbaz
\n", "
" ], "text/plain": [ " two\n", "a foo\n", "b bar\n", "c baz" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pq.read_pandas('example.parquet', columns=['two']).to_pandas()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "不需要使用字符串来指定文件的来源。它可以是以下任何一种:\n", "\n", "- 字符串形式的文件路径\n", "- 来自 PyArrow 的 [`NativeFile`](https://arrow.apache.org/docs/python/memory.html#io-native-file)\n", "- Python 文件对象" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "一般来说,Python 文件对象的读取性能最差,而字符串文件路径或 {class}`~pyarrow.NativeFile` 实例(尤其是内存映射)的性能最好。" ] } ], "metadata": { "kernelspec": { "display_name": "xin", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.2" } }, "nbformat": 4, "nbformat_minor": 2 }