{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Apache Parquet 格式的读取与写入\n", "\n", "[Apache Parquet](http://parquet.apache.org/) 项目提供了标准化的开源列式存储格式，用于数据分析系统。它最初是为 [Apache Hadoop](http://hadoop.apache.org/) 创建的，后来被 [Apache Drill](http://drill.apache.org/)、[Apache Hive](http://hive.apache.org/)、[Apache Impala](http://impala.apache.org/) 和 [Apache Spark](http://spark.apache.org/) 等系统采用，作为高性能数据 IO 的共同标准。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apache Arrow 是用于读取或写入 Parquet 文件的数据的理想内存传输层。一直在并行开发 Apache Parquet 的 [C++ 实现](https://github.com/apache/arrow/tree/main/cpp/tools/parquet)，其中包括原生的、多线程的 C++ 适配器，用于与内存中的 Arrow 数据进行交互。PyArrow 包含了这段代码的 Python 绑定，因此也能使用 `pandas` 来读取和写入 Parquet 文件。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 获取支持 Parquet 的 pyarrow\n", "如果你使用 `pip` 或 `conda` 安装了 `pyarrow`，它应该已经内置了对 Parquet 的支持：" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pyarrow.parquet as pq" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 读取和写入单个文件\n", "函数 {func}`~pyarrow.parquet.read_table` 和 {func}`~pyarrow.parquet.write_table` 分别用于读取和写入 {class}`pyarrow.Table`对象。\n", "\n", "让我们来看一个简单的表格：" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "import pandas as pd\n", "\n", "import pyarrow as pa\n", "\n", "df = pd.DataFrame({'one': [-1, np.nan, 2.5],\n", " 'two': ['foo', 'bar', 'baz'],\n", " 'three': [True, False, True]},\n", " index=list('abc'))\n", "\n", "table = pa.Table.from_pandas(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "将此使用 `write_table` 函数写入 Parquet 格式：" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import pyarrow.parquet as pq\n", "\n", "pq.write_table(table, 'example.parquet')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "这将创建一个单一的 Parquet 文件。实际上，一个 Parquet 数据集可能包含许多目录中的多个文件。我们可以使用 `read_table` 函数将单个文件读取回来：" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "

	one	two	three
a	-1.0	foo	True
b	NaN	bar	False
c	2.5	baz	True

\n", "

" ], "text/plain": [ " one two three\n", "a -1.0 foo True\n", "b NaN bar False\n", "c 2.5 baz True" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table2 = pq.read_table('example.parquet')\n", "\n", "table2.to_pandas()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "你可以传递一部分列来进行读取，这比读取整个文件要快得多（由于列式布局）：" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "

	one	three
0	-1.0	True
1	NaN	False
2	2.5	True

\n", "

" ], "text/plain": [ " one three\n", "0 -1.0 True\n", "1 NaN False\n", "2 2.5 True" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pq.read_table('example.parquet', columns=['one', 'three']).to_pandas()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "当从使用 Pandas dataframe 作为源的文件中读取部分列时，我们使用 `read_pandas` 来保持任何额外的索引列数据：" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "

	two
a	foo
b	bar
c	baz

\n", "

" ], "text/plain": [ " two\n", "a foo\n", "b bar\n", "c baz" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pq.read_pandas('example.parquet', columns=['two']).to_pandas()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "不需要使用字符串来指定文件的来源。它可以是以下任何一种：\n", "\n", "- 字符串形式的文件路径\n", "- 来自 PyArrow 的 [`NativeFile`](https://arrow.apache.org/docs/python/memory.html#io-native-file)\n", "- Python 文件对象" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "一般来说，Python 文件对象的读取性能最差，而字符串文件路径或 {class}`~pyarrow.NativeFile` 实例（尤其是内存映射）的性能最好。" ] } ], "metadata": { "kernelspec": { "display_name": "xin", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.2" } }, "nbformat": 4, "nbformat_minor": 2 }