{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Apache Parquet 格式的读取与写入\n",
"\n",
"[Apache Parquet](http://parquet.apache.org/) 项目提供了标准化的开源列式存储格式,用于数据分析系统。它最初是为 [Apache Hadoop](http://hadoop.apache.org/) 创建的,后来被 [Apache Drill](http://drill.apache.org/)、[Apache Hive](http://hive.apache.org/)、[Apache Impala](http://impala.apache.org/) 和 [Apache Spark](http://spark.apache.org/) 等系统采用,作为高性能数据 IO 的共同标准。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Apache Arrow 是用于读取或写入 Parquet 文件的数据的理想内存传输层。一直在并行开发 Apache Parquet 的 [C++ 实现](https://github.com/apache/arrow/tree/main/cpp/tools/parquet),其中包括原生的、多线程的 C++ 适配器,用于与内存中的 Arrow 数据进行交互。PyArrow 包含了这段代码的 Python 绑定,因此也能使用 `pandas` 来读取和写入 Parquet 文件。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 获取支持 Parquet 的 pyarrow\n",
"如果你使用 `pip` 或 `conda` 安装了 `pyarrow`,它应该已经内置了对 Parquet 的支持:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pyarrow.parquet as pq"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 读取和写入单个文件\n",
"函数 {func}`~pyarrow.parquet.read_table` 和 {func}`~pyarrow.parquet.write_table` 分别用于读取和写入 {class}`pyarrow.Table`对象。\n",
"\n",
"让我们来看一个简单的表格:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"import pandas as pd\n",
"\n",
"import pyarrow as pa\n",
"\n",
"df = pd.DataFrame({'one': [-1, np.nan, 2.5],\n",
" 'two': ['foo', 'bar', 'baz'],\n",
" 'three': [True, False, True]},\n",
" index=list('abc'))\n",
"\n",
"table = pa.Table.from_pandas(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"将此使用 `write_table` 函数写入 Parquet 格式:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"import pyarrow.parquet as pq\n",
"\n",
"pq.write_table(table, 'example.parquet')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"这将创建一个单一的 Parquet 文件。实际上,一个 Parquet 数据集可能包含许多目录中的多个文件。我们可以使用 `read_table` 函数将单个文件读取回来:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" two | \n",
" three | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" -1.0 | \n",
" foo | \n",
" True | \n",
"
\n",
" \n",
" b | \n",
" NaN | \n",
" bar | \n",
" False | \n",
"
\n",
" \n",
" c | \n",
" 2.5 | \n",
" baz | \n",
" True | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one two three\n",
"a -1.0 foo True\n",
"b NaN bar False\n",
"c 2.5 baz True"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"table2 = pq.read_table('example.parquet')\n",
"\n",
"table2.to_pandas()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"你可以传递一部分列来进行读取,这比读取整个文件要快得多(由于列式布局):"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" one | \n",
" three | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" -1.0 | \n",
" True | \n",
"
\n",
" \n",
" 1 | \n",
" NaN | \n",
" False | \n",
"
\n",
" \n",
" 2 | \n",
" 2.5 | \n",
" True | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" one three\n",
"0 -1.0 True\n",
"1 NaN False\n",
"2 2.5 True"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pq.read_table('example.parquet', columns=['one', 'three']).to_pandas()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"当从使用 Pandas dataframe 作为源的文件中读取部分列时,我们使用 `read_pandas` 来保持任何额外的索引列数据:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" two | \n",
"
\n",
" \n",
" \n",
" \n",
" a | \n",
" foo | \n",
"
\n",
" \n",
" b | \n",
" bar | \n",
"
\n",
" \n",
" c | \n",
" baz | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" two\n",
"a foo\n",
"b bar\n",
"c baz"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pq.read_pandas('example.parquet', columns=['two']).to_pandas()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"不需要使用字符串来指定文件的来源。它可以是以下任何一种:\n",
"\n",
"- 字符串形式的文件路径\n",
"- 来自 PyArrow 的 [`NativeFile`](https://arrow.apache.org/docs/python/memory.html#io-native-file)\n",
"- Python 文件对象"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"一般来说,Python 文件对象的读取性能最差,而字符串文件路径或 {class}`~pyarrow.NativeFile` 实例(尤其是内存映射)的性能最好。"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "xin",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}