/ /

数据类型

PyMongoArrow 支持大多数 BSON types。由于 Arrow 和Polars 为列表和结构体提供一流的支持，因此包括嵌入式数组和文档。

后续版本中将添加对其他类型的支持。

提示

有关BSON 类型的详细信息，请参阅 BSON规范。

BSON 类型	类型标识符
字符串	`py.str` 的实例 `pyarrow.string`
嵌入式文档	`py.dict` 的实例 `pyarrow.struct`
嵌入式数组	的实例 `pyarrow.list_`
ObjectId	`py.bytes` `bson.ObjectId` 的实例 `pymongoarrow.types.ObjectIdType` 的实例 `pymongoarrow.pandas_types.PandasObjectId`
Decimal128	`bson.Decimal128` 的实例 `pymongoarrow.types.Decimal128Type` 的实例 `pymongoarrow.pandas_types.PandasDecimal128` 的实例 `pyarrow.Decimal128`
布尔	`~py.bool` 的实例 `~pyarrow.bool_`
64 位二进制浮点点	`py.float` 的实例 `pyarrow.float64`
32 位整数	的实例 `pyarrow.int32`
64 位整型	`~py.int` `bson.int64.Int64` 的实例 `pyarrow.int64`
UTC 日期时间	`py.datetime.datetime` 具有 `ms` 解析的 `~pyarrow.timestamp` 实例
二进制数据	`bson.Binary` 的实例 `pymongoarrow.types.BinaryType` 的实例 `pymongoarrow.pandas_types.PandasBinary`
JavaScript 代码	`bson.Code` 的实例 `pymongoarrow.types.CodeType` 的实例 `pymongoarrow.pandas_types.PandasCode`
null	的实例 `pyarrow.null`

注意

PyMongoArrow 仅在小端系统上支持Decimal128 。在大端系统上，它使用null代替。

在pymongoarrow.api.Schema声明期间，使用类型标识符指定字段属于特定类型。例如，如果数据的字段f1和f2的类型为 32 位整数和 UTC 日期时间，并且_id是ObjectId ，则可以按如下方式定义模式：

schema = Schema({
  '_id': ObjectId,
  'f1': pyarrow.int32(),
  'f2': pyarrow.timestamp('ms')
})

模式中不支持的数据类型会导致标识字段及其数据类型的ValueError 。

嵌入式数组注意事项

用于嵌入式数组的模式必须使用pyarrow.list_()类型，以指定数组元素的类型。例如，

from pyarrow import list_, float64
schema = Schema({'_id': ObjectId,
  'location': {'coordinates': list_(float64())}
})

扩展类型

PyMongoArrow 将ObjectId 、 Decimal128 、 Binary data和JavaScript code类型实现为 PyArrow 和 Pandas 的扩展类型。对于箭头表，这些类型的值具有适当的pymongoarrow扩展类型，例如pymongoarrow.types.ObjectIdType 。您可以使用.as_py()方法或对表调用.to_pylist()来获取适当的bson Python 对象。

>>> from pymongo import MongoClient
>>> from bson import ObjectId
>>> from pymongoarrow.api import find_arrow_all
>>> client = MongoClient()
>>> coll = client.test.test
>>> coll.insert_many([{"_id": ObjectId(), "foo": 100}, {"_id": ObjectId(), "foo": 200}])
<pymongo.results.InsertManyResult at 0x1080a72b0>
>>> table = find_arrow_all(coll, {})
>>> table
pyarrow.Table
_id: extension<arrow.py_extension_type<ObjectIdType>>
foo: int32
----
_id: [[64408B0D5AC9E208AF220142,64408B0D5AC9E208AF220143]]
foo: [[100,200]]
>>> table["_id"][0]
<pyarrow.ObjectIdScalar: ObjectId('64408b0d5ac9e208af220142')>
>>> table["_id"][0].as_py()
ObjectId('64408b0d5ac9e208af220142')
>>> table.to_pylist()
[{'_id': ObjectId('64408b0d5ac9e208af220142'), 'foo': 100},
 {'_id': ObjectId('64408b0d5ac9e208af220143'), 'foo': 200}]

转换为 Pandas 时，扩展类型列具有适当的pymongoarrow扩展类型，例如pymongoarrow.pandas_types.PandasDecimal128 。数据框中元素的值是适当的bson类型。

>>> from pymongo import MongoClient
>>> from bson import Decimal128
>>> from pymongoarrow.api import find_pandas_all
>>> client = MongoClient()
>>> coll = client.test.test
>>> coll.insert_many([{"foo": Decimal128("0.1")}, {"foo": Decimal128("0.1")}])
<pymongo.results.InsertManyResult at 0x1080a72b0>
>>> df = find_pandas_all(coll, {})
>>> df
                       _id  foo
0  64408bf65ac9e208af220144  0.1
1  64408bf65ac9e208af220145  0.1
>>> df["foo"].dtype
<pymongoarrow.pandas_types.PandasDecimal128 at 0x11fe0ae90>
>>> df["foo"][0]
Decimal128('0.1')
>>> df["_id"][0]
ObjectId('64408bf65ac9e208af220144')

不支持扩展类型。

Null 值和转换为 Pandas DataFrame

在 Arrow 和Polars 中，所有数组均可为 null。Pandas 具有实验性可空数据类型，例如 Int64。您可以通过以下Apache文档代码指示 Arrow 使用可为 null 的数据类型创建一个 pandas DataFrame。

>>> dtype_mapping = {
...     pa.int8(): pd.Int8Dtype(),
...     pa.int16(): pd.Int16Dtype(),
...     pa.int32(): pd.Int32Dtype(),
...     pa.int64(): pd.Int64Dtype(),
...     pa.uint8(): pd.UInt8Dtype(),
...     pa.uint16(): pd.UInt16Dtype(),
...     pa.uint32(): pd.UInt32Dtype(),
...     pa.uint64(): pd.UInt64Dtype(),
...     pa.bool_(): pd.BooleanDtype(),
...     pa.float32(): pd.Float32Dtype(),
...     pa.float64(): pd.Float64Dtype(),
...     pa.string(): pd.StringDtype(),
... }
... df = arrow_table.to_pandas(
...     types_mapper=dtype_mapping.get, split_blocks=True, self_destruct=True
... )
... del arrow_table

为pa.string()定义转换还会将 Arrow 字符串转换为 NumPy 字符串，而不是对象。

嵌套扩展类型

在等待ARROW-179时，出现在嵌套文档中的扩展类型（例如 ObjectId）不会转换为相应的 PyMongoArrow 扩展类型，而是具有原始 Arrow 类型 FixedSizeBinaryType(fixed_size_binary[12])。

这些值可以按原样使用，也可以单独转换为所需的扩展类型，例如_id = out['nested'][0]['_id'].cast(ObjectIdType()) 。

严格遵守类型

当您在 PyMongoArrow v1.9 及更高版本中提供数据模式时，驱动程序会实施严格的类型遵守。如果字段值的类型与该字段的模式类型不匹配，则驱动程序会引发 TypeError。NaN 是对所有字段都有效的类型。

要抑制 TypeErrors 并以静默方式将类型不匹配转换为 NaN，请将 allow_invalid=True 参数传递给 pymongoarrow API调用。以下示例将 allow_invalid=True 传递给 pymongoarrow API调用，以将示例数据中的不匹配项转换为 NaN：

>>> from pymongoarrow.monkey import patch_all
>>> from pymongoarrow.api import Schema, find_arrow_all
>>> from pyarrow import int32, string
>>> from pymongo import MongoClient
>>> patch_all()
>>> client = MongoClient()
>>> coll = client.test.test
>>> sample_data = [
...     {"name": "Alice", "age": 35, "city": "Chicago"},
...     {"name": {"first": "Bob", "last": "Smith"}, "age": 28, "city": "Boston"},
...     {"name": "Charlie", "age": "thirty-two", "city": "Seattle"}
... ]
>>> coll.insert_many(sample_data)
>>> schema = Schema({
...     "name": string(),
...     "age": int32(),
...     "city": string()
... })
>>> result = find_arrow_all(collection, {}, schema=schema, allow_invalid=True)

后退

开始体验

来年

模式示例