Data Types
On this page
PyMongoArrow supports a majority of the BSON types. Because Arrow and Polars provide first-class support for Lists and Structs, this includes embedded arrays and documents.
Support for additional types will be added in subsequent releases.
Tip
For more information about BSON types, see the BSON specification.
BSON Type | Type Identifiers |
---|---|
String | py.str , an instance of pyarrow.string |
Embedded document | py.dict , and instance of pyarrow.struct |
Embedded array | An instance of pyarrow.list_ |
ObjectId | py.bytes , bson.ObjectId , an instance of pymongoarrow.types.ObjectIdType , an instance of pymongoarrow.pandas_types.PandasObjectId |
Decimal128 | bson.Decimal128 , an instance of pymongoarrow.types.Decimal128Type , an instance of pymongoarrow.pandas_types.PandasDecimal128 |
Boolean | An instance of ~pyarrow.bool_ , ~py.bool |
64-bit binary floating point | py.float , an instance of pyarrow.float64 |
32-bit integer | An instance of pyarrow.int32 |
64-bit integer | ~py.int , bson.int64.Int64 , an instance of pyarrow.int64 |
UTC datetime | An instance of ~pyarrow.timestamp with ms resolution, py.datetime.datetime |
Binary data | bson.Binary , an instance of pymongoarrow.types.BinaryType , an instance of pymongoarrow.pandas_types.PandasBinary . |
JavaScript code | bson.Code , an instance of pymongoarrow.types.CodeType , an instance of pymongoarrow.pandas_types.PandasCode |
Note
PyMongoArrow supports Decimal128
on only little-endian systems. On
big-endian systems, it uses null
instead.
Use type identifiers to specify that a field is of a certain type
during pymongoarrow.api.Schema
declaration. For example, if your data
has fields f1
and f2
bearing types 32-bit integer and UTC datetime, and
an _id
that is an ObjectId
, you can define your schema as follows:
schema = Schema({ '_id': ObjectId, 'f1': pyarrow.int32(), 'f2': pyarrow.timestamp('ms') })
Unsupported data types in a schema cause a ValueError
identifying the
field and its data type.
Embedded Array Considerations
The schema used for an embedded array must use the pyarrow.list_()
type, to specify
the type of the array elements. For example,
from pyarrow import list_, float64 schema = Schema({'_id': ObjectId, 'location': {'coordinates': list_(float64())} })
Extension Types
PyMongoArrow implements the ObjectId
, Decimal128
, Binary data
,
and JavaScript code
types as extension types for PyArrow and Pandas.
For arrow tables, values of these types have the appropriate
pymongoarrow
extension type, such as pymongoarrow.types.ObjectIdType
.
You can obtain the appropriate bson
Python object by using the .as_py()
method, or by calling .to_pylist()
on the table.
from pymongo import MongoClient from bson import ObjectId from pymongoarrow.api import find_arrow_all client = MongoClient() coll = client.test.test"_id": ObjectId(), "foo": 100}, {"_id": ObjectId(), "foo": 200}]) coll.insert_many([{<pymongo.results.InsertManyResult at 0x1080a72b0> table = find_arrow_all(coll, {}) tablepyarrow.Table _id: extension<arrow.py_extension_type<ObjectIdType>> foo: int32 ---- _id: [[64408B0D5AC9E208AF220142,64408B0D5AC9E208AF220143]] foo: [[100,200]] "_id"][0] table[<pyarrow.ObjectIdScalar: ObjectId('64408b0d5ac9e208af220142')> "_id"][0].as_py() table[ObjectId('64408b0d5ac9e208af220142') table.to_pylist()[{'_id': ObjectId('64408b0d5ac9e208af220142'), 'foo': 100}, {'_id': ObjectId('64408b0d5ac9e208af220143'), 'foo': 200}]
When converting to pandas, the extension type columns have an appropriate
pymongoarrow
extension type, such as
pymongoarrow.pandas_types.PandasDecimal128
. The value of the element in the
dataframe is the appropriate bson
type.
from pymongo import MongoClient from bson import Decimal128 from pymongoarrow.api import find_pandas_all client = MongoClient() coll = client.test.test"foo": Decimal128("0.1")}, {"foo": Decimal128("0.1")}]) coll.insert_many([{<pymongo.results.InsertManyResult at 0x1080a72b0> df = find_pandas_all(coll, {}) df _id foo 0 64408bf65ac9e208af220144 0.1 1 64408bf65ac9e208af220145 0.1 "foo"].dtype df[<pymongoarrow.pandas_types.PandasDecimal128 at 0x11fe0ae90> "foo"][0] df[Decimal128('0.1') "_id"][0] df[ObjectId('64408bf65ac9e208af220144')
Polars does not support Extension Types.
Null Values and Conversion to Pandas DataFrames
In Arrow and Polars, all Arrays are nullable.
Pandas has experimental nullable data types, such as Int64
.
You can instruct Arrow to create a pandas DataFrame using nullable dtypes
with the following Apache documentation code.
>>> dtype_mapping = { ... pa.int8(): pd.Int8Dtype(), ... pa.int16(): pd.Int16Dtype(), ... pa.int32(): pd.Int32Dtype(), ... pa.int64(): pd.Int64Dtype(), ... pa.uint8(): pd.UInt8Dtype(), ... pa.uint16(): pd.UInt16Dtype(), ... pa.uint32(): pd.UInt32Dtype(), ... pa.uint64(): pd.UInt64Dtype(), ... pa.bool_(): pd.BooleanDtype(), ... pa.float32(): pd.Float32Dtype(), ... pa.float64(): pd.Float64Dtype(), ... pa.string(): pd.StringDtype(), ... } ... df = arrow_table.to_pandas( ... types_mapper=dtype_mapping.get, split_blocks=True, self_destruct=True ... ) ... del arrow_table
Defining a conversion for pa.string()
also converts Arrow strings to NumPy strings, and not objects.
Nested Extension Types
Pending ARROW-179, extension
types, such as ObjectId
, that appear in nested documents are not
converted to the corresponding PyMongoArrow extension type, but
instead have the raw Arrow type, FixedSizeBinaryType(fixed_size_binary[12])
.
These values can be consumed as-is, or converted individually to the
desired extension type, such as _id = out['nested'][0]['_id'].cast(ObjectIdType())
.