Data Types

On this page

Embedded Array Considerations

Extension Types
Null Values and Conversion to Pandas DataFrames
Nested Extension Types

PyMongoArrow supports a majority of the BSON types. Because Arrow and Polars provide first-class support for Lists and Structs, this includes embedded arrays and documents.

Support for additional types will be added in subsequent releases.

Tip

For more information about BSON types, see the BSON specification.

BSON Type	Type Identifiers
String	`py.str` Instance of `pyarrow.string`
Embedded document	`py.dict` Instance of `pyarrow.struct`
Embedded array	Instance of `pyarrow.list_`
ObjectId	`py.bytes` `bson.ObjectId` Instance of `pymongoarrow.types.ObjectIdType` Instance of `pymongoarrow.pandas_types.PandasObjectId`
Decimal128	`bson.Decimal128` Instance of `pymongoarrow.types.Decimal128Type` Instance of `pymongoarrow.pandas_types.PandasDecimal128`
Boolean	`~py.bool` Instance of `~pyarrow.bool_`
64-bit binary floating point	`py.float` Instance of `pyarrow.float64`
32-bit integer	Instance of `pyarrow.int32`
64-bit integer	`~py.int` `bson.int64.Int64` Instance of `pyarrow.int64`
UTC datetime	`py.datetime.datetime` Instance of `~pyarrow.timestamp` with `ms` resolution
Binary data	`bson.Binary` Instance of `pymongoarrow.types.BinaryType` Instance of `pymongoarrow.pandas_types.PandasBinary`
JavaScript code	`bson.Code` Instance of `pymongoarrow.types.CodeType` Instance of `pymongoarrow.pandas_types.PandasCode`
null	Instance of `pyarrow.null`

Note

PyMongoArrow supports Decimal128 on only little-endian systems. On big-endian systems, it uses null instead.

Use type identifiers to specify that a field is of a certain type during pymongoarrow.api.Schema declaration. For example, if your data has fields f1 and f2 bearing types 32-bit integer and UTC datetime, and an _id that is an ObjectId, you can define your schema as follows:

schema = Schema({
  '_id': ObjectId,
  'f1': pyarrow.int32(),
  'f2': pyarrow.timestamp('ms')
})

Unsupported data types in a schema cause a ValueError identifying the field and its data type.

Embedded Array Considerations

The schema used for an embedded array must use the pyarrow.list_() type, to specify the type of the array elements. For example,

from pyarrow import list_, float64
schema = Schema({'_id': ObjectId,
  'location': {'coordinates': list_(float64())}
})

Extension Types

PyMongoArrow implements the ObjectId, Decimal128, Binary data, and JavaScript code types as extension types for PyArrow and Pandas. For arrow tables, values of these types have the appropriate pymongoarrow extension type, such as pymongoarrow.types.ObjectIdType. You can obtain the appropriate bson Python object by using the .as_py() method, or by calling .to_pylist() on the table.

>>> from pymongo import MongoClient
>>> from bson import ObjectId
>>> from pymongoarrow.api import find_arrow_all
>>> client = MongoClient()
>>> coll = client.test.test
>>> coll.insert_many([{"_id": ObjectId(), "foo": 100}, {"_id": ObjectId(), "foo": 200}])
<pymongo.results.InsertManyResult at 0x1080a72b0>
>>> table = find_arrow_all(coll, {})
>>> table
pyarrow.Table
_id: extension<arrow.py_extension_type<ObjectIdType>>
foo: int32
----
_id: [[64408B0D5AC9E208AF220142,64408B0D5AC9E208AF220143]]
foo: [[100,200]]
>>> table["_id"][0]
<pyarrow.ObjectIdScalar: ObjectId('64408b0d5ac9e208af220142')>
>>> table["_id"][0].as_py()
ObjectId('64408b0d5ac9e208af220142')
>>> table.to_pylist()
[{'_id': ObjectId('64408b0d5ac9e208af220142'), 'foo': 100},
 {'_id': ObjectId('64408b0d5ac9e208af220143'), 'foo': 200}]

When converting to pandas, the extension type columns have an appropriate pymongoarrow extension type, such as pymongoarrow.pandas_types.PandasDecimal128. The value of the element in the dataframe is the appropriate bson type.

>>> from pymongo import MongoClient
>>> from bson import Decimal128
>>> from pymongoarrow.api import find_pandas_all
>>> client = MongoClient()
>>> coll = client.test.test
>>> coll.insert_many([{"foo": Decimal128("0.1")}, {"foo": Decimal128("0.1")}])
<pymongo.results.InsertManyResult at 0x1080a72b0>
>>> df = find_pandas_all(coll, {})
>>> df
                       _id  foo
0  64408bf65ac9e208af220144  0.1
1  64408bf65ac9e208af220145  0.1
>>> df["foo"].dtype
<pymongoarrow.pandas_types.PandasDecimal128 at 0x11fe0ae90>
>>> df["foo"][0]
Decimal128('0.1')
>>> df["_id"][0]
ObjectId('64408bf65ac9e208af220144')

Polars does not support Extension Types.

Null Values and Conversion to Pandas DataFrames

In Arrow and Polars, all Arrays are nullable. Pandas has experimental nullable data types, such as Int64. You can instruct Arrow to create a pandas DataFrame using nullable dtypes with the following Apache documentation code.

>>> dtype_mapping = {
...     pa.int8(): pd.Int8Dtype(),
...     pa.int16(): pd.Int16Dtype(),
...     pa.int32(): pd.Int32Dtype(),
...     pa.int64(): pd.Int64Dtype(),
...     pa.uint8(): pd.UInt8Dtype(),
...     pa.uint16(): pd.UInt16Dtype(),
...     pa.uint32(): pd.UInt32Dtype(),
...     pa.uint64(): pd.UInt64Dtype(),
...     pa.bool_(): pd.BooleanDtype(),
...     pa.float32(): pd.Float32Dtype(),
...     pa.float64(): pd.Float64Dtype(),
...     pa.string(): pd.StringDtype(),
... }
... df = arrow_table.to_pandas(
...     types_mapper=dtype_mapping.get, split_blocks=True, self_destruct=True
... )
... del arrow_table

Defining a conversion for pa.string() also converts Arrow strings to NumPy strings, and not objects.

Nested Extension Types

Pending ARROW-179, extension types, such as ObjectId, that appear in nested documents are not converted to the corresponding PyMongoArrow extension type, but instead have the raw Arrow type, FixedSizeBinaryType(fixed_size_binary[12]).

These values can be consumed as-is, or converted individually to the desired extension type, such as _id = out['nested'][0]['_id'].cast(ObjectIdType()).

Back

Compare to PyMongo

Schema Examples