Docs Menu

Docs HomeMongoDB Spark Connector

Configuration Options

On this page

  • Specify Configuration
  • Input Configuration
  • Output Configuration
  • Cache Configuration

Various configuration options are available for the MongoDB Spark Connector.

You can specify these options via SparkConf using the --conf setting or the $SPARK_HOME/conf/spark-default.conf file, and MongoDB Spark Connector will use the settings in SparkConf as the defaults.

Important

When setting configurations via SparkConf, you must prefix the configuration options. Refer to the configuration sections for the specific prefix.

Various methods in the MongoDB Connector API accept an optional ReadConfig or a WriteConfig object. ReadConfig and WriteConfig settings override any corresponding settings in SparkConf.

For examples, see Using a ReadConfig and Using a WriteConfig. For more details, refer to the source for these methods.

In the Spark API, some methods (e.g. DataFrameReader and DataFrameWriter) accept options in the form of a Map[String, String].

You can convert custom ReadConfig or WriteConfig settings into a Map via the asOptions() method.

The connector provides a cache for MongoClients which can only be configured via the System Property. See Cache Configuration.

The following options for reading from MongoDB are available:

Note

If setting these connector input configurations via SparkConf, prefix these settings with spark.mongodb.input..

Property name
Description
uri

Required. The connection string of the form mongodb://host:port/ where host can be a hostname, IP address, or UNIX domain socket. If :port is unspecified, the connection uses the default MongoDB port 27017.

The other remaining input options may be appended to the uri setting. See uri Configuration Setting.

database
Required. The database name from which to read data.
collection
Required. The collection name from which to read data.
batchSize
Size of the internal batches used within the cursor.
localThreshold

The threshold (in milliseconds) for choosing a server from multiple MongoDB servers.

Default: 15 ms

readPreference.name

The Read Preference to use.

Default: Primary

readPreference.tagSets
The ReadPreference TagSets to use.
readConcern.level
The Read Concern level to use.
sampleSize

The sample size to use when inferring the schema.

Default: 1000

samplePoolSize

The sample pool size, used to limit the results from which to sample data.

Default: 10000

partitioner

The class name of the partitioner to use to partition the data. The connector provides the following partitioners:

  • MongoDefaultPartitioner
    Default. Wraps the MongoSamplePartitioner and provides help for users of older versions of MongoDB.
  • MongoSamplePartitioner
    Requires MongoDB 3.2. A general purpose partitioner for all deployments. Uses the average document size and random sampling of the collection to determine suitable partitions for the collection. For configuration settings for the MongoSamplePartitioner, see MongoSamplePartitioner Configuration.
  • MongoShardedPartitioner
    A partitioner for sharded clusters. Partitions the collection based on the data chunks. Requires read access to the config database. For configuration settings for the MongoShardedPartitioner, see MongoShardedPartitioner Configuration.
  • MongoSplitVectorPartitioner
    A partitioner for standalone or replica sets. Uses the splitVector command on the standalone or the primary to determine the partitions of the database. Requires privileges to run splitVector command. For configuration settings for the MongoSplitVectorPartitioner, see MongoSplitVectorPartitioner Configuration.
  • MongoPaginateByCountPartitioner
    A slow, general purpose partitioner for all deployments. Creates a specific number of partitions. Requires a query for every partition. For configuration settings for the MongoPaginateByCountPartitioner, see MongoPaginateByCountPartitioner Configuration.
  • MongoPaginateBySizePartitioner
    A slow, general purpose partitioner for all deployments. Creates partitions based on data size. Requires a query for every partition. For configuration settings for the MongoPaginateBySizePartitioner, see MongoPaginateBySizePartitioner Configuration.

In addition to the provided partitioners, you can also specify a custom partitioner implementation. For custom implementations of the MongoPartitioner trait, provide the full class name. If no package names are provided, then the default com.mongodb.spark.rdd.partitioner package is used.

To configure options for the various partitioner, see Partitioner Configuration.

Default: MongoDefaultPartitioner

registerSQLHelperFunctions

Register helper methods for unsupported MongoDB data types.

Default: false

sql.inferschema.mapTypes.enabled

Enable MapType detection in the schema infer step.

Default: true

sql.inferschema.mapTypes.minimumKeys

The minimum number of keys a StructType needs to have to be inferred as MapType.

Default: 250

hint
The JSON representation of hint documentation.
collation
The JSON representation of a collation. Used when querying MongoDB.

Note

If setting these connector configurations via SparkConf, prefix these configuration settings with spark.mongodb.input.partitionerOptions..

Property name
Description
partitionKey

The field by which to split the collection data. The field should be indexed and contain unique values.

Default: _id

partitionSizeMB

The size (in MB) for each partition

Default: 64 MB

samplesPerPartition

The number of sample documents to take for each partition.

Default: 10

Note

If setting these connector configurations via SparkConf, prefix these configuration settings with spark.mongodb.input.partitionerOptions..

Property name
Description
shardkey

The field by which to split the collection data. The field should be indexed.

Default: _id

Important

This property is not compatible with hashed shard keys.

Note

If setting these connector configurations via SparkConf, prefix these configuration settings with spark.mongodb.input.partitionerOptions..

Property name
Description
partitionKey

The field by which to split the collection data. The field should be indexed and contain unique values.

Default: _id

partitionSizeMB

The size (in MB) for each partition

Default: 64 MB

Note

If setting these connector configurations via SparkConf, prefix these configuration settings with spark.mongodb.input.partitionerOptions..

Property name
Description
partitionKey

The field by which to split the collection data. The field should be indexed and contain unique values.

Default: _id

numberOfPartitions

The number of partitions to create.

Default: 64

Note

If setting these connector configurations via SparkConf, prefix these configuration settings with spark.mongodb.input.partitionerOptions..

Property name
Description
partitionKey

The field by which to split the collection data. The field should be indexed and contain unique values.

Default: _id

partitionSizeMB

The size (in MB) for each partition

Default: 64 MB

You can set all Input Configuration via the input uri setting.

For example, consider the following example which sets the input uri setting via SparkConf:

Note

If configuring the MongoDB Spark input settings via SparkConf, prefix the setting with spark.mongodb.input..

spark.mongodb.input.uri=mongodb://127.0.0.1/databaseName.collectionName?readPreference=primaryPreferred

The configuration corresponds to the following separate configuration settings:

spark.mongodb.input.uri=mongodb://127.0.0.1/
spark.mongodb.input.database=databaseName
spark.mongodb.input.collection=collectionName
spark.mongodb.input.readPreference.name=primaryPreferred

If you specify a setting both in the uri and in a separate configuration, the uri setting overrides the separate setting. For example, given the following configuration, the input database for the connection is foobar:

spark.mongodb.input.uri=mongodb://127.0.0.1/foobar
spark.mongodb.input.database=bar

The following options for writing to MongoDB are available:

Note

If setting these connector output configurations via SparkConf, prefix these settings with: spark.mongodb.output..

Property name
Description
uri

Required. The connection string of the form mongodb://host:port/ where host can be a hostname, IP address, or UNIX domain socket. If :port is unspecified, the connection uses the default MongoDB port 27017.

Note

The other remaining options may be appended to the uri setting. See uri Configuration Setting.

database
Required. The database name to write data.
collection
Required. The collection name to write data to
extendedBsonTypes

Enables extended BSON types when writing data to MongoDB.

Default: true

localThreshold

The threshold (milliseconds) for choosing a server from multiple MongoDB servers.

Default: 15 ms

replaceDocument

Replace the whole document when saving Datasets that contain an _id field. If false it will only update the fields in the document that match the fields in the Dataset.

Default: true

maxBatchSize

The maximum batch size for bulk operations when saving data.

Default: 512

writeConcern.w

The write concern w value.

Default w: 1

writeConcern.journal
The write concern journal value.
writeConcern.wTimeoutMS
The write concern wTimeout value.
shardKey

The field by which to split the collection data. The field should be indexed and contain unique values.

Default: _id

forceInsert

Forces saves to use inserts, even if a Dataset contains _id.

Default: false

ordered

Sets the bulk operations ordered property.

Default: true

You can set all Output Configuration via the output uri.

For example, consider the following example which sets the input uri setting via SparkConf:

Note

If configuring the configuration output settings via SparkConf, prefix the setting with spark.mongodb.output..

spark.mongodb.output.uri=mongodb://127.0.0.1/test.myCollection

The configuration corresponds to the following separate configuration settings:

spark.mongodb.output.uri=mongodb://127.0.0.1/
spark.mongodb.output.database=test
spark.mongodb.output.collection=myCollection

If you specify a setting both in the uri and in a separate configuration, the uri setting overrides the separate setting. For example, given the following configuration, the output database for the connection is foobar:

spark.mongodb.output.uri=mongodb://127.0.0.1/foobar
spark.mongodb.output.database=bar

The MongoConnector includes a cache for MongoClients, so workers can share the MongoClient across threads.

Important

As the cache is setup before the Spark Configuration is available, the cache can only be configured via a System Property.

System Property name
Description
mongodb.keep_alive_ms

The length of time to keep a MongoClient available for sharing.

Default: 5000

←  MongoDB Connector for SparkSpark Connector Scala Guide →
Share Feedback
© 2023 MongoDB, Inc.

About

  • Careers
  • Investor Relations
  • Legal Notices
  • Privacy Notices
  • Security Information
  • Trust Center
© 2023 MongoDB, Inc.