Configuration Options

On this page

Specify Configuration

Input Configuration
Output Configuration
Cache Configuration

This version of the documentation is archived and no longer supported. See the current documentation for the latest version of the MongoDB Connector for Spark.

Various configuration options are available for the MongoDB Spark Connector.

Specify Configuration

Via `SparkConf`

You can specify these options via SparkConf using the --conf setting or the $SPARK_HOME/conf/spark-default.conf file, and MongoDB Spark Connector will use the settings in SparkConf as the defaults.

Important

When setting configurations via SparkConf, you must prefix the configuration options. Refer to the configuration sections for the specific prefix.

Via `ReadConfig` and `WriteConfig`

Various methods in the MongoDB Connector API accept an optional ReadConfig or a WriteConfig object. ReadConfig and WriteConfig settings override any corresponding settings in SparkConf.

For examples, see Using a ReadConfig and Using a WriteConfig. For more details, refer to the source for these methods.

Via Options Map

In the Spark API, some methods (e.g. DataFrameReader and DataFrameWriter) accept options in the form of a Map[String, String].

You can convert custom ReadConfig or WriteConfig settings into a Map via the asOptions() method.

Via System Property

The connector provides a cache for MongoClients which can only be configured via the System Property. See Cache Configuration.

Input Configuration

The following options for reading from MongoDB are available:

Note

If setting these connector input configurations via SparkConf, prefix these settings with spark.mongodb.input..

Property name	Description
`uri`	Required. The connection string of the form `mongodb://host:port/` where `host` can be a hostname, IP address, or UNIX domain socket. If `:port` is unspecified, the connection uses the default MongoDB port 27017. The other remaining input options may be appended to the `uri` setting. See `uri` Configuration Setting.
`database`	Required. The database name from which to read data.
`collection`	Required. The collection name from which to read data.
`batchSize`	Size of the internal batches used within the cursor.
`localThreshold`	The threshold (in milliseconds) for choosing a server from multiple MongoDB servers. Default: 15 ms
`readPreference.name`	The Read Preference to use. Default: Primary
`readPreference.tagSets`	The `ReadPreference` TagSets to use.
`readConcern.level`	The Read Concern level to use.
`sampleSize`	The sample size to use when inferring the schema. Default: 1000
`samplePoolSize`	The sample pool size, used to limit the results from which to sample data. Default: 10000
`partitioner`	The class name of the partitioner to use to partition the data. The connector provides the following partitioners: `MongoDefaultPartitioner` Default. Wraps the MongoSamplePartitioner and provides help for users of older versions of MongoDB. `MongoSamplePartitioner` Requires MongoDB 3.2. A general purpose partitioner for all deployments. Uses the average document size and random sampling of the collection to determine suitable partitions for the collection. For configuration settings for the MongoSamplePartitioner, see `MongoSamplePartitioner` Configuration. `MongoShardedPartitioner` A partitioner for sharded clusters. Partitions the collection based on the data chunks. Requires read access to the `config` database. For configuration settings for the MongoShardedPartitioner, see `MongoShardedPartitioner` Configuration. `MongoSplitVectorPartitioner` A partitioner for standalone or replica sets. Uses the `splitVector` command on the standalone or the primary to determine the partitions of the database. Requires privileges to run `splitVector` command. For configuration settings for the MongoSplitVectorPartitioner, see `MongoSplitVectorPartitioner` Configuration. `MongoPaginateByCountPartitioner` A slow, general purpose partitioner for all deployments. Creates a specific number of partitions. Requires a query for every partition. For configuration settings for the MongoPaginateByCountPartitioner, see `MongoPaginateByCountPartitioner` Configuration. `MongoPaginateBySizePartitioner` A slow, general purpose partitioner for all deployments. Creates partitions based on data size. Requires a query for every partition. For configuration settings for the MongoPaginateBySizePartitioner, see `MongoPaginateBySizePartitioner` Configuration. In addition to the provided partitioners, you can also specify a custom partitioner implementation. For custom implementations of the `MongoPartitioner` trait, provide the full class name. If no package names are provided, then the default `com.mongodb.spark.rdd.partitioner` package is used. To configure options for the various partitioner, see Partitioner Configuration. Default: `MongoDefaultPartitioner`
`registerSQLHelperFunctions`	Register helper methods for unsupported MongoDB data types. Default: `false`
`sql.inferschema.mapTypes.enabled`	Enable `MapType` detection in the schema infer step. Default: `true`
`sql.inferschema.mapTypes.minimumKeys`	The minimum number of keys a `StructType` needs to have to be inferred as `MapType`. Default: `250`
`hint`	The JSON representation of hint documentation.
`collation`	The JSON representation of a collation. Used when querying MongoDB.