Docs Home → MongoDB Spark Connector
Configuration Options
Various configuration options are available for the MongoDB Spark Connector.
Specify Configuration
Via SparkConf
You can specify these options via SparkConf
using the --conf
setting or the $SPARK_HOME/conf/spark-default.conf
file, and
MongoDB Spark Connector will use the settings in SparkConf
as the
defaults.
Important
When setting configurations via SparkConf
, you must prefix the
configuration options. Refer to the configuration sections for the
specific prefix.
Via ReadConfig
and WriteConfig
Various methods in the MongoDB Connector API accept an optional
ReadConfig
or a WriteConfig object.
ReadConfig
and WriteConfig
settings override any
corresponding settings in SparkConf
.
For examples, see Using a ReadConfig
and Using a WriteConfig
. For
more details, refer to the source for these methods.
Via Options Map
In the Spark API, some methods (e.g. DataFrameReader
and
DataFrameWriter
) accept options in the form of a Map[String,
String]
.
You can convert custom ReadConfig
or WriteConfig
settings into
a Map
via the asOptions()
method.
Via System Property
The connector provides a cache for MongoClients
which can only be
configured via the System Property. See Cache Configuration.
Input Configuration
The following options for reading from MongoDB are available:
Note
If setting these connector input configurations via SparkConf
,
prefix these settings with spark.mongodb.input.
.
Property name | Description |
---|---|
uri | Required. The connection string of the form
The other remaining input options may be appended to the |
database | Required. The database name from which to read data. |
collection | Required. The collection name from which to read data. |
batchSize | Size of the internal batches used within the cursor. |
localThreshold | The threshold (in milliseconds) for choosing a server from multiple MongoDB servers. Default: 15 ms |
readPreference.name | The Read Preference to use. Default: Primary |
readPreference.tagSets | The ReadPreference TagSets to use. |
readConcern.level | The Read Concern level to use. |
sampleSize | The sample size to use when inferring the schema. Default: 1000 |
samplePoolSize | The sample pool size, used to limit the results from which to sample data. Default: 10000 |
partitioner | The class name of the partitioner to use to partition the data. The connector provides the following partitioners:
In addition to the provided partitioners, you can also specify a
custom partitioner implementation. For custom implementations of
the To configure options for the various partitioner, see Partitioner Configuration. Default: |
registerSQLHelperFunctions | Register helper methods for unsupported MongoDB data types. Default: |
sql.inferschema.mapTypes.enabled | Enable Default: |
sql.inferschema.mapTypes.minimumKeys | The minimum number of keys a Default: |
hint | The JSON representation of hint documentation. |
collation | The JSON representation of a collation. Used when querying
MongoDB. |
Partitioner Configuration
MongoSamplePartitioner
Configuration
Note
If setting these connector configurations via SparkConf
, prefix
these configuration settings with
spark.mongodb.input.partitionerOptions.
.
Property name | Description |
---|---|
partitionKey | The field by which to split the collection data. The field should be indexed and contain unique values. Default: |
partitionSizeMB | The size (in MB) for each partition Default: 64 MB |
samplesPerPartition | The number of sample documents to take for each partition. Default: 10 |
MongoShardedPartitioner
Configuration
Note
If setting these connector configurations via SparkConf
, prefix
these configuration settings with
spark.mongodb.input.partitionerOptions.
.
Property name | Description |
---|---|
shardkey | The field by which to split the collection data. The field should be indexed. Default: ImportantThis property is not compatible with hashed shard keys. |
MongoSplitVectorPartitioner
Configuration
Note
If setting these connector configurations via SparkConf
, prefix
these configuration settings with
spark.mongodb.input.partitionerOptions.
.
Property name | Description |
---|---|
partitionKey | The field by which to split the collection data. The field should be indexed and contain unique values. Default: |
partitionSizeMB | The size (in MB) for each partition Default: 64 MB |
MongoPaginateByCountPartitioner
Configuration
Note
If setting these connector configurations via SparkConf
, prefix
these configuration settings with
spark.mongodb.input.partitionerOptions.
.
Property name | Description |
---|---|
partitionKey | The field by which to split the collection data. The field should be indexed and contain unique values. Default: |
numberOfPartitions | The number of partitions to create. Default: 64 |
MongoPaginateBySizePartitioner
Configuration
Note
If setting these connector configurations via SparkConf
, prefix
these configuration settings with
spark.mongodb.input.partitionerOptions.
.
Property name | Description |
---|---|
partitionKey | The field by which to split the collection data. The field should be indexed and contain unique values. Default: |
partitionSizeMB | The size (in MB) for each partition Default: 64 MB |
uri
Configuration Setting
You can set all Input Configuration via the input uri
setting.
For example, consider the following example which sets the input
uri
setting via SparkConf
:
Note
If configuring the MongoDB Spark input settings via SparkConf
,
prefix the setting with spark.mongodb.input.
.
spark.mongodb.input.uri=mongodb://127.0.0.1/databaseName.collectionName?readPreference=primaryPreferred
The configuration corresponds to the following separate configuration settings:
spark.mongodb.input.uri=mongodb://127.0.0.1/ spark.mongodb.input.database=databaseName spark.mongodb.input.collection=collectionName spark.mongodb.input.readPreference.name=primaryPreferred
If you specify a setting both in the uri
and in a separate
configuration, the uri
setting overrides the separate
setting. For example, given the following configuration, the input
database for the connection is foobar
:
spark.mongodb.input.uri=mongodb://127.0.0.1/foobar spark.mongodb.input.database=bar
Output Configuration
The following options for writing to MongoDB are available:
Note
If setting these connector output configurations via SparkConf
,
prefix these settings with: spark.mongodb.output.
.
Property name | Description |
---|---|
uri | Required. The connection string of the form NoteThe other remaining options may be appended to the |
database | Required. The database name to write data. |
collection | Required. The collection name to write data to |
extendedBsonTypes | Enables extended BSON types when writing data to MongoDB. Default: |
localThreshold | The threshold (milliseconds) for choosing a server from multiple MongoDB servers. Default: 15 ms |
replaceDocument | Replace the whole document when saving Datasets that contain an Default: true |
maxBatchSize | The maximum batch size for bulk operations when saving data. Default: 512 |
writeConcern.w | The write concern w value. Default |
writeConcern.journal | The write concern journal value. |
writeConcern.wTimeoutMS | The write concern wTimeout value. |
shardKey | The field by which to split the collection data. The field should be indexed and contain unique values. Default: |
forceInsert | Forces saves to use inserts, even if a Dataset contains Default: |
ordered | Sets the bulk operations ordered property. Default: |
uri
Configuration Setting
You can set all Output Configuration via the output uri
.
For example, consider the following example which sets the input
uri
setting via SparkConf
:
Note
If configuring the configuration output settings via SparkConf
,
prefix the setting with spark.mongodb.output.
.
spark.mongodb.output.uri=mongodb://127.0.0.1/test.myCollection
The configuration corresponds to the following separate configuration settings:
spark.mongodb.output.uri=mongodb://127.0.0.1/ spark.mongodb.output.database=test spark.mongodb.output.collection=myCollection
If you specify a setting both in the uri
and in a separate
configuration, the uri
setting overrides the separate
setting. For example, given the following configuration, the output
database for the connection is foobar
:
spark.mongodb.output.uri=mongodb://127.0.0.1/foobar spark.mongodb.output.database=bar
Cache Configuration
The MongoConnector includes a cache for MongoClients, so workers can share the MongoClient across threads.
Important
As the cache is setup before the Spark Configuration is available, the cache can only be configured via a System Property.
System Property name | Description |
---|---|
mongodb.keep_alive_ms | The length of time to keep a Default: 5000 |