Docs Menu
Docs Home
/
Spark Connector

FAQ

For any MongoDB deployment, the Spark Connector sets the preferred location for a DataFrame or Dataset to be where the data is.

  • For a nonsharded system, it sets the preferred location to be the hostname(s) of the standalone or the replica set.

  • For a sharded system, it sets the preferred location to be the hostname(s) of the shards.

To promote data locality, we recommend taking the following actions:

In MongoDB deployments with mixed versions of mongod, it is possible to get an Unrecognized pipeline stage name: '$sample' error. To mitigate this situation, explicitly configure the partitioner to use and define the schema when using DataFrames.

To use mTLS, include the following options when you run spark-submit:

--driver-java-options -Djavax.net.ssl.trustStore=<path to your truststore.jks file> \
--driver-java-options -Djavax.net.ssl.trustStorePassword=<your truststore password> \
--driver-java-options -Djavax.net.ssl.keyStore=<path to your keystore.jks file> \
--driver-java-options -Djavax.net.ssl.keyStorePassword=<your keystore password> \
--conf spark.executor.extraJavaOptions=-Djavax.net.ssl.trustStore=<path to your truststore.jks file> \
--conf spark.executor.extraJavaOptions=-Djavax.net.ssl.trustStorePassword=<your truststore password> \
--conf spark.executor.extraJavaOptions=-Djavax.net.ssl.keyStore=<path to your keystore.jks file> \
--conf spark.executor.extraJavaOptions=-Djavax.net.ssl.keyStorePassword=<your keystore password> \

The MongoConnector includes a cache that lets workers share a single MongoClient across threads. To specify the length of time to keep a MongoClient available, include the mongodb.keep_alive_ms option when you run spark-submit:

--driver-java-options -Dmongodb.keep_alive_ms=<number of milliseconds to keep MongoClient available>

By default, this property has a value of 5000.

Note

Because the cache is set up before the Spark Configuration is available, you must use a system property to configure it.

Back

Configuration