FAQ
How can I achieve data locality?
For any MongoDB deployment, the Spark Connector sets the preferred location for a DataFrame or Dataset to be where the data is.
For a nonsharded system, it sets the preferred location to be the hostname(s) of the standalone or the replica set.
For a sharded system, it sets the preferred location to be the hostname(s) of the shards.
To promote data locality, we recommend taking the following actions:
Ensure there is a Spark Worker on one of the hosts for nonsharded system or one per shard for sharded systems.
Use a
nearest
read preference to read from the localmongod
.For a sharded cluster, have a
mongos
on the same nodes and use thelocalThreshold
configuration setting to connect to the nearestmongos
. To partition the data by shard use theShardedPartitioner
Configuration.
How do I resolve Unrecognized pipeline stage name
Error?
In MongoDB deployments with mixed versions of mongod
, it is
possible to get an Unrecognized pipeline stage name: '$sample'
error. To mitigate this situation, explicitly configure the partitioner
to use and define the schema when using DataFrames.
How can I use mTLS for authentication?
To use mTLS, include the following options when you run spark-submit
:
--driver-java-options -Djavax.net.ssl.trustStore=<path to your truststore.jks file> \ --driver-java-options -Djavax.net.ssl.trustStorePassword=<your truststore password> \ --driver-java-options -Djavax.net.ssl.keyStore=<path to your keystore.jks file> \ --driver-java-options -Djavax.net.ssl.keyStorePassword=<your keystore password> \ --conf spark.executor.extraJavaOptions=-Djavax.net.ssl.trustStore=<path to your truststore.jks file> \ --conf spark.executor.extraJavaOptions=-Djavax.net.ssl.trustStorePassword=<your truststore password> \ --conf spark.executor.extraJavaOptions=-Djavax.net.ssl.keyStore=<path to your keystore.jks file> \ --conf spark.executor.extraJavaOptions=-Djavax.net.ssl.keyStorePassword=<your keystore password> \
How can I share a MongoClient instance across threads?
The MongoConnector includes a cache that lets workers
share a single MongoClient
across threads. To specify the length of time to keep a
MongoClient
available, include the mongodb.keep_alive_ms
option when you run
spark-submit
:
--driver-java-options -Dmongodb.keep_alive_ms=<number of milliseconds to keep MongoClient available>
By default, this property has a value of 5000
.
Note
Because the cache is set up before the Spark Configuration is available, you must use a system property to configure it.