Filters
When using filters with DataFrames or Datasets, the underlying MongoDB Connector code constructs an aggregation pipeline to filter the data in MongoDB before sending it to Spark. This improves Spark performance by retrieving and processing only the data you need.
MongoDB Spark Connector turns the following filters into aggregation pipeline stages:
- And
- EqualNullSafe
- EqualTo
- GreaterThan
- GreaterThanOrEqual
- In
- IsNull
- LessThan
- LessThanOrEqual
- Not
- Or
- StringContains
- StringEndsWith
- StringStartsWith
Use filter()
to read a subset of data from your MongoDB collection.
Consider a collection named fruit
that contains the
following documents:
{ "_id" : 1, "type" : "apple", "qty" : 5 } { "_id" : 2, "type" : "orange", "qty" : 10 } { "_id" : 3, "type" : "banana", "qty" : 15 }
First, set up a DataFrame to connect with your default MongoDB data source:
df = spark.read.format("mongodb").load()
The following example includes only
records in which the qty
field is greater than or equal to 10
.
df.filter(df['qty'] >= 10).show()
The operation prints the following output:
+---+----+------+ |_id| qty| type| +---+----+------+ |2.0|10.0|orange| |3.0|15.0|banana| +---+----+------+