Connector for BI to Power BI performance problems with very large collections

Hi all,

at my job, we’ve been evaluating the MongoDB Connector for BI for our visualization project. Unfortunately there are some problems we have noticed, especially in the interplay with Power BI and the Power Query editor. If we cannot resolve them, we will have to change our planned software stack and remove the BI Connector from the equation.

Concerns our largest collection:

  • ~34m documents and growing
  • ~34 GB uncompressed
  • ~1 kB Avg. document size
  • 2 Indexes: _id, and a “foreign key” (ObjectId linking this (n:) to document (:1) in much smaller collection)

After storing about ~20m documents, performance gets severely degraded. Most times an update of Power BI reports is possible after 1 or 2 retries. We are looking to move to higher Power BI tiers anyway, where we’d do refresh partitions to avoid re-importing the whole DB on every report refresh.

However, the really big problem is that the Power Query editor stops working. Various error messages emerge but no real indication of what is wrong. Through the MySQL endpoint I usually got an error code somewhere very close to Int32.MinValue, if that means anything to anyone.

Things we tried:

  • Turning off all user-adjustable background tasks in Power BI
  • Creating a view that reduces number of documents by date field ($lookup from superordinate entity, indexed field) >>> was even slower/less stable for some reason, “cardinality altering” in logs but no real explanation – and I do not actually alter the cardinality of the result
  • Making sure we set always set the “supported row reduction clause” to LIMIT on our Power BI data source
  • Power Query column profiling based on first 1000 rows (the default)
  • Power Query column profiling based on all rows (tried out of desperation)
  • Use the BI Connector’s MySQL endpoint instead of the MongoDB ODBC Driver

Things not yet tried:

  • Use a view based on a denormalized date field (just thought of it now)
  • Throw more CPU/RAM at the problem. Current host is admittedly underspecced but one would hope that 1000 rows for column profiling can be fetched by any machine

So, we currently persist an older subset of the data from xyz_collection to xyz_collection_archive and then delete them from the source collection. Hardly ideal, as we were hoping to visualize without a time limit.