Escaping html chars in collection

Vinicius_Carvalho · November 26, 2022, 10:08pm

Hi folks, I have imported a large amount of data, and unfortunately there’s a ton of escape chars such as & and < the original dump did this and I have not checked before importing.

Is there any function that can be used to process all documents in the collection and unescape those back to regular characters?

Thanks

Jason_Tran · December 1, 2022, 1:51am

Hi @Vinicius_Carvalho,

I have only briefly tested this on a small collection containing 3 sample documents so if you believe it may work for your use case / environment then I would recommend testing thoroughly on a test / duplicated environment of what you have currently imported to verify it is correct.

As an example, I have the following documents containing & and <:

DB>db.collection.find({},{_id:0,text:1})
[
  { text: '&amp;this is some&amp; text&lt;' },
  { text: 'text&lt;123' },
  { text: 'this &amp;text&lt; 123' }
]

Running the below update against this collection:

DB>db.collection.updateMany( {},
[
  {
    '$set': {
      text: {
        '$replaceAll': { input: '$text', find: '&amp;', replacement: '' }
      }
    }
  },
  {
    '$set': {
      text: {
        '$replaceAll': { input: '$text', find: '&lt;', replacement: '' }
      }
    }
  }
])
{
  acknowledged: true,
  insertedId: null,
  matchedCount: 3,
  modifiedCount: 3,
  upsertedCount: 0
}

Documents in the same collection after the above update:

DB> db.collection.find({},{_id:0,text:1})
[
  { text: 'this is some text' },
  { text: 'text123' },
  { text: 'this text 123' }
]

If you find that this does not suit your use case, please provide the following:

Sample documents
MongoDB Version in use
What you have attempted so far
Expected output

Hope this helps.

Regards,
Jason