Is possible by Data Lake?

Hi everyone.
I need to understand if data lake, can be the right soluction for resolves my problems.
I have many unknown files on my cloud: usually unstructured text files. I need to import only the fields that has a specific charateristic.

For example:

aaaa@ccc.com:31423-54365-6775-road road-Name-Last name
fuvhdsfuhnvuids fodivbiued83495y834ythwnv svnfvnfdjndbjgd
and so on…

aaaa@ccc.com interest to me, and then other field in this line of a text file. You can observe that there isn’t a single delimitator.
But I don’t want import the entire file into my MongoDB.
Is it possible, by data lake, to import only the “good line”? So I can have a DB more light and faster.
Thank you.

Hey @Nicola_Ricci , unfortunately data lake would not be a good fit for this challenge. We only support the formats specified in the documentation. I know there are many tools that could help, but I think most of them would require that you create a custom parser where you are effectively defining your own format.

Welcome to the MongoDB Community @Nicola_Ricci !

Since you are referring to text files, I assume you are asking about using Atlas Data Federation (which was known as Atlas Data Lake prior to June 2022). Data Federation allows you to query supported data formats (JSON, CSV, Parquet, Avro, …) in cloud object storage (eg AWS S3) using the MongoDB Query Language.

Data Federation works with supported file formats directly and does not import that data into MongoDB. If you want to efficiently read a subset of data, it will be better to filter the source files before saving to cloud storage. You could also choose to partition your data to support your common query patterns. However, as @Benjamin_Flast mentioned Data Federation would not be a good fit for your data format which appears to require a custom parser.

If your goal is to import your data into a MongoDB deployment, you probably want to be looking at using mongoimport or a custom import script. A custom script would be more appropriate if you are using unsupported file formats or want to filter data during ingestion.

Regards,
Stennie