Handle non utf-8 character in PyMongoArrow upon data fetch

Hi,

I am using PyMongoArrow package to fetch data from MongoDB to Pandas.

My source data collection contains few values which are in another language (Hindi Text to be specific). Although I have no problem storing the data in MongoDB, I get pyarrow.lib.ArrowException: Unknown error: Wrapping PyArrow error when I fetch the data using PyMongoArrow’s aggregate_pandas_all method.

Below is the python code I use to fetch data from MongoDB

df: DataFrame = pymongoarrow.api.aggregate_pandas_all(**aggregate_params)

Below is the error I am getting on the above line upon execution

error_message: Traceback (most recent call last):
  File "migration.py", line 250, in process_schemas
    df: DataFrame = pymongoarrow.api.aggregate_pandas_all(**aggregate_pandas_all_params)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pymongoarrow/api.py", line 201, in aggregate_pandas_all
    return _arrow_to_pandas(aggregate_arrow_all(collection, pipeline, schema=schema, **kwargs))
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pymongoarrow/api.py", line 159, in _arrow_to_pandas
    return arrow_table.to_pandas(split_blocks=True, self_destruct=True)
  File "pyarrow/array.pxi", line 830, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 3908, in pyarrow.lib.Table._to_pandas
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 820, in table_to_blockmanager
    blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 1170, in _table_to_blocks
    result = pa.lib.table_to_blocks(options, block_table, categories,
  File "pyarrow/table.pxi", line 2594, in pyarrow.lib.table_to_blocks
  File "pyarrow/error.pxi", line 138, in pyarrow.lib.check_status
pyarrow.lib.ArrowException: Unknown error: Wrapping ] w �� T/� � failed

Hi, thank you for raising this issue! Unfortunately, I am unable to replicate this error myself. Right now I am getting this when using aggregate_pandas_all and some Hindi Unicode characters:

                                    _id                                              hindi
0  b'd\x06q\x95\xd9j\xdf?\x87\x8c\x83O'  अआइईउऊऋएऐओऔव्यंजनकखगघङचछजझञाटठडढणतथदधनपफबभमयरल...
1  b'd\x06q\x95\xd9j\xdf?\x87\x8c\x83P'  अआइईउऊऋएऐओऔव्यंजनकखगघङचछजझञाटठडढणतथदधनपफबभमयरल...
2  b'd\x06q\x95\xd9j\xdf?\x87\x8c\x83Q'  अआइईउऊऋएऐओऔव्यंजनकखगघङचछजझञाटठडढणतथदधनपफबभमयरल...
...

Would it be possible for you to provide more details on exactly what unicode characters are causing the failure, in addition to exactly what you are providing to as aggregate_params? I think that a malformed unicode character may be causing this error. You can check by using the str.encode function in Python. Furthermore, could you provide what version of PyMongo, Python, and PyMongoArrow you are using?