As always, please run explain("executionStats") on the full aggregation and provide the output here - without seeing what the time is being spent on we would be guessing where the improvements could be best made.
Asya
P.S. if you are on 4.4 or later then full explain will show how much time is being spent in each stage of aggregation.