Let’s say you have a variable that you know is constant within each group, what is the best way to preserve it during a collapse operation in Stata? You might think taking the first value (firstnm) must be the fastest, since it theoretically only requires 1 step per group. If that is the case, you are in for a surprise—Stata is actually better in calculating the mean.
Here are the simulation results for 100 groups of 1000 randomly generated observations, averaged over 30 runs:
collapse mean 0.0443
collapse median 0.1062
collapse min 0.0844
collapse max 0.0657
collapse count 0.0456
collapse firstnm 0.0473
collapse lastnm 0.0464
The measurements are reported in seconds. The relative speed is quite stable to variations in number of groups and observations. Base on my analysis of the underlying algorthims collapse uses, the reason why firstnm is so slow is that an order-preserving sort has to be performed on the data, and order-preserving sorts are slow relative to non-preserving ones. To confirm this is true, I ran the test with just one group of 100k observations:
collapse mean 0.0614
collapse firstnm 0.0508
And as expected, firstnm is now faster. The calculation of mean also slows down more than that of firstnm as the number of groups decrease.
Base on my simulations, calculation of mean is faster when there are as little as 3 groups, so mean is the way to go in most cases.