RICI: Excel Musings – Sorting and Randomising
Welcome to another edition of R1C1 – Excel Musings. In this edition I want to share some thoughts on sorting lists of data, and its opposite, randomising the order of lists of data. Most people reading this are, I would expect, aware of the built in ‘Sort’ functionality in Excel, located in the Data tab of the ribbon. This is a useful stand-alone analytical tool, but it has two major shortcomings in my view. One is that it is a completely one-way operation. In using it, you overwrite the original ordering of your data, losing any record of it. If you begin at point A with a model including some raw data, wish to arrive at point B with a model that includes this data sorted and analysed, and need to keep an audit-trail of exactly how you moved from A to B for easy reproduction or independent review, then this discontinuity of overwriting your original data with a sorted list is not ideal. The second shortcoming is that if you need sorting functionality in a live model that is to be deployed to other end-users, then it is much more complicated for the end-users to use the model as a live tool if they need to manually call up the sorting process via the ribbon every time they change assumptions, compared with a model that could do it for them automatically. The second shortcoming can be mitigated via a macro that sorts, so all the user needs to do is click a button, but as it happens we can dispense with the macro, and sort lists exclusively using formulas. Doing so gets around both shortcomings of Excel’s built in sorting tool.
USING FORMULAS TO SORT LISTS OF NUMBERS
When sorting lists using formulas, we can use different approaches for lists of numbers and lists of text strings. Let’s start at looking at how to sort lists of numbers. This is actually fairly simple – all we need to do is make use of either the SMALL() function or the LARGE() function. Both of these functions take two arguments – the first argument is our list of data, and the second argument is a number, between 1 and n, where there are n elements in our list. The SMALL() function returns the nth smallest element in the list, and the LARGE function returns the nth largest element in the list. For example, let’s say cells A1:A10 each contain a number, and we wish to sort these in ascending order. We’ll use cells B1:B10 as some helper cells, and C1:C10 for our output. All we need do is put the numbers 1 through to 10 in B1:B10, and then in C1 type =SMALL($A$1:$A$10,$B1), and drag that formula down to cell C10. If we wish, we can even do it without the helper cells in column B, by using the ROWS() function as follows: =SMALL($A$1:$A$10,ROWS(A$1:A1)). Sorting in descending order can be done exactly the same way, using the LARGE function.
USING FORMULAS TO SORT LISTS OF TEXT
Unfortunately, the simplicity of using SMALL() or LARGE() won’t work if we wish to sort a list of text. As it turns out though, we can still build a formula based solution around the COUNTIF() function! This solution will however require the use of helper cells, unless we wish to use a multi-cell array formula. I will walk through the solution using the helper cells.
Let’s say cells A1:A10 contain our data, this time a list of 10 names that we wish to sort in alphabetical order. As per before, place the numbers 1 through 10 in cells B1:B10. In cell C1, write the formula =COUNTIF($A$1:$A$10,”<=”&$A1) and drag this down to C10. For each entry in our list, this tells us its relative rank in the list, by counting how many members of the list are less than or equal to the element at that row. This makes use of the fact that in Excel, the inequality operators < and > work not only on numbers but also on text, and base their less than or greater than assessments on alphabetical ordering. Quite an insight!
All that’s left now is to construct our final list of the sorted order. This involves an INDEX and a MATCH function. We can do this together in one column, but I will split it out to two columns here to better explain what is going on. In cell D1, we want to know which cell of our original list occurs earliest in the sort order. In cell D2, we want to know which cell occurs second in the sort order, and so on down to D10 where we want to know which cell of the original list is last in the sort order. To achieve this, in cell D1 write the formula =MATCH($B1,$C$1:$C$10,0) and drag it down to cell D10. What this is doing is telling us in cell D1 the relative position in the original list of the first ranking item – that is the item which caused the COUNTIF function to return 1. Finally, cells E1:E10 will contain our sorted list. In cell E1, write =INDEX($A$1:$A$10,$D1) and drag down to cell E10. That’s it!
As it happens, this approach will also work to sort a collection of numbers too, however the SMALL function is much, much simpler to use when sorting numbers. And if for some reason we are sorting numbers with this approach, then we can also use the RANK() function in place of the COUNTIF() function –eitherwill work.Butforsortingtext,only COUNTIF() will work.
BUT, BE WARNED! (OF COURSE IT COULDN’T BE THAT EASY, COULD IT?)
There are unfortunately a few caveats to be aware of with this COUNTIF/MATCH/ INDEX approach. It won’t work properly if (i) the data contains a mix of text cells and numeric cells; (ii) the data elements are not all unique (i.e. if there are any duplicates in the data); (iii) there are gaps of blank cells in the data. There are workarounds to each of these caveats, but explaining them is moving beyond the scope of this article.
WHAT ABOUT RANDOMISING?
Sometimes we may wish to go the other way, from a sorted list to a randomly ordered list. An example would be if we wanted to simulate a shuffled deck of cards in Excel for the purpose of some game analysis. Let’s say the numbers 101 through 152 represent our deck of cards, and our objective is to place the numbers 101 through 152 into a random order (the reason I am not using 1 through 52 is to make clearer the distinction between the elements we are sorting, and the list of 1 through 52 that we will use as helper cells). The basic approach is as follows:
- In cells A1:A52, place the numbers 1 through 52 (in order).
- In cells B1:B52, place 52 unique random numbers. (They can be any value, including decimals, with any upper and lower bound, and placed in any order. I suggest use =RAND() to first obtain them, and then copy/paste-as-values over the top of cells B1:B52 to end up with static random numbers.)
- In cells C1:C52, write a formula to return the numbers 1 through 52 as a randomly ordered list.
- In cells D1:D52, place our original list of the numbers 101 through 152.
- In cells E1:E52, write a formula to return our randomly shuffled deck of cards.
I’ll start with the formula for column C. What we want to do here is say “In row n of column C, what is the location of the nth smallest element of column B’s random numbers?” To achieve this, in C1 write =MATCH(SMALL($B$1 :$B$52,A1),$B$1:$B$52,0) and drag down to cell C52.
What this will do is place the numbers 1 through 52 in cells C1:C52, but in a random order. Let’s say for example that cell C1 returns the value 18, meaning the first- smallest element of column B is at cell B18. Now, in cell E1, write =INDEX($D$1:$D$52,$C1) and drag down to cell E52. Here, the formula at E1 is returning the 18th element from our sorted deck of cards at column D. Because all of the numbers 1 through 52 are contained in column C, but in a random order, every element from our deck of cards at column D will be returned in column E, also in a random order.
But what if we wanted to shuffle the deck again on cue – not every time we changed any cell, but frequently enough that we don’t want to rewrite the =RAND() formula in Column B and do the copy/paste again. Is there a way to achieve this? The answer is yes, depending on how relaxed we are about the definition of ‘random’.
ENTER THE LINEAR CONGRUENTIAL GENERATOR (LCG)
(Google it for more detailed information if you like.)
The idea here is that we choose a few different numbers for starting assumptions, and then use a simple arithmetic function to generate an ongoing sequence of numbers that behave like random numbers. A given set of starting assumptions and arithmetic function will always generate the same sequence of numbers. To get a new sequence, all we need to do is alter one of the starting assumption values. To recreate a previous sequence, simply go back to the earlier set of starting assumptions.
For example, pick three integer values and enter them at cells F1, F2 and F3. Let’s use F1 = 12345, F2 = 501 and F3 = 2^16. Using our earlier example, set cell B1 to be =MOD($F$2^3,$F$3). Then, in cell B2, write the formula =MOD($F$1*B1+$F$2,$F$3) and drag this down to cell B52. Bingo, pseudo- random numbers! How “random” these are depends on our choice of the 3 variables. As some general guidance, using 2^16 (or a larger power of 2) for F3, an odd number for F2 somewhere between 100 and 1000, and a number that is 1 more than a multiple of 4 for F1 will be a good start.
You can download a workbook containing examples of all the items mentioned in this article.
A quick note about the previous challenge on Defined Names: There is no book prize winner to announce. It seems that challenge wasn’t as popular as the previous challenge on array formulas! Finally, I’ll be back in 2015 continuing Excel Musings for the digital edition of the magazine. I’m putting the final touches on this article from New York, having just concluded co-organising the finals of the ModelOff Financial Modeling World Championships. My congratulations to institute members Peter Suen (Australia) and Michael Clarke (New Zealand) who shared 3rd place in the competition from over 3,000 competitors. Happy New Year to all!
CPD: Actuaries Institute Members can claim two CPD points for every hour of reading articles on Actuaries Digital.