Overview

minstrel is a mac80211 rate control algorithm ported over from MadWifi which supports multiple rate retries and claimed to be one of the best, if not the best, rate control algorithm.

History of minstrel

The minstrel rate control algorithm present on mac80211 was ported from MadWifi by Felix Fietkau. MadWifi's minstrel implementation was released on January 2005, originally designed and implemented by Derek Smithies Ph.D. Read minstrel's madwifi documentation.

Theory of operation

We defined the measure of successfulness (of packet transmission) as

                                  Mega bits transmitted
    Prob_success_transmission *  -----------------------
                                      elapsed time

This measure of successfulness will therefore adjust the transmission speed to get the maximum number of data bits through the radio interface. Further, it means that the 1 Mbps rate (which has a very high probability of successful transmission) will not be used in preference to the 11 Mbps rate.

We decided that the module should record the successfulness of all packets that are transmitted. From this data, the module has sufficient information to decide which packets are more successful than others. However, a variability element was required. We had to force the module to examine bit rates other than optimal. Consequently, some percent of the packets have to be sent at rates regarded as non optimal.

10 times a second (this frequency is alterable by changing the driver code) a timer fires, which evaluates the statistics table. EWMA calculations (described below) are used to process the success history of each rate. On completion of the calculation, a decision is made as to the rate which has the best throughput, second best throughput, and highest probability of success. This data is used for populating the retry chain during the next 100 ms.

As stated above, the minstrel algorithm collects statistics from all packet attempts. Minstrel spends a particular percentage of frames, doing "look around" i.e. randomly trying other rates, to gather statistics. The percentage of "look around" frames defaults to 10%. The distribution of lookaround frames is also randomized somewhat to avoid any potential "strobing" of lookaround between similar nodes.

TCP theory tells us that each packet sent must be delivered in under 26 ms. Any longer duration, and the TCP network layers will start to back off. A delay of 26 ms implies that there is congestion in the network, and that fewer packets should be injected to the device. Our conclusion was to adjust the retry chain of each packet so the retry chain was guaranteed to be finished in under 26 ms.

Retry Chain

Several devices have built in multirate retry chains. For example Atheros 802.11abg chipsets have four segments. Each segment is an advisement to the hardware to try to send the current packet at some rate, with a fixed number of retry attempts. Once the packet is successfully transmitted, the remainder of the retry chain is ignored. Selection of the number of retry attempts was based on the desire to get the packet out in under 26 ms, or fail.

There is some room for movement here - if the traffic is UDP then the limit of 26 ms for the retry chain length is "meaningless". However, one may argue that if the packet was not transmitted after some time period, it should fail. Further, one does expect UDP packets to fail in transmission. We leave it as an area for future improvement.

The (re)try segment chain is calculated in two possible manners. If this packet is a normal tranmission packet (90% of packets are this) then the retry count is best throughput, next best throughput, best probability, lowest baserate. If it is a sample packet (10% of packets are this), then the retry chain is random lookaround, best throughput, best probability, lowest base rate. In tabular format:

        Try | Lookaround rate    | Normal rate
        ------------------------------------------------
         1  | Random lookaround  | Best throughput
         2  | Best throughput    | Next best throughput
         3  | Best probability   | Best probability
         4  | Lowest Baserate    | Lowest Baserate

The retry count is adjusted so that the transmission time for that section of the retry chain is less than 26 ms.

We have adjusted the code so that the lowest rate is never used for the lookaround packet. Our view is that since this rate is used for management packets, this rate must be working. Alternatively, the link is set up with management packets, data packets are acknowledged with management packets. Should the lowest rate stop working, the link is going to die reasonably soon.

Analysis of information showed that the system was sampling too hard at some rates. For those rates that never work (54mb, 500m range) there is no point in sending 10 sample packets (< 6 ms time). Consequently, for the very very low probability rates, we sample at most twice.

The retry chain above does "work", but performance is suboptimal. The key problem being that when the link is good, too much time is spent sampling the slower rates. Thus, for two nodes adjacent to each other, the throughput between them was several Mbps below using a fixed rate. The view was that minstrel should not sample at the slower rates if the link is doing well. However, if the link deteriorates, minstrel should immediately sample at the lower rates.

Some time later, we realized that the only way to code this reliably was to use the retry chain as the method of determining if the slower rates are sampled. The retry chain was modified as:

Try |         Lookaround rate              | Normal rate
    | random < best    | random > best     |
--------------------------------------------------------------
 1  | Best throughput  | Random rate       | Best throughput
 2  | Random rate      | Best throughput   | Next best throughput
 3  | Best probability | Best probability  | Best probability
 4  | Lowest Baserate  | Lowest baserate   | Lowest baserate

With this retry chain, if the randomly selected rate is slower than the current best throughput, the randomly selected rate is placed second in the chain. If the link is not good, then there will be data collected at the randomly selected rate. Thus, if the best throughput rate is currently 54 Mbps, the only time slower rates are sampled is when a packet fails in transmission. Consequently, if the link is ideal, all packets will be sent at the full rate of 54 Mbps. Which is good.

EWMA

The EWMA calculation is carried out 10 times a second, and is run for each rate. This calculation has a smoothing effect, so that new results have a reasonable (but not large) influence on the selected rate. However, with time, a series of new results in some particular direction will predominate. Given this smoothing, we can use words like inertia to describe the EWMA.

By "new results", we mean the results collected in the just completed 100 ms interval. Old results are the EWMA scaling values from before the just completed 100 ms interval.

If no packets have been sent for a particular rate in a time interval, no calculation is carried out.

The appropriate update interval was selected on the basis of choosing a compromise between

collecting enough success/failure information to be meaningful
minimizing the amount of cpu time spent do the updates
providing a means to recover quickly enough from a bad rate selection.

The first two points are self explanatory. When there is a sudden change in the radio environment, an update interval of 100 ms will mean that the rates marked as optimal are very quickly marked as poor. Consequently, the sudden change in radio environment will mean that minstrel will very quickly switch to a better rate.

A sudden change in the transmission probabilities will happen when the node has not transmitted any data for a while, and during that time the environment has changed. On starting to transmit, the probability of success at each rate will be quite different. The driver must adapt as quickly as possible, so as to not upset the higher TCP network layers.