Partitioned Inflationary Rewards Distribution

Problem

With the increase of number of stake accounts, computing and redeeming the stake rewards at the start block of the epoch boundary becomes very expensive. Currently, with 550K stake accounts, the stake reward time has already taken more than 10 seconds. This prolonged computation slows down the network, and can cause large number of forks at the epoch boundary, which makes the matter even worse.

Proposed Solutions

Instead of computing and reward stake accounts at epoch boundary, we will decouple reward computation and reward credit into two phases.

A separate service, "EpochRewardCalculationService" will be created. The service will listen to a channel for any incoming rewards calculation requests, and perform the calculation for the rewards. For each block that cross the epoch boundary, the bank will send a request to the EpochRewardCalculationService. This marks the start of the reward computation phase.

N-1 -- N -- N+1
     \
      \
        N+2

In the above example, N is the start of the new epoch. Two rewards calculation requests will be sent out at slot N and slot N+2 because they both cross the epoch boundary and are on different forks. To avoid repeated computation with the same input, the signature of the computation requests, hash(epoch_number, hash(stake_accounts_data), hash(vote_accounts), hash(delegation_map)), are calculated. Duplicated computation requests will be discard. For the above example, if there are no stake/vote accounts changes between slot N and slot N+2, the 2nd computation request will be discarded.

When reaching block height N after the start of the reward computation phase, the bank starts the second phase - reward credit, in which, the bank first query the epoch calc service with the request signature to get the rewards result, which will be resented as a map from accounts_pubkey->rewards, then credit the rewards to the stake accounts for the next M blocks. If the rewards result is not available, the bank will wait until the results are available.

We call them:
(a) calculating interval: [epoch_start, epoch_start+N]
(b) credit interval: [epoch_start+N+1, epoch_start+N+M], respectively.
And the combined interval [epoch_start, epoch_start+N+M] is called rewarding interval.

For calculating interval, N is chosen to be sufficiently large so that the background computation should have completed and the result of the reward computation is available at the end of calculating interval. N can be fixed such as 100 (roughly equivalent to 50 seconds), or chosen as a function of the number of stake accounts, f(num_stake_accounts).

In credit interval, the bank will fetch the reward computation results from the background thread and start credit the rewards during the next M blocks. The idea is partition the accounts into M partitions. And each block, the bank credit 1/M accounts. The partition is required to be deterministic for the current epoch, but must also be random across different epochs. One way to achieve these properties is to hash the account's pubkey with some epoch dependent values, sort the results, and divide them into M bins. The epoch dependent value can be the epoch number, total rewards for the epoch, the leader pubkey for the epoch block, etc. M can be choses based on 50K account per block, which equal to ceil(num_stake_accounts/50,000).

num_stake_account is extracted from leader_schedule_epoch block, so we don't run into discrepancy where new transactions right before an epoch boundary creates one fork with X stake accounts and another fork with Y stake accounts.

In order to avoid putting extra burden of computing and credit the stake reward for blocks produced during the rewarding interval, we can reduce the compute budget limits on those blocks in rewarding interval, and reserve some computing and read/write capacity to perform stake rewarding.

Challenges

stake accounts reads/writes during the rewarding interval

epoch_start..epoch_start+N+M Because of the delayed credit of the rewards, Reads to those stake accounts will not return the value that the user are expecting (viz. not include the recent epoch stake rewards). Writes to those stake accounts will be lost once the reward are credited on block epoch_start+N+M. We will need to modify the runtime to restrict read/writes to stake accounts during the rewarding interval. Any transactions, which involves stake accounts, will result in a new execution error, i.e. "stake rewards pending, account access is restricted". However, normal rpc queries, such as 'getBalance', will return the current lamport of the account. The user can expect the rewards to be credit as some time point during the 'rewarding interval'.

voting during reward interval

During reward interval, vote transactions must be processed normally for achieving consensus and making progress for rooted blocks. However, those vote transactions may potentially change the vote accounts balance (i.e. pay for the voting transaction fee if vote_account and block reward recipient accounts are the same), before the epoch rewards are paid. When the epoch rewards are paid, those block rewards will be wiped out by the stale cached value. To prevent this, we will enforce that the vote_account and authorized_voter authority must be different.

snapshot taken during the rewarding interval

If a snapshot is taken during the rewarding interval, it would miss the rewards for the stake accounts. Any plain restart from those snapshots will be wrong, unless we reconstruct the rewards from the recent epoch boundary. This will add some complexity to validator restart. In the first implementation, we will force not taking any snapshot and not performing accounts hash calculation during the rewarding interval. Incremental snapshot request will be skipped. Full snapshot request will be re-queued be picked up later at the end of the reward interval.

In future, if needed, we can revisit to enable taking snapshots and perform hash calculation during reward interval.

account-db related action during the rewarding interval

Account-db related action such as flush, clean, squash, shrink etc. may touch and evict the stake accounts from account db's cache during the rewarding interval. This will slow down the credit in the future at bank epoch_start+N. We may need to exclude such accounts_db actions for stake_accounts during rewarding interval. This is going to be a performance tuning problem. In the first implementation, for simplicity, we will keep the account-db action as it is, and make the credit interval larger to accommodate the performance hit when writing back those accounts. In future, we can continue tuning account db actions during 'rewarding interval'.

view of total epoch capitalization change

The view of total epoch capitalization, instead of being available at every epoch boundary, is only available after the rewarding interval. Any third party application logic, which depends on total epoch capitalization, need to wait after rewarding interval.

getInflationReward JSONRPC API method call

Today, the getInflationReward JSONRPC API method call can simply grab the first block in the target epoch and lookup the target stake account's rewards entry. With these changes, the call will need updated to derive the target stake account's credit block, grab that block, then lookup rewards. Additionally we'll need to return more informative errors for queries made during the lockout period, so users can know that their rewards are pending for the target epoch. A new rpc API, i.e. getRewardInterval, will be added for querying the rewarding interval for the current epoch.

Problem​

Proposed Solutions​

Challenges​

Problem

Proposed Solutions

Challenges