Monotonic Estimation for Probability Distribution and Multivariate Risk Scales by Constrained Minimum Generalized Cross-Entropy

- Minimum cross-entropy estimation is an extension to the maximum likelihood estimation for multinomial probabilities. Given a probability distribution {𝒓 𝒊 } 𝒊=𝟏𝒌 , we show in this paper that the monotonic estimates {𝒑 𝒊 } 𝒊=𝟏𝒌 for the probability distribution by minimum cross-entropy are each given by the simple average of the given distribution values over some consecutive indexes. Results extend to the monotonic estimation for multivariate outcomes by generalized cross-entropy. These estimates are the exact solution for the corresponding constrained optimization and coincide with the monotonic estimates by least squares. A non-parametric algorithm for the exact solution is proposed. The algorithm is compared to the “pool adjacent violators” algorithm in least squares case for the isotonic regression problem. Applications to monotonic estimation of migration matrices and risk scales for multivariate outcomes are discussed.


I. INTRODUCTION
Utilizing prior knowledge is important for a learning process. A common prior is the monotone relationship between input and output. For example, we expect the loss for a loan to be lower when collateral value and quality of collateral type are higher; and people tend to buy less of a product when price increases. Examples of learnings, where monotonic constraints are imposed, include isotonic regression ( [1], [2], [3], [4]), rating migration models ( [5]), classification trees ( [6]), rule learning ( [7]), binning ( [8]), and deep lattice network ( [9]).
Results by (1.8) are the exact solution to the constrained optimization problem corresponding to (1.6) and are proved to be also the least squares estimates subject only to (1.4) (see Proposition 3.4), which links to isotonic regression. That is, for monotonic least squares estimates, (1.7) is an implication, while it is a condition (i.e. a constraint) for monotonic generalized cross-entropy estimates.
A non-parametric algorithm (Algorithm 4.1) is proposed in section IV for the partition integers in (1.8), hence the monotonic estimates. This algorithm is compared in section V to the PAV algorithm.
The key ideas to the proof of (1.8) and the algorithms proposed in this paper are the reparameterization of the estimates so that (1.4) is automatically satisfied. Consequently, the constrained programming is transformed into a tractable nonconstrained mathematical programming problem (see sections III and IV).
The paper is organized as follows: Partition integers are defined in section II. Equation (1.8) is proved in section III. We propose in section IV a non-parametric algorithm for finding these partition integers. In section V, we compare this nonparametric algorithm, in least squares case, with the Pool Adjacent Violators algorithm for isotonic regression. Two examples are provided in section V, where monotonic estimation for long-run rating migration matrices and loss rate time series are discussed. That is, given −1 , the partition integer is the largest index where ( −1 + 1, ) reaches its minimum at = within the remaining range This contradicts the fact that 1 is the largest index where (1, ) reaches its minimum at = for ≥ −1 + 1.

III. MONOTONIC ESTIMATION BY MINIMUM CROSS-ENTROPY
In this section, we prove equation (1.8), first for the minimum cross-entropy estimates subject to (1.4) and (1.5), then for the minimum generalized crossentropy estimates subject to (1.4) and (1.7). At the end of the section, we show that these estimates are also the monotonic least squares estimates, in absence of (1.7). Proof. First, we show that the 1 st statement implies the 2 nd statement. The second statement in the lemma holds if = = 0 because, in this case, = 0 and = 0 for all ′ . If > 0, then: where ′ = / , ′ = / , and = − ∑ log( ) =1 . By (1.7), { ′ } =1 sum to one. Since . We now show the first statement. We consider the following three cases. Case (a). 0 < < 1 for all 1 ≤ ≤ . Take the derivative of with respect to in the range 0 < < 1 and set it to zero, using the relation = 1 − ( 1 + 2 + ⋯ + −1 ). We have − = 0. This holds for all ′ . Thus the vector ( 1 , 2 , … , ) is in proportion to ( 1 , 2 , … , ), hence in proportion to ( 1 , 2 , … , ). Because of (1.5), we must have = .

IV. ALGORITHMS FOR MONOTONIC ESTIMATION BY MINIMUM GENERALISED CROSS-ENTROPY
In this section we propose algorithms for finding monotonic estimates. First, we propose a nonparametric algorithm with time complexity ( 2 ) for the partition integers, hence the exact solution for the monotonic estimates.

A. Isotonic Regression
Given real numbers { } =1 , the task of isotonic regression is to find { } =1 that minimize the weighted sum squares ∑ ( − ) 2

=1
, where { } =1 are the given weights. When is 1 and takes value 0 or 1 for all 's, it is known ( [14]) that the results for isotonic regression coincide with the maximum likelihood estimates subject to (1.4) for the Bernoulli log-likelihood ∑ [ log( ) +

=1 (1 − )log (1 − )].
A unique exact solution to the isotonic regression exists and can be obtained by a non-parametric algorithm called Pool Adjacent Violators (PAV) ( [1]). The basic idea, as described in [4], is the following: Starting with 1 , we move to the right and stop at the first place where > +1 . Since +1 violates the monotonic assumption, we pool and +1 replacing both with their weighted average. Call this average * = +1 * = ( + +1 +1 )/( + +1 ). We then move to the left to make sure that −1 ≤ * -if not, we pool −1 with * and +1 * replacing these three with their weighted average. We continue to the left until the monotonic requirement is satisfied, then proceed again to the right (see [1], [2], [4], [8]). This algorithm finds the exact solution via forward and backward averaging. Another parametric algorithm, called Active Set Method, approximates the solution using the Karush-Kuhn-Tucker (KKT, [15]) conditions for linearly constrained optimization ( [2], [8]).
For a given sample = {( 1 , 2 , … , )} =1 , where are real numbers, the sum-squares-error in (3.11) can be rewritten as: = , and = ∑ =1 . Because 1 does not depend on parameters { } =1 , the estimates that minimize subject to (3.12) are the same as the estimates that minimize 2 subject to (3.12). Hence, the least squares estimates { } =1 of (3.11) subject to (3.12) are the solution to the isotonic regression problem where weights are equal to . The algorithm PAV repeatedly searches both backward and forward for violators and takes average whenever a violator is found. In contrast, Algorithm 4.1 determines explicitly the groups of consecutive indexes by a forward search for partition integers. Average is then to be taken over each of these groups. For Algorithm 4.2, the constrained optimization is transformed into a non-constrained mathematical programming, through a reparameterization. No KKT conditions and active set method are used.

B. Monotonic Estimation of Risk Scales for Multivariate Outcomes
In this section, we show two examples on how the proposed algorithms can be used for monotonic estimation for a loss rate time series and a long-run migration matrix. Parametric methods for monotonic estimation of long-run migration matrices was discussed in ( [5]).
In the first example, a loan portfolio is observed for loss for each loan since the account is opened. The 1 st row in Table 1 below shows the yearly (since account open date) loss rate for the portfolio for 25000 accounts (i.e. = 25000). The rate is calculated as the ratio of the total loss amount in a year divided by the total initial balance at open date for the portfolio. It is assumed that the loss rate is decreasing as loans survive through time.
The non-parametric algorithm (Algorithm 4.1, labelled as "NPSM") is used, by reversing the time index, to obtain the monotonic least squares estimates for 10 yearly rates. As a result, simple average is taken for cell groups {1, 2} and {8, 9, 10} respectively. For other cells the rate is kept unchanged. Strictly monotonic least squares estimates are obtained by using algorithm 4.2 (labelled as "PSM", where in the algorithm is chosen to satisfy exp( ) = 1.05).
A benchmark model of the form = + is calibrated, where denotes the time since account opening, with parameters being estimated by least squares regression. This is a simplified model for monotonic continuous yield curve used by Nelson and Siegel ([16], pp.483). We label this approach by "NSSM".
As shown in the table, the non-parametric algorithm gets the lowest sum squared error (labelled as "SSE").
In the second example, the non-parametric algorithm is used to "smooth" the long-run average rating migration matrix for a portfolio with six non- default ratings. It is expected that an entity will migrate to the closer non-default rating than a faraway non-default rating, i.e. the following conditions are required for each ℎ row in the longrun average migration matrix: , +1 ≥ , +2 ≥ ⋯ ≥ , , (5.1) ,1 ≤ ,2 ≤ ⋯ ≤ , −1 (5.2) where , denotes the probability of migrating from non-default rating to non-default rating , conditional on that it migrates to a non-default rating.
Smoothing of a given migration matrix means the action of modifying the migration matrix subject to (5.1) and (5.2) with minimum loss (cross-entropy). Table 2 below shows the sample long-run average rating migration matrix before smoothing, conditional on migrating to a non-default rating, calculated from the historical sample generated synthetically between 2007Q1 and 2017Q1 for a commercial portfolio. There are six non-default ratings. Three highlighted blocks violate (5.1) or (5.2).
Take, for example, the right-hand-side of the diagonal for the first row in the matrix, before smoothing, these numbers are: 0.0183, 0.0031, 0.0055, 0.0010, 0.0002. (5.3) We can think these numbers are the sample multinomial percentages by dividing into each the sum of these 5 numbers, then applying Algorithm 4.1 to obtain the smoothed rates, and finally times back the sum of the above 5 numbers. Or equivalently, apply Algorithm 4.1 directly without normalization to (5.3). This means, the smoothed results are given by replacing the values for the 2 nd and the 3 rd numbers by their average on 0.0031 and 0.0055, while keeping others unchanged. Table 3 shows the migration matrix after smoothing.

VI. CONCLUSIONS AND FUTURE WORKS
With the proposed non-parametric algorithm, the exact solution to the monotonic estimation of the risk scales for multivariate outcomes becomes easier. No calculation for the optimization gradients and Hessian matrices, only a machine learning data driven process is required.
One of the interesting future research subjects is the monotonic estimation for the survival probability of a loan over a risk rated portfolio: a loan with lower risk rating is expected to survive more likely. We will propose models and algorithms for the monotonic estimation of these survival probabilities.

ACKNOWLEDGMENT
The author thanks Carlos Lopez for his consistent inputs, insights, and supports for this research. ) as a function of , up to an additive constant (because is fixed). Both take on minimal values when = , which is 0 for KL divergence and ( ) for the cross-entropy. Thus cross-entropy measures the dissimilarity between the given distribution and the distribution ( [5], [9]).