利用者:Trunk5772/シェーンハーゲ-シュトラッセンアルゴリズム

シェーンハーゲ・シュトラッセン法は巨大な整数の乗算を漸近的に高速に実行するアルゴリズムである。 1971年にen:Arnold Schönhageとフォルカー・シュトラッセンによって開発された。 ^[1] 2つのn-桁の数の乗算に必要な実行時間のbit complexityはランダウの記号を用いると, $O{\Big (}n\cdot \log(n)\cdot \log {\big (}\log(n){\big )}{\Big )}$ となる。このアルゴリズムでは再帰的に 2ⁿ+1個の要素から成る環 (数学)についての高速フーリエ変換 (FFT;この場合Number theoretic transform, NNTともいう en:Discrete Fourier transform (general)#Number-theoretic transform)を用いる。

1971年の発見から2007年にいたるまで、知られている手法の中で漸近的に最も高速なものであった。 2007年により低い漸近的な複雑性を持つ、すなわち十分大きい整数に対してはより高速な手法である en:Fürer's algorithmが発見された。 ^[2] しかしながら、現状ではシェーンハーゲ・シュトラッセン法よりFürer's algorithmの方が高速になるのは天文学的な大きさの整数の場合であり、実用的には用いられていない。

実用的にシェーンハーゲ・シュトラッセン法がカラツバ法やen:Toom–Cook multiplication等の従来の手法を超える性能を発揮するのは当該の数が 2^2¹⁵あるいは2^2¹⁷(10進で10,000から40,000桁)程度より大きい場合である。^[3]^[4]^[5] ライブラリGNU Multi-Precision LibraryではCPUアーキテクチャーに応じて最小でも64-bitワードで1728から7808(10進で33,000から150,000桁)以上の大きさの数に対しシェーンハーゲ・シュトラッセン法を用いることにしている。^[6] あるJAVAによる実装ではシェーンハーゲ・シュトラッセン法を10進74,000桁より大きい数について用いることになっている。^[7]

シェーンハーゲ・シュトラッセン法の応用には数学的経験主義に基づくようなものXXX---例えばGIMPSや''π''の近似値の計算---もあれば, en:Kronecker substitutionのような実用的なものもある。 Kronecker substitutionとは整数係数の多項式の積を大きな整数の積を使って効率的に計算する方法であり、暗号分野で有用な整数の因数分解の方法XXXen:Lenstra elliptic curve factorizationの実装の一つであるGMP-ECMに用いられている。

詳細

本節ではシェーンハーゲ・シュトラッセン法の実際を詳細に解説する。説明は基本的にはCrandall and PomeranceによるPrime Numbers: A Computational Perspectiveの概説に基く。^[8] 説明するのは原論文の方法の変種であり、多少の改善が施されている。具体的にはthe discrete weighted transform to perform 負巡回畳み込み more efficiently. 他に詳細な情報が得られる文献としてはドナルド・クヌースのThe Art of Computer Programmingがある。^[9]

様々な畳み込み

2つの数、例えば123と456の乗算を考える。B進法を用いいわゆる筆算(long multiplication)を行なうが、一点、繰り上げ操作はしないことにしてみる。結果はこの様になる:

		1	2	3
	×	4	5	6

		6	12	18
	5	10	15
4	8	12

4	13	28	27	18

ここで最後にでてきた数の連なり(4, 13, 28, 27, 18)は元々の2つの数の連なり(1,2,3)と(4,5,6)の非巡回畳み込み(acyclic convolution)あるいは線型畳み込み(linear convolution) と呼ばれる。一度非巡回畳み込みが得られれば元々の数の積を計算するのは簡単で、繰り上げ計算をすればよい。例えば一番右の列については8を残して1を左隣の27に加えるのである。上の例について実際繰り上げをすれば当然ながら正しい積56088が得られる。

他にも役に立つ畳み込みが二種類ある。インプットの数の連なり(例では(1,2,3)と(4,5,6))が n個の数から成るとしよう(例ではnは3に等しい)。非巡回畳み込みはn+n−1個の数から成る。非巡回畳み込みの右端から数えてn個の要素と左端から数えてn−1個の要素を加えたものが巡回畳み込みである。例の場合だと巡回畳み込みは

	28	27	18
+		4	13

	28	31	31

このように(28,31,31)になる。

巡回畳み込みに繰り上げ計算をすると Bⁿ − 1を法として計算したい積に等しいものが得られる。例ではB=10, n=3であるのでBⁿ − 1 = 10³ − 1 = 999であり(28, 31, 31)に繰り上げ計算をすると3141が得られ、実際3141 ≡ 56088 (mod 999)となっている。

一方で非巡回畳み込みの右端から数えてn個の要素から左端から数えてn−1個の要素を 引き算することにより負巡回畳み込み(negacyclic convolution)が得られる。例の場合では

	28	27	18
−		4	13

	28	23	5

のように負巡回畳み込みは(28,23,5)となる。

負巡回畳み込みに繰り上げ計算をすると求めている積にBⁿ + 1を法として等しい結果になる。例の場合では10³ + 1 = 1001で、(28, 23, 5)に繰り上げをすると3035になり、実際3035 ≡ 56088 (mod 1001)が成り立っている。負巡回畳み込みは負の数を含むことがある。その場合は引き算の筆算(long subtraction)と同じ要領で繰り下げすればよい。

畳み込みの定理

他のFFTに基づく乗算のアルゴリズムと同じ様にシェーンハーゲ・シュトラッセン法もその基本は畳み込みの定理である。畳み込みの定理を使うことによって2つの数の連なりの巡回畳み込みを効率良く計算できるのである。すなわち定理によれば、 2つのベクトルの巡回畳み込み(cyclic convolution)は(1) それぞれのベクトル離散フーリエ変換 (Discrete Fourier Transform, DFT)を計算し(2) 得られた2つのベクトルを成分毎に乗じたベクトルを作り(3)それに逆離散フーリエ変換(Inverse Discrete Fourier Transform, IDFT)を施すことによって計算できる。記号的に表現すると

CyclicConvolution(X,Y)=IDFT(DFT(X) · DFT(Y))。

となる。 DFTとIDFTをFFT(高速フーリエ変換)を使って計算し, 変換されたベクトル DFT(X)とDFT(Y)の成分毎の積を今考えているアルゴリズムで再帰的に計算することにより巡回畳み込みを高速に計算するアルゴリズムが得られる。

この方法では負巡回畳み込みを計算するのがさらによいXXX。実は畳み込み定理を少し変えることによって負巡回畳み込みも計算できるようになるのである (参考: Discrete Weighted Transform, DWT, 離散荷重変換)。ベクトルXとYの成分の数をnとし, aをの1の2n乗根とする。(すなわちaを2n乗して初めて1となる数とする。) さらにウェイトベクトル(weight vector)と呼ばれるベクトル AとA^-1を

A =(a^j),0 ≤ j < n

A^-1 =(a^−j), 0 ≤ j < n

で定義する。これにより以下が成り立つ:

NegacyclicConvolution(X, Y)= A^-1 · IDFT(DFT(A · X) · DFT(A · Y))

すなわち、巡回畳み込みと殆ど同じであって、違いは初めにインプットとAを乗じまた最後の結果をA⁻¹と乗ずるのみである。

環の選び方

DFTはどの様な環についても考えられる抽象的な操作である。典型的には環として複素数を用いる。しかし、複素数の四則演算を使うと誤差もでやすく、また特に元々計算したい整数の積を正しく実行するために必要な精度を出そうとすると、計算時間がかかりすぎる。シェーンハーゲ・シュトラッセン法では複素数を使うかわりに、 NTTすなわちフーリエ変換をある整数Nを法とする整数上で行なう。

任意のnに対し複素平面上に1のn乗根(上で用いたa)が存在するのと同じように、任意のnに対してNを適当に選ぶことによりbをNを法とした整数の中での1のn次の原始羃根XXX above XXXとできる。XXX(すなわち bⁿ ≡ 1 (mod N),でありnより小さい指数mに対してはb^m≡ 1 (mod N) とならない。)

このアルゴリズムでは計算時間の殆どは再帰的により小さい数の積を取る作業に費される。素朴なバージョンのアルゴリズムではこのような積は様々な局面で実行される。

FFTアルゴリズムの中で, 原始羃根bの2乗、一般の羃乗、他の値との積が繰り返し計算される。
When taking powers of the primitive root of unity a to form the weight vector A and when multiplying A or A⁻¹ by other vectors.
When performing element-by-element multiplication of the transformed vectors.

The key insight to depends シェーンハーゲ・シュトラッセン法 is to choose N, the modulus, to be equal to 2ⁿ + 1 for some integer n that is a multiple of the number of pieces P=2^p being transformed. This has a number of benefits in standard systems that represent large integers in binary form:

Any value can be rapidly reduced modulo 2ⁿ + 1 using only shifts and adds, as explained in the next section.
All roots of unity in this ring can be written in the form 2^k; consequently we can multiply or divide any number by a root of unity using a shift, and power or square a root of unity by operating only on its exponent.
The element-by-element recursive multiplications of the transformed vectors can be performed using a 負巡回畳み込み, which is faster than an 非巡回畳み込み and already has "for free" the effect of reducing its result mod 2ⁿ + 1.

To make the recursive multiplications convenient, we will frame シェーンハーゲ・シュトラッセン法 as being a specialized multiplication アルゴリズム for computing not just the product of two numbers, but the product of two numbers mod 2ⁿ + 1 for some given n. This is not a loss of generality, since one can always choose n large enough so that the product mod 2ⁿ + 1 is simply the product.

シフト演算による最適化

In the course of the アルゴリズム, there are many cases in which multiplication or division by a power of two (including all roots of unity) can be profitably replaced by a small number of shifts and adds. This makes use of the observation that:

(2ⁿ)^k ≡ (-1)^k mod(2ⁿ +1)

Note that a k-digit number in base 2ⁿ written in positional notation can be expressed as $(d_{k-1},\dots ,d_{1},d_{0})$ . It represents the number $\sum _{i=0}^{k-1}d_{i}\cdot (2^{n})^{i}$ . Also note that for each $d_{i}$ , $0\leq d_{i}<2^{n}$ .

This makes it simple to reduce a number represented in binary mod 2ⁿ + 1: take the rightmost (least significant) n bits, subtract the next n bits, add the next n bits, and so on until the bits are exhausted. If the resulting value is still not between 0 and 2ⁿ, normalize it by adding or subtracting a multiple of the modulus 2ⁿ + 1. For example, if n=3 (and so the modulus is 2³+1 = 9) and the number being reduced is 656, we have:

656 = 1010010000₂ ≡ 000₂ − 010₂ + 010₂ − 1₂ = 0 − 2 + 2 − 1 = −1 ≡ 8 (mod 2³ + 1).

Moreover, it's possible to effect very large shifts without ever constructing the shifted result. Suppose we have a number A between 0 and 2ⁿ, and wish to multiply it by 2^k. Dividing k by n we find k = qn + r with r < n. It follows that:

A(2^k) = A(2^{qn + r}) = A[(2ⁿ)^q(2^r)] ≡ (−1)^q(A shift-left r) (mod 2ⁿ + 1).

Since A is ≤ 2ⁿ and r < n, A shift-left r has at most 2n−1 bits, and so only one shift and subtraction (followed by normalization) is needed.

Finally, to divide by 2^k, observe that squaring the first equivalence above yields:

2²ⁿ ≡ 1 (mod 2ⁿ + 1)

Hence,

A/2^k = A(2^−k) ≡ A(2^{2n − k}) = A shift-left (2n − k) (mod 2ⁿ + 1).

全体像

The アルゴリズム follows a split, evaluate (forward FFT), pointwise multiply, interpolate (inverse FFT), and combine phases similar to Karatsuba and Toom-Cook methods.

Given input numbers x and y, and an integer N, the following アルゴリズム computes the product xy mod 2^N + 1. Provided N is sufficiently large this is simply the product.

Split each input number into vectors X and Y of 2^k parts each, where 2^k divides N. (e.g. 12345678 -> (12, 34, 56, 78)).
In order to make progress, it's necessary to use a smaller N for recursive multiplications. For this purpose choose n as the smallest integer at least 2N/2^k + k and divisible by 2^k.
Compute the product of X and Y mod 2ⁿ + 1 using the 負巡回畳み込み:
1. Multiply X and Y each by the weight vector A using shifts (shift the jth entry left by jn/2^k).
2. Compute the DFT of X and Y using the number-theoretic FFT (perform all multiplications using shifts; for the 2^k-th root of unity, use 2^{2n/2^k}).
3. Recursively apply this アルゴリズム to multiply corresponding elements of the transformed X and Y.
4. Compute the IDFT of the resulting vector to get the result vector C (perform all multiplications using shifts). This corresponds to interpolation phase.
5. Multiply the result vector C by A⁻¹ using shifts.
6. Adjust signs: some elements of the result may be negative. We compute the largest possible positive value for the jth element of C, (j + 1)2^{2N/2^k}, and if it exceeds this we subtract the modulus 2ⁿ + 1.
Finally, perform carrying mod 2^N+1 to get the final result.

The optimal number of pieces to divide the input into is proportional to ${\sqrt {N}}$ , where N is the number of input bits, and this setting achieves the running time of O(N log N log log N), so the parameter k should be set accordingly. In practice, it is set empirically based on the input sizes and the architecture, typically to a value between 4 and 16.

In step 2, the observation is used that:

Each element of the input vectors has at most n/2^k bits;
The product of any two input vector elements has at most 2n/2^k bits;
Each element of the 畳み込み is the sum of at most 2^k such products, and so cannot exceed 2n/2^k + k bits.
n must be divisible by 2^k to ensure that in the recursive calls the condition "2^k divides N" holds in step 1.

最適化

This section explains a number of important practical optimizations that have been considered when implementing シェーンハーゲ・シュトラッセン法 real systems. It is based primarily on a 2007 work by Gaudry, Kruppa, and Zimmermann describing enhancements to the GNU Multi-Precision Library.^[10]

Below a certain cutoff point, it's more efficient to perform the recursive multiplications using other アルゴリズム, such as Toom–Cook multiplication. The results must be reduced mod 2ⁿ + 1, which can be done efficiently as explained above in Shift optimizations with shifts and adds/subtracts.

Computing the IDFT involves dividing each entry by the primitive root of unity 2^{2n/2^k}, an operation that is frequently combined with multiplying the vector by A⁻¹ afterwards, since both involve division by a power of two.

In a system where a large number is represented as an array of 2^w-bit words, it's useful to ensure that the vector size 2^k is also a multiple of the bits per word by choosing k ≥ w (e.g. choose k ≥ 5 on a 32-bit computer and k ≥ 6 on a 64-bit computer); this allows the inputs to be broken up into pieces without bit shifts, and provides a uniform representation for values mod 2ⁿ + 1 where the high word can only be zero or one.

Normalization involves adding or subtracting the modulus 2ⁿ+1; this value has only two bits set, which means this can be done in constant time on average with a specialized operation.

Iterative FFT アルゴリズム such as the Cooley–Tukey FFT アルゴリズム, although frequently used for FFTs on vectors of complex numbers, tend to exhibit very poor cache locality with the large vector entries used in シェーンハーゲ・シュトラッセン法 The straightforward recursive, not in-place implementation of FFT is more successful, with all operations fitting in the cache beyond a certain point in the call depth, but still makes suboptimal use of the cache in higher call depths. Gaudry, Kruppa, and Zimmerman used a technique combining Bailey's 4-step アルゴリズム with higher radix transforms that combine multiple recursive steps. They also mix phases, going as far into the アルゴリズム as possible on each element of the vector before moving on to the next one.

The "square root of 2 trick", first described by Schönhage, is to note that, provided k ≥ 2, 2^3n/4−2^n/4 is a square root of 2 mod 2ⁿ+1, and so a 4n-th root of unity (since 2²ⁿ ≡ 1). This allows the transform length to be extended from 2^k to 2^{k + 1}.

Finally, the authors are careful to choose the right value of k for different ranges of input numbers, noting that the optimal value of k may go back and forth between the same values several times as the input size increases.

参考文献

^ A. Schönhage and V. Strassen, "Schnelle Multiplikation großer Zahlen", Computing 7 (1971), pp. 281–292.
^ Martin Fürer, "Faster integer multiplication", STOC 2007 Proceedings, pp. 57–66.
^ Rodney Van Meter and Kohei M. Itoh, "Fast quantum modular exponentiation", Physical Review A, Vol. 71 (2005).
^ Overview of Magma V2.9 Features, arithmetic section: 実用的に様々なアルゴリズムの間で性能を比較しどの程度の大きさの数でどのアルゴリズムが有効になるかを議論している。
^ Luis Carlos Coronado García, "Can Schönhage multiplication speed up the RSA encryption or decryption?"
^ “MUL_FFT_THRESHOLD”. GMP developers' corner. 3 November 2011閲覧。
^ “An improved BigInteger class which uses efficient algorithms, including Schönhage–Strassen”. Oracle. 2014年1月10日閲覧。
^ R. Crandall & C. Pomerance.
^ Donald E. Knuth, The Art of Computer Programming, Volume 2: Seminumerical Algorithms (3rd Edition), 1997.
^ Pierrick Gaudry, Alexander Kruppa, and Paul Zimmermann.

[schönhage-1] A. Schönhage and V. Strassen, "Schnelle Multiplikation großer Zahlen", Computing 7 (1971), pp. 281–292.

[2] Martin Fürer, "Faster integer multiplication", STOC 2007 Proceedings, pp. 57–66.

[3] Rodney Van Meter and Kohei M. Itoh, "Fast quantum modular exponentiation", Physical Review A, Vol. 71 (2005).

[4] Overview of Magma V2.9 Features, arithmetic section: 実用的に様々なアルゴリズムの間で性能を比較しどの程度の大きさの数でどのアルゴリズムが有効になるかを議論している。

[5] Luis Carlos Coronado García, "Can Schönhage multiplication speed up the RSA encryption or decryption?"

[6] “MUL_FFT_THRESHOLD”. GMP developers' corner. 3 November 2011閲覧。

[7] “An improved BigInteger class which uses efficient algorithms, including Schönhage–Strassen”. Oracle. 2014年1月10日閲覧。

[crandall-8] R. Crandall & C. Pomerance.

[9] Donald E. Knuth, The Art of Computer Programming, Volume 2: Seminumerical Algorithms (3rd Edition), 1997.

[Gaudry-10] Pierrick Gaudry, Alexander Kruppa, and Paul Zimmermann.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]