CPUベンチマークの採取方法

CPUの性能を調べる方法を紹介、というかメモです。

次の記事が長くなりすぎないように分割したやつ。

CPU Benchmark Charts

最も手軽にCPU性能を調べられるのが、このサイトです。

PassMark Software – CPU Benchmark Charts

私の場合は、サイドバーの『Single Thread』を眺めたり、ヘッダの検索から型番のページに行って、Single Thread の値を確認したりします。

マルチコアでの総合力も大事なんですが、個人的には Single での性能がレスポンスタイムなどに直結するので、重要視しています。

CPU情報

/proc/cpuinfo

Linux なら、これで大体の情報を確認できます。

$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 79
model name      : Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
stepping        : 1
microcode       : 0xb000038
cpu MHz         : 2300.015
cache size      : 46080 KB
physical id     : 0
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt
bogomips        : 4600.04
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

$ cat /proc/cpuinfo

processor : 0

vendor_id : GenuineIntel

cpu family : 6

model : 79

model name : Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz

stepping : 1

microcode : 0xb000038

cpu MHz : 2300.015

cache size : 46080 KB

physical id : 0

siblings : 1

core id : 0

cpu cores : 1

apicid : 0

initial apicid : 0

fpu : yes

fpu_exception : yes

cpuid level : 13

wp : yes

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt

bogomips : 4600.04

clflush size : 64

cache_alignment : 64

address sizes : 46 bits physical, 48 bits virtual

power management:

model name で型番を確認できます。が、最近のAWSで新しいArmのCPUだと、見れません。

physical id が物理的なCPUの番号で、
core id がCPU内でのコア番号、
先頭の processor がスレッド番号です。

例えば、１つのサーバーに4core, multi-threaed のCPUが2つ搭載されている場合、physical id が [0-1]、さらにそれぞれに core id が [0-7]、processor が [0-15] となります。core id 0 の中の multi-thread の番号が 0,1 が対なのか、0,4 が対なのかは忘れました。

クラウド時代なので、もうそこまで気にしてませんが、プライベートクラウドを作った時は、子への割り当てスレッドが重複しないように、とか考慮してたので重要なポイントでした。

lscpu

このコマンドでは、また少し違った情報が見れます。

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                1
On-line CPU(s) list:   0
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:              1
CPU MHz:               2300.015
BogoMIPS:              4600.04
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0

$ lscpu

Architecture: x86_64

CPU op-mode(s): 32-bit, 64-bit

Byte Order: Little Endian

CPU(s): 1

On-line CPU(s) list: 0

Thread(s) per core: 1

Core(s) per socket: 1

Socket(s): 1

NUMA node(s): 1

Vendor ID: GenuineIntel

CPU family: 6

Model: 79

Model name: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz

Stepping: 1

CPU MHz: 2300.015

BogoMIPS: 4600.04

Hypervisor vendor: Xen

Virtualization type: full

L1d cache: 32K

L1i cache: 32K

L2 cache: 256K

L3 cache: 46080K

NUMA node0 CPU(s): 0

top

普通に top を表示すると、全体100%として見れますが

$ top
top - 15:32:05 up 225 days,  1:02,  1 user,  load average: 0.66, 0.51, 0.45
Tasks: 409 total,   1 running, 406 sleeping,   0 stopped,   2 zombie
 %Cpu(s):  0.1 us,  0.0 sy,  0.0 ni, 99.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 13185012+total, 14100336 free, 10678348+used, 10966308 buff/cache
KiB Swap:  2097148 total,  2097148 free,        0 used. 24107176 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 5589 mysql     20   0  0.116t 0.097t   7544 S  37.5 79.3  13355:15 mysqld
62309 saito_y+  20   0  168360   2488   1532 R   6.2  0.0   0:00.01 top

$ top

top - 15:32:05 up 225 days, 1:02, 1 user, load average: 0.66, 0.51, 0.45

Tasks: 409 total, 1 running, 406 sleeping, 0 stopped, 2 zombie

%Cpu(s): 0.1 us, 0.0 sy, 0.0 ni, 99.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

KiB Mem : 13185012+total, 14100336 free, 10678348+used, 10966308 buff/cache

KiB Swap: 2097148 total, 2097148 free, 0 used. 24107176 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

5589 mysql 20 0 0.116t 0.097t 7544 S 37.5 79.3 13355:15 mysqld

62309 saito_y+ 20 0 168360 2488 1532 R 6.2 0.0 0:00.01 top

表示中に 1 を入力すると、スレッド表示になります。これは vCPU=40 で中略したやつ。

$ top
top - 14:59:44 up 225 days, 29 min,  1 user,  load average: 0.47, 0.45, 0.43
Tasks: 409 total,   1 running, 406 sleeping,   0 stopped,   2 zombie
 %Cpu0  :  1.0 us,  0.0 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st  
 %Cpu1  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
~snip~
 %Cpu38 :  1.0 us,  0.0 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
 %Cpu39 :  1.0 us,  0.0 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 13185012+total, 14243840 free, 10678503+used, 10821256 buff/cache
KiB Swap:  2097148 total,  2097148 free,        0 used. 24106720 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 5589 mysql     20   0  0.116t 0.097t   7544 S  37.6 79.3  13341:15 mysqld
43711 saito_y+  20   0  168328   2612   1620 R   2.0  0.0   0:00.10 top

$ top

top - 14:59:44 up 225 days, 29 min, 1 user, load average: 0.47, 0.45, 0.43

Tasks: 409 total, 1 running, 406 sleeping, 0 stopped, 2 zombie

%Cpu0 : 1.0 us, 0.0 sy, 0.0 ni, 99.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

%Cpu1 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

~snip~

%Cpu38 : 1.0 us, 0.0 sy, 0.0 ni, 99.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

%Cpu39 : 1.0 us, 0.0 sy, 0.0 ni, 99.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

KiB Mem : 13185012+total, 14243840 free, 10678503+used, 10821256 buff/cache

KiB Swap: 2097148 total, 2097148 free, 0 used. 24106720 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

5589 mysql 20 0 0.116t 0.097t 7544 S 37.6 79.3 13341:15 mysqld

43711 saito_y+ 20 0 168328 2612 1620 R 2.0 0.0 0:00.10 top

スレッドごとの使用率を診ると、想定通りの使用スレッド数になっているかとか、平均的な使用率なのか、master的プロセスのみ多く使っているのか、といったことを確認することができます。

姫野ベンチマーク

以下、Amazon Linux 2 AMI での準備内容になります。

lha

lha コマンドがパッケージでなさげなので、入れます。

yum -y install gcc automake autoconf

cd /usr/local/src
wget -O lha.zip https://github.com/jca02266/lha/archive/master.zip
unzip lha.zip
cd lha-master/

aclocal; autoheader; automake -a; autoconf
./configure
make
make install

type lha

yum -y install gcc automake autoconf

cd /usr/local/src

wget -O lha.zip https://github.com/jca02266/lha/archive/master.zip

unzip lha.zip

cd lha-master/

aclocal; autoheader; automake -a; autoconf

./configure

make

make install

type lha

himeno mpirun

ダウンロードページはこちら

ダウンロード（MPI、VPP による並列バージョン） | 理化学研究所情報システム本部

インストールします。並列数ごとにファイルを用意するのは何か間違ってそうと思いつつ。

cd /usr/local/src
wget https://i.riken.jp/wp-content/uploads/2015/07/cc_himenobmtxp_mpi.zip
unzip cc_himenobmtxp_mpi.zip

mkdir himeno
cd himeno/
lha e ../cc_himenobmtxp_mpi.lzh

yum -y install openmpi-devel
ln -s /usr/lib64/openmpi/bin/mpicc /usr/local/bin/
ln -s /usr/lib64/openmpi/bin/mpirun /usr/local/bin/

cp Makefile.sample Makefile
chmod +x paramset.sh

for i in 1 2 4 8
do
    ./paramset.sh M 1 1 $i
    make
    mv bmt p$i.bmt
    make clean
done

cd /usr/local/src

wget https://i.riken.jp/wp-content/uploads/2015/07/cc_himenobmtxp_mpi.zip

unzip cc_himenobmtxp_mpi.zip

mkdir himeno

cd himeno/

lha e ../cc_himenobmtxp_mpi.lzh

yum -y install openmpi-devel

ln -s /usr/lib64/openmpi/bin/mpicc /usr/local/bin/

ln -s /usr/lib64/openmpi/bin/mpirun /usr/local/bin/

cp Makefile.sample Makefile

chmod +x paramset.sh

for i in 1 2 4 8

./paramset.sh M 1 1 $i

make

mv bmt p$i.bmt

make clean

done

実行するとこんな感じ。最後の MFLOPS measured の値をメモります。

$ mpirun -np 4 ./p4.bmt
Sequential version array size
 mimax = 129 mjmax = 129 mkmax = 257
Parallel version array size
 mimax = 129 mjmax = 129 mkmax = 67
imax = 128 jmax = 128 kmax =65
I-decomp = 1 J-decomp = 1 K-decomp =4
 Start rehearsal measurement process.
 Measure the performance in 3 times.

 MFLOPS: 15840.417755 time(s): 0.025966 1.702138e-03

 Now, start the actual measurement process.
 The loop will be excuted in 6932 times
 This will take about one minute.
 Wait for a while

cpu : 57.226783 sec.
Loop executed for 6932 times
Gosa : 7.446413e-05
MFLOPS measured : 16607.831998
Score based on Pentium III 600MHz : 200.480830

$ mpirun -np 4 ./p4.bmt

Sequential version array size

mimax = 129 mjmax = 129 mkmax = 257

Parallel version array size

mimax = 129 mjmax = 129 mkmax = 67

imax = 128 jmax = 128 kmax =65

I-decomp = 1 J-decomp = 1 K-decomp =4

Start rehearsal measurement process.

Measure the performance in 3 times.

MFLOPS: 15840.417755 time(s): 0.025966 1.702138e-03

Now, start the actual measurement process.

The loop will be excuted in 6932 times

This will take about one minute.

Wait for a while

cpu : 57.226783 sec.

Loop executed for 6932 times

Gosa : 7.446413e-05

MFLOPS measured : 16607.831998

Score based on Pentium III 600MHz : 200.480830

OpenSSL speed

思いっきり特化した処理内容だけど、簡単なので。多分入っている openssl で、暗号化方式やスレッド数を指定して実行します。

最後の一番うしろの値で比較など。

$ openssl speed -elapsed -evp aes-128-gcm -multi 4

Forked child 0
Forked child 1
+DT:aes-128-gcm:3:16
Forked child 2
Forked child 3
~
Got: +H:16:64:256:1024:8192 from 0
Got: +F:22:aes-128-gcm:600754778.67:1387024320.00:2609117781.33:4128702464.00:5146301781.33 from 0
Got: +H:16:64:256:1024:8192 from 1
Got: +F:22:aes-128-gcm:602350976.00:1387460650.67:2610551722.67:4131929088.00:5154248021.33 from 1
Got: +H:16:64:256:1024:8192 from 2
Got: +F:22:aes-128-gcm:602043280.00:1387473728.00:2610055765.33:4130897578.67:5136031744.00 from 2
Got: +H:16:64:256:1024:8192 from 3
Got: +F:22:aes-128-gcm:601597354.67:1386964458.67:2610664362.67:4131077802.67:5134775637.33 from 3
OpenSSL 1.0.2k-fips  26 Jan 2017
built on: reproducible build, date unspecified
options:bn(64,64) md2(int) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)
compiler: gcc -I. -I.. -I../include  -fPIC -DOPENSSL_PIC -DZLIB -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DKRB5_MIT -m64 -DL_ENDIAN -Wall -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches    -m64 -mtune=generic -Wa,--noexecstack -DPURIFY -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DRC4_ASM -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM -DECP_NISTZ256_ASM
evp            2406746.39k  5548923.16k 10440389.63k 16522606.93k 20571357.18k

$ openssl speed -elapsed -evp aes-128-gcm -multi 4

Forked child 0

Forked child 1

+DT:aes-128-gcm:3:16

Forked child 2

Forked child 3

Got: +H:16:64:256:1024:8192 from 0

Got: +F:22:aes-128-gcm:600754778.67:1387024320.00:2609117781.33:4128702464.00:5146301781.33 from 0

Got: +H:16:64:256:1024:8192 from 1

Got: +F:22:aes-128-gcm:602350976.00:1387460650.67:2610551722.67:4131929088.00:5154248021.33 from 1

Got: +H:16:64:256:1024:8192 from 2

Got: +F:22:aes-128-gcm:602043280.00:1387473728.00:2610055765.33:4130897578.67:5136031744.00 from 2

Got: +H:16:64:256:1024:8192 from 3

Got: +F:22:aes-128-gcm:601597354.67:1386964458.67:2610664362.67:4131077802.67:5134775637.33 from 3

OpenSSL 1.0.2k-fips 26 Jan 2017

built on: reproducible build, date unspecified

options:bn(64,64) md2(int) rc4(16x,int) des(idx,cisc,16,int) aes(partial) idea(int) blowfish(idx)

compiler: gcc -I. -I.. -I../include -fPIC -DOPENSSL_PIC -DZLIB -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DKRB5_MIT -m64 -DL_ENDIAN -Wall -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -Wa,--noexecstack -DPURIFY -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DRC4_ASM -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM -DECP_NISTZ256_ASM

evp 2406746.39k 5548923.16k 10440389.63k 16522606.93k 20571357.18k

PassMark PerformanceTest

冒頭のCPU Benchmark Chartsを公開しているところのテストツールです。

PassMark PerformanceTest Linux – Linux System Benchmark Software

上記ページにも書いてありますが、ちょい不足を補いつつ準備します。

yum -y install ncurses-compat-libs

cd /usr/local/src
# arm
wget https://www.passmark.com/downloads/pt_linux_ARM_64.zip
unzip pt_linux_ARM_64.zip
# x86
wget https://www.passmark.com/downloads/pt_linux_x86_64.zip
unzip pt_linux_x86_64.zip

yum -y install ncurses-compat-libs

cd /usr/local/src

# arm

wget https://www.passmark.com/downloads/pt_linux_ARM_64.zip

unzip pt_linux_ARM_64.zip

# x86

wget https://www.passmark.com/downloads/pt_linux_x86_64.zip

unzip pt_linux_x86_64.zip

実行するとこんな感じ。最下部にあるとおり、A を入力すると全部実行し始めるので待ちます。

$ ./pt_linux_ARM_64
                          PassMark PerformanceTest Linux


 (aarch64)
4 cores @ 0 MHz  |  7.6 GiB RAM
Number of Processes: 4  |  Test Iterations: 3  |  Test Duration: Short

                                                       Iteration: 3/3
Tests                              Status              Result
Integer Math                       Complete            19026.55 MOps/s
Floating Point Math                Complete            12057.33 MOps/s
Prime Numbers                      Complete            36.07 Million Primes/s
Sorting                            Complete            10464.38 Thousand Strings/s
Encryption                         Complete            544.56 MB/s
Compression                        Complete            18.53 MB/s
CPU Single Threaded                Complete            1090.09 MOps/s
Physics                            Complete            725.57 Frames/s
Extended Instructions (NEON)       Complete            2329.41 Million Matrices/s
Cross-platform Mark                Complete            12945.05

CPU Mark                           Complete            2735.94

Results submission disabled
Unable to get CPU model, upload disabled

Use ESC or CTRL-C to exit
A: Run All Tests   U: Upload Test Results
1: Integer Test    2: Floating Point   3: Prime     4: Sorting  5: Encryption
6: Compression     7: Single Thread    8: Physics   9: SSE      0: Cross-platform

$ ./pt_linux_ARM_64

PassMark PerformanceTest Linux

(aarch64)

4 cores @ 0 MHz | 7.6 GiB RAM

Number of Processes: 4 | Test Iterations: 3 | Test Duration: Short

Iteration: 3/3

Tests Status Result

Integer Math Complete 19026.55 MOps/s

Floating Point Math Complete 12057.33 MOps/s

Prime Numbers Complete 36.07 Million Primes/s

Sorting Complete 10464.38 Thousand Strings/s

Encryption Complete 544.56 MB/s

Compression Complete 18.53 MB/s

CPU Single Threaded Complete 1090.09 MOps/s

Physics Complete 725.57 Frames/s

Extended Instructions (NEON) Complete 2329.41 Million Matrices/s

Cross-platform Mark Complete 12945.05

CPU Mark Complete 2735.94

Results submission disabled

Unable to get CPU model, upload disabled

Use ESC or CTRL-C to exit

A: Run All Tests U: Upload Test Results

1: Integer Test 2: Floating Point 3: Prime 4: Sorting 5: Encryption

6: Compression 7: Single Thread 8: Physics 9: SSE 0: Cross-platform

CPU Mark が総合結果ですが、処理ごとに結構差が出たりするので、真面目に比較する場合はそれぞれの数値を比較するとよいです。

また、試験内容は更新されていくものらしいので、昔の数値と比べるのはよくないようです。

ミドルウェアで負荷試験

CPUは新しいほど高性能な傾向にはありますが、処理内容や並列度合いで、それなりに得意不得意があったりするので、よほど世代が離れていなければ絶対早くなるというわけではありません。

なので、CPUテストは参考程度にして、実際に動かすミドルウェアなどで試験するのが肝要です。

アプリケーション・サーバー

いわゆるWEBサーバーなど、並列に負荷分散できる環境の場合、負荷試験ツールで試験するのも良いですが、もし本番環境が既にあって、新しいCPUとの比較をしてみたい場合、わりと簡単に計測することができます。

既存の分散グループに対して、CPUは新しいけどOSや設定が同じサーバーをグループに投入します。分散アルゴリズムがラウンドロビンなど均等の場合、その状態で新旧の数値を見たり、グラフに残すなどして比較することで性能比を確認することができます。

もし旧CPU使用率が 40% で、新CPU使用率が 30% ならば、性能比としては 40/30% の向上となり、入れ替えれば台数を 30/40% に削減しても、ほぼ元の平均使用率40%になり節約できる、と推算できます。

イメージを使い回せばそれが最も確実ですが、x86 -> Arm だと作り直しが必要なので、その場合はインフラの構築がコード化されていればサクッと作り直し、念の為、単発で起動後に動作確認し、それからグループに投入すればリスクはありません。

データベース・サーバー

DBの場合はアプリケーションみたいにソッと混ぜて試すことができないので、キャッシュをOFFにして、クエリの値をランダムにして、クライアント接続数をvCPUの倍くらい用意してガーッとぶん回す。っていつものをテスト環境でやる感じでしょうか。

その環境もテストデータ作成を省くならば、本番バックアップから新旧CPUで２台起動し、そこでテストすればかなりリアルな計測値となります。

肝心なのは、あくまでCPU単体テストは参考値、実際に扱うミドルウェアやソフトウェアを通したテストこそが信用できる値ということです。

まぁサーバーを購入していた昔なら、このへんは事前にテスト機を借りて計測して、納得してから買う。みたいな流れがありましたが……今はクラウドでポチポチ気軽に変更できるので、そこまで厳密に測定する必要なくて、費用／性能面でアプリケーションが健全に動けばよい。くらいの感覚でよいのだと思います。

……という感じで次記事へ派生します:-)