Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small difference makes suspicion performance decreasing #8624

Open
hungryzzz opened this issue May 15, 2024 · 3 comments
Open

Small difference makes suspicion performance decreasing #8624

hungryzzz opened this issue May 15, 2024 · 3 comments

Comments

@hungryzzz
Copy link
Contributor

hungryzzz commented May 15, 2024

Test Cases

cases.zip

Steps to Reproduce

Hi, I run the attached two cases(good.wasm&bad.wasm) in Wasmtime and WasmEdge(AOT), and collect their execution time respectively (measured by time tool).

# command to collect execution time of wasmtime
wasmtime compile bad.wasm -o bad.cwasm
time wasmtime run --allow-precompiled bad.cwasm

# command to collect execution time of wasmedge
wasmedgec bad.wasm bad-wasmedge-aot.wasm
time wasmedge bad-wasmedge-aot.wasm

Expected Results & Actual Results

For good.wasm, the execution time in different runtimes are as follows:

  • Wasmtime: 1.17s
  • WasmEdge: 0.59s

For bad.wasm, the execution time in different runtimes are as follows:

  • Wasmtime: 6.01s
  • WasmEdge: 0.59s

The difference between the attached two cases is as follow, i.e., changing the operand of i32.add from i32.const 0 to local.get 1, which can bring 5x performance decreasing on Wasmtime but has no effect on WasmEdge.

➜  cases diff good.wat bad.wat
62c62
<                       i32.const 0
---
>                       local.get 1
;; good.wat
i32.const 0
i32.const 1
i32.add

;; bad.wat
local.get 1
i32.const 1
i32.add

At first I thought the performance decreasing was caused by the difference, because the good one uses a constant while the bad one uses a local variable which may need to fetch from memory. So I do a small experiment: repeatly calculate (2000000000 times) an addition operation whose operand are a constant or a local variable and measure the execution time respectively. And I find that they are almost the same, 1.09s vs. 1.1s. Therefore, I think the above performance decreasing is caused by other reasons.

;; experiment case
(module
  (type (;0;) (func))
  (type (;1;) (func (param i32)))
  (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 1)))
  (func (;1;) (type 0)
    (local i32 i32)
    loop
      ;; i32.const 0
      local.get 1
      i32.const 1
      i32.add
      local.get 0
      i32.add
      local.tee 0
      i32.const 2000000000
      i32.ne
      br_if 0
    end
    i32.const 0
    call 0
    unreachable)
  (memory (;0;) 8192 8192)
  (export "_start" (func 1))
  (export "memory" (memory 0)))

Profiling Information

I use Perf tool to profile the execution time and find that the hotspot is in the loop where the small difference happens, so I think the difference change some compilation strategy which may cause the performance decreasing.

Samples: 23K of event 'cycles', Event count (approx.): 21853956752
Overhead  Command          Shared Object      Symbol
  99.87%  wasmtime         jitted-93855-1.so  [.] wasm[0]::function[2]                                                                                           ▒
   0.02%  tokio-runtime-w  [kernel.kallsyms]  [k] __mod_memcg_lruvec_state                                                                                       ▒
   0.02%  wasmtime         ld-2.31.so         [.] _dl_relocate_object                                                                                            ▒
   0.02%  wasmtime         [kernel.kallsyms]  [k] __do_fault                                                                                                     ▒
   0.01%  wasmtime         [kernel.kallsyms]  [k] pmd_page_vaddr                                                                                                 ▒
# the commands to profile
perf record -k mono ~/wasmtime/target/release/wasmtime run --profile jitdump --allow-precompiled bad.cwasm
perf inject --jit --input perf.data --output perf.jit.data
perf report --input perf.jit.data --no-children

Versions and Environment

  • Wasmtime version or commit: d0cf46a
  • Operating system: Linux ringzzz-OptiPlex-7070 5.15.0-97-generic
  • Architecture: Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz
@alexcrichton
Copy link
Member

Locally on Intel(R) Core(TM) i9-14900K I get 0.9s on bad.wasm and 0.74s on good.wasm

Profiling good.wasm the hot loop is, according to perf:

       │ d5:┌─→mov    (%rbx),%r10d                                                                                                                                                                                                                                                                                             ▒
       │    │  add    $0x1,%r13d                                                                                                                                                                                                                                                                                               ▒
       │    ├──test   %r13d,%r13d                                                                                                                                                                                                                                                                                              ▒
100.00 │    └──jne    d5                                                                                                                                                                                                                                                                                                       ▒

For bad.wasm it's

       │ c5:┌─→mov    (%rbx),%r10d                                                                                                                                                                                                                                                                                             ▒
       │    │  mov    (%rsp),%rcx                                                                                                                                                                                                                                                                                              ▒
  0.09 │    │  lea    0x1(%rcx),%r10d                                                                                                                                                                                                                                                                                          ▒
 55.32 │    │  test   %r10d,%r10d                                                                                                                                                                                                                                                                                              ▒
       │    │↓ je     e5                                                                                                                                                                                                                                                                                                       ▒
  0.06 │    │  mov    %r10,%rcx                                                                                                                                                                                                                                                                                                ▒
  0.03 │    │  mov    %rcx,(%rsp)                                                                                                                                                                                                                                                                                              ▒
 44.51 │    └──jmp    c5                                                                                                                                                                                                                                                                                                       ▒
       │ e5:   xor    %edi,%edi                                                                                                                                                                                                                                                                                                ▒

I believe WasmEdge is an LLVM-based backend so it probably unrolls this loop or something similar.

@hungryzzz
Copy link
Contributor Author

hungryzzz commented May 16, 2024

I find that there is not difference between the binary generated by WasmEdge before and after changes, so I think maybe the block which contains the small changes is optimized (eliminated) during code generation in LLVM.

So maybe the performance decreasing in Wasmtime is caused by the newly generated instructions in loop, but I am still surprised that only the small changes in machine code can make more than 5 times performance impact...

@alexcrichton
Copy link
Member

My guess is you're hitting micro-architectural limits or things like that. Basically various cliffs in the CPU in terms of performance where once you fall off the happy path it's both difficult to explain why and difficult to understand the effects. That's just my best guess though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants