为什么 for-range 的行为会根据切片结构的大小而有所不同?

我在玩这个代码


main_var.go


package main


func main() {

    const size = 1000000


    slice := make([]SomeStruct, size)

    for _, s := range slice { // line 7

        _ = s

    }


}

type_small.go


package main


type SomeStruct struct {

    ID0 int64

    ID1 int64

    ID2 int64

    ID3 int64

    ID4 int64

    ID5 int64

    ID6 int64

    ID7 int64

    ID8 int64

}

我注意到,如果我在ID9结构中添加另一个 64 位 int64(总共 10 * 8 字节 = 80 字节),则 for 循环会变慢。


如果我比较了程序集,它添加了复制元素的指令


// with 9 int64 (72 bytes)

    0x001d 00029 (main_var.go:6)    LEAQ    type."".SomeStruct(SB), AX

    0x0024 00036 (main_var.go:6)    MOVQ    AX, (SP)

    0x0028 00040 (main_var.go:6)    MOVQ    $1000000, 8(SP)

    0x0031 00049 (main_var.go:6)    MOVQ    $1000000, 16(SP)

    0x003a 00058 (main_var.go:6)    CALL    runtime.makeslice(SB)

    0x003f 00063 (main_var.go:6)    XORL    AX, AX

    0x0041 00065 (main_var.go:7)    INCQ    AX

    0x0044 00068 (main_var.go:7)    CMPQ    AX, $1000000

    0x004a 00074 (main_var.go:7)    JLT    65

    0x004c 00076 (main_var.go:7)    MOVQ    32(SP), BP

    0x0051 00081 (main_var.go:7)    ADDQ    $40, SP

    0x0055 00085 (main_var.go:7)    RET

    0x0056 00086 (main_var.go:7)    NOP

    0x0056 00086 (main_var.go:3)    CALL    runtime.morestack_noctxt(SB)

    0x005b 00091 (main_var.go:3)    JMP    0


// with 10 int64 (80 bytes), it added DUFFCOPY instruction

    0x001d 00029 (main_var.go:6)    LEAQ    type."".SomeStruct(SB), AX

    0x0024 00036 (main_var.go:6)    MOVQ    AX, (SP)

    0x0028 00040 (main_var.go:6)    MOVQ    $1000000, 8(SP)

    0x0031 00049 (main_var.go:6)    MOVQ    $1000000, 16(SP)

    0x003a 00058 (main_var.go:6)    CALL    runtime.makeslice(SB)

    0x003f 00063 (main_var.go:6)    MOVQ    24(SP), AX

    0x0044 00068 (main_var.go:6)    XORL    CX, CX

    0x0046 00070 (main_var.go:7)    JMP    76


我想知道为什么较大的结构(> 80 字节)的不同行为即使在这两种情况下都没有使用切片的元素。


湖上湖
浏览 169回答 1
1回答

慕雪6442864

我发现这是因为 SSA 优化。lower在传递过程中更明确。此过程将中间表示更改为特定于机器的装配。在writebarrier(前 1 步lower)处,两种结构尺寸的说明仍然相同。&nbsp; &nbsp; &nbsp; &nbsp; v22 (7) = Phi <*SomeStruct> v14 v45&nbsp; &nbsp; &nbsp; &nbsp; v28 (7) = Phi <int> v16 v37&nbsp; &nbsp; &nbsp; &nbsp; v23 (7) = Phi <mem> v12 v27&nbsp; &nbsp; &nbsp; &nbsp; v37 (+7) = Add64 <int> v28 v36&nbsp; &nbsp; &nbsp; &nbsp; v39 (7) = Less64 <bool> v37 v8&nbsp; &nbsp; &nbsp; &nbsp; v25 (7) = VarDef <mem> {.autotmp_7} v23&nbsp; &nbsp; &nbsp; &nbsp; v26 (7) = LocalAddr <*SomeStruct> {.autotmp_7} v2 v25&nbsp; &nbsp; &nbsp; &nbsp; v27 (+7) = Move <mem> {SomeStruct} [72] v26 v22 v25&nbsp; # <-- copy operation如您所见, Move在 v27 上有操作。然而,在lower通过之后,指令就不同了。有 9 个 int64(72 字节)&nbsp; &nbsp; &nbsp; &nbsp; v22 (7) = Phi <*SomeStruct> v14 v45&nbsp; &nbsp; &nbsp; &nbsp; v28 (7) = Phi <int> v16 v37&nbsp; &nbsp; &nbsp; &nbsp; v23 (7) = Phi <mem> v12 v27&nbsp; &nbsp; &nbsp; &nbsp; v37 (+7) = ADDQconst <int> [1] v28&nbsp; &nbsp; &nbsp; &nbsp; v25 (7) = VarDef <mem> {.autotmp_7} v23&nbsp; &nbsp; &nbsp; &nbsp; v26 (7) = LEAQ <*SomeStruct> {.autotmp_7} v2&nbsp; &nbsp; &nbsp; &nbsp; v44 (7) = CMPQconst <flags> [1000000] v37&nbsp; &nbsp; &nbsp; &nbsp; v32 (+7) = LEAQ <*SomeStruct> {.autotmp_7} [8] v2&nbsp; &nbsp; &nbsp; &nbsp; v31 (+7) = ADDQconst <*SomeStruct> [8] v22&nbsp; &nbsp; &nbsp; &nbsp; v29 (+7) = MOVQload <uint64> v22 v25&nbsp; &nbsp; &nbsp; &nbsp; v24 (+7) = LEAQ <*SomeStruct> {.autotmp_7} [40] v2&nbsp; &nbsp; &nbsp; &nbsp; v15 (+7) = ADDQconst <*SomeStruct> [40] v22&nbsp; &nbsp; &nbsp; &nbsp; v46 (+7) = LEAQ <*SomeStruct> {.autotmp_7} [56] v2&nbsp; &nbsp; &nbsp; &nbsp; v35 (+7) = ADDQconst <*SomeStruct> [56] v22&nbsp; &nbsp; &nbsp; &nbsp; v21 (+7) = LEAQ <*SomeStruct> {.autotmp_7} [24] v2&nbsp; &nbsp; &nbsp; &nbsp; v17 (+7) = ADDQconst <*SomeStruct> [24] v22&nbsp; &nbsp; &nbsp; &nbsp; v39 (7) = SETL <bool> v44&nbsp; &nbsp; &nbsp; &nbsp; v42 (7) = TESTB <flags> v39 v39&nbsp; &nbsp; &nbsp; &nbsp; v30 (+7) = MOVQstore <mem> {.autotmp_7} v2 v29 v25&nbsp; &nbsp; &nbsp; &nbsp; v41 (+7) = MOVOload <int128> [8] v22 v30&nbsp; &nbsp; &nbsp; &nbsp; v20 (+7) = MOVOstore <mem> {.autotmp_7} [8] v2 v41 v30&nbsp; &nbsp; &nbsp; &nbsp; v34 (+7) = MOVOload <int128> [24] v22 v20&nbsp; &nbsp; &nbsp; &nbsp; v19 (+7) = MOVOstore <mem> {.autotmp_7} [24] v2 v34 v20&nbsp; &nbsp; &nbsp; &nbsp; v33 (+7) = MOVOload <int128> [40] v22 v19&nbsp; &nbsp; &nbsp; &nbsp; v38 (+7) = MOVOstore <mem> {.autotmp_7} [40] v2 v33 v19&nbsp; &nbsp; &nbsp; &nbsp; v47 (+7) = MOVOload <int128> [56] v22 v38&nbsp; &nbsp; &nbsp; &nbsp; v27 (+7) = MOVOstore <mem> {.autotmp_7} [56] v2 v47 v38具有 10 个 int64(80 字节),它使用 DUFFCOPY 设备优化 MOVE&nbsp; &nbsp; v22 (7) = Phi <*SomeStruct> v14 v45&nbsp; &nbsp; v28 (7) = Phi <int> v16 v37&nbsp; &nbsp; v23 (7) = Phi <mem> v12 v27&nbsp; &nbsp; v37 (+7) = ADDQconst <int> [1] v28&nbsp; &nbsp; v25 (7) = VarDef <mem> {.autotmp_7} v23&nbsp; &nbsp; v26 (7) = LEAQ <*SomeStruct> {.autotmp_7} v2&nbsp; &nbsp; v44 (7) = CMPQconst <flags> [1000000] v37&nbsp; &nbsp; v32 (+7) = LEAQ <*SomeStruct> {.autotmp_7} [8] v2&nbsp; &nbsp; v31 (+7) = ADDQconst <*SomeStruct> [8] v22&nbsp; &nbsp; v29 (+7) = MOVQload <uint64> v22 v25&nbsp; &nbsp; v39 (7) = SETL <bool> v44&nbsp; &nbsp; v42 (7) = TESTB <flags> v39 v39&nbsp; &nbsp; v30 (+7) = MOVQstore <mem> {.autotmp_7} v2 v29 v25&nbsp; &nbsp; v27 (+7) = DUFFCOPY <mem> [826] v32 v31 v30 # <---这种优化是由于rewriteAMD64.go 上的这条规则match: (Move [s] dst src mem)cond: s > 64 && s <= 16*64 && s%16 == 0 && !config.noDuffDeviceresult: (DUFFCOPY [14*(64-s/16)] dst src mem)在后期(elim unread autos),SSA 优化可以检测到临时变量autotmp_7没有被使用并且可以被移除。使用 DUFFCOPY 的较大结构不是这种情况我在这里写得更详细一点
打开App,查看更多内容
随时随地看视频慕课网APP

相关分类

Go