在技术探讨的范畴中,以太坊节点数据同步这一问题长期让许多人感到困扰。我清楚大家心中的疑惑,所以特地购置了服务器来进行节点同步测试,接下来我将把测试过程里出现的各种问题以及对应的解决方案分享给大家。
服务器配置
我选择的是阿里云的 2 核服务器,其操作系统是 7.4 。同时,还挂载了 500G 的高速云盘。如果条件允许,建议进行配置升级,比如 4 核 8G 、8 核 16G 等配置,这样能更好地应对同步任务。较低的配置会遇到后续可能会提到的问题,因此,尽量选择高配置,以提升同步的成功率。
启动节点配置
goroutine 10855 [IO wait]:
internal/poll.runtime_pollWait(0x7f4a6599ebb0, 0x72, 0x0)
/home/travis/.gimme/versions/go1.9.2.linux.amd64/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc43863a198, 0x72, 0xffffffffffffff00, 0x184e740, 0x18475a0)
/home/travis/.gimme/versions/go1.9.2.linux.amd64/src/internal/poll/fd_poll_runtime.go:85 +0xae
internal/poll.(*pollDesc).waitRead(0xc43863a198, 0xc462457a00, 0x20, 0x20)
/home/travis/.gimme/versions/go1.9.2.linux.amd64/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Read(0xc43863a180, 0xc462457a40, 0x20, 0x20, 0x0, 0x0, 0x0)
/home/travis/.gimme/versions/go1.9.2.linux.amd64/src/internal/poll/fd_unix.go:126 +0x18a
net.(*netFD).Read(0xc43863a180, 0xc462457a40, 0x20, 0x20, 0x0, 0xc42158dcc0, 0x302b35d6a3a0)
/home/travis/.gimme/versions/go1.9.2.linux.amd64/src/net/fd_unix.go:202 +0x52
net.(*conn).Read(0xc421aac000, 0xc462457a40, 0x20, 0x20, 0x0, 0x0, 0x0)
/home/travis/.gimme/versions/go1.9.2.linux.amd64/src/net/net.go:176 +0x6d
io.ReadAtLeast(0x7f4a603b02f8, 0xc421aac000, 0xc462457a40, 0x20, 0x20, 0x20, 0xf1e900, 0x464600, 0x7f4a603b02f8)
/home/travis/.gimme/versions/go1.9.2.linux.amd64/src/io/io.go:309 +0x86
io.ReadFull(0x7f4a603b02f8, 0xc421aac000, 0xc462457a40, 0x20, 0x20, 0x20, 0x0, 0x6fc23a9b4)
/home/travis/.gimme/versions/go1.9.2.linux.amd64/src/io/io.go:327 +0x58
github.com/ethereum/go-ethereum/p2p.(*rlpxFrameRW).ReadMsg(0xc43432b650, 0xbe9568cbea77ec48, 0x5186ea4942, 0x19d4c80, 0x0, 0x0, 0x19d4c80, 0x28, 0x11, 0x0)
/home/travis/gopath/src/github.com/ethereum/go-ethereum/p2p/rlpx.go:650 +0x100
github.com/ethereum/go-ethereum/p2p.(*rlpx).ReadMsg(0xc440545da0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
/home/travis/gopath/src/github.com/ethereum/go-ethereum/p2p/rlpx.go:95 +0x148
github.com/ethereum/go-ethereum/p2p.(*Peer).readLoop(0xc4315cc660, 0xc4315cd0e0)
/home/travis/gopath/src/github.com/ethereum/go-ethereum/p2p/peer.go:251 +0xad
created by github.com/ethereum/go-ethereum/p2p.(*Peer).run
/home/travis/gopath/src/github.com/ethereum/go-ethereum/p2p/peer.go:189 +0xf2
goroutine 14632 [select]:
net.(*netFD).connect.func2(0x18583c0, 0xc42f87c8a0, 0xc488fcaf00, 0xc4caba3da0, 0xc4caba3d40)
/home/travis/.gimme/versions/go1.9.2.linux.amd64/src/net/fd_unix.go:129 +0xf2
created by net.(*netFD).connect
/home/travis/.gimme/versions/go1.9.2.linux.amd64/src/net/fd_unix.go:128 +0x2a3
goroutine 7089 [select]:
github.com/ethereum/go-ethereum/p2p.(*Peer).run(0xc427e1af60, 0xd80820, 0xc44e84bd80, 0x0)
/home/travis/gopath/src/github.com/ethereum/go-ethereum/p2p/peer.go:199 +0x2fe
github.com/ethereum/go-ethereum/p2p.(*Server).runPeer(0xc4201e2fc0, 0xc427e1af60)
/home/travis/gopath/src/github.com/ethereum/go-ethereum/p2p/server.go:790 +0x122
created by github.com/ethereum/go-ethereum/p2p.(*Server).run
/home/travis/gopath/src/github.com/ethereum/go-ethereum/p2p/server.go:570 +0x139c
goroutine 14620 [select]:
net.(*netFD).connect.func2(0x18583c0, 0xc47df23560, 0xc4289ccc80, 0xc4551937a0, 0xc455193740)
/home/travis/.gimme/versions/go1.9.2.linux.amd64/src/net/fd_unix.go:129 +0xf2
created by net.(*netFD).connect
/home/travis/.gimme/versions/go1.9.2.linux.amd64/src/net/fd_unix.go:128 +0x2a3
正常启动节点需按照官网参数进行操作,我将 cache 参数值设定为 512 。大家可以根据服务器的实际情况来扩大该值,这样能够加快节点数据的同步速度。这一步很容易出现问题,有很多人在这方面遇到了困难。如果对这一步所出现的问题处理得不合适,那么后续的同步工作就无法正常进行。
WARN [02-03|12:54:57] Synchronisation failed, dropping peer peer=3616e2d0bcacf32f err="retrieved hash chain is invalid"
WARN [02-03|12:56:02] Ancestor below allowance peer=64e4dd3f53e5c01e number=4843643 hash=000000…000000 allowance=4843643
WARN [02-03|12:56:02] Synchronisation failed, dropping peer peer=64e4dd3f53e5c01e err="retrieved ancestor is invalid"
// 和以下异常
WARN [02-03|12:58:55] Synchronisation failed, dropping peer peer=dbf24adb86cfa3e6 err="no peers available or all tried for download"
WARN [02-03|12:59:23] Synchronisation failed, retrying err="receipt download canceled (requested)"
WARN [02-03|13:00:17] Synchronisation failed, retrying err="peer is unknown or unhealthy"
WARN [02-03|13:03:06] Synchronisation failed, retrying err="block download canceled (requested)"
WARN [02-03|13:03:07] Synchronisation failed, retrying err="peer is unknown or unhealthy"
程序同步异常
程序进行同步的时候,如果抛出了异常,那么很有可能是因为内存不足或者 IO 不足,从而导致程序死机。一般情况下,重新启动程序就可以解决这个问题。然而,有时候问题并没有这么简单,即使反复地重新启动程序,仍然会出现异常,在这种情况下,就需要进行更深入的排查。
> net.peerCount
0
日志卡住问题
日志停留在某个地方,这表明 geth 没有与有效的节点相连接。通过后台的命令去查看的话,能够发现链接的节点数量是 0。在这种情况下可以选择等待,如果经过很长时间都没有得到响应,建议重新启动节点去寻找新的 peers,同时也可以手动进行添加,而星火计划所提供的节点列表能够发挥作用。
[
"enode://6427b7e7446bb05f22fe7ce9ea175ec05858953d75a5a6e4f99a6aec0779a8bd6276f1959a42fe5948acbe14bcd0652082dc546d3b37ae8f2aea41eba4eca43b@121.201.14.181:30303",
"enode://91922b12115c067005c574844c6bbdb114eb262f90b6355cec89e13b483c3e4669c6d63ec66b6e3ca7a3a462d28edb3c659e9fa05ed4c7234524e582a8816743@120.27.164.92:13333",
"enode://3dde41a994b3b99f938f75ddf6d48318c78ddd869c70b48d00b922190bb434fc5474f6250c143723f4387273d123e02f6a38f07d0311f240d2915f6140e09850@207.226.141.212:30303",
"enode://7ab8fa90b204f2146c00939b8474549c544caa3598a0894fa639a5cdbd992cbc6135fd776f8bcf97ae95fdaa3afbfa2d107fea71549119afd7ea57356b899be5@121.201.24.236:30303",
"enode://db81152a8296089b04a21ad9bf347df3ff0450ffc8215d9f50c400ccf8d18963118010cacf03c4b71981cf9cac5394438cab3039e98db4d2aae5859ab7d1793e@139.198.1.244:30303",
"enode://68dd1360f0a4ac362b41124692e31652ffe26f6f06a284ca11f3b514b3968594ac1f4320d1aa1ca343b06327c18a2e40eded74edfb3086e1baaa27ca24226b21@113.106.85.172:30303",
"enode://58f6b6908286cefe43c166cfc4fed033c750caa1bc3f6e1e1e1507752c0b91248addb3122f8557c5f8912e702285a160ab3a10203ae1eff3807eda25d6ed6478@45.113.71.186:30303",
"enode://87190a01c02cafb97e7f49672b4c3be2937cf79c3969e0b8e7b35cac28cebfbda52a13d56fd2113c726a1dd359c9476ccf7e60651439cef56e3a71039f6a4f5e@119.29.207.90:30303",
"enode://d1fdd05a62fd9544eeb455e4f4d4bd8bb574138d82d8f909f3041d0792e3401f8695133d39ad0a3aa5d217e3c5bed0511b531505a67b03607a909ae9096720d2@120.26.129.121:30303",
"enode://a1e9cf99eca94590ae776c8dd5c6c043a8c1f0375e9e391c9fb55133385bf453ac3d3fb3ead8e63415b2ef99d54a19e2a7bc830cd1fdbbb283818e3bcb0ea31e@182.254.209.254:30303",
"enode://562796b19d43d79dfb6160abd2d7bb78a2f2efd9501a0a767c00677e0fb3a4407235f813c3003682c2a421a58709c52f595827bc15708cc5f534f55d0f8d03ad@121.40.199.54:30303",
"enode://fa2c17dcc83a6e2825668210abf7480452de4b13d8bdea8f301c3b603701918bc4dade9e68d119d7a8214e90e7ea10a2782041c98951385d97bee73358fb08f4@120.26.124.58:30303",
"enode://0b331b27e2976d797aed1d1464ac483a7f262860334cb5737a01a0188da08d79226a6973adc5f2a2c1a20192b399161eee23a0d56ecf472cbe4058d010ecc89f@47.89.49.61:30303",
"enode://0639f20fdb5af1fecd2f2bc0ddb648885483a5945686530e6b046678635d3435dd7b92fe34209f81ec6f003482aa78e407e5e6eb1b10be4773a2adbcf1fc1ba6@118.192.161.147:30303",
"enode://fd2a5d30e4f3917ee640876cc57d72a8bf5ecf049e9106c95e60cf306dd7a5dd68d1a295f3718af44a7083252686926d6e8a402f1abe6f805e10e7281967db28@121.201.29.82:30303",
"enode://0d1b9eed7afe2d5878d5d8a4c2066b600a3bcac2e5730586421af224e93a58cd03cac75bf0b2a62fd8049cd3692a085758cc1e407c8b2c94bb069814a5e8d0f0@209.9.106.245:30303",
"enode://ca087a651571d04953187753af969f7deb1582af2a06a3048b90adb3f87d4c41973aac4b5e80449efc97154dac769a5ea447b123c3aaf7a2c23825a1558804dc@182.150.37.23:30303",
"enode://9b53b9d41d964f71db60d2198cfa9013fc7808d707c5e0a32da1e22d3cacd6adbae46901df6506a752d9d4e3791df29171315fbb86f7b09331a25458158fe65b@182.150.37.24:30303"
]
节点自动关闭
geth 自动关闭了,而且没有异常日志,这是因为服务器内存不足,触发了 Linux 的 OOM 操作。目前除了升级内存,没有太好的办法,只能经常监控程序,一旦出现问题就重启。设置 swap 是一种折中的办法,但是会降低同步速度,所以需要在速度和稳定性之间做权衡。
日志疯狂打印
若长时间打印类似日志且区块同步高度保持不变,又没有其他操作日志,即便等待了几个小时依然没有变化,那就尽早放弃。这意味着基础设施,比如网络、硬盘等可能存在问题,而解决这些问题的时间并不确定,即便重启也仍可能陷入这种困境,所以不要浪费精力和时间。
INFO [02-03|13:07:24] Imported new state entries count=1142 elapsed=5.888ms processed=84671 pending=1907 retry=0 duplicate=0 unexpected=170
我在昨天 6 点把服务器部署好并开始同步。刚开始的时候交易比较少,所以速度很快。到了凌晨 2 点多,节点出现了卡死的情况。在 7 点多的时候,重新启动并重新开始同步。在这个过程中,经历了多次挂掉、程序出现异常以及 oom 的情况。当同步到距离最新高度大概 200 块左右的时候,加载结构体这个阶段非常漫长。大家要耐心等待,不要轻易地重启。各位在以太坊节点数据同步的过程中,都遇到过哪些比较棘手的问题?如果觉得这篇文章有用的话,不妨点个赞并且分享一下!
暂无评论
发表评论