tiDB error messages

109 views
Skip to first unread message

Wilfred Chau

unread,
May 13, 2018, 10:11:20 PM5/13/18
to TiDB user group
Why do I see the below error messages:

From Tikv:
2018/05/11 21:49:15.721 store.rs:3360: [ERROR] Send compaction finished event to raftstore failed: Discard("Failed to send CompactedEvent cf write due to full")
2018/05/11 21:49:35.419 kv.rs:1053: [ERROR] send raft msg to raft store fail: Transport(Discard("Failed to send Raft Message due to full"))
2018/05/11 21:49:35.443 snap.rs:1087: [ERROR] notify snapshot stats failed Discard("Failed to send Snapshot stats due to full")

Mainly from Tikv0:
tikv0_1                | 2018/05/11 21:30:28.370 store.rs:3179: [ERROR] [store 3] handle raft message err: Other(StringError("[src/raftstore/store/store.rs:924]: [region 41443934] region not exist but not tombstone: region {id: 41443934 start_key: \"t\\200\\000\\000\\000\\000\\000\\005\\377\\021_r\\200\\000\\000\\000\\001\\377\\177\\377\\342\\000\\000\\000\\000\\000\\372\" end_key: \"t\\200\\000\\000\\000\\000\\000\\005\\377\\021_r\\200\\000\\000\\000\\001\\377\\224~\\022\\000\\000\\000\\000\\000\\372\" region_epoch {conf_ver: 1153 version: 998} peers {id: 41443935 store_id: 1} peers {id: 41443936 store_id: 3} peers {id: 41443937 store_id: 6}}"))
tikv0_1                
| 2018/05/11 21:30:28.390 store.rs:3179: [ERROR] [store 3] handle raft message err: Other(StringError("[src/raftstore/store/store.rs:924]: [region 41444872] region not exist but not tombstone: region {id: 41444872 start_key: \"t\\200\\000\\000\\000\\000\\000\\005\\377\\021_r\\200\\000\\000\\000\\001\\377\\224~\\022\\000\\000\\000\\000\\000\\372\" end_key: \"t\\200\\000\\000\\000\\000\\000\\005\\377\\021_r\\200\\000\\000\\000\\001\\377\\251\\214\\241\\000\\000\\000\\000\\000\\372\" region_epoch {conf_ver: 1153 version: 999} peers {id: 41444873 store_id: 1} peers {id: 41444874 store_id: 3} peers {id: 41444875 store_id: 6}}"))
tikv0_1                
| 2018/05/11 21:30:30.886 store.rs:3179: [ERROR] [store 3] handle raft message err: Other(StringError("[src/raftstore/store/store.rs:924]: [region 41443934] region not exist but not tombstone: region {id: 41443934 start_key: \"t\\200\\000\\000\\000\\000\\000\\005\\377\\021_r\\200\\000\\000\\000\\001\\377\\177\\377\\342\\000\\000\\000\\000\\000\\372\" end_key: \"t\\200\\000\\000\\000\\000\\000\\005\\377\\021_r\\200\\000\\000\\000\\001\\377\\224~\\022\\000\\000\\000\\000\\000\\372\" region_epoch {conf_ver: 1153 version: 998} peers {id: 41443935 store_id: 1} peers {id: 41443936 store_id: 3} peers {id: 41443937 store_id: 6}}"))
tikv0_1                
| 2018/05/11 21:30:30.886 store.rs:3179: [ERROR] [store 3] handle raft message err: Other(StringError("[src/raftstore/store/store.rs:924]: [region 41444872] region not exist but not tombstone: region {id: 41444872 start_key: \"t\\200\\000\\000\\000\\000\\000\\005\\377\\021_r\\200\\000\\000\\000\\001\\377\\224~\\022\\000\\000\\000\\000\\000\\372\" end_key: \"t\\200\\000\\000\\000\\000\\000\\005\\377\\021_r\\200\\000\\000\\000\\001\\377\\251\\214\\241\\000\\000\\000\\000\\000\\372\" region_epoch {conf_ver: 1153 version: 999} peers {id: 41444873 store_id: 1} peers {id: 41444874 store_id: 3} peers {id: 41444875 store_id: 6}}"))


From pd:
2018/05/11 21:31:53.255 log.go:80: [error] rafthttp: [failed to read cca50c2cc3dee5e7 on stream MsgApp v2 (context canceled)]
2018/05/11 21:31:53.256 log.go:80: [error] rafthttp: [failed to read e6170d714ae2fcfb on stream Message (context canceled)]
2018/05/11 21:32:10.862 etcdutil.go:65: [error] could not retrieve cluster information from the given urls


tl

unread,
May 14, 2018, 1:08:30 AM5/14/18
to TiDB user group
seem that your TiKV is very busy, what is the machine configuration for your cluster? How do you deploy the cluster? 

can you show me the iostat and top result for TiKV?

在 2018年5月14日星期一 UTC+8上午10:11:20,Wilfred Chau写道:

Wilfred Chau

unread,
May 14, 2018, 10:44:38 PM5/14/18
to TiDB user group
What I have been trying to do is to upload 8 mega size tables.  To give you an idea of the totoal  size, mydumper created 8500x 64MB files.

Few days ago, I restarted the process using the loader tool, this time after approx 3 hours, it errored out with this :
commit failed Error 1105: Information schema is out of date.

We used your docker-compose file to build the environment.   If I rerun loader will it resume from where it stopped without any data corruption? Thanks



Here is the iostat

Linux 4.13.0-21-generic (uwt-hwdb-shared-p01)   05/14/2018      _x86_64_        (48 CPU)

avg
-cpu:  %user   %nice %system %iowait  %steal   %idle
           
4.25    0.00    0.28    0.18    0.00   95.30

Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
nvme0n1          
13.18       148.59      8922.71   49370087 2964656716
nvme1n1          
22.19       341.31      9479.27  113403135 3149577900
sda              
0.84         4.13        13.23    1370857    4394520
sdb              
38.97       372.13      4677.54  123643515 1554158680

and here is from top
top - 10:07:05 up 3 days, 20:18,  2 users,  load average: 3.10, 2.22, 1.96
Tasks: 639 total,   1 running, 638 sleeping,   0 stopped,   0 zombie
%Cpu(s):  5.8 us,  0.3 sy,  0.0 ni, 93.8 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 52777676+total, 24062707+free, 72630160 used, 21451953+buff/cache
KiB Swap:  2097148 total,  2096880 free,      268 used. 45171718+avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  
%CPU %MEM     TIME+ COMMAND
 
88561 root      20   0 17.712g 4.163g  55672 S 106.6  0.8   1610:54 tikv-server
 
88563 root      20   0 21.595g 0.012t  55504 S  95.0  2.5   1645:05 tikv-server
 
88596 root      20   0 8572252 2.603g  55832 S  17.9  0.5   2859:08 tikv-server
 
86715 root      20   0 10.263g 264784  50768 S  17.5  0.1 608:24.86 pd-server
 
86712 root      20   0   66348  49628   5664 S  11.6  0.0 301:48.71 pushgateway
 
88772 root      20   0 6442332 1.841g  55568 S   8.3  0.4 546:43.74 tikv-server
 
88766 root      20   0 8303964 1.928g  55996 S   7.9  0.4 351:18.17 tikv-server
 
88531 root      20   0 17.859g 0.011t  55588 S   7.0  2.3   1473:37 tikv-server
 
88621 root      20   0 14.019g 8.118g  55872 S   7.0  1.6 318:18.62 tikv-server
 
90499 root      20   0 4067772 337544  22084 S   5.6  0.1 113:06.58 tidb-server
 
88720 root      20   0 30.861g 0.022t  55620 S   5.3  4.5 535:59.93 tikv-server
 
86898 root      20   0 6848296 427148  27172 S   3.3  0.1  92:36.05 prometheus
 
86929 root      20   0 10.201g 166184  50680 S   1.0  0.0  37:21.35 pd-server
 
86980 root      20   0 10.198g 135852  50764 S   1.0  0.0  34:07.11 pd-server
 
86677 root      20   0 10.200g 143544  50612 S   0.7  0.0  35:54.76 pd-server
 
86961 root      20   0 10.201g 168532  50580 S   0.7  0.0  36:04.22 pd-server
 
87340 root      20   0 10.196g 138876  50592 S   0.7  0.0  34:37.04 pd-server
 
87462 root      20   0 10.193g 137612  50728 S   0.7  0.0  36:34.02 pd-server
 
98803 root      20   0   46892   4544   3348 R   0.7  0.0   0:00.07 top
 
55464 root      20   0 7864108  57880  27056 S   0.3  0.0   9:07.63 dockerd
 
87624 root      20   0 10.200g 137060  50600 S   0.3  0.0  33:59.91 pd-server
     
1 root      20   0  220764   9444   6668 S   0.0  0.0   0:12.05 systemd
     
2 root      20   0       0      0      0 S   0.0  0.0   0:00.09 kthreadd
     
4 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H
     
7 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 mm_percpu_wq
     
8 root      20   0       0      0      0 S   0.0  0.0   0:04.08 ksoftirqd/0
     
9 root      20   0       0      0      0 S   0.0  0.0   3:13.50 rcu_sched
   
10 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_bh
   
11 root      rt   0       0      0      0 S   0.0  0.0   0:03.54 migration/0
   
12 root      rt   0       0      0      0 S   0.0  0.0   0:00.48 watchdog/0
   
13 root      20   0       0      0      0 S   0.0  0.0   0:00.00 cpuhp/0
   
14 root      20   0       0      0      0 S   0.0  0.0   0:00.00 cpuhp/1


Wilfred Chau

unread,
May 15, 2018, 10:19:50 PM5/15/18
to TiDB user group
docker-compose down failed.. but only the PDs.

Stopping tidbdockercompose_tidb_1        ... done
Stopping tidbdockercompose_tikv1_1       ... done
Stopping tidbdockercompose_tikv0_1       ...
Stopping tidbdockercompose_tikv6_1       ... done
Stopping tidbdockercompose_tikv4_1       ... done
Stopping tidbdockercompose_tikv5_1       ... done
Stopping tidbdockercompose_tikv7_1       ... done
Stopping tidbdockercompose_tikv2_1       ... done
Stopping tidbdockercompose_tikv3_1       ... done
Stopping tidbdockercompose_pd1_1         ... error
Stopping tidbdockercompose_pd2_1         ... error
Stopping tidbdockercompose_pushgateway_1 ... done
Stopping tidbdockercompose_prometheus_1  ... done
Stopping tidbdockercompose_pd7_1         ... error
Stopping tidbdockercompose_pd4_1         ... error
Stopping tidbdockercompose_pd5_1         ... error
Stopping tidbdockercompose_pd0_1         ... error
Stopping tidbdockercompose_tidb-vision_1 ... done
Stopping tidbdockercompose_pd3_1         ... error
Stopping tidbdockercompose_pd6_1         ... error
Stopping tidbdockercompose_grafana_1     ... done

ERROR
: for tidbdockercompose_tikv0_1  UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=70)
ERROR
: An HTTP request took too long to complete. Retry with --verbose to obtain debug information.
If you encounter this issue regularly because of slow network conditions, consider setting COMPOSE_HTTP_TIMEOUT to a higher value (current value: 60).


Wilfred Chau

unread,
May 15, 2018, 10:19:50 PM5/15/18
to TiDB user group
tikv0 errors while loading...  Giving up...

2018/05/15 20:38:51.612 scheduler.rs:1278: [ERROR] get snapshot failed for cids=[35260], error Request(message: "region is stale" stale_epoch {new_regions {id: 118273 start_key: "t\200\000\000\000\000\000\000\377[_i\200\000\000\000\000\377\000\000\001\003\200\000\000\000\377\000\000\000\003\003\200\000\000\377\000\000\000\007\306\003\200\000\377\000\000\000\000\000\006\003\200\377\000\000\000\000\000\000\002\003\377\200\000\000\000\000\000\000\026\377\003\200\000\000\000\000\000\001\377'\003\200\000\000\000\000\000\377\000\276\003\200\000\000\000\000\377\000\000\003\000\000\000\000\000\372" end_key: "t\200\000\000\000\000\000\000\377[_i\200\000\000\000\000\377\000\000\001\003\200\000\000\000\377\000\000\000\004\003\200\000\000\377\000\000\000\007\306\003\200\000\377\000\000\000\000\000\004\003\200\377\000\000\000\000\000\000\001\003\377\200\000\000\000\000\000\000\030\377\003\200\000\000\000\000\000\001\377B\003\200\000\000\000\000\000\377\000\273\003\200\000\000\000\000\377\000\000\002\000\000\000\000\000\372" region_epoch {conf_ver: 11 version: 42} peers {id: 118274 store_id: 8} peers {id: 118275 store_id: 6} peers {id: 118276 store_id: 9}} new_regions {id: 156295 start_key: "t\200\000\000\000\000\000\000\377[\000\000\000\000\000\000\000\370" end_key: "t\200\000\000\000\000\000\000\377[_i\200\000\000\000\000\377\000\000\001\003\200\000\000\000\377\000\000\000\003\003\200\000\000\377\000\000\000\007\306\003\200\000\377\000\000\000\000\000\006\003\200\377\000\000\000\000\000\000\002\003\377\200\000\000\000\000\000\000\026\377\003\200\000\000\000\000\000\001\377'\003\200\000\000\000\000\000\377\000\276\003\200\000\000\000\000\377\000\000\003\000\000\000\000\000\372" region_epoch {conf_ver: 11 version: 42} peers {id: 156296 store_id: 8} peers {id: 156297 store_id: 6} peers {id: 156298 store_id: 9}}})
2018/05/15 20:38:51.792 scheduler.rs:1278: [ERROR] get snapshot failed for cids=[35540], error Request(message: "peer is not leader" not_leader {region_id: 148161 leader {id: 148162 store_id: 5}})
2018/05/15 20:38:51.941 scheduler.rs:1278: [ERROR] get snapshot failed for cids=[35571], error Request(message: "peer is not leader" not_leader {region_id: 148161 leader {id: 148162 store_id: 5}})
2018/05/15 20:38:52.431 scheduler.rs:1278: [ERROR] get snapshot failed for cids=[35567], error Request(message: "peer is not leader" not_leader {region_id: 148161 leader {id: 148162 store_id: 5}})
2018/05/15 20:38:53.642 scheduler.rs:1278: [ERROR] get snapshot failed for cids=[35493], error Request(message: "peer is not leader" not_leader {region_id: 31727 leader {id: 48426 store_id: 5}})
2018/05/15 20:38:53.653 scheduler.rs:1278: [ERROR] get snapshot failed for cids=[35534], error Request(message: "peer is not leader" not_leader {region_id: 148161 leader {id: 148162 store_id: 5}})
2018/05/15 20:38:54.142 scheduler.rs:1278: [ERROR] get snapshot failed for cids=[35495], error Request(message: "peer is not leader" not_leader {region_id: 31727 leader {id: 48426 store_id: 5}})
2018/05/15 20:38:54.417 scheduler.rs:1278: [ERROR] get snapshot failed for cids=[35502], error Request(message: "peer is not leader" not_leader {region_id: 31727 leader {id: 48426 store_id: 5}})
2018/05/15 20:38:56.110 scheduler.rs:1278: [ERROR] get snapshot failed for cids=[36002], error Request(message: "peer is not leader" not_leader {region_id: 31727 leader {id: 48426 store_id: 5}})
2018/05/15 20:38:56.269 scheduler.rs:1278: [ERROR] get snapshot failed for cids=[35492], error Request(message: "peer is not leader" not_leader {region_id: 31727 leader {id: 48426 store_id: 5}})
2018/05/15 20:38:56.270 scheduler.rs:1278: [ERROR] get snapshot failed for cids=[35498], error Request(message: "peer is not leader" not_leader {region_id: 31727 leader {id: 48426 store_id: 5}})
2018/05/15 20:38:56.270 scheduler.rs:1278: [ERROR] get snapshot failed for cids=[35552, 35520], error Request(message: "peer is not leader" not_leader {region_id: 148161 leader {id: 148162 store_id: 5}})
2018/05/15 20:38:56.610 scheduler.rs:1278: [ERROR] get snapshot failed for cids=[35320], error Request(message: "region is stale" stale_epoch {new_regions {id: 118273 start_key: "t\200\000\000\000\000\000\000\377[_i\200\000\000\000\000\377\000\000\001\003\200\000\000\000\377\000\000\000\003\003\200\000\000\377\000\000\000\007\306\003\200\000\377\000\000\000\000\000\006\003\200\377\000\000\000\000\000\000\002\003\377\200\000\000\000\000\000\000\026\377\003\200\000\000\000\000\000\001\377'\003\200\000\000\000\000\000\377\000\276\003\200\000\000\000\000\377\000\000\003\000\000\000\000\000\372" end_key: "t\200\000\000\000\000\000\000\377[_i\200\000\000\000\000\377\000\000\001\003\200\000\000\000\377\000\000\000\004\003\200\000\000\377\000\000\000\007\306\003\200\000\377\000\000\000\000\000\004\003\200\377\000\000\000\000\000\000\001\003\377\200\000\000\000\000\000\000\030\377\003\200\000\000\000\000\000\001\377B\003\200\000\000\000\000\000\377\000\273\003\200\000\000\000\000\377\000\000\002\000\000\000\000\000\372" region_epoch {conf_ver: 11 version: 42} peers {id: 118274 store_id: 8} peers {id: 118275 store_id: 6} peers {id: 118276 store_id: 9}} new_regions {id: 156295 start_key: "t\200\000\000\000\000\000\000\377[\000\000\000\000\000\000\000\370" end_key: "t\200\000\000\000\000\000\000\377[_i\200\000\000\000\000\377\000\000\001\003\200\000\000\000\377\000\000\000\003\003\200\000\000\377\000\000\000\007\306\003\200\000\377\000\000\000\000\000\006\003\200\377\000\000\000\000\000\000\002\003\377\200\000\000\000\000\000\000\026\377\003\200\000\000\000\000\000\001\377'\003\200\000\000\000\000\000\377\000\276\003\200\000\000\000\000\377\000\000\003\000\000\000\000\000\372" region_epoch {conf_ver: 11 version: 42} peers {id: 156296 store_id: 8} peers {id: 156297 store_id: 6} peers {id: 156298 store_id: 9}}})
2018/05/15 20:38:57.148 scheduler.rs:1278: [ERROR] get snapshot failed for cids=[35491], error Request(message: "peer is not leader" not_leader {region_id: 31727 leader {id: 48426 store_id: 5}})
2018/05/15 20:38:57.363 scheduler.rs:1278: [ERROR] get snapshot failed for cids=[35501], error Request(message: "peer is not leader" not_leader {region_id: 31727 leader {id: 48426 store_id: 5}})
2018/05/15 20:39:02.986 scheduler.rs:1278: [ERROR] get snapshot failed for cids=[36455, 36468], error Request(message: "peer is not leader" not_leader {region_id: 31727 leader {id: 48426 store_id: 5}})

2018/05/15 20:39:02.986 scheduler.rs:1278: [ERROR] get snapshot failed for cids=[36485], error Request(message: "peer is not leader" not_leader {region_id: 31727 leader {id: 48426 store_id: 5}})
2018/05/15 20:39:03.045 scheduler.rs:1278: [ERROR] get snapshot failed for cids=[36469], error Request(message: "peer is not leader" not_leader {region_id: 31727 leader {id: 48426 store_id: 5}})





On Sunday, May 13, 2018 at 7:11:20 PM UTC-7, Wilfred Chau wrote:

Wilfred Chau

unread,
May 15, 2018, 10:19:50 PM5/15/18
to TiDB user group
here are a few transaction costs:
5 13:32:40 db.go:95: [warning] transaction execution costs 34.346460 seconds
2018/05/15 13:32:40 db.go:95: [warning] transaction execution costs 10.723655 seconds
2018/05/15 13:32:40 db.go:95: [warning] transaction execution costs 10.888477 seconds
2018/05/15 13:32:41 db.go:95: [warning] transaction execution costs 29.332055 seconds
2018/05/15 13:32:46 db.go:95: [warning] transaction execution costs 26.324745 seconds
2018/05/15 13:32:46 db.go:95: [warning] transaction execution costs 20.335754 seconds
2018/05/15 13:32:46 db.go:95: [warning] transaction execution costs 25.545078 seconds
2018/05/15 13:32:46 db.go:95: [warning] transaction execution costs 58.994167 seconds
2018/05/15 13:32:46 db.go:95: [warning] transaction execution costs 19.322283 seconds
2018/05/15 13:32:51 db.go:95: [warning] transaction execution costs 48.830328 seconds
2018/05/15 13:32:55 db.go:95: [warning] transaction execution costs 14.718913 seconds
2018/05/15 13:32:56 db.go:95: [warning] transaction execution costs 16.553422 seconds
2018/05/15 13:32:56 db.go:95: [warning] transaction execution costs 19.461351 seconds
2018/05/15 13:32:56 db.go:95: [warning] transaction execution costs 70.581759 seconds
2018/05/15 13:32:59 db.go:95: [warning] transaction execution costs 151.895386 seconds
2018/05/15 13:33:00 db.go:95: [warning] transaction execution costs 18.982004 seconds
2018/05/15 13:33:01 db.go:95: [warning] transaction execution costs 118.188987 seconds
2018/05/15 13:33:07 db.go:95: [warning] transaction execution costs 11.568836 seconds
2018/05/15 13:33:09 db.go:95: [warning] transaction execution costs 143.582970 seconds
2018/05/15 13:33:10 db.go:95: [warning] transaction execution costs 39.078756 seconds
2018/05/15 13:33:10 db.go:95: [warning] transaction execution costs 29.676157 seconds
2018/05/15 13:33:10 db.go:95: [warning] transaction execution costs 19.458685 seconds
2018/05/15 13:33:11 db.go:95: [warning] transaction execution costs 137.748751 seconds
2018/05/15 13:33:12 db.go:95: [warning] transaction execution costs 50.951515 seconds
2018/05/15 13:33:14 db.go:95: [warning] transaction execution costs 41.998781 seconds
2018/05/15 13:33:14 db.go:95: [warning] transaction execution costs 17.896376 seconds
2018/05/15 13:33:14 db.go:95: [warning] transaction execution costs 28.082305 seconds
2018/05/15 13:33:16 db.go:95: [warning] transaction execution costs 9.119518 seconds
2018/05/15 13:33:16 db.go:95: [warning] transaction execution costs 16.307478 seconds
2018/05/15 13:33:17 db.go:95: [warning] transaction execution costs 30.642236 seconds
2018/05/15 13:33:19 db.go:95: [warning] transaction execution costs 32.840351 seconds
2018/05/15 13:33:20 db.go:95: [warning] transaction execution costs 138.293319 seconds
2018/05/15 13:33:20 db.go:95: [warning] transaction execution costs 34.087941 seconds
2018/05/15 13:33:24 db.go:95: [warning] transaction execution costs 27.349513 seconds
2018/05/15 13:33:27 db.go:95: [warning] transaction execution costs 17.936387 seconds
2018/05/15 13:33:27 db.go:95: [warning] transaction execution costs 17.522731 seconds
2018/05/15 13:33:27 db.go:95: [warning] transaction execution costs 26.740878 seconds





On Sunday, May 13, 2018 at 7:11:20 PM UTC-7, Wilfred Chau wrote:

Wilfred Chau

unread,
May 15, 2018, 10:19:50 PM5/15/18
to TiDB user group
Pull the latest code.  Rebuilt the entire cluster.  Reloading...

iostat
Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
nvme0n1          
0.00         0.00         0.00          0          0
nvme1n1          
0.00         0.00         0.00          0          0
sda              
0.00         0.00         0.00          0          0
sdb            
516.00       512.00     53870.00       1024     107740

Device             tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
nvme0n1          
0.00         0.00         0.00          0          0
nvme1n1          
0.00         0.00         0.00          0          0
sda              
0.00         0.00         0.00          0          0
sdb            
482.50       516.00     39492.00       1032      78984




On Sunday, May 13, 2018 at 7:11:20 PM UTC-7, Wilfred Chau wrote:

Liu Tang

unread,
May 16, 2018, 2:35:33 AM5/16/18
to Wilfred Chau, TiDB user group
Hi Wilfred

Seem that your Raftstore thread is blocked, can you show me the metrics from 127.0.0.1:3000, in the TiKV metrics, give me the Raft IO board, I guess it has a big latency.

--
You received this message because you are subscribed to the Google Groups "TiDB user group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tidb-user+unsubscribe@googlegroups.com.
Visit this group at https://groups.google.com/group/tidb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/tidb-user/e9eedb34-5f68-4320-bbbd-ca0655917cea%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Wilfred Chau

unread,
May 17, 2018, 9:48:21 PM5/17/18
to TiDB user group
Attached more boards for your review.

To unsubscribe from this group and stop receiving emails from it, send an email to tidb-user+...@googlegroups.com.
tidb-cluster-tikv (6).png
tidb-cluster-tikv (7).png
tidb-cluster-tikv (2).png
tidb-cluster-tikv (3).png
tidb-cluster-tikv (4).png
tidb-cluster-tikv (5).png

Wilfred Chau

unread,
May 17, 2018, 9:48:21 PM5/17/18
to TiDB user group
Hi  Liu

I pulled the latest code and rebuilt the entire cluster.  It is loading data for almost 2 days and is still going. 

When I looked at the logs on tikv0, I still see lots of errors, like.  But loading seems to be fine so far.
2018/05/17 17:54:18.272 scheduler.rs:1278: [ERROR] get snapshot failed for cids=[5556319], error Request(message: "region is stale" stale_epoch {new_regions {id: 148161 start_key: "t\200\000\000\000\000\000\000\377g_i\200\000\000\000\000\377\000\000\001\003\200\000\000\000\377\000\000\000\003\003\200\000\000\377\000\000\000\007\306\003\200\000\377\000\000\000\000\007\325\003\200\377\000\000\000\000\000\257W\003\377\200\000\000\000\000\000\000\001\377\003\200\000\000\000\000\000\000\377\037\003\200\000\000\000\000\000\377\001\206\003\200\000\000\000\000\377\000\000_\003\200\000\000\000\377\000\000\000\003\000\000\000\000\373" end_key: "t\200\000\000\000\000\000\000\377g_i\200\000\000\000\000\377\000\000\001\003\200\000\000\000\377\000\000\000\004\003\200\000\000\377\000\000\000\007\306\003\200\000\377\000\000\000\000\007\325\003\200\377\000\000\000\000\000\000\001\003\377\200\000\000\000\000\000\000\002\377\003\200\000\000\000\000\000\000\377\t\003\200\000\000\000\000\000\377\001\353\003\200\000\000\000\000\377\000\000l\003\200\000\000\000\377\000\000\000\001\000\000\000\000\373" region_epoch {conf_ver: 97 version: 163} peers {id: 12428697 store_id: 6} peers {id: 17107115 store_id: 5} peers {id: 17168430 store_id: 10}} new_regions {id: 19129480 start_key: "t\200\000\000\000\000\000\000\377g_i\200\000\000\000\000\377\000\000\001\003\200\000\000\000\377\000\000\000\003\003\200\000\000\377\000\000\000\007\306\003\200\000\377\000\000\000\000\007\325\003\200\377\000\000\000\000\000\257O\003\377\200\000\000\000\000\000\000\003\377\003\200\000\000\000\000\000\000\377\007\003\200\000\000\000\000\000\377\001\231\003\200\000\000\000\000\377\000\000\364\003\200\000\000\000\377\000\000\000\001\000\000\000\000\373" end_key: "t\200\000\000\000\000\000\000\377g_i\200\000\000\000\000\377\000\000\001\003\200\000\000\000\377\000\000\000\003\003\200\000\000\377\000\000\000\007\306\003\200\000\377\000\000\000\000\007\325\003\200\377\000\000\000\000\000\257W\003\377\200\000\000\000\000\000\000\001\377\003\200\000\000\000\000\000\000\377\037\003\200\000\000\000\000\000\377\001\206\003\200\000\000\000\000\377\000\000_\003\200\000\000\000\377\000\000\000\003\000\000\000\000\373" region_epoch {conf_ver: 97 version: 163} peers {id: 19129481 store_id: 6} peers {id: 19129482 store_id: 5} peers {id: 19129483 store_id: 10}}})
2018/05/17 17:54:18.306 scheduler.rs:1278: [ERROR] get snapshot failed for cids=[5556330], error Request(message: "region is stale" stale_epoch {new_regions {id: 148161 start_key: "t\200\000\000\000\000\000\000\377g_i\200\000\000\000\000\377\000\000\001\003\200\000\000\000\377\000\000\000\003\003\200\000\000\377\000\000\000\007\306\003\200\000\377\000\000\000\000\007\325\003\200\377\000\000\000\000\000\257W\003\377\200\000\000\000\000\000\000\001\377\003\200\000\000\000\000\000\000\377\037\003\200\000\000\000\000\000\377\001\206\003\200\000\000\000\000\377\000\000_\003\200\000\000\000\377\000\000\000\003\000\000\000\000\373" end_key: "t\200\000\000\000\000\000\000\377g_i\200\000\000\000\000\377\000\000\001\003\200\000\000\000\377\000\000\000\004\003\200\000\000\377\000\000\000\007\306\003\200\000\377\000\000\000\000\007\325\003\200\377\000\000\000\000\000\000\001\003\377\200\000\000\000\000\000\000\002\377\003\200\000\000\000\000\000\000\377\t\003\200\000\000\000\000\000\377\001\353\003\200\000\000\000\000\377\000\000l\003\200\000\000\000\377\000\000\000\001\000\000\000\000\373" region_epoch {conf_ver: 97 version: 163} peers {id: 12428697 store_id: 6} peers {id: 17107115 store_id: 5} peers {id: 17168430 store_id: 10}} new_regions {id: 19129480 start_key: "t\200\000\000\000\000\000\000\377g_i\200\000\000\000\000\377\000\000\001\003\200\000\000\000\377\000\000\000\003\003\200\000\000\377\000\000\000\007\306\003\200\000\377\000\000\000\000\007\325\003\200\377\000\000\000\000\000\257O\003\377\200\000\000\000\000\000\000\003\377\003\200\000\000\000\000\000\000\377\007\003\200\000\000\000\000\000\377\001\231\003\200\000\000\000\000\377\000\000\364\003\200\000\000\000\377\000\000\000\001\000\000\000\000\373" end_key: "t\200\000\000\000\000\000\000\377g_i\200\000\000\000\000\377\000\000\001\003\200\000\000\000\377\000\000\000\003\003\200\000\000\377\000\000\000\007\306\003\200\000\377\000\000\000\000\007\325\003\200\377\000\000\000\000\000\257W\003\377\200\000\000\000\000\000\000\001\377\003\200\000\000\000\000\000\000\377\037\003\200\000\000\000\000\000\377\001\206\003\200\000\000\000\000\377\000\000_\003\200\000\000\000\377\000\000\000\003\000\000\000\000\373" region_epoch {conf_ver: 97 version: 163} peers {id: 19129481 store_id: 6} peers {id: 19129482 store_id: 5} peers {id: 19129483 store_id: 10}}})
2018/05/17 17:55:22.837 scheduler.rs:1278: [ERROR] get snapshot failed for cids=[5558053], error Request(message: "peer is not leader" not_leader {region_id: 71902 leader {id: 12253610 store_id: 1}})

Here is the raft10 boards:




On Tuesday, May 15, 2018 at 11:35:33 PM UTC-7, tl wrote:
To unsubscribe from this group and stop receiving emails from it, send an email to tidb-user+...@googlegroups.com.
tidb-cluster-tikv (1).png
tidb-cluster-tikv.png

Peng Qu

unread,
May 18, 2018, 12:22:05 AM5/18/18
to TiDB user group
 Hi Chau,
  Are you still trying to load your data? Which version of Loader do you use? Latest loader supports continuing the loading at the break point, and you can see the break point in system table tidb_loader.checkpoint.

  Seems you got many error message `get snapshot fail`. This is because region leaders are transfered to other peers. TiDB will retry that so users can ignore that.

  Finally Seems your TiKV is very busy. Could you restart your cluster and the loader, and show metrics about TiKV Raft for me?
在 2018年5月18日星期五 UTC+8上午9:48:21,Wilfred Chau写道:

Wilfred Chau

unread,
May 19, 2018, 3:09:03 AM5/19/18
to TiDB user group
Hi Peng

   Yes, I rebuilt the cluster and is loading since Tue.  Still going.  It have not failed yet.

   Here is the loader version I am using:
   
loader: v1.0.0~rc2+git.a0e76c4
Git Commit Hash: a0e76c4aaf67ff1be138fd97dda0fb2b45be1873
UTC Build Time: 2018-04-02 08:16:18
Go Version: go1.10

   I am restoring approx. 520gb uncompress mydumper sql files.

   Last time it failed, it said it failed to commit to tidb_loader.checkpoint.  Does that mean I cannot resume from where it stopped?

Thanks

Peng Qu

unread,
May 19, 2018, 4:29:25 AM5/19/18
to TiDB user group
Hi Chau, Loader implements importing data and updating checkpoint table in one transaction. So it's ok for killing or restarting Loader whenever. It will resume from where it stopped.

在 2018年5月19日星期六 UTC+8下午3:09:03,Wilfred Chau写道:
Reply all
Reply to author
Forward
0 new messages