<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Breakdown on Jimmy's Blog</title><link>https://blog.jimersylee.com/tags/breakdown/</link><description>Recent content in Breakdown on Jimmy's Blog</description><generator>Hugo</generator><language>zh-cn</language><lastBuildDate>Mon, 31 Oct 2022 15:34:00 +0800</lastBuildDate><atom:link href="https://blog.jimersylee.com/tags/breakdown/index.xml" rel="self" type="application/rss+xml"/><item><title>一次websocket服务故障复盘</title><link>https://blog.jimersylee.com/posts/%E4%B8%80%E6%AC%A1websocket%E6%9C%8D%E5%8A%A1%E6%95%85%E9%9A%9C%E5%A4%8D%E7%9B%98/</link><pubDate>Mon, 31 Oct 2022 15:34:00 +0800</pubDate><guid>https://blog.jimersylee.com/posts/%E4%B8%80%E6%AC%A1websocket%E6%9C%8D%E5%8A%A1%E6%95%85%E9%9A%9C%E5%A4%8D%E7%9B%98/</guid><description>&lt;h2 id="故障现象"&gt;故障现象&lt;/h2&gt;
&lt;p&gt;先收到报警通知,然后运营反馈问题,&lt;/p&gt;
&lt;p&gt;websocket服务掉线用户大于2k&lt;/p&gt;
&lt;h2 id="排查过程"&gt;排查过程&lt;/h2&gt;
&lt;p&gt;集群整体负载正常&lt;/p&gt;
&lt;p&gt;&lt;a href="https://tcs-devops.aliyuncs.com/storage/112m519b857e529120abde36b852122ca471?Signature=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJBcHBJRCI6IjVlNzQ4MmQ2MjE1MjJiZDVjN2Y5YjMzNSIsIl9hcHBJZCI6IjVlNzQ4MmQ2MjE1MjJiZDVjN2Y5YjMzNSIsIl9vcmdhbml6YXRpb25JZCI6IiIsImV4cCI6MTY2NzgwNjAyMiwiaWF0IjoxNjY3MjAxMjIyLCJyZXNvdXJjZSI6Ii9zdG9yYWdlLzExMm01MTliODU3ZTUyOTEyMGFiZGUzNmI4NTIxMjJjYTQ3MSJ9.8QXc6fCy7gBq54YM6KYn-FZ2LDKljm9hu3LSaRYdjow"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://tcs-devops.aliyuncs.com/storage/112mf700f219a68f594015fc82702ebeb501?Signature=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJBcHBJRCI6IjVlNzQ4MmQ2MjE1MjJiZDVjN2Y5YjMzNSIsIl9hcHBJZCI6IjVlNzQ4MmQ2MjE1MjJiZDVjN2Y5YjMzNSIsIl9vcmdhbml6YXRpb25JZCI6IiIsImV4cCI6MTY2NzgwNjAyMiwiaWF0IjoxNjY3MjAxMjIyLCJyZXNvdXJjZSI6Ii9zdG9yYWdlLzExMm1mNzAwZjIxOWE2OGY1OTQwMTVmYzgyNzAyZWJlYjUwMSJ9.Cy_gobfcuop9rwriaGtI8o5KjwjB46ScSSsDMQ6Tskk"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://tcs-devops.aliyuncs.com/storage/112m11ce9fe27f816c008eca6f7b9040fedc?Signature=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJBcHBJRCI6IjVlNzQ4MmQ2MjE1MjJiZDVjN2Y5YjMzNSIsIl9hcHBJZCI6IjVlNzQ4MmQ2MjE1MjJiZDVjN2Y5YjMzNSIsIl9vcmdhbml6YXRpb25JZCI6IiIsImV4cCI6MTY2NzgwNjAyMiwiaWF0IjoxNjY3MjAxMjIyLCJyZXNvdXJjZSI6Ii9zdG9yYWdlLzExMm0xMWNlOWZlMjdmODE2YzAwOGVjYTZmN2I5MDQwZmVkYyJ9.qusBkE1znU_HLsSWxpjyxBowm_-ybAYLkZkTh1wcYQ0"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://tcs-devops.aliyuncs.com/storage/112m800f3abb23a4cd84c0d0f87d70938901?Signature=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJBcHBJRCI6IjVlNzQ4MmQ2MjE1MjJiZDVjN2Y5YjMzNSIsIl9hcHBJZCI6IjVlNzQ4MmQ2MjE1MjJiZDVjN2Y5YjMzNSIsIl9vcmdhbml6YXRpb25JZCI6IiIsImV4cCI6MTY2NzgwNjAyMiwiaWF0IjoxNjY3MjAxMjIyLCJyZXNvdXJjZSI6Ii9zdG9yYWdlLzExMm04MDBmM2FiYjIzYTRjZDg0YzBkMGY4N2Q3MDkzODkwMSJ9.RZQOk0scLaGOFoVFXgRfamVXDgcnZZOQSV_oxmOHy8w"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://tcs-devops.aliyuncs.com/storage/112m122a13ea2abf7ef608fc6e1b0e6f5433?Signature=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJBcHBJRCI6IjVlNzQ4MmQ2MjE1MjJiZDVjN2Y5YjMzNSIsIl9hcHBJZCI6IjVlNzQ4MmQ2MjE1MjJiZDVjN2Y5YjMzNSIsIl9vcmdhbml6YXRpb25JZCI6IiIsImV4cCI6MTY2NzgwNjAyMiwiaWF0IjoxNjY3MjAxMjIyLCJyZXNvdXJjZSI6Ii9zdG9yYWdlLzExMm0xMjJhMTNlYTJhYmY3ZWY2MDhmYzZlMWIwZTZmNTQzMyJ9.6mZMWNldZhY8eFXGItVUdzvDxpjzTLTqoyB86CuaUpI"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://arms.console.aliyun.com/apm?pid=fl0zuy81v2%40f9d1ca21c1bad6a&amp;amp;regionId=cn-shanghai#/callChain/ea1ae6018c16637637393566236d0001/1663763439356/1663764039356/fl0zuy81v2@f9d1ca21c1bad6a?&amp;amp;page=apps"&gt;https://arms.console.aliyun.com/apm?pid=fl0zuy81v2%40f9d1ca21c1bad6a&amp;amp;regionId=cn-shanghai#/callChain/ea1ae6018c16637637393566236d0001/1663763439356/1663764039356/fl0zuy81v2@f9d1ca21c1bad6a?&amp;amp;page=apps&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://tcs-devops.aliyuncs.com/storage/112m458ca7b063a0c6b72a9f9c1758bd6fb0?Signature=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJBcHBJRCI6IjVlNzQ4MmQ2MjE1MjJiZDVjN2Y5YjMzNSIsIl9hcHBJZCI6IjVlNzQ4MmQ2MjE1MjJiZDVjN2Y5YjMzNSIsIl9vcmdhbml6YXRpb25JZCI6IiIsImV4cCI6MTY2NzgwNjAyMiwiaWF0IjoxNjY3MjAxMjIyLCJyZXNvdXJjZSI6Ii9zdG9yYWdlLzExMm00NThjYTdiMDYzYTBjNmI3MmE5ZjljMTc1OGJkNmZiMCJ9.XuIHUlyoPs4fCvlQ6ajX5W3c41NZd3vPPd6feurzwM0"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;9-21 21:09:39&lt;/p&gt;
&lt;p&gt;mongodb?&lt;/p&gt;
&lt;p&gt;9-21 21:10:20&lt;/p&gt;
&lt;p&gt;因为健康检查超时是3秒,但是这个请求3.6秒了,我看是检查mongodb的时候慢了,我再看看mongo怎么了&lt;/p&gt;
&lt;p&gt;9-21 21:14:29&lt;/p&gt;
&lt;p&gt;当时mongo CPU100%了&lt;/p&gt;
&lt;p&gt;&lt;a href="https://tcs-devops.aliyuncs.com/storage/112m40ff2049adc2ed872b9b9e7c0c629b5b?Signature=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJBcHBJRCI6IjVlNzQ4MmQ2MjE1MjJiZDVjN2Y5YjMzNSIsIl9hcHBJZCI6IjVlNzQ4MmQ2MjE1MjJiZDVjN2Y5YjMzNSIsIl9vcmdhbml6YXRpb25JZCI6IiIsImV4cCI6MTY2NzgwNjAyMiwiaWF0IjoxNjY3MjAxMjIyLCJyZXNvdXJjZSI6Ii9zdG9yYWdlLzExMm00MGZmMjA0OWFkYzJlZDg3MmI5YjllN2MwYzYyOWI1YiJ9.wVdnOnmUcxP6KP86m8Hc57O0f9Bz_wMi-02BItRMd_U"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;我分析了精确的监控,mongo从20:35:30到20:35:40 短短10秒钟时间 CPU从40%到100%,不知道发生了什么&lt;/p&gt;
&lt;p&gt;&lt;a href="https://tcs-devops.aliyuncs.com/storage/112m514389cda2eb2cd551381807ef0e3c3c?Signature=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJBcHBJRCI6IjVlNzQ4MmQ2MjE1MjJiZDVjN2Y5YjMzNSIsIl9hcHBJZCI6IjVlNzQ4MmQ2MjE1MjJiZDVjN2Y5YjMzNSIsIl9vcmdhbml6YXRpb25JZCI6IiIsImV4cCI6MTY2NzgwNjAyMiwiaWF0IjoxNjY3MjAxMjIyLCJyZXNvdXJjZSI6Ii9zdG9yYWdlLzExMm01MTQzODljZGEyZWIyY2Q1NTEzODE4MDdlZjBlM2MzYyJ9.kl9H4d4gWhJUg5KmAVC7bKoIXkYcdQXIv7ixLdsls04"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;比较异常的是这几个指标&lt;/p&gt;
&lt;p&gt;从日志的先后来看,不是健康检查通不过才掉线的,是先掉线,后健康检查不通过的&lt;/p&gt;
&lt;p&gt;20:35:35 battle-websocket开始出现大量掉线日志&lt;/p&gt;
&lt;p&gt;20:35:35-20:35:40 mongo读写队列从平常的0飙升到117&lt;/p&gt;
&lt;p&gt;20:35:40 mongodb cpu100%&lt;/p&gt;
&lt;p&gt;20:35:42 健康检查超时,是20:35:38左右开始发出检查,42时超时,因为mongodb cpu100%,检查mongo用时3.6秒&lt;/p&gt;
&lt;p&gt;websocket实例在这期间没有发生full gc&lt;/p&gt;
&lt;p&gt;沈文俊 9-21 21:53:14&lt;/p&gt;
&lt;p&gt;mongocpu上去还是因为大量重练的ws连接吧&lt;/p&gt;
&lt;p&gt;9-21 21:55:10&lt;/p&gt;
&lt;p&gt;对 从表现上来是这样&lt;/p&gt;
&lt;p&gt;所以问题来到了为什么这个websocket实例会突然大量用户掉线,我看过没有异常gc,且掉线用户都来自同一个实例&lt;/p&gt;
&lt;p&gt;9-21 21:58:27&lt;/p&gt;
&lt;p&gt;&lt;a href="https://tcs-devops.aliyuncs.com/storage/112m7ad32eb14293be163fa8ba73805867da?Signature=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJBcHBJRCI6IjVlNzQ4MmQ2MjE1MjJiZDVjN2Y5YjMzNSIsIl9hcHBJZCI6IjVlNzQ4MmQ2MjE1MjJiZDVjN2Y5YjMzNSIsIl9vcmdhbml6YXRpb25JZCI6IiIsImV4cCI6MTY2NzgwNjAyMiwiaWF0IjoxNjY3MjAxMjIyLCJyZXNvdXJjZSI6Ii9zdG9yYWdlLzExMm03YWQzMmViMTQyOTNiZTE2M2ZhOGJhNzM4MDU4NjdkYSJ9.vnAFGSQlh7y-9hcViShmNgUcthhZIFsv6_RTRN9mtxE"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;没有什么错误日志&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;and &lt;strong&gt;topic&lt;/strong&gt;: battle-websocket and &lt;strong&gt;tag&lt;/strong&gt;:&lt;strong&gt;client_ip&lt;/strong&gt;: &amp;ldquo;106.14.32.124&amp;rdquo; and error&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;那台容器的节点服务器 网络上会不会有什么问题?&lt;/p&gt;
&lt;p&gt;&lt;a href="https://cs.console.aliyun.com/?spm=5176.12818093.ProductAndResource--ali--widget-product-recent.dre0.3be916d09xleyZ#/next/clusters/c5ee2ef6132674f438dfca921ee9345b5/eventcenter?ns=kube-system"&gt;https://cs.console.aliyun.com/?spm=5176.12818093.ProductAndResource--ali--widget-product-recent.dre0.3be916d09xleyZ#/next/clusters/c5ee2ef6132674f438dfca921ee9345b5/eventcenter?ns=kube-system&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="https://tcs-devops.aliyuncs.com/storage/112m379e418d948141fb71b04b893a3a390d?Signature=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJBcHBJRCI6IjVlNzQ4MmQ2MjE1MjJiZDVjN2Y5YjMzNSIsIl9hcHBJZCI6IjVlNzQ4MmQ2MjE1MjJiZDVjN2Y5YjMzNSIsIl9vcmdhbml6YXRpb25JZCI6IiIsImV4cCI6MTY2NzgwNjAyMiwiaWF0IjoxNjY3MjAxMjIyLCJyZXNvdXJjZSI6Ii9zdG9yYWdlLzExMm0zNzllNDE4ZDk0ODE0MWZiNzFiMDRiODkzYTNhMzkwZCJ9.cZCCIGxDeOVx57uqF1ZhwvloxrraQSLeOA2KtVFfR00"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;从k8s事件来看,发生了reload,但是时间对不上&lt;/p&gt;
&lt;p&gt;跟运维确认,昨晚没有进行运维操作,但是集群确认有nginx reload操作,现在知道的是如果认为执行nginx reload操作,针对http服务没有什么大的影响,但是tcp长连接服务是会掉线的&lt;/p&gt;</description></item></channel></rss>