记一次在 AWS 网络出故障的时候让自己的线上服务提前”恢复”

嗯, 首先声明, 这个方法不具备普适性, 甚至几乎完全是一个运气问题, 不过总觉得这么神奇的事情还是写一下吧, 于是才有了这篇 blog.

22:08 A 服务器突然离线, 访问分配的 Elastic IP 不通.
22:15 发现我在同机房的另一台 AWS (B) 在线, 遂登陆访问 A 的 AWS 私有 IP – 通!
22:20 用 B 开 ssh -D, 配合 tsocks 登陆 A 的 AWS 私有 IP, 检查服务器状况良好. 抓包确认包可以正常从 Elastic IP 出去, 但是回不来. 确认是 AWS 的错.
22:23 登陆 Cloudflare 修改服务的解析到 B. (Automatic 的 DNS 缓存时间居然只有 30 秒, 怒赞)
22:32 AWS 在其 status 页面宣布 “We are investigating network connectivity issues for instances in the US-EAST-1 Region.”
22:35 在 B 上用 nginx 架设反向代理, 并拷贝本地备份的 A 的 SSL 证书到 B. 服务恢复在线.
23:02 A 的原 Elastic IP 恢复正常访问.

嗯, 是时候折腾些自动化工具来做反代了, 有时候出其不意的能发挥点作用. 这次我的 downtime 比 AWS 的 downtime 短了近一倍, 然而如果我架设反代的速度再快点, 还能有更大提升的说(

————- Update (23:32) ————-
23:07 A 的 Elastic IP 再次离线, AWS 在其 status 页面发布 “We can confirm network connectivity issues affecting instances in the US-EAST-1 Region. We are also experiencing increased error rates and latencies for the EBS APIs and increased error rates for EBS-backed instance launches.”
23:24 “We are experiencing network connectivity issues affecting instances in a single Availability Zone in the US-EAST-1 Region. We are also experiencing increased error rates and latencies for the EBS APIs and increased error rates for EBS-backed instance launches.”
嗯, 结果到本次编辑为止 A 的 Elastic IP 继续 down, 反代继续坚挺~~

————- Update (9-14 8:50) ————-
0:40 “We identified the cause of the connectivity issues and remediated the issue. Network connectivity has recovered for the vast majority of instances which were impacted in the affected Availability Zone. We are continuing to validate that all instances are operating normally. We are also working to resolve the elevated EC2 API latencies.”
0:46 “Between 22:04 and 23:54 we experienced network connectivity issues affecting a portion of the instances in a single Availability Zone in the US-EAST-1 Region. Impacted instances were unreachable via public IP addresses, but were able to communicate to other instances in the same Availability Zone using private IP addresses. Impacted instances may have also had difficulty reaching other AWS services. We identified the root cause of the issue and full network connectivity had been restored. We are continuing to work to resolve elevated EC2 API latencies.”
1:04 “We have resolved the issue causing increased EC2 API error rates and latencies in the US-EAST-1 region. The issue has been resolved and the service is operating normally.”

附上反代配置(赶时间超简陋版):

server {
    listen 80;
    server_name my-service.com www.my-service.com;

    location / {
        proxy_pass http://10.xx.xx.xx/;
        proxy_redirect default;
        proxy_set_header Host my-service.com;
        proxy_connect_timeout 300;
    }
}

server {
    listen 443;
    server_name my-service.com www.my-service.com;

    ssl on;
    ssl_certificate /etc/nginx/my-service.com.verified.bundle.crt;
    ssl_certificate_key /etc/nginx/my-service.com.verified.key;

    location / {
        proxy_pass http://10.xx.xx.xx/;
        proxy_redirect default;
        proxy_set_header Host my-service.com;
        proxy_connect_timeout 300;
    }
}

Leave a Reply Cancel reply