之前做了部分nginx qos分析和cdn日志qos分析计算程序开发的工作,涉及到对实时的和离线的nginx 日志的qos计算。不管是研发也好,运维也好,都需要关心自己负责的domain的qos情况,以此来判断业务的运行情况,容量情况,是否需要扩容,以及验证应用的变更是否影响用户访问等等。
最常见的比如对nginx code的分析,比如计算域名的访问量,可用性(2xx+3xx的比例),4xx,5xx的比例,平均响应时间,平均文件大小,后端响应时间(upstream time)等等。
有时候还需要对应用的url调用情况进行分析,比如访问量最多的url,速度最慢的url,错误率最高的url,访问情况的同比和环比,这样就可以快速的定位到某一点的问题。还可以增加一些更加详细的数据,比如用户的分布,各个区域用户的qos情况,等等。
1.离线的qos计算是通过把nginx日志导入hadoop,然后通过hive进行计算,比如对cdn日志的qos分析,对nginx源站的分析等等。
这里列几个常用的sql,大家可以参考下:
比如对各个省份响应时间分布的
1 2 3 4 5 6 7 | "" "select ip_province,round(SUM(IF(response<=10 AND STATUS='200',1,0))*100/COUNT(1),4) as less10ms_ratio, round(SUM(IF(response<=100 AND STATUS='200',1,0))*100/COUNT(1),4) as less100ms_ratio, round(SUM(IF(response<=1000 AND STATUS='200',1,0))*100/COUNT(1),4) as less1000ms_ratio, round(avg(response)/1000,4) as avg_rt,round(sum(size)*8/(1000*(sum(response)/1000)),2) as svg_speed FROM %s where dt = 'xxx' and domain = 'xxxx' and status='200' and hour in ('09','10','11') group by ip_province order by avg_rt desc " "" |
1 | |
分析404最多的url,refer
1 2 3 4 | SELECT regexp_extract(request,'(.*?) (.*?) (.*?)',2),http_referer,COUNT(1) AS COUNT_ALL FROM viplog.dw_nginx_log WHERE dt='xxx' and host='xxx' and status='404' GROUP BY regexp_extract(request,'(.*?) (.*?) (.*?)',2), http_referer ORDER BY COUNT_ALL DESC limit 20 |
对域名各个http code的比例的分析:
1 2 3 4 5 | """select a.status as code,a.count_all,round((a.count_all/b.total)*100,4) as ratio from (select status,count(1) as count_all from viplog.dw_nginx_log where dt='xxx' and host='xxx' group by status ) a join (select count(1) as total from viplog.dw_nginx_log where host='%s' and dt='xxx') b order by ratio desc """ |
对cdn厂商qos分析:
1 2 3 4 5 | """select vendor,domain,count_all,round((2xx_3xx_count/count_all)*100,4) as availability,avg_response_time,round((more_than1s_count/count_all)*100,4) as more_than1s_per,round((404_count/count_all)*100,4) as 404_ratio from cdn_qos where date='%s' and count_all > 10000 order by count_all desc """ |
分析访问速度最慢的url:
1 2 3 4 5 6 7 8 9 10 | """SELECT * FROM ( SELECT concat(host,regexp_extract(request,'.+? +(.+?)(?:\\\?| )+.*',1)), COUNT(1) AS COUNT_ALL,round(avg(request_time),2) AS avg_rt FROM viplog.dw_nginx_log WHERE dt='%s' and host='%s' and (status rlike '^2.*' or status rlike '^3.*') and substr(time_local,14,5) >= '09:00' and substr(time_local,14,5)<='21:00' GROUP BY concat(host,regexp_extract(request,'.+? +(.+?)(?:\\\?| )+.*',1)) )a where a.COUNT_ALL > 10 and a.avg_rt > 0.01 ORDER BY a.avg_rt DESC limit 10 """ |
对某一个业务各个接口的qos分析:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | """ select a.service,a.api,a.cnt,b.avaible,avg_rt,more1s_ratio,404_ratio from ( select regexp_extract(request,'.+? +(.+?)service=(.+?)&(.+) .+?',2) as service, regexp_extract(request,'.+? +(.+?)api_key=(.+?)&(.+) .+?',2) as api, count(1) as cnt from viplog.dw_nginx_log where dt='%s' and host='xxxx' and regexp_extract(request,'.+? +(.+?)service=(.+?)&(.+) .+?',2) != '' and regexp_extract(request,'.+? +(.+?)api_key=(.+?)&(.+) .+?',2) != '' group by regexp_extract(request,'.+? +(.+?)service=(.+?)&(.+) .+?',2), regexp_extract(request,'.+? +(.+?)api_key=(.+?)&(.+) .+?',2) ) a join (select service,api,cnt,round(2xx_3xx_count*100/cnt,4) as avaible,avg_rt, round(more_than1s_count*100/cnt,4) as more1s_ratio,round(404_count*100/cnt,4) as 404_ratio FROM (select regexp_extract(request,'.+? +(.+?)service=(.+?)&(.+) .+?',2) as service, regexp_extract(request,'.+? +(.+?)api_key=(.+?)&(.+) .+?',2) as api,count(1) as cnt, (SUM(IF(SUBSTR(STATUS,0,1)='2',1,0)) + SUM(IF(SUBSTR(STATUS,0,1)='3',1,0))) as 2xx_3xx _count, ROUND(AVG(request_time),4) AS avg_rt,SUM(IF(request_time>1,1,0)) as more_than1s_count,SUM(IF(STATUS='404',1,0)) AS 404_count FROM viplog.dw_nginx_log where dt='%s' and host='xxxx' and regexp_extract(request,'.+? +(.+?)service=(.+?)&(.+) .+?',2) != '' and regexp_extract(request,'.+? +(.+?)api_key=(.+?)&(.+) .+?',2) != '' group by regexp_extract(request,'.+? +(.+?)service=(.+?)&(.+) .+?',2), regexp_extract(request,'.+? +(.+?)api_key=(.+?)&(.+) .+?',2) order by cnt desc)a)b on a.service=b.service and a.api=b.api and a.cnt >10000 order by a.cnt desc """ |
2.实时计算目前使用的是storm+impala的组合。
因为现阶段,性能比较稳定的impala版本是1.1.1,暂时不支持serde和udf,因此对于nginx的分析不是很方便。
而storm可以做这种udf的计算。但是storm使用不是特别方便,两者配合使用才会有更好的效果。
因为对storm不是特别熟悉,这里就暂时不介绍了。简单的贴下我们用impala计算qos的sql:
(因为不支持serde,所以需要对nginx的日志字段进行解析)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | select host,concat(substr(lt,1,18),'01') as st,parse_url(concat('http://',host,regexp_extract(request,'([^ ]*) ([^ ]*)',2)),'PATH') as url, sum(cast( case when status like '2%' or status like '3%' then body_bytes_sent else '0' end as int)) as 2xx3xx_body_size, sum(cast( case when status like '4%' then body_bytes_sent else '0' end as int)) as 4xx_body_size, sum(cast( case when status like '5%' then body_bytes_sent else '0' end as int)) as 5xx_body_size, sum(cast(body_bytes_sent as int)) as all_body_size , sum(cast( case when status like '2%' or status like '3%' then request_time else '0' end as int)) as 2xx3xx_response_time, sum(cast( case when status like '4%' then request_time else '0' end as int)) as 4xx_response_time, sum(cast( case when status like '5%' then request_time else '0' end as int)) as 5xx_response_time, sum(cast(request_time as int)) as all_response_time , sum(cast( case when status like '2%' or status like '3%' then 1 else 0 end as int)) as 2xx3xx_count, sum(cast( case when status like '4%' then 1 else 0 end as int)) as 4xx_count, sum(cast( case when status like '5%' then 1 else 0 end as int)) as 5xx_count, sum(1) as all_count from ( select regexp_extract(line,'([^ ]*) - ([^ ]*) \\\[(.*)\\\] "([^"]*)" ("[^"]*" )?(-|[0-9]*) (-|[0-9]*) "([^"]*)" "([^"]*)" (.*) ([^ ]*com) ([^ ]*)($| ([^ ]*))', 3) as lt,regexp_extract(line,'([^ ]*) - ([^ ]*) \\\[(.*)\\\] "([^"]*)" ("[^"]*" )?(-|[0-9]*) (-|[0-9]*) "([^"]*)" "([^"]*)" (.*) ([^ ]*com) ([^ ]*)($| ([^ ]*))', 4) as request, regexp_extract(line,'([^ ]*) - ([^ ]*) \\\[(.*)\\\] "([^"]*)" ("[^"]*" )?(-|[0-9]*) (-|[0-9]*) "([^"]*)" "([^"]*)" (.*) ([^ ]*com) ([^ ]*)($| ([^ ]*))', 11) as host,regexp_extract(line,'([^ ]*) - ([^ ]*) \\\[(.*)\\\] "([^"]*)" ("[^"]*" )?(-|[0-9]*) (-|[0-9]*) "([^"]*)" "([^"]*)" (.*) ([^ ]*com) ([^ ]*)($| ([^ ]*))', 6) as status,trim(regexp_replace(regexp_replace(regexp_replace(regexp_extract(line,'([^ ]*) - ([^ ]*) \\\[(.*)\\\] "([^"]*)" ("[^"]*" )?(-|[0-9]*) (-|[0-9]*) "([^"]*)" "([^"]*)" (.*) ([^ ]*com) ([^ ]*)($| ([^ ]*))', 5),'\\"',''),'\\.',''),'^0*','')) as request_time,regexp_extract(line,'([^ ]*) - ([^ ]*) \\\[(.*)\\\] "([^"]*)" ("[^"]*" )?(-|[0-9]*) (-|[0-9]*) "([^"]*)" "([^"]*)" (.*) ([^ ]*com) ([^ ]*)($| ([^ ]*))', 7) as body_bytes_sent from ods_nginx_log_5min_impala)tmp where host !='' and length(host)<=40 and (request !='-' or status !='400') and lower(request) not like '%\.jpg%' and lower(request) not like '%\.ico%' and lower(request) not like '%\.gif%' and lower(request) not like '%\.swf%' and lower(request) not like '%\.txt%' and lower(request) not like '%\.html%' and lower(request) not like '%\.js%' and lower(request) not like '%\.css%' and lower(request) not like '%\.png%' group by host,st,parse_url(concat('http://',host,regexp_extract(request,'([^ ]*) ([^ ]*)',2)),'PATH'); |