之前做了部分nginx qos分析和cdn日志qos分析计算程序开发的工作,涉及到对实时的和离线的nginx 日志的qos计算。不管是研发也好,运维也好,都需要关心自己负责的domain的qos情况,以此来判断业务的运行情况,容量情况,是否需要扩容,以及验证应用的变更是否影响用户访问等等。

 最常见的比如对nginx code的分析,比如计算域名的访问量,可用性(2xx+3xx的比例),4xx,5xx的比例,平均响应时间,平均文件大小,后端响应时间(upstream time)等等。
 有时候还需要对应用的url调用情况进行分析,比如访问量最多的url,速度最慢的url,错误率最高的url,访问情况的同比和环比,这样就可以快速的定位到某一点的问题。还可以增加一些更加详细的数据,比如用户的分布,各个区域用户的qos情况,等等。

1.离线的qos计算是通过把nginx日志导入hadoop,然后通过hive进行计算,比如对cdn日志的qos分析,对nginx源站的分析等等。
这里列几个常用的sql,大家可以参考下:

比如对各个省份响应时间分布的
1
2
3
4
5
6
7
""
"select ip_province,round(SUM(IF(response<=10 AND STATUS='200',1,0))*100/COUNT(1),4) as less10ms_ratio,
round(SUM(IF(response<=100 AND STATUS='200',1,0))*100/COUNT(1),4) as
less100ms_ratio,
round(SUM(IF(response<=1000 AND STATUS='200',1,0))*100/COUNT(1),4) as less1000ms_ratio,
round(avg(response)/1000,4) as avg_rt,round(sum(size)*8/(1000*(sum(response)/1000)),2) as svg_speed
FROM %s where dt = 'xxx' and domain = 'xxxx' and status='200' and hour in ('09','10','11')
group by ip_province order by avg_rt desc "
""
1
 
分析404最多的url,refer
1
2
3
4
SELECT regexp_extract(request,'(.*?) (.*?) (.*?)',2),http_referer,COUNT(1)
AS COUNT_ALL FROM viplog.dw_nginx_log WHERE dt='xxx' and host='xxx' and
status='404' GROUP BY regexp_extract(request,'(.*?) (.*?) (.*?)',2),
http_referer ORDER BY COUNT_ALL DESC limit 20
对域名各个http code的比例的分析:
1
2
3
4
5
"""select a.status as code,a.count_all,round((a.count_all/b.total)*100,4)
as ratio from (select  status,count(1) as count_all from viplog.dw_nginx_log
where dt='xxx' and host='xxx' group by status ) a join
(select count(1) as total from viplog.dw_nginx_log where host='%s'
and dt='xxx') b order by ratio desc """
对cdn厂商qos分析:
1
2
3
4
5
"""select vendor,domain,count_all,round((2xx_3xx_count/count_all)*100,4)
as availability,avg_response_time,round((more_than1s_count/count_all)*100,4)
as more_than1s_per,round((404_count/count_all)*100,4) as 404_ratio from
cdn_qos  where date='%s' and count_all > 10000  order by count_all desc
   
"""
分析访问速度最慢的url:
1
2
3
4
5
6
7
8
9
10
"""SELECT *  FROM (
                                                                                                                     
SELECT concat(host,regexp_extract(request,'.+? +(.+?)(?:\\\?| )+.*',1)),
COUNT(1) AS COUNT_ALL,round(avg(request_time),2) AS avg_rt FROM
viplog.dw_nginx_log WHERE dt='%s'
and host='%s' and (status rlike '^2.*' or status rlike '^3.*')  and
substr(time_local,14,5) >= '09:00' and substr(time_local,14,5)<='21:00'
GROUP BY concat(host,regexp_extract(request,'.+? +(.+?)(?:\\\?| )+.*',1))
        
)a where  a.COUNT_ALL > 10 and a.avg_rt > 0.01  ORDER BY a.avg_rt DESC limit 10
    
"""
对某一个业务各个接口的qos分析:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
"""
select a.service,a.api,a.cnt,b.avaible,avg_rt,more1s_ratio,404_ratio from
(
select regexp_extract(request,'.+? +(.+?)service=(.+?)&(.+) .+?',2) as service,
regexp_extract(request,'.+? +(.+?)api_key=(.+?)&(.+) .+?',2) as api,
count(1)  as cnt from viplog.dw_nginx_log where dt='%s' and host='xxxx' and
regexp_extract(request,'.+? +(.+?)service=(.+?)&(.+) .+?',2) != '' and
regexp_extract(request,'.+? +(.+?)api_key=(.+?)&(.+) .+?',2) != ''
group by regexp_extract(request,'.+? +(.+?)service=(.+?)&(.+) .+?',2),
regexp_extract(request,'.+? +(.+?)api_key=(.+?)&(.+) .+?',2)
) a join
(select service,api,cnt,round(2xx_3xx_count*100/cnt,4) as avaible,avg_rt,
round(more_than1s_count*100/cnt,4) as more1s_ratio,round(404_count*100/cnt,4) as
404_ratio FROM
(select regexp_extract(request,'.+? +(.+?)service=(.+?)&(.+) .+?',2) as service,
regexp_extract(request,'.+? +(.+?)api_key=(.+?)&(.+) .+?',2) as api,count(1) as cnt,
(SUM(IF(SUBSTR(STATUS,0,1)='2',1,0)) + SUM(IF(SUBSTR(STATUS,0,1)='3',1,0))) as 2xx_3xx
_count,   
ROUND(AVG(request_time),4) AS avg_rt,SUM(IF(request_time>1,1,0)) as
more_than1s_count,SUM(IF(STATUS='404',1,0)) AS 404_count
FROM  viplog.dw_nginx_log where dt='%s' and host='xxxx'  and
regexp_extract(request,'.+? +(.+?)service=(.+?)&(.+) .+?',2) != ''
and regexp_extract(request,'.+? +(.+?)api_key=(.+?)&(.+) .+?',2) != ''
group by regexp_extract(request,'.+? +(.+?)service=(.+?)&(.+) .+?',2),
regexp_extract(request,'.+? +(.+?)api_key=(.+?)&(.+) .+?',2)  order by cnt desc)a)b
on a.service=b.service and a.api=b.api and a.cnt >10000 order by a.cnt desc
    
"""
2.实时计算目前使用的是storm+impala的组合。
因为现阶段,性能比较稳定的impala版本是1.1.1,暂时不支持serde和udf,因此对于nginx的分析不是很方便。
而storm可以做这种udf的计算。但是storm使用不是特别方便,两者配合使用才会有更好的效果。
因为对storm不是特别熟悉,这里就暂时不介绍了。简单的贴下我们用impala计算qos的sql:
(因为不支持serde,所以需要对nginx的日志字段进行解析)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
select host,concat(substr(lt,1,18),'01') as st,parse_url(concat('http://',host,regexp_extract(request,'([^ ]*) ([^ ]*)',2)),'PATH') as url,
sum(cast( case when status like '2%' or status like '3%' then body_bytes_sent else '0' end as int)) as  2xx3xx_body_size,
sum(cast( case when status like '4%' then body_bytes_sent else '0' end as int)) as  4xx_body_size,
sum(cast( case when status like '5%' then body_bytes_sent else '0' end as int)) as  5xx_body_size,
sum(cast(body_bytes_sent as int)) as all_body_size ,
sum(cast( case when status like '2%' or status like '3%' then request_time else '0' end as int)) as  2xx3xx_response_time,
sum(cast( case when status like '4%' then request_time else '0' end as int)) as  4xx_response_time,
sum(cast( case when status like '5%' then request_time else '0' end as int)) as  5xx_response_time,
sum(cast(request_time as int)) as all_response_time ,
sum(cast( case when status like '2%' or status like '3%' then 1 else 0 end as int)) as  2xx3xx_count,
sum(cast( case when status like '4%' then 1 else 0 end as int)) as  4xx_count,
sum(cast( case when status like '5%' then 1 else 0 end as int)) as  5xx_count,
sum(1) as all_count
from (
select regexp_extract(line,'([^ ]*) - ([^ ]*) \\\[(.*)\\\] "([^"]*)" ("[^"]*" )?(-|[0-9]*) (-|[0-9]*) "([^"]*)" "([^"]*)" (.*) ([^ ]*com) ([^ ]*)($| ([^ ]*))', 3) as lt,regexp_extract(line,'([^ ]*) - ([^ ]*) \\\[(.*)\\\] "([^"]*)" ("[^"]*" )?(-|[0-9]*) (-|[0-9]*) "([^"]*)" "([^"]*)" (.*) ([^ ]*com) ([^ ]*)($| ([^ ]*))', 4) as request,
regexp_extract(line,'([^ ]*) - ([^ ]*) \\\[(.*)\\\] "([^"]*)" ("[^"]*" )?(-|[0-9]*) (-|[0-9]*) "([^"]*)" "([^"]*)" (.*) ([^ ]*com) ([^ ]*)($| ([^ ]*))', 11) as    host,regexp_extract(line,'([^ ]*) - ([^ ]*) \\\[(.*)\\\] "([^"]*)" ("[^"]*" )?(-|[0-9]*) (-|[0-9]*) "([^"]*)" "([^"]*)" (.*) ([^ ]*com) ([^ ]*)($| ([^ ]*))', 6)   as status,trim(regexp_replace(regexp_replace(regexp_replace(regexp_extract(line,'([^ ]*) - ([^ ]*) \\\[(.*)\\\] "([^"]*)" ("[^"]*" )?(-|[0-9]*) (-|[0-9]*) "([^"]*)" "([^"]*)" (.*) ([^ ]*com) ([^ ]*)($| ([^ ]*))', 5),'\\"',''),'\\.',''),'^0*','')) as request_time,regexp_extract(line,'([^ ]*) - ([^ ]*) \\\[(.*)\\\] "([^"]*)" ("[^"]*" )?(-|[0-9]*) (-|[0-9]*) "([^"]*)" "([^"]*)" (.*) ([^ ]*com) ([^ ]*)($| ([^ ]*))', 7) as body_bytes_sent
from ods_nginx_log_5min_impala)tmp where host !='' and length(host)<=40 and (request !='-' or status !='400') and lower(request) not like '%\.jpg%' and lower(request) not like '%\.ico%' and lower(request) not like '%\.gif%' and lower(request) not like '%\.swf%' and lower(request) not like '%\.txt%' and lower(request) not like '%\.html%' and  lower(request) not like '%\.js%' and lower(request) not like '%\.css%' and lower(request) not like '%\.png%' group by host,st,parse_url(concat('http://',host,regexp_extract(request,'([^ ]*) ([^ ]*)',2)),'PATH');