以下使用PIG来做一个最简单的统计。
统计网站服务某一个nginx日志,在一天之内,存在哪些频繁访问的IP。
以前曾使用awk来做过类似的统计,具体可参看以前的文章。
首先,nginx日志格式如下:
121.42.0.88 - - [10/May/2016:03:23:04 +0800] "GET /index.html HTTP/1.1" 500 594 "http://img.zuobin.net/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;Alibaba.Security.Heimdall.1346402)"
#先启动hdfs和yarn
./start-dfs.sh
./start-yarn.sh
#在namenode上执行命令:mr-jobhistory-daemon.sh start historyserver
#这样在,namenode上会启动JobHistoryServer服务,可以在historyserver的日志中查看运行情况
./mr-jobhistory-daemon.sh start historyserver
PIG操作:
#首先将日志copy到hdfs
grunt> copyFromLocal /home/hadoop/access.log .
grunt> ls
hdfs://172.16.22.251:9005/user/hadoop/pig/access.log<r 1> 1272021
#
grunt> a = load 'access.log'
>> using PigStorage(' ')
>> AS (ip,a1,a2,a3,a4,a5,a6,a7,a8);
grunt> b = foreach a generate ip;
grunt> c = group b by ip;
grunt> d = foreach c generate group,COUNT($1);
grunt> dump d
#省略部分日志
2016-06-14 11:52:41,081 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2016-06-14 11:52:41,081 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1464339368801_0004]
2016-06-14 11:53:08,418 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2016-06-14 11:53:08,418 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1464339368801_0004]
2016-06-14 11:53:20,492 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1464339368801_0004]
2016-06-14 11:53:21,499 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2016-06-14 11:53:21,531 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2016-06-14 11:53:22,864 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2016-06-14 11:53:22,870 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2016-06-14 11:53:22,945 [main] INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2016-06-14 11:53:22,953 [main] INFO org.apache.hadoop.mapred.ClientServiceDelegate - Application state is completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history server
2016-06-14 11:53:23,031 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2016-06-14 11:53:23,033 [main] INFO org.apache.pig.tools.pigstats.mapreduce.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.7.2 0.16.0 hadoop 2016-06-14 11:52:39 2016-06-14 11:53:23 GROUP_BY
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs
job_1464339368801_0004 1 1 16 16 16 16 9 9 9 9 a,b,c,d GROUP_BY,COMBINER hdfs://172.16.22.251:9005/tmp/temp-942287448/tmp1415989805,
Input(s):
Successfully read 5959 records (1272400 bytes) from: "hdfs://172.16.22.251:9005/user/hadoop/pig/access.log"
Output(s):
Successfully stored 335 records (7146 bytes) in: "hdfs://172.16.22.251:9005/tmp/temp-942287448/tmp1415989805"
Counters:
Total records written : 335
Total bytes written : 7146
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1464339368801_0004
(1.192.26.82,8)
(119.5.161.3,7)
(121.42.0.34,2)
(121.42.0.35,4)
(121.42.0.54,4)
(121.42.0.63,1)
(121.42.0.71,6)
(121.42.0.82,2)
(121.42.0.86,2)
(121.42.0.88,1186)
(212.14.50.9,7)
(36.33.5.181,7)
(5.178.86.75,3)
(5.178.86.76,1)
(80.82.78.38,1)
(1.214.197.21,8)
(106.41.97.65,23)
(111.13.65.57,123)
(112.86.137.2,21)
(114.113.31.2,5)
(115.60.76.48,10)
(117.78.38.44,8)
(117.78.38.50,11)
(117.78.39.92,8)
(117.78.41.30,6)
(117.78.41.31,9)
(117.78.42.39,12)
(117.78.44.26,9)
(119.29.62.87,7)
(121.43.57.44,8)
(122.96.252.7,8)
(124.133.28.7,73)
原文链接:Hadoop学习笔记(14)--Pig使用,转载请注明来源!