The Chicago Crime example
Crime Data with HIVE and PIG Using the Chicago Crime data. Here I will answer a few simple questions to illustrate the use of some common big data tools.
The data set :
The data set contains a little over 90 plus records, perhaps not really on the scale of big data, however the tools and code used in this document (HIVE and PIG) will be unchanged if we were to handle this data set with tens of millions of records.
Questions to Answer:
1. The most frequently occurring primary type (i.e. theft, narcotics etc..)
2. Districts with the most reported incidents
3. Blocks with the most reported incidents
4. Blocks with the most reported incidents, grouped by primary type
5. A look at the date and time when the highest number of incidents where reported
6. Arrests by primary type
7. Arrests by district
8. A look at the date and time when the highest number of arrests took place.
In
each instance we will restrict the reporting in this document to 10 lines of
data, simply to preserve space.
The intention at a high level is to use historical data to assist law enforcement in answering, WHAT has been taking place (primary type i.e. narcotics, motor theft etc.), WHERE has it been taking place (district, block etc.), WHEN has it been taking place (month, day, hour). With this information law enforcement could operate in a more effective and efficient manner. In addition when combining this data with additional variables from other data sets/sources, law enforcement could possibly develop predictive models, further improving the effectiveness and efficiency of its operations.
1. The most frequently occurring primary type (i.e. theft, narcotics etc..)?
HIVE QUERY:
SELECT
primarytype,
COUNT(*)
AS cnt FROM crime GROUP BY primarytype
ORDER
BY cnt DESC
PIG SCRIPT:
crime
= LOAD '/home/cloudera/Downloads/ chicago _Crimes_2014.csv’
(its path in which you
have store your Chicago crime “csv” file. Path could be change as per your
requirement)
crime_grp_type
= GROUP crime BY primarytype;
crime_grp_type_cntd
= FOREACH crime_grp_type GENERATE COUNT(crime) AS cnt;
srtd
= ORDER crime_grp_type_cntd BY cnt;
DUMP
srtd; RESULT:
2. Districts with the most reported incidents?
HIVE QUERY:
SELECT
district,
COUNT(*)
AS cntdistrict FROM crime GROUP BY district
ORDER
BY cntdistrict DESC
PIG SCRIPT:
crime
= LOAD '/home/cloudera/Downloads/ chicago _Crimes_2014.csv’
crime_grp_dist
= GROUP crime BY district;
crime_grp_dist_cntd
= FOREACH crime_grp_dist GENERATE COUNT(crime) AS cnt;
srtd = ORDER crime_grp_dist_cntd BY cnt;
DUMP srtd;RESULT:
3. Blocks with the most reported incidents?
HIVE QUERY:
SELECT
block,
COUNT(*)
AS cntblock FROM crime
GROUP
BY block
ORDER
BY cntblock DESC
PIG SCRIPT:
crime
= LOAD '/home/cloudera/Downloads/ chicago _Crimes_2014.csv’
crime_grp_block
= GROUP crime BY block;
crime_grp_block_cntd = FOREACH crime_grp_block
GENERATE COUNT(crime) AS cnt;
srtd = ORDER crime_grp_block_cntd BY cnt;
DUMP srtd; RESULT:
4. Blocks with the most reported incidents, grouped by primary type?
HIVE QUERY:
SELECT
block,
primarytype,
COUNT(*) AS cntblocktype FROM crime GROUP BY block,
primarytype
ORDER BY cntblocktype DESC
PIG SCRIPT:
crime
= LOAD '/home/cloudera/Downloads/ chicago _Crimes_2014.csv’
crime_cogrp_block_type = COGROUP crime BY (block, primarytype);
crime_
cogrp_block_type _cntd = FOREACH crime_ cogrp_block_type GENERATE COUNT(crime)
AS cnt;
srtd
= ORDER crime_ cogrp_block_type _cntd BY cnt;
DUMP
srtd;
RESULT:
5. A look at the date and time when the highest number of incidents
where reported?
HIVE QUERY:
SELECT
date,
COUNT(*)
AS cnt FROM crime
GROUP
BY date
ORDER
BY cnt DESC
PIG SCRIPT:
crime
= LOAD '/home/cloudera/Downloads/ chicago _Crimes_2014.csv’
crime_grp_date
= GROUP crime BY date;
crime_grp_date_cntd = FOREACH crime_grp_date
GENERATE COUNT(crime) AS cnt;
srtd
= ORDER crime_grp_date_cntd BY cnt;
DUMP srtd;
RESULT:
6. Arrests by primary type?
HIVE QUERY:
SELECT
primarytype,
COUNT(*)
AS cnt FROM crime WHERE arrest = True
GROUP
BY primarytype
ORDER
BY cnt DESC
PIG SCRIPT:
crime
= LOAD ''/home/cloudera/Downloads/ chicago _Crimes_2014.csv’
crime_filter
= FILTER crime BY ( UPPER (arrest) matches '.*TRUE.*' );
crime_grp_type
= GROUP crime_filter BY primarytype;
crime_grp_type_cntd
= FOREACH crime_grp_type GENERATE COUNT(crime_filter) AS cnt;
srtd
= ORDER crime_grp_type_cntd BY cnt;
DUMP
srtd;
RESULT:
7. Arrests by district?
HIVE QUERY:
SELECT
district,
COUNT(*)
AS cntdistrictarrest FROM crime WHERE arrest = True
GROUP
BY district
ORDER
BY cntdistrictarrest DESC
PIG SCRIPT:
crime
= LOAD '/home/cloudera/Downloads/ chicago _Crimes_2014.csv’
crime_filter
= FILTER crime BY ( UPPER (arrest) matches '.*TRUE.*' );
crime_grp_dist
= GROUP crime_filter BY district;
crime_grp_dist_cntd
= FOREACH crime_grp_dist GENERATE COUNT(crime_filter) AS cnt;
srtd
= ORDER crime_grp_dist_cntd BY cnt;
DUMP
srtd;
RESULT:
8. A look at the date and time when the highest number of arrests took
place?
HIVE QUERY:
SELECT
date,
COUNT(*)
AS cnt_arrest FROM crime WHERE arrest = True
GROUP
BY date
ORDER
BY cnt_arrest DESC
PIG SCRIPT:
crime
= LOAD '/home/cloudera/Downloads/ chicago _Crimes_2014.csv’
crime_filter
= FILTER crime BY ( UPPER (arrest) matches '.*TRUE.*' );
crime_grp_date
= GROUP crime_filter BY date;
crime_grp_date_cntd
= FOREACH crime_grp_date GENERATE COUNT(crime_filter) AS cnt;
srtd
= ORDER crime_grp_date_cntd BY cnt;
DUMP srtd
RESULT:
Downloads:
I hope this tutorial will surely help you. If you have any questions or problem let me know.
Happy Hadooping with Patrick..
No comments:
Post a Comment