HADOOP GYAAN: Apache Hadoop : Chicago Crime HIVE and PIG Case Study

The Chicago Crime example

Crime Data with HIVE and PIG Using the Chicago Crime data. Here I will answer a few simple questions to illustrate the use of some common big data tools.

The data set :

The data set contains a little over 90 plus records, perhaps not really on the scale of big data, however the tools and code used in this document (HIVE and PIG) will be unchanged if we were to handle this data set with tens of millions of records.

Questions to Answer:

1. The most frequently occurring primary type (i.e. theft, narcotics etc..)
2. Districts with the most reported incidents
3. Blocks with the most reported incidents
4. Blocks with the most reported incidents, grouped by primary type
5. A look at the date and time when the highest number of incidents where reported
6. Arrests by primary type
7. Arrests by district
8. A look at the date and time when the highest number of arrests took place.

In each instance we will restrict the reporting in this document to 10 lines of data, simply to preserve space.

The intention at a high level is to use historical data to assist law enforcement in answering, WHAT has been taking place (primary type i.e. narcotics, motor theft etc.), WHERE has it been taking place (district, block etc.), WHEN has it been taking place (month, day, hour). With this information law enforcement could operate in a more effective and efficient manner. In addition when combining this data with additional variables from other data sets/sources, law enforcement could possibly develop predictive models, further improving the effectiveness and efficiency of its operations.

1. The most frequently occurring primary type (i.e. theft, narcotics etc..)?

HIVE QUERY:

SELECT primarytype,

COUNT(*) AS cnt FROM crime GROUP BY primarytype

ORDER BY cnt DESC

PIG SCRIPT:

crime = LOAD '/home/cloudera/Downloads/ chicago _Crimes_2014.csv’

(its path in which you have store your Chicago crime “csv” file. Path could be change as per your requirement)

crime_grp_type = GROUP crime BY primarytype;

crime_grp_type_cntd = FOREACH crime_grp_type GENERATE COUNT(crime) AS cnt;

srtd = ORDER crime_grp_type_cntd BY cnt;

DUMP srtd;

RESULT:

2. Districts with the most reported incidents?

HIVE QUERY:

SELECT district,

COUNT(*) AS cntdistrict FROM crime GROUP BY district

ORDER BY cntdistrict DESC

PIG SCRIPT:

crime = LOAD '/home/cloudera/Downloads/ chicago _Crimes_2014.csv’

crime_grp_dist = GROUP crime BY district;

crime_grp_dist_cntd = FOREACH crime_grp_dist GENERATE COUNT(crime) AS cnt;

srtd = ORDER crime_grp_dist_cntd BY cnt;

DUMP srtd;

RESULT:

3. Blocks with the most reported incidents?

HIVE QUERY:

SELECT block,

COUNT(*) AS cntblock FROM crime

GROUP BY block

ORDER BY cntblock DESC

PIG SCRIPT:

crime = LOAD '/home/cloudera/Downloads/ chicago _Crimes_2014.csv’

crime_grp_block = GROUP crime BY block;

crime_grp_block_cntd = FOREACH crime_grp_block GENERATE COUNT(crime) AS cnt;

srtd = ORDER crime_grp_block_cntd BY cnt;

DUMP srtd;

RESULT:

4. Blocks with the most reported incidents, grouped by primary type?

HIVE QUERY:

SELECT block,

primarytype, COUNT(*) AS cntblocktype FROM crime GROUP BY block,

primarytype ORDER BY cntblocktype DESC

PIG SCRIPT:

crime = LOAD '/home/cloudera/Downloads/ chicago _Crimes_2014.csv’

crime_cogrp_block_type = COGROUP crime BY (block, primarytype);

crime_ cogrp_block_type _cntd = FOREACH crime_ cogrp_block_type GENERATE COUNT(crime) AS cnt;

srtd = ORDER crime_ cogrp_block_type _cntd BY cnt;

DUMP srtd;

RESULT:

5. A look at the date and time when the highest number of incidents where reported?

HIVE QUERY:

SELECT date,

COUNT(*) AS cnt FROM crime

GROUP BY date

ORDER BY cnt DESC

PIG SCRIPT:

crime = LOAD '/home/cloudera/Downloads/ chicago _Crimes_2014.csv’

crime_grp_date = GROUP crime BY date;

crime_grp_date_cntd = FOREACH crime_grp_date GENERATE COUNT(crime) AS cnt;

srtd = ORDER crime_grp_date_cntd BY cnt;

DUMP srtd;

RESULT:

6. Arrests by primary type?

HIVE QUERY:

SELECT primarytype,

COUNT(*) AS cnt FROM crime WHERE arrest = True

GROUP BY primarytype

ORDER BY cnt DESC

PIG SCRIPT:

crime = LOAD ''/home/cloudera/Downloads/ chicago _Crimes_2014.csv’

crime_filter = FILTER crime BY ( UPPER (arrest) matches '.*TRUE.*' );

crime_grp_type = GROUP crime_filter BY primarytype;

crime_grp_type_cntd = FOREACH crime_grp_type GENERATE COUNT(crime_filter) AS cnt;

srtd = ORDER crime_grp_type_cntd BY cnt;

DUMP srtd;

RESULT:

7. Arrests by district?

HIVE QUERY:

SELECT district,

COUNT(*) AS cntdistrictarrest FROM crime WHERE arrest = True

GROUP BY district

ORDER BY cntdistrictarrest DESC

PIG SCRIPT:

crime = LOAD '/home/cloudera/Downloads/ chicago _Crimes_2014.csv’

crime_filter = FILTER crime BY ( UPPER (arrest) matches '.*TRUE.*' );

crime_grp_dist = GROUP crime_filter BY district;

crime_grp_dist_cntd = FOREACH crime_grp_dist GENERATE COUNT(crime_filter) AS cnt;

srtd = ORDER crime_grp_dist_cntd BY cnt;

DUMP srtd;

RESULT:

8. A look at the date and time when the highest number of arrests took place?

HIVE QUERY:

SELECT date,

COUNT(*) AS cnt_arrest FROM crime WHERE arrest = True

GROUP BY date

ORDER BY cnt_arrest DESC

PIG SCRIPT:

crime = LOAD '/home/cloudera/Downloads/ chicago _Crimes_2014.csv’

crime_filter = FILTER crime BY ( UPPER (arrest) matches '.*TRUE.*' );

crime_grp_date = GROUP crime_filter BY date;

crime_grp_date_cntd = FOREACH crime_grp_date GENERATE COUNT(crime_filter) AS cnt;

srtd = ORDER crime_grp_date_cntd BY cnt;

DUMP srtd

RESULT:

Downloads:

1.Input sample file (chicago _Crimes_2014.csv)

I hope this tutorial will surely help you. If you have any questions or problem let me know.

Happy Hadooping with Patrick..

Pages

Tuesday, 14 June 2016

Apache Hadoop : Chicago Crime HIVE and PIG Case Study

I hope this tutorial will surely help you. If you have any questions or problem let me know.

No comments:

Post a Comment