Reproducible Research Using R Markdown

While doing research and analysis, an important aspect which normally goes unnoticed is documentation. Suffice to say however, if your research can't be reproduced - people will find it hard to give them any serious thought.

Though documentation is a boring (yet crucial) task in anyone's research pipeline, the ability to document as you code is indeed helpful. In R, one can make use of R Markdown and the knitr package to do this. 

Below is a sample of my assignment (which deals with reproducible research) as part of John Hopkins Data Science Specialization course which I'm still currently going through. Had to pull an all-nighter last night since I've been procrastinating really bad as of late (which is rather evident in the kind of analysis that I did below) .

Weather Events And It’s Affect To Population

Hafidz Zulkifli

Synopsis

This report attempts to draw correlation between various weather events to consequences against the human population in the United States. Weather data used is based from NOAA Storm Database reading from 1950 to November 2011. The first goal of the report focuses upon are the affects of weather events against population health. The second goal of the study is to determine the effects of weather against the United States’ economy.

Data Processing

The data set for this study is taken from NOAA Storm Database. We will first load this into R.
data <- span=""> read.csv("D:/Self_Development/Coursera - JHU/5. Reproducible Research/2/repdata-data-StormData.csv.bz2" ) 
Since we are looking at two aspects of the events, it is helpful to separate the data set into 2 halves, each to answer it’s respective question.
To prepare the data set to answer the first question, we will be using the event types, amount of reported injuries and also fatalities. This would give us an idea of which weather events are most harmful - or deadly - to the population.
Our basic strategy in processing the data would be to sum up all the fatalities and injuries per weather event, and visualize the total amount in the Result section.
However due to the many types of events that has been reported, we shall concentrate our work on only the top 10 for each category.
weather_data <- span=""> data.frame(data$EVTYPE, data$FATALITIES, data$INJURIES)  
colnames(weather_data) <- span=""> c('event', 'fatality','injury')

library(plyr)
by_event1 <- span=""> ddply(weather_data,~event,summarise,sum=sum(fatality))
fatality <- span=""> by_event1[by_event1$sum %in% sort(by_event1$sum, decreasing=TRUE)[1:10],]

by_event2 <- span=""> ddply(weather_data,~event,summarise,sum=sum(injury))
injury <- span=""> by_event2[by_event2$sum %in% sort(by_event2$sum, decreasing=TRUE)[1:10],]   
For the second dataset, we will be using the event types, the amount of property damage and the crop damage (which are available in dollars).
Similar to our previous work, the basic strategy in processing the data would be to sum up all the damages on property and crop. We will then visualize the result in the Result section.
However due to the many types of events that has been reported, we shall concentrate our work on only the top 10 for each category.
economy_data <- span=""> data.frame(data$EVTYPE, data$PROPDMG, data$PROPDMGEXP, data$CROPDMG, data$CROPDMGEXP)
colnames(economy_data) <- span=""> c('event', 'propertydmg','propdmgexp','cropdmg','cropdmgexp')
summary(economy_data)
##                event         propertydmg        propdmgexp    
##  HAIL             :288661   Min.   :   0.00          :465934  
##  TSTM WIND        :219940   1st Qu.:   0.00   K      :424665  
##  THUNDERSTORM WIND: 82563   Median :   0.00   M      : 11330  
##  TORNADO          : 60652   Mean   :  12.06   0      :   216  
##  FLASH FLOOD      : 54277   3rd Qu.:   0.50   B      :    40  
##  FLOOD            : 25326   Max.   :5000.00   5      :    28  
##  (Other)          :170878                     (Other):    84  
##     cropdmg          cropdmgexp    
##  Min.   :  0.000          :618413  
##  1st Qu.:  0.000   K      :281832  
##  Median :  0.000   M      :  1994  
##  Mean   :  1.527   k      :    21  
##  3rd Qu.:  0.000   0      :    19  
##  Max.   :990.000   B      :     9  
##                    (Other):     9
Based on the summary, it can be seen that not all the damage estimates have been given the magnitudes. As such, the resultant value without any assigned magnitude may not be relevant enough to be included in the calculation (value is too small to be significant).
Thus, any estimates which does not have any magnitude assigned other than those that have been documented by NWS (ie ‘K’, ‘M’ and ‘B’) will be omitted from this study.
#to simplify typing
ed <- span=""> economy_data

# get the property damages
edp <- span=""> data.frame(ed$event, ed$propertydmg, ed$propdmgexp)
ed_M <- span=""> edp[edp$ed.propdmgexp == 'M',]
ed_K <- span=""> edp[edp$ed.propdmgexp == 'K',]
ed_B <- span=""> edp[edp$ed.propdmgexp == 'B',]

#calculate real value
ed_M1 <- span=""> data.frame(ed_M$ed.event, ed_M$ed.propertydmg*1000000)
colnames(ed_M1) <- span=""> c('event','prop_damage')
ed_K1 <- span=""> data.frame(ed_K$ed.event, ed_K$ed.propertydmg*1000)
colnames(ed_K1) <- span=""> c('event','prop_damage')
ed_B1 <- span=""> data.frame(ed_B$ed.event, ed_B$ed.propertydmg*1000000000)
colnames(ed_B1) <- span=""> c('event','prop_damage')

#recombine
edp <- span=""> rbind(ed_K1, ed_M1, ed_B1)

library(plyr)
by_event3 <- span=""> ddply(edp,~event,summarise,sum=sum(prop_damage))
property_damage <- span=""> by_event3[by_event3$sum %in% sort(by_event3$sum, decreasing=TRUE)[1:10],]


# now let's get the crop damages
edc <- span=""> data.frame(ed$event, ed$cropdmg, ed$cropdmgexp)
edc_M <- span=""> edc[edc$ed.cropdmgexp == 'M',]
edc_K <- span=""> edc[edc$ed.cropdmgexp == 'K',]
edc_B <- span=""> edc[edc$ed.cropdmgexp == 'B',]

#calculate real value
edc_M1 <- span=""> data.frame(edc_M$ed.event, edc_M$ed.cropdmg*1000000)
colnames(edc_M1) <- span=""> c('event','crop_damage')
edc_K1 <- span=""> data.frame(edc_K$ed.event, edc_K$ed.cropdmg*1000)
colnames(edc_K1) <- span=""> c('event','crop_damage')
edc_B1 <- span=""> data.frame(edc_B$ed.event, edc_B$ed.cropdmg*1000000000)
colnames(edc_B1) <- span=""> c('event','crop_damage')

#recombine
edc <- span=""> rbind(edc_K1, edc_M1, edc_B1)

by_event4 <- span=""> ddply(edc,~event,summarise,sum=sum(crop_damage))
crop_damage <- span=""> by_event4[by_event4$sum %in% sort(by_event4$sum, decreasing=TRUE)[1:10],]
Finally, we compute the cost incurred due to both property and crop damages.
#combine property and crop

prop_crop <- span=""> rbind(by_event3, by_event4)
prop_crop_damage <- span=""> prop_crop[prop_crop$sum %in% sort(prop_crop$sum, decreasing=TRUE)[1:10],]

Result

Weather Events and Health

#plotting two panel graph
par(mai=c(3,1,1,1))
par(mfrow=c(1,2))
barplot(fatality$sum,names.arg=fatality$event,xlab=mtext("Event Types", side=3), ylab="Total Fatalities", las=2)
barplot(injury$sum,names.arg=injury$event,xlab=mtext("Event Types", side=3),ylab="Total Injuries", las=2 )
title(outer=TRUE,main="Top 10 Most Harmful Weather Events", line= -2)
Based on the figure above, we can conclude visually that tornado seems to be the most harmful weather event (both in terms of death and injuries caused) that affects the wide population.
However, do also note that there are occurences where there are typos and multiple classification for the same type of weather event. What this means is that more analysis is needed to combine similar types of events together to get the correct picture of things.

Weather Events and Economy

par(mai=c(3,1,1,1))
par(mfrow=c(1,3))
barplot(property_damage$sum,names.arg=property_damage$event,xlab=mtext("Property Damage vs Event Types", side=3), ylab="", las=2)
barplot(crop_damage$sum,names.arg=crop_damage$event,xlab=mtext("Crop Damage vs Event Types", side=3), ylab="", las=2)
barplot(prop_crop_damage$sum,names.arg=prop_crop_damage$event,xlab=mtext("Total Damage vs Event Types", side=3), ylab="", las=2)

title(outer=TRUE,main="Top 10 Most Costly Weather Events", line= -2)
Meanwhile on the economy, we see that flood is the major contributor of economic loss. Drilling further down into a subsegment of this, we see that drought is major contributor to economic loss when it comes to crops, but doesn’t affect the property segment at all.

Comments

Popular posts from this blog

HIVE: Both Left and Right Aliases Encountered in Join

Assign select result to variable in Netezza stored procedure

Splitting value in Netezza using array_split