Showing posts from May, 2015

Reproducible Research Using R Markdown

While doing research and analysis, an important aspect which normally goes unnoticed is documentation. Suffice to say however, if your research can't be reproduced - people will find it hard to give them any serious thought.
Though documentation is a boring (yet crucial) task in anyone's research pipeline, the ability to document as you code is indeed helpful. In R, one can make use of R Markdown and the knitr package to do this. 
Below is a sample of my assignment (which deals with reproducible research) as part of John Hopkins Data Science Specialization course which I'm still currently going through. Had to pull an all-nighter last night since I've been procrastinating really bad as of late (which is rather evident in the kind of analysis that I did below) . Weather Events And It’s Affect To PopulationHafidz Zulkifli Synopsis This report attempts to draw correlation between various weather events to consequences against the human population in the United States. We…

Setting up IPython Notebook on Centos 6.6 64-bit

I've been using IPython Notebook for awhile now for my development work in Windows environment. It's a great tool to use as it provides a confined environment for which I can write and test my code at the same time, not to mention the wealth of available libraries out there that one can use.
However not all libraries can be successfully installed in Anaconda (Windows). Specifically speaking - the igraph library. For some time I've been using a flavor of WinPython that has igraph pre-installed (google winpython-64bit- for more info), but I guess it became quite cumbersome to keep track of which library is installed in which environment.
Hence to maintain my level of sanity, I'm going back to IPython Notebook - in Linux this time. Note that the steps below is really meant for me, but hopefully others can make some use of it as well. Setting up anaconda. Download it from Anaconda. Choose "Linux 64-bit - Python 2.7".Runbash Anaconda-2.2.0-Lin…

HIVE: Both Left and Right Aliases Encountered in Join

Recently was stuck on this error while trying to do a JOIN in HIVE. Below is my SQL
create table ma.internet_v2 as
select A.msisdn, A.imsi, A.dt,
A.start_time, A.end_time, A.url, A.ttl_connection_dur_ms,
A.ttl_upload_bytes, A.ttl_download_bytes, A.ttl_cdr_cnt,
coalesce(B.domain_desc,A.domain_desc) domain_desc,
coalesce(B.subdomain_desc, A.subdomain_desc) subdomain_desc
db.internet A
left outer join
hfz_domain_name_mapping B
on instr(A.url, B.url_pattern) > 0; HIVE then give me and error message "Both Left and Right Aliases Encountered in Join".

After going through some mail threads it seems that HIVE can support certain types of join.
Only equality joins, outer joins, and left semi joins are supported in Hive. Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a map/reduce job. A poster from Stack Overflow suggested that one should use WHERE instead of ON. Duly following his advice, it seemed to do the tr…