Posts

Showing posts from 2015

Natural Language Processing by Stanford

Image
Below is a great course by Stanford University on Natural Language Processing. There hasn't been any recent sessions as of late in Coursera, but you can still access the archive at this link.

I'm currently working on my capstone for the John Hopkins Data Science Specialization, where we're asked to build a data product that is able to predict the next sets of word based on what users type into a textbox - similar to stuff like Google Autocomplete or Swiftkey.

Pretty psyched about it - looking forward to the challenge! :)

Behavioural Economics

Image
Below are a few notes (most are copy pastes) of stuff that I've covered during my time at Dilip Soman's Behavioural Economics online course at edX.

The topic is a recent interest of mine after spending some time earlier this year learning about social graphs and basic graph theory in general. So as a natural extension to that, a question that comes to mind is, how do people make purchasing decision?

The notes have been mostly compiled in Slack - somehow I kinda take a liking in the way the notes there get formatted. It's relatively easy too - perfect for lazy people like myself.

Below are the public links of my notes in Slack with regards to the topic:
Early General NotesA Theory of Decision PointsChoice OverloadGlossary of ConceptsConsumption VocabularyRecent Nudge ExperimentsDecision AidsDisclosure One thing that I really like about the course is that it also talks about how to conduct experiments should you have an idea that you'd like to test. Below are some notes:

Geospatial Display with Shiny

Image
One of the reason I like to join these online courses are that it gives you the chance to meet with people from different backgrounds, industries and countries.

In this particular post, I'm quite amazed with the dedication and thought that was put in by one of my classmates. The assignments required that we create our own data product using R - so that we become a wholesome data science practitioner - we acquire data, process, model, document, and create data products for others to consume.

It's one thing for doing assignments for the sake of completing the course, it's another to produce a beauty such as the above.

You may explore the Shiny app here at this link, and have a look at the forked source code here. I've forked it since I know I'll be making use of this in times to come.

Using R with Shiny

Image
Lately I've been studying Shiny and how to use in R.

It's a cool arsenal to have while using R as you can really quickly develop a data product right from R itself. Of course you could say load the data up in tools like Tableau or Qlik Sense are have a much cooler/sexier visualisation - but that's not the point I'm trying to bring here.

For a quick preview of what I managed to conjure up with Shiny, pop over to Social Network .

Couldn't help it - it just had to be a network graph - again :P. A real sucker for graphs I am.

Anyways, what it aims to demonstrate are how from a social network graph like that, you can derive the centralities (degree, closeness etc) and from there - the roles of each nodes based on how they are connected to each other. I'll not be making the claim that it's correct in any way - it's just something that I've picked up from Drew Conway, based on his presentation on Socio-Terrorism.

For more details - check out the codes in …

Notes on R Machine Learning Packages

The below excerpt are taken from this page. Copying it here for future reference in finding the right R packages for different types of analysis - god knows it's hard to find the right packages in R. :)

Several add-on packages implement ideas and methods developed at the borderline between computer science and statistics - this field of research is usually referred to as machine learning. The packages can be roughly structured into the following topics:Neural Networks : Single-hidden-layer neural network are implemented in package nnet (shipped with base R). Package RSNNS offers an interface to the Stuttgart Neural Network Simulator (SNNS). An interface to the FCNN library allows user-extensible artificial neural networks in package FCNN4R.Recursive Partitioning : Tree-structured models for regression, classification and survival analysis, following the ideas in the CART book, are implemented in rpart (shipped with base R) and tree. Package rpart is recommended for computing CART-…

Clearing up memory in R

There's basically two ways that I know of to clear up memory in R.
1. rm()  2. gc()
The first one basically removes your variable of vector or data frame from your workspace. But somehow rather the memory can sometimes (or in my case, all the time) still be consumed by R based on Task Manager. That's when garbage collection ( gc() ) comes in handy. 
For more info on gc(), use ?gc

Fast load for Teradata

Below is a sample script that I've used to import data into a Teradata DWH using the Fast Load tool. Teradata is a real hassle in not having any bulk import functionality like Oracle. Took me hours to understand what is going on and getting the script to work.

In any case, sharing it here for others to refer to.
//test.csv
node,cluster_id,node_type
XXXX12710,1,msisdn
XXXX643124,2,msisdn
//

SESSIONS 5;
LOGON /,;
CREATE TABLE , NO FALLBACK
   (
    NODE VARCHAR(50) ,
    CLUSTER_ID VARCHAR(10),
    NODE_TYPE VARCHAR(10)
   )
   PRIMARY INDEX(NODE);

begin loading + tablename>
errorfiles + tablename>_err1, + tablename>_err2;
set record vartext ",";
record 2;
DEFINE  NODE (VARCHAR(50)),
        CLUSTER_ID (VARCHAR(10)),
NODE_TYPE (VARCHAR(10))
FILE=D:\test.csv;

insert into + tablename> values
(
:NODE,
:CLUSTER_ID,
:NODE_TYPE
);
END LOADING;
LOGOFF;
quit;

Cross Validating in R

Excerpt from the article Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models by Dr. Jon Starkweather.

Cross validation is useful for overcoming the problem of over-fitting. Over-fitting is one aspect of the larger issue of what statisticians refer to as shrinkage (Harrell, Lee, & Mark, 1996). Over-fitting is a term which refers to when the model requires more information than the data can provide. For example, over-fitting can occur when a model which was initially fit with the same data as was used to assess fit. Much like exploratory and confirmatory analysis should not be done on the same sample of data, fitting a model and then assessing how well that model performs on the same data should be avoided
For more info, refer to the actual article.

Hindsight: 8 months down the analytics road

Image
Well this is probably not my typical piece of how-tos or tutorial on analytic tool or that sort of thing.

This time it's just some random piece of thoughts that I thought I'd share with all, having walked the path for nearly a year now since the beginning of 2015.

I started this journey following a change in the direction of our company. We've decided sometime late last year that big data and analytic had a huge potential in the future of the telecommunication industry, and thus - we should try and revamp our company and adopt this data-driven culture into our daily routine.

Personally I thought it was fun. With all the hype of big data and it's possibilities, not to mention data scientist being a sexy job, I was ecstatic (to say the least) to be part of the bandwagon.

Ahem.

Not to mention that we also had Hadoop in our IT ecosystem. Seemed like the best place to learn and practice big data analytics - I said to myself.

8 months down the road though, I feel nowhere as…

Learning R using Swirl

Swirl is an R package that turns your R console into an interactive learning environment. From the readme file of the Swirl github:
This is a collection of interactive courses for use with the swirl R package. You'll find instructions for installing courses further down on this page. Some courses are still in development and we'd love to hear any suggestions you have as you work through them. The github contains a few ready made courses that use can use to learn R while also learning about other data science related subjects. Current available courses in the repository are:

1. Exploratory Data Analysis
2. Getting and Cleaning Data
3. Mathematical Biostatistics
4. R Programming
5. Regression Models
6. Statistical Models
7. Overview of Statistics
8. Data Analysis

Most of them are based on John Hopkins Data Science Specialization Coursera online course. So even if you have not registered to the course - you could still learn about those topics on your own.


Notes from FB: How people do postdoc research

Image

Notes from FB: Anda tak layak buat kajian experimen jika abaikan 10 perkara ini.

Anda tak layak buat kajian experimen jika abaikan 10 perkara ini...READ and MAKE IT VIRAL.... Its REAL ! Kajian...
Posted by Othman Talib on Thursday, 30 July 2015

Calculating P-values in R

Notes below are documented for future reference should I ever need to use them again someday. A pharmaceutical company is interested in testing a potential blood pressure lowering medication. Their first examination considers only subjects that received the medication at baseline then two weeks later. The data are as follows (SBP in mmHg) SubjectBaselineWeek 211401322138135315015141481465135130Consider testing the hypothesis that there was a mean reduction in blood pressure? Give the P-value for the associated two sided T test. (Hint, consider that the observations are paired). Solution: baseline<- span=""> c(140,138,150,148,135)week2<-c(132,135,151,146,130)round(t.test(baseline,week2, paired=TRUE)$p.value,3)## [1] 0.087Researchers conducted a blind taste test of Coke versus Pepsi. Each of four people was asked which of two blinded drinks given in random order that they preferred. The data was such that 3 of the 4 people chose Coke. Assu

Hypothesis testing and confidence interval in R

As usual, the notes below are documented for future reference should I ever need to use them again someday. Also to note, most of the exercises this time around requires that you’d be able to generate a random data set in a precise manner - else you wouldn’t get the same answer as what was provided in the exercises. You’d think that R already has a function on that but sadly no. Due to the small sample sizes in most of the questions, rnorm wouldn’t usually give you an accurate result. It’d instead give you something like this: a<- rnorm(10, 5, 1)mean(a)## [1] 5.081355sd(a)## [1] 0.7624848 Lucky for me, I was able to find this nice function right here Stack Overflow rnorm2<- function(n,mean,sd){mean+sd*scale(rnorm(n))} It’s very similar to rnorm - only more precise and behaves exactly as you’d expect a random number generator to behave. Let’s take it for a test run. b<- rnorm2(10,5,1)mean(b)## [1] 5sd(b