Showing posts from August, 2015

Clearing up memory in R

There's basically two ways that I know of to clear up memory in R.
1. rm()  2. gc()
The first one basically removes your variable of vector or data frame from your workspace. But somehow rather the memory can sometimes (or in my case, all the time) still be consumed by R based on Task Manager. That's when garbage collection ( gc() ) comes in handy. 
For more info on gc(), use ?gc

Fast load for Teradata

Below is a sample script that I've used to import data into a Teradata DWH using the Fast Load tool. Teradata is a real hassle in not having any bulk import functionality like Oracle. Took me hours to understand what is going on and getting the script to work.

In any case, sharing it here for others to refer to.

    NODE VARCHAR(50) ,

begin loading + tablename>
errorfiles + tablename>_err1, + tablename>_err2;
set record vartext ",";
record 2;
        CLUSTER_ID (VARCHAR(10)),

insert into + tablename> values

Cross Validating in R

Excerpt from the article Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models by Dr. Jon Starkweather.

Cross validation is useful for overcoming the problem of over-fitting. Over-fitting is one aspect of the larger issue of what statisticians refer to as shrinkage (Harrell, Lee, & Mark, 1996). Over-fitting is a term which refers to when the model requires more information than the data can provide. For example, over-fitting can occur when a model which was initially fit with the same data as was used to assess fit. Much like exploratory and confirmatory analysis should not be done on the same sample of data, fitting a model and then assessing how well that model performs on the same data should be avoided
For more info, refer to the actual article.

Hindsight: 8 months down the analytics road

Well this is probably not my typical piece of how-tos or tutorial on analytic tool or that sort of thing.

This time it's just some random piece of thoughts that I thought I'd share with all, having walked the path for nearly a year now since the beginning of 2015.

I started this journey following a change in the direction of our company. We've decided sometime late last year that big data and analytic had a huge potential in the future of the telecommunication industry, and thus - we should try and revamp our company and adopt this data-driven culture into our daily routine.

Personally I thought it was fun. With all the hype of big data and it's possibilities, not to mention data scientist being a sexy job, I was ecstatic (to say the least) to be part of the bandwagon.


Not to mention that we also had Hadoop in our IT ecosystem. Seemed like the best place to learn and practice big data analytics - I said to myself.

8 months down the road though, I feel nowhere as…

Learning R using Swirl

Swirl is an R package that turns your R console into an interactive learning environment. From the readme file of the Swirl github:
This is a collection of interactive courses for use with the swirl R package. You'll find instructions for installing courses further down on this page. Some courses are still in development and we'd love to hear any suggestions you have as you work through them. The github contains a few ready made courses that use can use to learn R while also learning about other data science related subjects. Current available courses in the repository are:

1. Exploratory Data Analysis
2. Getting and Cleaning Data
3. Mathematical Biostatistics
4. R Programming
5. Regression Models
6. Statistical Models
7. Overview of Statistics
8. Data Analysis

Most of them are based on John Hopkins Data Science Specialization Coursera online course. So even if you have not registered to the course - you could still learn about those topics on your own.