The following are some notes taken while doing a course on statistics. Again, I'm using R markdown to produce this both as a way for me to practice using it and second, because I think it's an awesome tool for document as you code.
It’s really meant for my future reference, since there is quite a high probability that I’ll forget all of these neat functions in a couple of months from now :P
1. The respiratory disturbance index (RDI), a measure of sleep disturbance, for a specific population has a mean of 15 (sleep events per hour) and a standard deviation of 10. They are not normally distributed. Give your best estimate of the probability that a sample mean RDI of 100 people is between 14 and 16 events per hour?
Answer: Recall that the formula for variance is
Variance = (Standard deviation)^2
Standard deviation = sqrt(variance)
Recall another formula to derive variance sample from variance of population.
Variance = (standard deviation)^2/(Sample Size)
# calculate the sample standard deviation# first let's get the sample variance10^2/100
##  1
#Thus the sample standard deviation issqrt(1)
##  1
#Since sample standard deviation is 1, we now know that 14 and 16 events are within 1 and -1 standard deviation from the sample mean. We can thus calculate the probability in that area.pnorm(1)-pnorm(-1)
##  0.6826895
Thus the answer is around 68%.
2. You flip a fair coin 5 times, about what’s the probability of getting 4 or 5 heads?
Answer: To solve this, we need to know the combinations of 4 heads and 1 tails that we can get from doing the flip 5 times.
A great tutorial for this (and basically solve this problem as a whole actually) is shown at Khan Academy
Anyways, this is an example on to solve the problem using R.
# Get the number of combinations for 4 heads 1 tailfactorial(5)/(factorial(4)*factorial(5-4))
##  5
# There is only 1 combinations for 5 heads and 0 tail.# The probability of heads or tails is both 50% since it's a fair coin. This simplifies our calculation a lot.5*0.5^5+1*0.5^5
##  0.1875
# If it's not a fair coin,..say 70% heads and 30 tails; then the calculation becomes5*0.7^4*0.3^1+1*0.7^5
##  0.52822
3. Suppose that diastolic blood pressures (DBPs) for men aged 35-44 are normally distributed with a mean of 80 (mm Hg) and a standard deviation of 10. About what is the probability that a random 35-44 year old has a DBP less than 70?
# we know that 70 is 1 standard deviation less than the mean (i.e standard deviation (-1)). thus, use the pnorm function.pnorm(-1)
##  0.1586553
4. Brain volume for adult women is normally distributed with a mean of about 1,100 cc for women with a standard deviation of 75 cc. What brain volume represents the 95th percentile?
#just plug in the numbers in the qnorm functionround(qnorm(.95, mean=1100, sd=75),3)
##  1223.364
5. Brain volume for adult women is about 1,100 cc for women with a standard deviation of 75 cc. Consider the sample mean of 100 random adult women from this population. What is the 95th percentile of the distribution of that sample mean?
#get the variance for the sample75^2/100
##  56.25
#plug-in the numbersround(qnorm(.95, mean=1100, sd=sqrt(56.25)),3)
##  1112.336
6. Consider a standard uniform density. The mean for this density is .5 and the variance is 1 / 12. You sample 1,000 observations from this distribution and take the sample mean, what value would you expect it to be near?
#get the sample standard deviation for the samplesqrt(1/12/1000)
##  0.009128709
Thus the expectation is the mean should be very close to the population mean. (i.e 0.5)
7. The number of people showing up at a bus stop is assumed to be Poisson with a mean of 5 people per hour. You watch the bus stop for 3 hours. About what’s the probability of viewing 10 or fewer people?
#use the ppois function. lambda is the rate of the poisson distribution. ppois(10, lambda=5*3)
Recently was stuck on this error while trying to do a JOIN in HIVE. Below is my SQL create table ma.internet_v2 as select A.msisdn, A.imsi, A.dt, A.start_time, A.end_time, A.url, A.ttl_connection_dur_ms, A.ttl_upload_bytes, A.ttl_download_bytes, A.ttl_cdr_cnt, coalesce(B.domain_desc,A.domain_desc) domain_desc, coalesce(B.subdomain_desc, A.subdomain_desc) subdomain_desc from db.internet A left outer join hfz_domain_name_mapping B on instr(A.url, B.url_pattern) > 0;
HIVE then give me and error message "Both Left and Right Aliases Encountered in Join".
After going through some mail threads it seems that HIVE can support certain types of join. Only equality joins, outer joins, and left semi joins are supported in Hive. Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a map/reduce job.
A poster from Stack Overflow suggested that one should use WHERE instead of ON. Duly following his advice, it seemed to do the tr…
Am currently working on stored procedure to calculate Dijkstra's shortest path when I ran into this problem (as stated above).
Looked through Netezza's Stored Procedure guide but couldn't find anything of use (perhaps I was not looking hard enough.
Unfortunately for me even more when most of the SQL-variants out there also couldn't point me in the right direction (even PostgresSQL!)
After examining the error code in Aginity multiple times, I tried to infer that the INTO probably had to be put after the statement, since the error message was complaining something about not being able to do select a variable before doing an INTO.
So what if the variable was put after the INTO?
Maybe even after the whole statement itself.
DECLARE vID varchar; vESTIMATE integer; ... ... select '5','8'--id , estimate, from ( select row_number() over (order by estimate) row_num,id, estimate from SNA_TEMP_PATH where done = 0 order by estimat…
To split a column's value in Netezza you can use the array_split function.
For example if column AB_MSISDN have a value like "01212345679|019234567679" and we'd like to split this into A number and B number, we could use the below command in Netezza:
Doing the above will split the values into arrays. However you wouldn't be able to access the value directly. To do this, you use the get_value_varchar function. Example below:
select ab_msisdn, get_value_varchar(array_split(ab_msisdn,'|'),1) source, get_value_varchar(array_split(ab_msisdn,'|'),2) target from telco_edgelist;
Of course one could argue that there are other ways to do this such as using a substring or regex. This is just another option.
For more details on the function above do visit IBM's website such this.