Descriptive and Inferential Statistics

  • Descriptive statistics are functions of the data that are intrinsically interesting in describing some feature of the data. Classic descriptive statistics include mean, min, max, standard deviation, median, skew, kurtosis.
  • Inferential statistics are a function of the sample data that assists you to draw an inference regarding an hypothesis about a population parameter. Classic inferential statistics include z, t, χ2, F-ratio, etc.

    There are two major divisions of inferential statistics:

    1. A confidence interval gives a range of values for an unknown parameter of the population by measuring a statistical sample. This is expressed in terms of an interval and the degree of confidence that the parameter is within the interval.
    2. Tests of significance or hypothesis testing tests a claim about the population by analyzing a statistical sample. By design there is some uncertainty in this process. This can be expressed in terms of a level of significance.

Example-

Say we want to assess the effects of Vitamin C on cognitive ability in adults. Rather than using the entire population of all adults in India, we select a random sample of 1000 adults, one-half consume 500mg of vitamin C daily for 4 weeks and the other one-half do not.

Say that the average cognitive ability for adults who do not consume vitamin C is M = 50 (higher numbers indicate better cognitive ability).The average cognitive ability for those adults who consumed vitamin C during the past month is M = 65.

The data indicate a 15-point difference between the two samples.

There are two possible interpretations:

1)  There is no “real” difference between the two groups (suggesting the mean differences are simply due to chance factors — i.e., sampling error).

OR

2)  The sampling data reflect a “true” difference between the two groups.

 

The goal of inferential statistics is to help researchers decide between the two interpretations.

Inferential statistics begins with actual data (sample data) from the experiment above and ends with a probability statement (i.e., the probability of obtaining data like those above if there is no effect of vitamin C  on cognitive ability in the population)

If the probability is very small (p<.05) that the mean differences were due to chance factors, we can conclude that vitamin C does affect cognitive ability. That is, the observed data are not what would be expected by chance alone.

Understanding Statistics

Lies, damn lies and statistics- Mark Twain

Before we delve into the advanced realms of analytics it is important for us to understand basic statistical concepts. Hence we will start at understanding and exploring basic statistics.

Statistics is a field of study concerned with-
Summarizing data.
Interpreting data,
and making decisions based on data.

Any dataset is basically of two types-
Population- includes all the elements from a set of data
Sample- consists of one or more observations from the population.

Sometimes it is not possible for us to collect information on total population hence it becomes cost effective and time saving to work with samples. Measurable characteristics of population is known as parameters and characteristics of sample is known as statistics.

population

Types of statistical engagement we can have with our data:

Descriptive– Collecting, Summarizing and Describing data.
To compare any datasets we need to understand the Distributions(spread of the data) and Central Tendency(Center of spread) and how it relates with the datasets.

Inferential– Drawing conclusion and making decision on population based on sample data.
The methods of inferential statistics are (1) the estimation of parameter(s) and (2) testing of statistical hypotheses.

We use various Sampling Methods to drive samples from population.

We will pick up information one by one and dig as deep as possible to understand the basics.

What is R

“R is a language and environment for statistical computing and graphics.” This definition comes from the Holy Grail- www.r-project.org. I know most of us must already be familiarized with R however my outlook is to get into the nitty-gritty of the explanations.

This is a GNU Project. So, what do we mean by GNU Project?

History- Richard Stallman started GNU Project in 1983. He says GNU is a recursive acronym for“GNU’s Not Unix!”. He started this project and wanted GNU to be a completely free,UNIX compatible OS with free(Open Source) softwares.
By 1991,  the GNU project had finished many of the pieces of the GNU operating system, including the GNU C Compiler (gcc), bash command-line shell, many shell utilities, the Emacs text editor, and more. For Graphical desktop free software -X Window System was used.
The kernel was seen as “the last missing piece” of the GNU operating system by the GNU project- the core part of the operating system – the GNU kernel – was not complete.
In 1991, Linus Torvalds released the first version of the Linux kernel.Then it started to bundle the packages together –  Linux kernel, GNU software, and X Window System together.

When we say free software there has been guidelines rolled out for what would be defined as a freedom-

  • The freedom to run the program as you wish, for any purpose
  • The freedom to access to the source code
  • The freedom to redistribute copies
  • The freedom to distribute copies of your modified versions to others.

RStudio is a free and open source integrated development environment (IDE) for R.

window

R can be extended (easily) via packages. There are about eight packages supplied with the R distribution and many more are available through the CRAN family of Internet sites covering a very wide range of modern statistics.You can load objects into memory and play with it and when you shutdown your environment your data is not cleared! Rather you can save it (into the .Rdata file) and it retains such information per project!

R-Mailing List- What do you do when you get stuck- put your questions in relevant mailing lists. There are five general mailing lists devoted to R.You can read more and subscribe to mailing lists from https://www.r-project.org/mail.html.