Sunday, 3 April 2016

Journey towards Machine Learning using R - part 1

"What is Machine Learning?"


Initially, when I have started googling / reading about Machine Learning, I felt like various rockets bombarding on me :-). Machine learning is a vast area and it is quite beyond the scope of this post to cover all its features.

There are many definitions available in the web and in the simplest form we can say "Machine learning refers to the techniques for recognizing and understanding the vast data and making wise decisions based on the data by developing algorithms."

There are several ways to implement machine learning techniques. Broadly used are 



For example Recommendation is also popular technique that provides close recommendations based on user’s previous purchases, clicks, and ratings.

Apache Mahout is a classic example. It is an open source project used in producing machine learning algorithms.

There are many open source projects available for producing scalable machine learning algorithms. In this post I will concentrate on basics of R programming.

Installing R on Machine

The easiest way to set-up R is by downloading a copy of it from here  and the IDE RStudio from here  , which makes R coding much easier and faster.

After successful installation of R you can launch the GUI console. For reference please find below one sample snapshot





Understanding R

Snap shot from RStudio




As you can see from the snap for variable assignment we can use <- or = or ->
# is used for commenting

Data structure

Selecting a data structure to hold data is an important task. In R, the data source can include text files, spreadsheets, statistical packages and database etc.

R contains wide variety of structures for holding data including scalars, vectors, arrays, data frames and lists. Unlike java, variables are not required to declare as data type.
We can get to know about the data type using below command
> flag <- TRUE
> print(class(flag))
[1] "logical"

Vectors

 Vectors are one dimensional arrays. Combine function c() is used to form the vector.
> a<- c(11,21,31,41,51)
> print(a)
[1] 11 21 31 41 51
> a[3]
[1] 31
> a[2:4]p
[1] 21 31 41
Note: Scalars are one element vector.

Matrices

A matrix is a two dimensional array where each element has the same type like numeric, character or logical.
> rownames<-c("Row1","Row2","Row3","Row4","Row5")
> colnames<-c("Column1","Column2","Column3","Column4")
> X<-matrix(1:20,nrow=5,ncol=4,byrow=TRUE,dimnames=list(rownames,colnames))
> x
Error: object 'x' not found
> print(x)
Error in print(x) : object 'x' not found

Please note variables are case sensitives which causes the error in RED.

> X
     Column1 Column2 Column3 Column4
Row1       1       2       3       4
Row2       5       6       7       8
Row3       9      10      11      12
Row4      13      14      15      16
Row5      17      18      19      20

dimnames: is used for labels. Optional.


Arrays
Arrays are similar to matrices and can have more than 2 dimensions.
> X<-array(1:20,c(2,3,4))
> X
, , 1

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

, , 2

     [,1] [,2] [,3]
[1,]    7    9   11
[2,]    8   10   12

, , 3

     [,1] [,2] [,3]
[1,]   13   15   17
[2,]   14   16   18

, , 4

     [,1] [,2] [,3]
[1,]   19    1    3
[2,]   20    2    4



Data frames

Data frames are mostly used data structure in R. It can contain different modes of data like numeric, character etc. But one point to remember that each column must have only one mode.
> studentID<- c(101,102,103,104)
> age<-c(25,24,26,25)
> grade<-c("good","poor","improved","excellent")
> score<-c(70,45,60,90)
> studentDetails<-data.frame(studentID,age,grade,score)
> studentDetails
  studentID age     grade score
1       101  25      good    70
2       102  24      poor    45
3       103  26  improved    60
4       104  25 excellent    90
> studentDetails[1:3]
  studentID age     grade
1       101  25      good
2       102  24      poor
3       103  26  improved
4       104  25 excellent

> studentDetails$score
[1] 70 45 60 90
> studentDetails[c("studentID","score")]
  studentID score
1       101    70
2       102    45
3       103    60
4       104    90
> table(studentDetails$score,studentDetails$grade)
   
     excellent good improved poor
  45         0    0        0    1
  60         0    0        1    0
  70         0    1        0    0
  90         1    0        0    0
> max(studentDetails$score)
[1] 90

Now if we use plot(studentDetails$studentID,studentDetails$score)
Execute plot(studentDetails$studentID,studentDetails$score,type = "o") in R and see the result J .


List

List can gather any kind of objects/ structure we have seen so far.
listExample<- list(obj1,obj2,…)


Importing Data into R

  • edit() function can be used to take input from the user
It's important to store the data in variable otherwise all entered data will be lost. See above image.
  • Import data from text file

If you have R- Studio installed then you can take advantage of the help predictions like below



So to get & set current working directory we can use below commands
> getwd()
[1] "C:/Users/aniket/Documents"
> setwd("E:/tmp/data/")
> getwd()
[1] "E:/tmp/data"
 
Its important to set the current directory to the location of the file system which you want to read. 
To read a file in table format and creates a data frame from it we can use below options and then we can manipulate the data same way like data.frames

  • tableData<-read.table("sample_data4.txt",header=TRUE,sep=",")

  • tableData<-read.delim("sample_data3.txt",header=TRUE,sep=",")


Also we can use the option available in RStudio i.e. Tools->Import Dataset->From Local File…

Note: To get help for any command you can use like help("read.delim2")



We can read and manipulate data from csv, xslx file formats as well. Sometime we may have to install new packages to do this kind of activities.



Working with R Packages

To see all the available packages you can use library() function.

To install a new package we can use install.packages("Name of the package")




or we can use the option in RStudio i.e.


Tools->Install Packages…

To load installed package you can simply use library("package name")

  • One interesting package I came across which gives you the power to manipulate data frames using SQL as well.

> library("sqldf")
Loading required package: gsubfn
Loading required package: proto
Loading required package: RSQLite
Loading required package: DBI
> studentID<- c(101,102,103,104)
> age<-c(25,24,26,25)
> grade<-c("good","poor","improved","excellent")
> score<-c(70,45,60,90)
> studentDetails<-data.frame(studentID,age,grade,score)
> QueryData<-sqldf("select * from studentDetails where studentId=101",row.names=TRUE)
Loading required package: tcltk
> QueryData
studentID age grade score
1 101 25 good 70


R Code Sample

R syntax is different but if you have good grasp on any languages like JAVA then it will not take time to take a grip on R basic syntax like conditions, loop,functions etc. Below are some use of R sample code which can be useful.

> new.function <- function(a) { # defining new function
+ if(a%in%8:12){ # checks whether a is exist between 8 to 12
+ for(i in 1:a) { # for loop will iterate till 1 to value of a
+ if(i==3){

+ next # used same as continue

+ }
+ else{
+ b <- i^2
+ print(b)
+ }
+ }
+ }
+ }
> new.function(9) # call the new function
[1] 1
[1] 4
[1] 16
[1] 25
[1] 36
[1] 49
[1] 64
[1] 81

Almost every sectors like Retail, Healthcare & Life sciences, Banking etc. can leverage the benefits of Machine Learning. But we need to identify/understand where cxactly we can maximize the benefits out of it. 

In  ECM space we can use these techniques to provide better insight of audit trail data to the end users or auditors.

Keep me posted your valuable thoughts and happy learning ;-).

2 comments:

Make life easier — Git automation with single command file

Make life easier — Git automation with single command file Posted on medium #makelifeeasier series - Automation of git related activity...