"What is Machine Learning?"
Initially, when I have started googling / reading about Machine Learning, I felt like various rockets bombarding on me :-). Machine learning is a vast area and it is quite beyond the scope of this post to cover all its features.
There are many definitions available in the web and in the simplest form we can say "Machine learning refers to the techniques for recognizing and understanding the vast data and making wise decisions based on the data by developing algorithms."
There are several ways to implement machine learning techniques. Broadly used are
For example Recommendation is also popular technique that provides close recommendations based on user’s previous purchases, clicks, and ratings.
Apache Mahout is a classic example. It is an open source project used in producing machine learning algorithms.
There are many open source projects available for producing scalable machine learning algorithms. In this post I will concentrate on basics of R programming.
Installing R on Machine
The easiest way to set-up R is by downloading a copy of it from here and the IDE RStudio from here , which makes R coding much easier and faster.
After successful installation of R you can launch the GUI console. For reference please find below one sample snapshot
Snap shot from RStudio
As you can see from the snap for variable assignment we can use <- or = or ->
# is used for commenting
Data structure
Selecting a data structure to hold data is an important task. In R, the data source can include text files, spreadsheets, statistical packages and database etc.
R contains wide variety of structures for holding data including scalars, vectors, arrays, data frames and lists. Unlike java, variables are not required to declare as data type.
We can get to know about the data type using below command
> flag <- TRUE
> print(class(flag))
[1] "logical"
|
Vectors
Vectors are one dimensional arrays. Combine function c() is used to form the vector.
> a<- c(11,21,31,41,51)
> print(a)
[1] 11 21 31 41 51
> a[3]
[1] 31
> a[2:4]p
[1] 21 31 41
|
Note: Scalars are one element vector.
Matrices
A matrix is a two dimensional array where each element has the same type like numeric, character or logical.
> rownames<-c("Row1","Row2","Row3","Row4","Row5")
> colnames<-c("Column1","Column2","Column3","Column4")
> X<-matrix(1:20,nrow=5,ncol=4,byrow=TRUE,dimnames=list(rownames,colnames))
> x
Error: object 'x' not found
> print(x)
Error in print(x) : object 'x' not found
Please note variables are case sensitives which causes the error in RED.
> X
Column1 Column2 Column3 Column4
Row1 1 2 3 4
Row2 5 6 7 8
Row3 9 10 11 12
Row4 13 14 15 16
Row5 17 18 19 20
dimnames: is used for labels. Optional.
Arrays
Arrays are similar to matrices and can have more than 2 dimensions.
> X<-array(1:20,c(2,3,4))
> X
, , 1
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
, , 2
[,1] [,2] [,3]
[1,] 7 9 11
[2,] 8 10 12
, , 3
[,1] [,2] [,3]
[1,] 13 15 17
[2,] 14 16 18
, , 4
[,1] [,2] [,3]
[1,] 19 1 3
[2,] 20 2 4
Data frames
Data frames are mostly used data structure in R. It can contain different modes of data like numeric, character etc. But one point to remember that each column must have only one mode.
> studentID<- c(101,102,103,104)
> age<-c(25,24,26,25)
> grade<-c("good","poor","improved","excellent")
> score<-c(70,45,60,90)
> studentDetails<-data.frame(studentID,age,grade,score)
> studentDetails
studentID age grade score
1 101 25 good 70
2 102 24 poor 45
3 103 26 improved 60
4 104 25 excellent 90
> studentDetails[1:3]
studentID age grade
1 101 25 good
2 102 24 poor
3 103 26 improved
4 104 25 excellent
> studentDetails$score
[1] 70 45 60 90
> studentDetails[c("studentID","score")]
studentID score
1 101 70
2 102 45
3 103 60
4 104 90
> table(studentDetails$score,studentDetails$grade)
excellent good improved poor
45 0 0 0 1
60 0 0 1 0
70 0 1 0 0
90 1 0 0 0
> max(studentDetails$score)
[1] 90
Now if we use plot(studentDetails$studentID,studentDetails$score)
Execute plot(studentDetails$studentID,studentDetails$score,type = "o") in R and see the result J .
List
List can gather any kind of objects/ structure we have seen so far.
listExample<- list(obj1,obj2,…)
Importing Data into R
- edit() function can be used to take input from the user
- Import data from text file
If you have R- Studio installed then you can take advantage of the help predictions like below
So to get & set current working directory we can use below commands
> getwd()
[1] "C:/Users/aniket/Documents"
> setwd("E:/tmp/data/")
> getwd()
[1] "E:/tmp/data"
Its important to set the current directory to the location of the file system which you want to read.
To read a file in table format and creates a data frame from it we can use below options and then we can manipulate the data same way like data.frames
- tableData<-read.table("sample_data4.txt",header=TRUE,sep=",")
- tableData<-read.delim("sample_data3.txt",header=TRUE,sep=",")
Also we can use the option available in RStudio i.e. Tools->Import Dataset->From Local File…
Note: To get help for any command you can use like help("read.delim2")
We can read and manipulate data from csv, xslx file formats as well. Sometime we may have to install new packages to do this kind of activities.
Working with R Packages
To see all the available packages you can use library() function.
To install a new package we can use install.packages("Name of the package")
or we can use the option in RStudio i.e.
Tools->Install Packages…
or we can use the option in RStudio i.e.
Tools->Install Packages…
To load installed package you can simply use library("package name")
- One interesting package I came across which gives you the power to manipulate data frames using SQL as well.
> library("sqldf")
Loading required package: gsubfn
Loading required package: proto
Loading required package: RSQLite
Loading required package: DBI
> studentID<- c(101,102,103,104)
> age<-c(25,24,26,25)
> grade<-c("good","poor","improved","excellent")
> score<-c(70,45,60,90)
> studentDetails<-data.frame(studentID,age,grade,score)
> QueryData<-sqldf("select * from studentDetails where studentId=101",row.names=TRUE)
Loading required package: tcltk
> QueryData
studentID age grade score
1 101 25 good 70
R Code Sample
R syntax is different but if you have good grasp on any languages like JAVA then it will not take time to take a grip on R basic syntax like conditions, loop,functions etc. Below are some use of R sample code which can be useful.
> new.function <- function(a) { # defining new function
+ if(a%in%8:12){ # checks whether a is exist between 8 to 12
+ for(i in 1:a) { # for loop will iterate till 1 to value of a
+ if(i==3){
+ next # used same as continue
+ }
+ else{
+ b <- i^2
+ print(b)
+ }
+ }
+ }
+ }
> new.function(9) # call the new function
[1] 1
[1] 4
[1] 16
[1] 25
[1] 36
[1] 49
[1] 64
[1] 81
Almost every sectors like Retail, Healthcare & Life sciences, Banking etc. can leverage the benefits of Machine Learning. But we need to identify/understand where cxactly we can maximize the benefits out of it.
In ECM space we can use these techniques to provide better insight of audit trail data to the end users or auditors.
Keep me posted your valuable thoughts and happy learning ;-).