SAP with R
Something that piqued my curiosity lately was the developments with SAP HANA and R (good overview here). This is definitely a new and exciting direction for SAP, with creating a well structured, and organised 'Big Table' option for in memory computing, and then going the extra mile to embed a specialised Open Source Statistical Computing package (R) in it - making the fore front of the world of statistical analysis open to those that dare.
This is utterly brilliant, but the problem is that I can't access it as I don't have access to a SAP HANA instance (nor would most people). It is also heavily geared to 'Big Data', when there is still an awful lot to be gained from small, and mid-range data analysis arenas (resisting the temptation about size and clichés).
This has definitely touched on my hackers itch, and in response to this I've created one more Scripting Language Connector for R - RSAP.
The idea of this is to enable RFC calls (using the SAP NW RFC SDK) where any table contents are returned as data.frames (in R parlance).
Once you have this data in R, then the world is your oyster - it is up to your imagination as to what you do with it. To give an overview of how it works, and what you can do, I'm going to step through the process of installing and using RSAP.
Obtaining and Installing
Firstly you need to install R. I recommend using RStudio as it is a comfortable graphical user interface - you can get it from here.
Under debian (read Ubuntu) flavoured Linux you can install R first before downloading/installing RStudio using:
sudo apt-get install r-base-core r-base-dev r-base-html r-recommended
SAP NW RFCSDK
The SDK is available from the SAP Service Market Place SWDC - this is a forum discussion on getting it http://scn.sap.com/thread/950318
If you have (like me) installed the NPL SAP Test Drive instance, then the SAP NW RFC libs exist in the /usr/sap/NPL/SYS/exe/run directory, the only problem being that it does not contain the C header files (really - SAP should make this available on SDN).
RSAP
Download or clone the RSAP project source from https://github.com/piersharding/RSAP
Building
Ensure that the R library prerequisites are installed. To do this there is a helper script in the RSAP source code directory. cd to the source directory (downloaded above) - in my case /home/piers/git/public/RSAP - and run the following:
R --no-save < install_dependencies.R
This will prompt to install the packages yaml, reshape, plotrix, and RUnit.
To build and install the RSAP package, cd to the source directory (downloaded above) - in my case /home/piers/git/public/RSAP - run the following:
R CMD INSTALL --build --preclean --clean --configure-args='--with-nwrfcsdk-include=/home/piers/code/sap/nwrfcsdk/include --with-nwrfcsdk-lib=/home/piers/code/sap/nwrfcsdk/lib' .
You must change the values for --with-nwrfcsdk-include and --with-nwrfcsdk-lib to point to the directory locations that you have downloaded the SAP NW RFC SDK to.
Under Linux, it is also likely that you need to add the lib directory to the LD cache or set the LD_LIBRARY_PATH variable.
Setting the LD Cache:
as root, edit /etc/ld.so.conf and add the lib path from above to it on it's own line. Now regenrate the cache by executiong 'sudo ldconfig'.
Setting LD_LIBRARY_PATH
You must ensure that the following environment variable is set in all your shells:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/nwrfcsdk/lib
The easiest way to do this is to add the above line to your $HOME/.bashrc file so that it happens automatically for all future shells.
Does it work?
Once the build and install of the RSAP package is complete, now you should test to make sure it's all working.
Change to the package source code directory (you are probably still there from the above activities), and launch either R or RStudio.
From the R command line try the following:
> library(RSAP)
Loading required package: yaml
>
You should get the above confirmation message that the dependent yaml package has been loaded. Now we are ready to try some R wizardry.
How to work with RSAP
Lets work through the general process steps for interacting with SAP.
Connecting to SAP
Using RSAP we need to establish a connection to SAP. For this you need an account that has the appropriate access for RFC calls, and functionality access. Connections can be built in two ways - directly passing connection parameters:
> conn <- RSAPConnect(ashost="nplhost", sysnr="42",
client="001", user="developer",
passwd="developer", lang="EN")
>
Or using a YAML encoded file that contains the connection details:
> conn <- RSAPConnect("sap.yml")
>
The sap.yml file is structured like:
ashost: nplhost
sysnr: "42"
client: "001"
user: developer
passwd: developer
lang: EN
trace: 1
The above activates the trace functionality in NW RFC SDK. This will create trace files in the current working directory, and are invaluable for debugging connectivity problems.
Calling SAP
Now we have the connection object, we can get connection info with it:
info <- RSAPGetInfo(conn)
Query the system with:
res <- RSAPInvoke(conn, "<RFC Function Name", parms)
Or close the connection:
RSAPClose(conn)
RSAPInvoke() is what we are most interested in, and we need to pass the parameters as a series of nested named lists. The classic example is RFC_READ_TABLE:
parms <- list('DELIMITER' = '|',
'FIELDS' = list(FIELDNAME = list('CARRID', 'CONNID', 'PRICE',
'SEATSMAX', 'SEATSOCC')),
'OPTIONS' = list(TEXT = list("CARRID = 'AA' ", " AND CONNID = 0017 ")),
'QUERY_TABLE' = 'SFLIGHTS2')
res <- RSAPInvoke(conn, "RFC_READ_TABLE", parms)
The names must correspond directly to the parameter and structure (for tables) names, and use numeric and character types as appropriate.
The other thing that is really important to get your head around is that R data structures are column oriented, which means we have to think differently about tables that we get from SAP. Tables in SAP translate to lists of vectors where the outer list is a list of column names (a slightly loose analogy but it will do) and the vectors hang off these column names corresponding to all the values in that column down the rows.
Working through the examples in get_flights.R
In the source code package there is an example script - get_flights.R. It uses the standard demonstration data for the Flight Data system contained in table SFLIGHT2. Let's look at what this does.
Load libraries:
> library(RSAP)
Loading required package: yaml
> library(reshape)
Loading required package: plyr
Attaching package: ‘reshape’
The following object(s) are masked from ‘package:plyr’:
rename, round_any
> library(plotrix)
>
We now have all the necessary libraries for the rest of the examples.
conn <- RSAPConnect("sap.yml")
parms <- list('DELIMITER' = ';',
'QUERY_TABLE' = 'SFLIGHTS2')
res <- RSAPInvoke(conn, "RFC_READ_TABLE", parms)
RSAPClose(conn)
sflight = res$DATA
flds <- sub("\\s+$", "", res$FIELDS$FIELDNAME)
sflight <- data.frame(sflight, colsplit(sflight$WA, split = ";", names = flds))
This connects to SAP, calls RFC_READ_TABLE to get the contents of SFLIGHT2, and sets the column delimiter for that table as ';'. We close the connection and copy the table data from the return parameter res$DATA (see RFC_READ_TABLE in transaction SE37) into sflight. We also grab the field names returned in table FIELDS, and remove the whitespace at the end. Next - this is where the importance of the ';' delimiter is - using the colsplit() function from the reshape package, we split return DATA into columns named by the FIELDS that RFC_READ_TABLE provided us.
Now we have a data.frame that looks a lot like the table SFLIGHT2 when viewed in transaction SE16.
sflight <- cbind(sflight, FLIGHTNO = paste(sub("\\s+$", "",
sflight$CARRID),sflight$CONNID, sep=""))
sflight$SEGMENT <- paste(sflight$AIRPFROM, sflight$AIRPTO, sep=" - ")
sflight$CARRNAME <- sub("\\s+$", "", sflight$CARRNAME)
sflight$DISTANCE <- as.numeric(lapply(sflight$DISTANCE,
FUN=function (x) {sub("\\*","", x)}))
sflight$DISTANCE <- as.numeric(lapply(sflight$DISTANCE,
FUN=function (x) {if (x == 0) NA else x}))
sflight[sflight$CARRNAME == 'Qantas Airways','DISTANCE'] <- 10258
This next chunk created new vectors (columns) FLIGHTNO combined from CARRID and CONNID, SEGMENT from AIRPFROM and AIRPTO, and cleaned vectors CARRNAME, and DISTANCE.
Now create some aggregated views, to generate visualisations from:
airline_avgocc <- aggregate(data.frame(SEATSMAX=sflight$SEATSMAX,
SEATSOCC=sflight$SEATSOCC,
OCCUPANCY=sflight$SEATSOCC/sflight$SEATSMAX),
by=list(carrname=sflight$CARRNAME), FUN=mean, na.rm=TRUE)
airline_sumocc <- aggregate(data.frame(SEATSOCC=sflight$SEATSOCC),
by=list(carrname=sflight$CARRNAME), FUN=sum, na.rm=TRUE)
Show a pie chart - sum of airline occupancy as a share of market:
x11()
lbls <- paste(airline_sumocc$carrname, "\n", sprintf("%.2f%%",
(airline_sumocc$SEATSOCC/sum(airline_sumocc$SEATSOCC))*100), sep="")
pie3D(airline_sumocc$SEATSOCC, labels=lbls,
col=rainbow(length(airline_sumocc$carrname)),
main="Occupancy sum share for Airlines", explode=0.1)
Create a Stacked Bar Plot with Colors and Legend showing a summary of occupancy by segment and carrier - to do this we need to generate a summary (aggregate), and fill in the missing combinations of the grid, and then switch the orientation of rows for columns to present to the plotting funcitons:
d <- aggregate(SEATSOCC ~ CARRNAME:SEGMENT, data=sflight, FUN=sum, na.rm=FALSE)
d2 <- with(d, expand.grid(CARRNAME = unique(d$CARRNAME), SEGMENT = unique(d$SEGMENT)))
airline_sumsegocc <- merge(d, d2, all.y = TRUE)
airline_sumsegocc$SEATSOCC[is.na(airline_sumsegocc$SEATSOCC)] <- 0
# switch orientation to segment * carrier
counts <- data.frame(unique(airline_sumsegocc$CARRNAME))
for (a in unique(airline_sumsegocc$SEGMENT))
{counts <- cbind(counts,
airline_sumsegocc$SEATSOCC[which(airline_sumsegocc$SEGMENT == a)]);}
counts[,1] <- NULL
colnames(counts) <- unique(airline_sumsegocc$SEGMENT);
rownames(counts) <- unique(airline_sumsegocc$CARRNAME);
x11()
barplot(as.matrix(counts), main="Total Occupancy by Segment and Carrier",
ylab="Number of Seats",
col=rainbow(dim(counts)[1]),
ylim=c(0, 15000), legend = rownames(counts))
Lastly - we create a simple performance indicator using a time series comparison of different airlines:
# performance by airline over time - dollars per customer KM
sflight$FLDATEYYMM <- substr(sflight$FLDATE, start=1, stop=6)
d <- aggregate(data.frame(PAYMENTSUM=sflight$PAYMENTSUM,
SEATSOCC=sflight$SEATSOCC,
DISTANCE=sflight$DISTANCE,
PERFORMANCE=(sflight$PAYMENTSUM/(sflight$SEATSOCC *
sflight$DISTANCE))),
by=list(carrname=sflight$CARRNAME,
fldateyymm=sflight$FLDATEYYMM),
FUN=sum, na.rm=TRUE)
d2 <- with(d, expand.grid(carrname = unique(d$carrname),
fldateyymm = unique(d$fldateyymm)))
agg_perf <- merge(d, d2, all.y = TRUE)
agg_perf <- agg_perf[order(agg_perf$carrname, agg_perf$fldateyymm),]
agg_perf$PERFORMANCE[is.na(agg_perf$PERFORMANCE)] <- 0
# create time series and plot comparison
perf_series <- data.frame(1:length(unique(agg_perf$fldateyymm)))
for (a in unique(agg_perf$carrname))
{perf_series <- cbind(perf_series,
agg_perf$PERFORMANCE[which(agg_perf$carrname == a)]);}
perf_series[,1] <- NULL
colnames(perf_series) <- unique(agg_perf$carrname);
# convert all to time series
for (a in length(unique(agg_perf$carrname)))
{perf_series[[a]] <- ts(perf_series[,a], start=c(2011,5), frequency=12)}
# plot the first and line the rest
x11()
ts.plot(ts(perf_series, start=c(2011,5), frequency=12),
gpars=list(main="Performance: dollar per customer KM",
xlab="Months",
ylab="Dollars",
col=rainbow(dim(perf_series)[2]), xy.labels=TRUE))
legend(2012.05, 3.2, legend=colnames(perf_series),
col=rainbow(dim(perf_series)[2]), lty=1, seg.len=1)
Hopefully, I've shown that there is a lot that can be done with R - especially in the area of adHoc advanced business intelligence and data analysis. I have not really even scratched the surface in terms of what R can offer for advanced statistical analysis and modelling - that is where the true wizards live.
I would love to hear back from anyone who tries RSAP out - issues and user experiences alike.
References:
- Post on SAP HANA and R from Alvaro
Basic R Tutorials
- Quick R
- The R Cookbook
- The R Statistics Blog
- Producing simple Graphs with R
- The CRAN project Documentation
Posted by PiersHarding at June 13, 2012 2:37 PM