*This post from Sherry LaMonica is the first in a series from members of the Revolution Analytics Engineering team — ed.*

Do you know about the big data capabilities in the RevoScaleR package, included with every Revolution R Enterprise installation?

RevoScaleR provides a framework for fast and efficient multi-core processing of large data sets. You can visualize and model data sets with millions of records on your local machine using syntax like:

myLinMod <- rxLinMod(y ~ x + z, data=myData)

Some highlights of the RevoScaleR package include:

- The XDF file format, a binary file format with an R interface that optimizes row and column processing and analysis.
- Data transformation tools for exploring and preparing large data sets for analysis.
- Statistical algorithms optimized for large data sets.

Most users will want to proceed from data import to data analysis in a three-step process. Below are some of the frequently used RevoScaleR functions in each of these steps:

**Step 1: Import the data you want to analyze from external file:**

rxTextToXdf() - Import data to .xdf format from a delimited text file.

rxImportToXdf() - Import data from a data source, such as fixed-format text or SAS data (use together with the RxTextData and RxSasData functions)

rxDataStepXdf() - Transform your data and select subsets of variables and/or rows for data exploration and analysis.

**Step 2: Explore and Transform the Data:**

rxSummary(), rxCube, rxCrossTabs() - Obtain summary statistics and compute crosstabulations.

rxHistogram() - Plot a histogram of a variable in an .xdf file.

rxLinePlot() - Create a line or scatterplot from data in an .xdf file or the results from rxCube.

**Step 3: Perform model fitting and additional statistical analysis on data:**

rxLinMod() - Fit a linear regression model to data in an .xdf file.

rxLogit() - Fit a logistic regression model to data in an .xdf file.

rxPredict() - Compute predictions and residuals from a linear or logistic regeression fit.

rxCovCor() - Compute the covariance/correlation matrix for a linear or logistic regression model.

The RevoScaleR 'Getting Started Guide' contains several examples of how to analyze your data with the RevoScaleR package. You can open the PDF document from within Revolution R Enterprise for Windows by going to the 'Help' menu and selecting the option 'R Manuals(PDF)' from the menu. This will open the PDF portfolio, the third document listed is 'RevoScaleRGetStart.pdf'.

XDF is just another proprietary data format?

Why would someone go through all this trouble to move from one proprietary format to another one?

Posted by: roach | March 10, 2011 at 17:59

Good question.

Today .sas, tomorrow .xdf.

Posted by: Yishua | March 10, 2011 at 18:59

Thanks for the post! Nice to hear that there is a lot of work going on regarding big data.

A couple of questions please:

1) When we talk of big data what do we mean by "big"?

2) To what data capacity (measurement in Gigabytes or Terabytes or Petabytes) can RevoScaleR handle, in other words, its limit?

3) Listed here are all it capabilities in terms of functionality? What if one is interested in some nonlinear modeling process not mentioned.. can it be possible given that RevoScaleR has handled the importing process.

4) Too much talk these days about Hadoop solution approach to the same problem.. how is RevoScaleR a better choice over Hadoop?

Thanks again for the post.. and thanks for handling my questions.

Posted by: iThink, iAct! | March 11, 2011 at 00:37

Thanks for the questions. On why you'd want to store data in XDF, and what's meant by "big data", check out this white paper on big data analysis with RevoScaleR.

Posted by: David Smith | March 11, 2011 at 13:54

Does RevoScaleR contain "big data" functions for multi-level regression? If not, is there any idea if and when this might be available? RAM limitations and slow performance of multi-level regression in R is by far my biggest analysis headache.

Posted by: Eric | April 08, 2011 at 09:47

Is XDF-based file processing an alternative solution to the RMR approach?

Posted by: sanjeev taran | June 03, 2013 at 18:36