Quantcast
Channel: Search Results for “register”– R-bloggers
Viewing all 405 articles
Browse latest View live

Register now for RStudio Shiny Workshops in D.C., New York, Boston, L.A., San Francisco and Seattle

$
0
0

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

Great news for Shiny and R Markdown enthusiasts!

An Interactive Reporting Workshop with Shiny and R Markdown is coming to a city near you. Act fast as only 20 seats are available for each workshop.

You can find out more / register by clicking on the link for your city!

East Coast West Coast
March 2 – Washington, DC April 15 – Los Angeles, CA
March 4 – New York, NY April 17 – San Francisco, CA
March 6 – Boston, MA April 20 – Seattle, WA

You’ll want to take this workshop if…

You have some experience working with R already. You should have written a number of functions, and be comfortable with R’s basic data structures (vectors, matrices, arrays, lists, and data frames).

You will learn from…

The workshop is taught by Garrett Grolemund. Garrett is the Editor-in-Chief of shiny.rstudio.com, the development center for the Shiny R package. He is also the author of Hands-On Programming with R as well as Data Science with R, a forthcoming book by O’Reilly Media. Garrett works as a Data Scientist and Chief Instructor for RStudio, Inc. GitHub

To leave a comment for the author, please follow the link and comment on his blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Introducing stackr: An R package for querying the Stack Exchange API

$
0
0

(This article was first published on Variance Explained, and kindly contributed to R-bloggers)

There’s no end of interesting data analyses that can be performed with Stack Overflow and the Stack Exchange network of Q&A sites. Earlier this week I posted a Shiny app that visualizes the personalized prediction data from their machine learning system, Providence. I’ve also looked at whether high-reputation users were decreasing their answering activity over time, using data from the Stack Exchange Data Explorer.

One issue is that each of these approaches requires working outside of R to obtain the data (in the case of the Data Explorer, it also requires knowledge of SQL). I’ve thus created the stackr package, which can query the Stack Exchange API to obtain information on questions, answers, users, tags, etc, and converts the output into an R data frame that can easily be manipulated, analyzed, and visualized. (Hadley Wickham’s httr package, along his terrific guide for writing an API package, helped a lot!) stackr provides the tools to perform analyses of a particular user, of recently asked questions, of a particular tag, or of other facets of the site.

The package is straightforward to use. Every function starts with stack_: stack_answers to query answers, stack_questions for questions, stack_users, stack_tags, and so on. Each output is a data frame, where each row represents one object (an answer, question, user, etc). The package also provides features for sorting and filtering results in the API: almost all the features available in the API itself. Since the API has an upper limit of returning 100 results at a time, the package also handles pagination so you can get as many results as you need.

Example: answering activity

Here I’ll show an example of using the stackr package to analyze an individual user. We’ll pick one at random: eeny, meeny, miny… me. (OK, that might not have been random). Stack Overflow provides many summaries and analyses on that profile already, but the stackr package lets us bring the data seamlessly into R so we can analyze it however we want. Extracting all of my answers is done using the stack_users function with the extra argument "answers". We’ll take advantage of stackr’s pagination feature, and turn the result into a tbl_df from dplyr so that it prints more reasonably:

library(stackr)
library(dplyr)
answers <- stack_users(712603, "answers", num_pages = 10, pagesize = 100)
answers <- tbl_df(answers)
answers
## Source: local data frame [732 x 14]
## 
##    owner_reputation owner_user_id owner_user_type owner_accept_rate
## 1             34279        712603      registered               100
## 2             34279        712603      registered               100
## 3             34279        712603      registered               100
## 4             34279        712603      registered               100
## 5             34279        712603      registered               100
## 6             34279        712603      registered               100
## 7             34279        712603      registered               100
## 8             34279        712603      registered               100
## 9             34279        712603      registered               100
## 10            34279        712603      registered               100
## ..              ...           ...             ...               ...
## Variables not shown: owner_profile_image (chr), owner_display_name (chr),
##   owner_link (chr), is_accepted (lgl), score (int), last_activity_date
##   (time), last_edit_date (time), creation_date (time), answer_id (int),
##   question_id (int)

This lets me find out a lot about myself: for starters, that I’ve answered 732 questions. What percentage of my answers were accepted by the asker?

mean(answers$is_accepted)
## [1] 0.6297814

And what is the distribution of scores my answers have received?

library(ggplot2)
ggplot(answers, aes(score)) + geom_histogram(binwidth = 1)

center

How has my answering activity changed over time? To find this out, I can use dplyr to count the number of answers per month and graph it:

library(lubridate)

answers %>% mutate(month = round_date(creation_date, "month")) %>%
    count(month) %>%
    ggplot(aes(month, n)) + geom_line()

center

Well, it looks like my activity has been decreasing over time (though I already knew that). How about how my answering activity changes over the course of a day?

answers %>% mutate(hour = hour(creation_date)) %>%
    count(hour) %>%
    ggplot(aes(hour, n)) + geom_line()

center

(Note that the times are in my own time zone, EST). Unsurprisingly, I answer more during the day than at night, but I’ve still done some answering even around 4-6 AM. You can also spot two conspicuous dips: one at 12 when I eat lunch, and one at 6 when I take the train home from work.

(If that’s not enough invasion of my privacy, you could look at my commenting activity with stack_users(712603, "comments", ...), but it generally shows the same trends).

Top tags

The API also makes it easy to extract the tags I’ve most answered, which is another handy way to extract and visualize information about my answering activity:

top_tags <- stack_users(712603, "top-answer-tags", pagesize = 100)
head(top_tags)
##   user_id answer_count answer_score question_count question_score
## 1  712603          463         1604              1              7
## 2  712603          234          812              6             32
## 3  712603           52          187              0              0
## 4  712603           26          127              1              7
## 5  712603           34          110              1              9
## 6  712603           26          104              0              0
##     tag_name
## 1     python
## 2          r
## 3       list
## 4 python-2.7
## 5     django
## 6     string
top_tags %>% mutate(tag_name = reorder(tag_name, -answer_score)) %>%
    head(20) %>%
    ggplot(aes(tag_name, answer_score)) + geom_bar(stat = "identity") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

center

We could also view it using the wordcloud package:

library(wordcloud)
wordcloud(top_tags$tag_name, top_tags$answer_count)

center

This is just scratching the surface of the information that the API can retrieve. Hopefully the stackr package will make possible other analyses, visualizations, and Shiny apps that help understand and interpret Stack Exchange data.

To leave a comment for the author, please follow the link and comment on his blog: Variance Explained.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Introducing drat: Lightweight R Repositories

$
0
0

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new package of mine just got to CRAN in its very first version 0.0.1: drat. Its name stands for drat R Archive Template, and an introduction is provided at the drat page, the GitHub repository, and below.

drat builds on a core strength of R: the ability to query multiple repositories. Just how one could always query, say, CRAN, BioConductor and OmegaHat---one can now adds drats of one or more other developers with ease. drat also builds on a core strength of GitHub. Every user automagically has a corresponding github.io address, and by appending drat we are getting a standardized URL.

drat combines both strengths. So after an initial install.packages("drat") to get drat, you can just do either one of

library(drat)
addRepo("eddelbuettel")

or equally

drat:::add("eddelbuettel")

to register my drat. Now install.packages() will work using this new drat, as will update.packages(). The fact that the update mechanism works is a key strength: not only can you get a package, but you can gets its updates once its author replaces them into his drat.

How does one do that? Easy! For a package foo_0.1.0.tar.gz we do

library(drat)
insertPackage("foo_0.1.0.tar.gz")

The default git repository locally is taken as the default ~/git/drat/ but can be overriden as both a local default (via options()) or directly on the command-line. Note that this also assumes that you a) have a gh-pages branch and b) have that branch as the currently active branch. Automating this / testing for this is left for a subsequent release. Also available is an alternative unexported short-hand function:

drat:::insert("foo_0.1.0.tar.gz", "/opt/myWork/git")

show here with the alternate use case of a local fileshare you can copy into and query from---something we do at work where we share packages only locally.

The easiest way to obtain the corresponding file system layout may be to just fork the drat repository.

So that is it. Two exported functions, and two unexported (potentially name-clobbering) shorthands. Now drat away!

Courtesy of CRANberries, there is also a copy of the DESCRIPTION file for this initial release. More detailed information is on the drat page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on his blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

CFP: 13th Australasian Data Mining Conference (AusDM 2015)

$
0
0

(This article was first published on blog.RDataMining.com, and kindly contributed to R-bloggers)

The 13th Australasian Data Mining Conference (AusDM 2015)
Sydney, Australia, 8-9 August 2015
co-located with SIGKDD’15
URL: http://ausdm15.ausdm.org/
Join us on LinkedIn: http://www.linkedin.com/groups/AusDM-4907891

The Australasian Data Mining Conference has established itself as the premier Australasian meeting for both practitioners and researchers in data mining. It is devoted to the art and science of intelligent analysis of (usually big) data sets for meaningful (and previously unknown) insights. This conference will enable the sharing and learning of research and progress in the local context and new breakthroughs in data mining algorithms and their applications across all industries.

Publication and topics
We are calling for papers, both research and applications, and from both academia and industry, for presentation at the conference. All papers will go through double-blind, peer-review by a panel of international experts. Accepted papers will be published in an up-coming volume (Data Mining and Analytics 2015) of the Conferences in Research and Practice in Information Technology (CRPIT) series by the Australian Computer Society which is also held in full-text on the ACM Digital Library and will also be distributed at the conference. For more details on CRPIT please see http://www.crpit.com. Please note that we require that at least one author for each accepted paper will register for the conference and present their work. AusDM invites contributions addressing current research in data mining and knowledge discovery as well as experiences, novel applications and future challenges.

Topics of interest include, but are not restricted to:
– Applications and Case Studies – Lessons and Experiences
– Big Data Analytics
– Biomedical and Health Data Mining
– Business Analytics
– Computational Aspects of Data Mining
– Data Integration, Matching and Linkage
– Data Mining Education
– Data Mining in Security and Surveillance
– Data Preparation, Cleaning and Preprocessing
– Data Stream Mining
– Evaluation of Results and their Communication
– Implementations of Data Mining in Industry
– Integrating Domain Knowledge
– Link, Tree, Graph, Network and Process Mining
– Multimedia Data Mining
– New Data Mining Algorithms
– Professional Challenges in Data Mining
– Privacy-preserving Data Mining
– Spatial and Temporal Data Mining
– Text Mining
– Visual Analytics
– Web and Social Network Mining

Submission of papers
We invite two types of submissions for AusDM 2015:

Academic submissions: Normal academic submissions reporting on research progress, with a paper length of between 8 and 12 pages in CRPIT style, as detailed below. Academic submissions we will use a double-blinded review process, i.e. paper submissions must NOT include authors names or affiliations (and also not acknowledgements referring to funding bodies). Self-citing references should also be removed from the submitted papers (they can be added on after the review) for the double blind reviewing purpose.

Industry submissions: Submissions from governments and industry can report on specific data mining implementations and experiences. Submissions in this category can be between 4 and 8 pages in CRPIT style, as detailed below. These submissions do not need to be double-blinded. A special committee made of industry representatives will assess industry submissions.

Paper submissions are required to follow the general format specified for papers in the CRPIT series by the Australian Computer Society. Submission details are available from http://crpit.com/AuthorsSubmitting.html.

Important Dates
Submission of full papers: Monday 20 April 2015 (midnight PST)
Notification of authors: Sunday June 7 2015
Final version and author registration: Sunday June 28 2015
Conference: 8-9 August 2015


To leave a comment for the author, please follow the link and comment on his blog: blog.RDataMining.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Turning R into a GIS – Mapping the weather in Germany

$
0
0

(This article was first published on Big Data Doctor » R, and kindly contributed to R-bloggers)

temp-mapNothing has gotten more attention in the visualization world like the map-based insights, or in other words, just plotting on a map different KPIs to allow for a playful discovery experience. I must admit, maps are cool, an awesome tool to “show-off” and to visually gain some insights.

But let’s be also clear about the limitations of map based charts:

  • You can compare locations based on a KPI, but you cannot quantify the difference between them
  • Color is difficult to understand and often leads to misinterpretation (e.g: what’s the meaning of red? more sales? or worst results?).
  • Color gradients are also pretty challenging for the human eye.
  • Zooming in/out results in insights detailing down/aggregation, but it’s difficult to establish a quantification between different granularity levels.

Anyways, R can be really useful to create high-quality maps… There are awesome packages like rMaps, where you have a set of controls available to make your maps interactive, rgooglemaps, maptools, etc.

In this post I’m going to plot weather KPIs for over 8K different postal codes (Postleitzahl or PLZ) in Germany. I’m going to shade the different areas according to their values -as you would expect :)

We are going to follow these steps to visualize the temperature, the humidity and the snow fall for the entire German country:

  1. Preparation of the required assets (PLZ coordinates, polygon lines, weather API key, etc)
  2. Querying the weather API for each PLZ to retrieve the weather values
  3. Map creation and PLZ data frame merging with the obtained weather information
  4. Map display for the weather metrics and high-resolution picture saving

1- Assets preparation

We need to prepare a few assets… Everything freely accessible and just a mouse click away… Amazing, isn’t it?

  • The list of all PLZ with city name and the lat/long coordinates of a centroid (you can download this data from geonames)
  • The shapefiles for the PLZ to know how to draw them on a map (kindly made available out of the OpenStreetMaps at suche-postleitzahl.org)
  • A key for the weather API (you need to register at openweathermap.org, takes literally a second and they are not going to bother you with newsletters)

2-Downloading the weather data

Basically, it’s just a JSON call we can perform for each PLZ passing the lat/long coordinates to the openweather api’s endpoint. Each weather entry is then stored as a 1 row data frame we keep appending to the one holding all entries:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
library(jsonlite)
#load the plz info you download from the geonames resource
plz.ort.de<-read.csv(file = "../plzgeo.csv")
weather.de<-NULL
for (i in 1:nrow(plz.ort.de))
{
  url<-paste0('http://api.openweathermap.org/data/2.5/weather?lat=',plz.ort.de[i,]$lat, '&lon=',plz.ort.de[i,]$lon,'&units=metric&APPID=PUT_YOUR_KEY_HERE')
  weather.entry<-jsonlite::fromJSON(url,simplifyMatrix = F,simplifyDataFrame = F,flatten = T)
  temperature<-weather.entry$main$temp
  humidity<-weather.entry$main$humidity
  wind.speed<-weather.entry$wind$speed
  wind.deg<-weather.entry$wind$deg
  snow<-weather.entry$snow$`3h`
  if (is.null(wind.speed)){ wind.speed<-NA}
  if (is.null(wind.deg)){ wind.deg<-NA}
  if (is.null(snow)){ snow<-NA}
  if (is.null(humidity)){ humidity<-NA}
  if (is.null(temperature)){ temperature<-NA}
  weather.de<-rbind(data.frame(plz=plz.ort.de[i,]$plz,temperature,humidity,wind.speed,wind.deg,snow),weather.de)  
#you might want to take your process to sleep for some milliseconds to give the API a breath
}

3-Map creation and PLZ-weather data frames merging

Using the rgal for the required spatial transformations. In this case, we use the EPSG 4839 for the German geography (see spTransform)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
library(ggplot2)
library(rgdal)           # required for readOGR and spTransform
library(RColorBrewer)
 
setwd("[your_path]/plz-gebiete.shp/")
# read shapefile
wmap <- readOGR(dsn=".", layer="plz-gebiete")
 
map <- readOGR(dsn=".", layer="plz-gebiete")
map <- spTransform(map,CRS=CRS("+init=epsg:4839"))
map$region<-substr(map$plz, 1,1)
map.data <- data.frame(id=rownames(map@data), map@data)
map.data$cplz<- as.character(map.data$plz)
 
weather.de$cplz<- as.character(weather.de$plz)
#normalization to force all PLZs having 5 digits
weather.de$cplz<- ifelse(nchar(weather.de$cplz)<5, paste0("0",weather.de$cplz), weather.de$cplz)
map.data<-merge(map.data,weather.de,by=c("cplz"),all=T)
map.df   <- fortify(map)
map.df   <- merge(map.df,map.data,by="id", all=F)

4-Map display for the weather metrics and high-resolution picture saving

We just rely on the standard ggplot functionality to plot the weather metric we’d like to. To make it more readable, I facetted by region.

1
2
3
4
5
6
7
8
9
10
11
12
13
temperature<-ggplot(map.df, aes(x=long, y=lat, group=group))+
  geom_polygon(aes(fill=temperature))+
  facet_wrap(~region,scales = 'free') +
  geom_path(colour="lightgray", size=0.5)+
  scale_fill_gradient2(low ="blue", mid = "white", high = "green", 
                       midpoint = 0, space = "Lab", na.value = "lightgray", guide = "legend")+  theme(axis.text=element_blank())+
  theme(axis.text=element_text(size=12)) +
  theme(axis.title=element_text(size=14,face="bold")) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  theme(panel.background = element_rect(fill = 'white')) +
  theme(panel.grid.major = element_line( color="snow2")) 
 
ggsave("../plz-temperature-300.png",  width=22.5, height=18.25, dpi=300)

bavaria-temperature

5-(Bonus) Underlying map tiles

You probably feel like having a map as reference to see city names, roads, rivers and all that stuff in each PLZ. For that we can use ggmap, a really cool package for spatial visualization with Google Maps and OpenStreetMap.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
library(plyr)
library(ggmap)
# reading the shapes
area <- readOGR(dsn=".", layer="plz-gebiete")
# using the normalized version with "0" for the later join
weather.de$plz<-weather.de$cplz
# from factor to character
area.df$plz<-as.character(area.df$plz)
area.df <- data.frame(id=rownames(area@data), area@data)
# merging weather and geographical information
area.extend<-merge(area.df, weather.de, by=c("plz"),all=F)
# building 
area.points <- fortify(area)
area.points<-merge(area.points, area.extend, by=c("id"),all=F)
d <- join(area.points, area.extend, by="id")
# region extraction
d$region<-substr(d$plz, 1,1)
bavaria<-subset(d, region=="8")
# google map tiles request... location is where you want your map centered at
google.map <- get_map(location="Friedberg", zoom =8, maptype = "terrain", color = "bw", scale=4)
ggmap(google.map) +
  geom_polygon(data=bavaria, aes(x=long, y=lat, group=group, fill=temperature), colour=NA, alpha=0.5) +
  scale_fill_gradient2(low ="blue", mid = "yellow", high = "green", 
                       midpoint = 0, space = "Lab", na.value = "lightgray", guide = "legend")+  theme(axis.text=element_blank())+  
  labs(fill="") +
  theme_nothing(legend=TRUE)
 
ggsave("../plz-temperature-Bavaria.png",  width=22.5, height=18.25, dpi=300)

The results speak for themselves!
Temperature in Germany
temperature-Germany
Temperature in the area around Munich only
temperature-munich
Snow across Germany
snow-Germany

To leave a comment for the author, please follow the link and comment on his blog: Big Data Doctor » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How-to go parallel in R – basics + tips

$
0
0

(This article was first published on G-Forge » R, and kindly contributed to R-bloggers)
Don’t waist another second, start parallelizing your computations today! The image is CC by  Smudge 9000
Don’t waist another second, start parallelizing your computations today! The image is CC by Smudge 9000

Today is a good day to start parallelizing your code. I’ve been using the parallel package since its integration with R (v. 2.14.0) and its much easier than it at first seems. In this post I’ll go through the basics for implementing parallel computations in R, cover a few common pitfalls, and give tips on how to avoid them.

The common motivation behind parallel computing is that something is taking too long time. For me that means any computation that takes more than 3 minutes – this because parallelization is incredibly simple and most tasks that take time are /wiki/Embarrassingly_parallel">embarrassingly parallel. Here are a few common tasks that fit the description:

  • Bootstrapping
  • Cross-validation
  • Multivariate Imputation by Chained Equations (MICE)
  • Fitting multiple regression models

Learning lapply is key

One thing I regret is not learning earlier lapply. The function is beautiful in its simplicity: It takes one parameter (a vector/list), feeds that variable into the function, and returns a list:

?View Code RSPLUS
1
lapply(1:3, function(x) c(x, x^2, x^3))
[[1]]
 [1] 1 1 1

[[2]]
 [1] 2 4 8

[[3]]
 [1] 3 9 27

You can feed it additional values by adding named parameters:

?View Code RSPLUS
1
lapply(1:3/3, round, digits=3)
[[1]]
[1] 0.333

[[2]]
[1] 0.667

[[3]]
[1] 1

The tasks are /wiki/Embarrassingly_parallel">embarrassingly parallel as the elements are calculated independently, i.e. second element is independent of the result from the first element. After learning to code using lapply you will find that parallelizing your code is a breeze.

The parallel package

The parallel package is basically about doing the above in parallel. The main difference is that we need to start with setting up a cluster, a collection of “workers” that will be doing the job. A good number of clusters is the numbers of available cores – 1. I’ve found that using all 8 cores on my machine will prevent me from doing anything else (the computers comes to a standstill until the R task has finished). I therefore always set up the cluster as follows:

?View Code RSPLUS
1
2
3
4
5
6
7
library(parallel)
 
# Calculate the number of cores
no_cores <- detectCores() - 1
 
# Initiate cluster
cl <- makeCluster(no_cores)

Now we just call the parallel version of lapply, parLapply:

?View Code RSPLUS
1
2
3
parLapply(cl, 2:4,
          function(exponent)
            2^exponent)
[[1]]
[1] 4

[[2]]
[1] 8

[[3]]
[1] 16

Once we are done we need to close the cluster so that resources such as memory are returned to the operating system.

?View Code RSPLUS
1
stopCluster(cl)

Variable scope

On Mac/Linux you have the option of using makeCluster(no_core, type="FORK") that automatically contains all environment variables (more details on this below). On Windows you have to use the Parallel Socket Cluster (PSOCK) that starts out with only the base packages loaded (note that PSOCK is default on all systems). You should therefore always specify exactly what variables and libraries that you need for the parallel function to work, e.g. the following fails:

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
cl<-makeCluster(no_cores)
base <- 2
 
parLapply(cl, 
          2:4, 
          function(exponent) 
            base^exponent)
 
stopCluster(cl)
 Error in checkForRemoteErrors(val) : 
  3 nodes produced errors; first error: object 'base' not found 

While this passes:

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
cl<-makeCluster(no_cores)
 
base <- 2
clusterExport(cl, "base")
parLapply(cl, 
          2:4, 
          function(exponent) 
            base^exponent)
 
stopCluster(cl)
[[1]]
[1] 4

[[2]]
[1] 8

[[3]]
[1] 16

Note that you need the clusterExport(cl, "base") in order for the function to see the base variable. If you are using some special packages you will similarly need to load those through clusterEvalQ, e.g. I often use the rms package and I therefore use clusterEvalQ(cl, library(rms)). Note that any changes to the variable after clusterExport are ignored:

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
cl<-makeCluster(no_cores)
clusterExport(cl, "base")
base <- 4
# Run
parLapply(cl, 
          2:4, 
          function(exponent) 
            base^exponent)
 
# Finish
stopCluster(cl)
[[1]]
[1] 4

[[2]]
[1] 8

[[3]]
[1] 16

Using parSapply

Sometimes we only want to return a simple value and directly get it processed as a vector/matrix. The lapply version that does this is called sapply, thus it is hardly surprising that its parallel version is parSapply:

?View Code RSPLUS
1
2
3
parSapply(cl, 2:4, 
          function(exponent) 
            base^exponent)
[1]  4  8 16

Matrix output with names (this is why we need the as.character):

?View Code RSPLUS
1
2
3
4
5
parSapply(cl, as.character(2:4), 
          function(exponent){
            x <- as.numeric(exponent)
            c(base = base^x, self = x^x)
          })
     2  3   4
base 4  8  16
self 4 27 256

The foreach package

The idea behind the foreach package is to create ‘a hybrid of the standard for loop and lapply function’ and its ease of use has made it rather popular. The set-up is slightly different, you need “register” the cluster as below:

?View Code RSPLUS
1
2
3
4
5
library(foreach)
library(doParallel)
 
cl<-makeCluster(no_cores)
registerDoParallel(cl)

Note that you can change the last two lines to:

?View Code RSPLUS
1
registerDoParallel(no_cores)

But then you need to remember to instead of stopCluster() at the end do:

?View Code RSPLUS
1
stopImplicitCluster()

The foreach function can be viewed as being a more controlled version of the parSapply that allows combining the results into a suitable format. By specifying the .combine argument we can choose how to combine our results, below is a vector, matrix, and a list example:

?View Code RSPLUS
1
2
3
foreach(exponent = 2:4, 
        .combine = c)  %dopar%  
  base^exponent
[1]  4  8 16
?View Code RSPLUS
1
2
3
foreach(exponent = 2:4, 
        .combine = rbind)  %dopar%  
  base^exponent
         [,1]
result.1    4
result.2    8
result.3   16
?View Code RSPLUS
1
2
3
4
foreach(exponent = 2:4, 
        .combine = list,
        .multicombine = TRUE)  %dopar%  
  base^exponent
[[1]]
[1] 4

[[2]]
[1] 8

[[3]]
[1] 16

Note that the last is the default and can be achieved without any tweaking, just foreach(exponent = 2:4) %dopar%. In the example it is worth noting the .multicombine argument that is needed to avoid a nested list. The nesting occurs due to the sequential .combine function calls, i.e. list(list(result.1, result.2), result.3):

?View Code RSPLUS
1
2
3
foreach(exponent = 2:4, 
        .combine = list)  %dopar%  
  base^exponent
[[1]]
[[1]][[1]]
[1] 4

[[1]][[2]]
[1] 8


[[2]]
[1] 16

Variable scope

The variable scope constraints are slightly different for the foreach package. Variable within the same local environment are by default available:

?View Code RSPLUS
1
2
3
4
5
6
7
base <- 2
cl<-makeCluster(2)
registerDoParallel(cl)
foreach(exponent = 2:4, 
        .combine = c)  %dopar%  
  base^exponent
stopCluster(cl)
 [1]  4  8 16

While variables from a parent environment will not be available, i.e. the following will throw an error:

?View Code RSPLUS
1
2
3
4
5
6
test <- function (exponent) {
  foreach(exponent = 2:4, 
          .combine = c)  %dopar%  
    base^exponent
}
test()
 Error in base^exponent : task 1 failed - "object 'base' not found" 

A nice feature is that you can use the .export option instead of the clusterExport. Note that as it is part of the parallel call it will have the latest version of the variable, i.e. the following change in “base” will work:

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
base <- 2
cl<-makeCluster(2)
registerDoParallel(cl)
 
base <- 4
test <- function (exponent) {
  foreach(exponent = 2:4, 
          .combine = c,
          .export = "base")  %dopar%  
    base^exponent
}
test()
 
stopCluster(cl)
 [1]  4  8 16

Similarly you can load packages with the .packages option, e.g. .packages = c("rms", "mice"). I strongly recommend always exporting the variables you need as it limits issues that arise when encapsulating the code within functions.

Fork or sock?

I do most of my analyses on Windows and have therefore gotten used to the PSOCK system. For those of you on other systems you should be aware of some important differences between the two main alternatives:

FORK: "to divide in branches and go separate ways"
Systems: Unix/Mac (not Windows)
Environment: Link all

PSOCK: Parallel Socket Cluster
Systems: All (including Windows)
Environment: Empty

Memory handling

Unless you are using multiple computers or Windows or planning on sharing your code with someone using a Windows machine, you should try to use FORK (I use capitalized due to the makeCluster type argument). It is leaner on the memory usage by linking to the same address space. Below you can see that the memory address space for variables exported to PSOCK are not the same as the original:

?View Code RSPLUS
1
2
3
4
5
6
library(pryr) # Used for memory analyses
cl<-makeCluster(no_cores)
clusterExport(cl, "a")
clusterEvalQ(cl, library(pryr))
 
parSapply(cl, X = 1:10, function(x) {address(a)}) == address(a)
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

While they are for FORK clusters:

?View Code RSPLUS
1
2
cl<-makeCluster(no_cores, type="FORK")
parSapply(cl, X = 1:10, function(x) address(a)) == address(a)
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

This can save a lot of time during setup and also memory. Interestingly, you do not need to worry about variable corruption:

?View Code RSPLUS
1
2
3
4
5
6
7
b <- 0
parSapply(cl, X = 1:10, function(x) {b <- b + 1; b})
# [1] 1 1 1 1 1 1 1 1 1 1
parSapply(cl, X = 1:10, function(x) {b <<- b + 1; b})
# [1] 1 2 3 4 5 1 2 3 4 5
b
# [1] 0

Debugging

Debugging is especially hard when working in a parallelized environment. You cannot simply call browser/cat/print in order to find out what the issue is.

The tryCatchlist approach

Using stop() for debugging without modification is generally a bad idea; while you will receive the error message, there is a large chance that you have forgotten about that stop(), and it gets evoked once you have run your software for a day or two. It is annoying to throw away all the previous successful computations just because one failed (yup, this is default behavior of all the above functions). You should therefore try to catch errors and return a text explaining the setting that caused the error:

?View Code RSPLUS
1
2
3
4
5
6
7
foreach(x=list(1, 2, "a"))  %dopar%  
{
  tryCatch({
    c(1/x, x, 2^x)
  }, error = function(e) return(paste0("The variable '", x, "'", 
                                      " caused the error: '", e, "'")))
}
[[1]]
[1] 1 1 2

[[2]]
[1] 0.5 2.0 4.0

[[3]]
[1] "The variable 'a' caused the error: 'Error in 1/x: non-numeric argument to binary operatorn'"

This is also why I like lists, the .combine may look appealing but it is easy to manually apply and if you have function that crashes when one of the element is not of the expected type you will loose all your data. Here is a simple example of how to call rbind on a lapply output:

?View Code RSPLUS
1
2
out <- lapply(1:3, function(x) c(x, 2^x, x^x))
do.call(rbind, out)
     [,1] [,2] [,3]
[1,]    1    2    1
[2,]    2    4    4
[3,]    3    8   27

Creating a common output file

Since we can’t have a console per worker we can set a shared file. I would say that this is a “last resort” solution:

?View Code RSPLUS
1
2
3
4
5
6
7
cl<-makeCluster(no_cores, outfile = "debug.txt")
registerDoParallel(cl)
foreach(x=list(1, 2, "a"))  %dopar%  
{
  print(x)
}
stopCluster(cl)
starting worker pid=7392 on localhost:11411 at 00:11:21.077
starting worker pid=7276 on localhost:11411 at 00:11:21.319
starting worker pid=7576 on localhost:11411 at 00:11:21.762
[1] 2]

[1] "a"

As you can see due to a race between first and the second node the output is a little garbled and therefore in my opinion less useful than returning a custom statement.

Creating a node-specific file

A perhaps slightly more appealing alternative is to a have a node-specific file. This could potentially be interesting when you have a dataset that is causing some issues and you want to have a closer look at that data set:

?View Code RSPLUS
1
2
3
4
5
6
7
cl<-makeCluster(no_cores, outfile = "debug.txt")
registerDoParallel(cl)
foreach(x=list(1, 2, "a"))  %dopar%  
{
  cat(dput(x), file = paste0("debug_file_", x, ".txt"))
} 
stopCluster(cl)

A tip is to combine this with your tryCatchlist approach. Thereby you can extract any data that is not suitable for a simple message (e.g. a large data.frame), load that, and debug it without parallel. If the x is too long for a file name I suggest that you use digest as described below for the cache function.

The partools package

There is an interesting package partools that has a dbs() function that may be worth looking into (unless your on a Windows machine). It allows coupling terminals per process and debugging through them.

Caching

I strongly recommend implementing some caching when doing large computations. There may be a multitude of reasons to why you need to exit a computation and it would be a pity to waist all that valuable time. There is a package for caching, R.cache, but I’ve found it easier to write the function myself. All you need is the built-in digest package. By feeding the data + the function that you are using to the digest() you get an unique key, if that key matches your previous calculation there is no need for re-running that particular section. Here is a function with caching:

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
cacheParallel <- function(){
  vars <- 1:2
  tmp <- clusterEvalQ(cl, 
                      library(digest))
 
  parSapply(cl, vars, function(var){
    fn <- function(a) a^2
    dg <- digest(list(fn, var))
    cache_fn <- 
      sprintf("Cache_%s.Rdata", 
              dg)
    if (file.exists(cache_fn)){
      load(cache_fn)
    }else{
      var <- fn(var); 
      Sys.sleep(5)
      save(var, file = cache_fn)
    }
    return(var)
  })
}

The when running the code it is pretty obvious that the Sys.sleep is not invoked the second time around:

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
system.time(out <- cacheParallel())
# user system elapsed
# 0.003 0.001 5.079
out
# [1] 1 4
system.time(out <- cacheParallel())
# user system elapsed
# 0.001 0.004 0.046
out
# [1] 1 4
 
# To clean up the files just do:
file.remove(list.files(pattern = "Cache.+\.Rdata"))

Load balancing

Balancing so that the cores have similar weight load and don’t fight for memory resources is core for a successful parallelization scheme.

Work load

Note that the parLapply and foreach are wrapper functions. This means that they are not directly doing the processing the parallel code, but rely on other functions for this. In the parLapply the function is defined as:

?View Code RSPLUS
1
2
3
4
5
6
parLapply <- function (cl = NULL, X, fun, ...) 
{
    cl <- defaultCluster(cl)
    do.call(c, clusterApply(cl, x = splitList(X, length(cl)), 
        fun = lapply, fun, ...), quote = TRUE)
}

Note the splitList(X, length(cl)). This will split the tasks into even portions and send them onto the workers. If you have many of those cached or there is a big computational difference between the tasks you risk ending up with only one cluster actually working while the others are inactive. To avoid this you should when caching try to remove those that are cached from the X or try to mix everything into an even workload. E.g. if we want to find optimal number of neurons in a neural network we may want to change:

?View Code RSPLUS
1
2
3
4
# From the nnet example
parLapply(cl, c(10, 20, 30, 40, 50), function(neurons) 
  nnet(ir[samp,], targets[samp,],
       size = neurons))

to:

?View Code RSPLUS
1
2
3
4
# From the nnet example
parLapply(cl, c(10, 50, 30, 40, 20), function(neurons) 
  nnet(ir[samp,], targets[samp,],
       size = neurons))

Memory load

Running large datasets in parallel can quickly get you into trouble. If you run out of memory the system will either crash or run incredibly slow. The former happens to me on Linux systems while the latter is quite common on Windows systems. You should therefore always monitor your parallelization to make sure that you aren’t too close to the memory ceiling.

Using FORKs is an important tool for handling memory ceilings. As they link to the original variable address the fork will not require any time for exporting variables or take up any additional space when using these. The impact on performance can be significant (my system has 16Gb of memory and eight cores):

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
> rm(list=ls())
> library(pryr)
> library(magrittr)
> a <- matrix(1, ncol=10^4*2, nrow=10^4)
> object_size(a)
1.6 GB
> system.time(mean(a))
   user  system elapsed 
  0.338   0.000   0.337 
> system.time(mean(a + 1))
   user  system elapsed 
  0.490   0.084   0.574 
> library(parallel)
> cl <- makeCluster(4, type = "PSOCK")
> system.time(clusterExport(cl, "a"))
   user  system elapsed 
  5.253   0.544   7.289 
> system.time(parSapply(cl, 1:8, 
                        function(x) mean(a + 1)))
   user  system elapsed 
  0.008   0.008   3.365 
> stopCluster(cl); gc();
> cl <- makeCluster(4, type = "FORK")
> system.time(parSapply(cl, 1:8, 
                        function(x) mean(a + 1)))
   user  system elapsed 
  0.009   0.008   3.123 
> stopCluster(cl)

FORKs can also make your able to run code in parallel that otherwise crashes:

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
> cl <- makeCluster(8, type = "PSOCK")
> system.time(clusterExport(cl, "a"))
   user  system elapsed 
 10.576   1.263  15.877 
> system.time(parSapply(cl, 1:8, function(x) mean(a + 1)))
Error in checkForRemoteErrors(val) : 
  8 nodes produced errors; first error: cannot allocate vector of size 1.5 Gb
Timing stopped at: 0.004 0 0.389 
> stopCluster(cl)
> cl <- makeCluster(8, type = "FORK")
> system.time(parSapply(cl, 1:8, function(x) mean(a + 1)))
   user  system elapsed 
  0.014   0.016   3.735 
> stopCluster(cl)

Although, it won’t save you from yourself :-D as you can see below when we create an intermediate variable that takes up storage space:

?View Code RSPLUS
1
2
3
4
5
6
7
> a <- matrix(1, ncol=10^4*2.1, nrow=10^4)
> cl <- makeCluster(8, type = "FORK")
> parSapply(cl, 1:8, function(x) {
+   b <- a + 1
+   mean(b)
+   })
Error in unserialize(node$con) : error reading from connection

Memory tips

  • Frequently use rm() in order to avoid having unused variables around
  • Frequently call the garbage collector gc(). Although this should be implemented automatically in R, I’ve found that while it may releases the memory locally it may not return it to the operating system (OS). This makes sense when running at a single instance as this is an time expensive procedure but if you have multiple processes this may not be a good strategy. Each process needs to get their memory from the OS and it is therefore vital that each process returns memory once they no longer need it.
  • Although it is often better to parallelize at a large scale due to initialization costs it may in memory situations be better to parallelize at a small scale, i.e. in subroutines.
  • I sometimes run code in parallel, cache the results, and once I reach the limit I change to sequential.
  • You can also manually limit the number of cores, using all the cores is of no use if the memory isn’t large enough. A simple way to think of it is: memory.limit()/memory.size() = max cores

Other tips

  • A general core detector function that I often use is:
    ?View Code RSPLUS
    1
    
    max(1, detectCores() - 1)
  • Never use set.seed(), use clusterSetRNGStream() instead, to set the cluster seed if you want reproducible results
  • If you have a Nvidia GPU-card, you can get huge gains from micro-parallelization through the gputools package (Warning though, the installation can be rather difficult…).
  • When using mice in parallel remember to use ibind() for combining the imputations.

flattr this!

To leave a comment for the author, please follow the link and comment on his blog: G-Forge » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Online course: Survey Analysis in R with Thomas Lumley

$
0
0

On March 20, Thomas Lumley, the creator of the R Package “Survey”, will give an online course (in statistics.com) titled “Survey Analysis in R

The purpose of this 4-week online course, is to teach survey researchers who are familiar with R how to use it in survey research. The course uses Lumley’s Survey package. You will learn how to describe to R the design of a survey; both simple and complex designs are covered. You will then learn how to produce descriptive statistics and graphs with the survey data, and also to perform regression analysis on the data. The instructor Thomas Lumley, PHD, is a Professor of Biostatistics at the University of Auckland and an Affiliate Professor at the University of Washington. He has published numerous journal articles in his areas of research interest, which include regression modeling, clinical trials, statistical computing, and survey research. The course requires about 15 hours per week and there are no set hours when you must be online. Participants can ask questions and exchange comments directly with Dr. Lumley via a private discussion board throughout the period.

You can register to this online course by clicking here.  (use the promo code “LumleyR”, to get a 10% discount)

 

Course Program:

WEEK 1: Describing the Survey Design to R

  • The usual ‘with-replacement’ approximation
    • svydesign()
    • svrepdesign()
  • Database-backed designs for large surveys
  • Full description of multistage surveys
  • Creating replicate weights for a design: as.svrepdesign()

 

WEEK 2: Summary Statistics

  • Computing summary statistics and design effects.
  • Extracting information from result objects
  • Tables of summary statistics
  • Contingency tables: svychisq(), svyloglin()

 

WEEK 3: Graphics

  • Boxplots, histograms, plots of tabular data.
  • Strategies for weighting in scatterplots: bubble plots, hexagonal binning, transparency
  • Scatterplot smoothers.

 

WEEK 4: Regression

  • Linear models
  • Generalized linear models
  • Proportional odds and other cumulative link models
  • Survival analysis

You can register to this online course by clicking here.  (use the promo code “LumleyR”, to get a 10% discount)

Making R Files Executable (under Windows)

$
0
0

(This article was first published on Automated Data Collection with R Blog - rstats, and kindly contributed to R-bloggers)

Although it is reasonable that R scripts get opened in edit mode by default, it would be even nicer (once in a while) to run them with a simple double-click. Well, here we go ...

Choosing a new file extension name (.Rexec)

First, we have to think about a new file extension name. While double-click to run is a nice-to-have, the default behaviour should not be overwritten. In the Windows universe one cannot simply attach two different behaviours to the same file extension but we can register new extensions and associate custom defaults to those. Therefore we need another, new file extension.

To make the file extension as self-explanatory as possible, I suggest using .Rexec for R scripts that should be executable while leaving the default system behaviour for .R files as is.

Associating a new file type with the .Rexec extension

In the next step, we tell Windows that the .Rexec file extension is associated with the RScriptExecutable file type. Furthermore, we inform Windows how these kind of files should be opened by default.

To do so, we need access to the command line interface, e.g., via cmd. Click Start and type cmd into the search bar. Instead of hitting enter right away, right click on the 'cmd.exe' search result, choose Run as administrator from the context menu, and click Yes on the following pop up window. The windows command line should pop up thereafter.

Within the command line, type first:

ASSOC .Rexec=RScriptExecutable

... then ... FTYPE RScriptExecutable=C:Program FilesRR-3.1.2binx64Rscript.exe %1 %* ... while making sure that the path used above really leads to your most recent/preferred RScript.exe.

Testing

To test if everything works as expected, create an R script and write the following lines:

message(getwd())
for(i in 1:100) {
  cat(".")
  Sys.sleep(0.01)
}
message("nBye.")
Sys.sleep(3)

Save it as, e.g., 'test.Rexec' and double click on the file. Now a black box should pop up, informing you about the current working directory, and printing 100 dots on the screen and terminate itself after saying 'Bye'.

Et voilà.

One more thing (or two)

While you are now able to produce executable R script files, note that it is also very easy to transform those back by simply changing the file extension from .Rexec to .R and vice versa.

If you execute your R scripts from the command line, you might want to save yourself from having to add the file extension every time. Simply register .Rexec as a file extension that is executable. The PATHEXT environment variable stores all executable file types. Either go to: Start > Control Panel > System > Advanced System Settings > Environment Variables and search for the 'PATHEXT' entry under System Variables and add .Rexec to the end of the line like that: '.COM;.EXE;.BAT;.Rexec', or go to the command line again and type:

set PATHEXT=%PATHEXT%;.Rexec 

Sources of knowledge

To leave a comment for the author, please follow the link and comment on his blog: Automated Data Collection with R Blog - rstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The United States In Two Words

$
0
0

(This article was first published on Ripples, and kindly contributed to R-bloggers)

Sweet home Alabama, Where the skies are so blue; Sweet home Alabama, Lord, I’m coming home to you (Sweet home Alabama, Lynyrd Skynyrd)

This is the second post I write to show the abilities of twitteR package and also the second post I write for KDnuggets. In this case my goal is to have an insight of what people tweet about american states. To do this, I look for tweets containing the exact phrase “[STATE NAME] is” for every states. Once I have the set of tweets for each state I do some simple text mining: cleaning, standardizing, removing empty words and crossing with these sentiment lexicons. Then I choose the two most common words to describe each state. You can read the original post here. This is the visualization I produced to show the result of the algorithm:

States In Two Words v2

Since the right side of the map is a little bit messy, in the original post you can see a table with the couple of words describing each state. This is just an experiment to show how to use and combine some interesting tools of R. If you don’t like what Twitter says about your state, don’t take it too seriously.

This is the code I wrote for this experiment:

# Do this if you have not registered your R app in Twitter
library(twitteR)
library(RCurl)
setwd("YOUR-WORKING-DIRECTORY-HERE")
if (!file.exists('cacert.perm'))
{
  download.file(url = 'http://curl.haxx.se/ca/cacert.pem', destfile='cacert.perm')
}
requestURL="https://api.twitter.com/oauth/request_token"
accessURL="https://api.twitter.com/oauth/access_token"
authURL="https://api.twitter.com/oauth/authorize"
consumerKey = "YOUR-CONSUMER_KEY-HERE"
consumerSecret = "YOUR-CONSUMER-SECRET-HERE"
Cred <- OAuthFactory$new(consumerKey=consumerKey,
                         consumerSecret=consumerSecret,
                         requestURL=requestURL,
                         accessURL=accessURL,
                         authURL=authURL)
Cred$handshake(cainfo=system.file("CurlSSL", "cacert.pem", package="RCurl"))
save(Cred, file="twitter authentification.Rdata")
# Start here if you have already your twitter authentification.Rdata file
library(twitteR)
library(RCurl)
library(XML)
load("twitter authentification.Rdata")
registerTwitterOAuth(Cred)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
#Read state names from wikipedia
webpage=getURL("http://simple.wikipedia.org/wiki/List_of_U.S._states")
table=readHTMLTable(webpage, which=1)
table=table[!(table$"State name" %in% c("Alaska", "Hawaii")), ]
#Extract tweets for each state
results=data.frame()
for (i in 1:nrow(table))
{
  tweets=searchTwitter(searchString=paste("'"", table$"State name"[i], " is"'",sep=""), n=200, lang="en")
  tweets.df=twListToDF(tweets)
  results=rbind(cbind(table$"State name"[i], tweets.df), results)
}
results=results[,c(1,2)]
colnames(results)=c("State", "Text")
library(tm)
#Lexicons
pos = scan('positive-words.txt',  what='character', comment.char=';')
neg = scan('negative-words.txt',  what='character', comment.char=';')
posneg=c(pos,neg)
results$Text=tolower(results$Text)
results$Text=gsub("[[:punct:]]", " ", results$Text)
# Extract most important words for each state
words=data.frame(Abbreviation=character(0), State=character(0), word1=character(0), word2=character(0), word3=character(0), word4=character(0))
for (i in 1:nrow(table))
{
  doc=subset(results, State==as.character(table$"State name"[i]))
  doc.vec=VectorSource(doc[,2])
  doc.corpus=Corpus(doc.vec)
  stopwords=c(stopwords("english"), tolower(unlist(strsplit(as.character(table$"State name"), " "))), "like")
  doc.corpus=tm_map(doc.corpus, removeWords, stopwords)
  TDM=TermDocumentMatrix(doc.corpus)
  TDM=TDM[Reduce(intersect, list(rownames(TDM),posneg)),]
  v=sort(rowSums(as.matrix(TDM)), decreasing=TRUE)
  words=rbind(words, data.frame(Abbreviation=as.character(table$"Abbreviation"[i]), State=as.character(table$"State name"[i]),
                                   word1=attr(head(v, 4),"names")[1],
                                   word2=attr(head(v, 4),"names")[2],
                                   word3=attr(head(v, 4),"names")[3],
                                   word4=attr(head(v, 4),"names")[4]))
}
# Visualization
require("sqldf")
statecoords=as.data.frame(cbind(x=state.center$x, y=state.center$y, abb=state.abb))
#To make names of right side readable
texts=sqldf("SELECT a.abb,
            CASE WHEN a.abb IN ('DE', 'NJ', 'RI', 'NH') THEN a.x+1.7
            WHEN a.abb IN ('CT', 'MA') THEN a.x-0.5  ELSE a.x END as x,
            CASE WHEN a.abb IN ('CT', 'VA', 'NY') THEN a.y-0.4 ELSE a.y END as y,
            b.word1, b.word2 FROM statecoords a INNER JOIN words b ON a.abb=b.Abbreviation")
texts$col=rgb(sample(0:150, nrow(texts)),sample(0:150, nrow(texts)),sample(0:150, nrow(texts)),max=255)
library(maps)
jpeg(filename = "States In Two Words v2.jpeg", width = 1200, height = 600, quality = 100)
map("state", interior = FALSE, col="gray40", fill=FALSE)
map("state", boundary = FALSE, col="gray", add = TRUE)
text(x=as.numeric(as.character(texts$x)), y=as.numeric(as.character(texts$y)), apply(texts[,4:5] , 1 , paste , collapse = "n" ), cex=1, family="Humor Sans", col=texts$col)
dev.off()

To leave a comment for the author, please follow the link and comment on his blog: Ripples.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RStudio v0.99 Preview: Vim Mode Improvements

$
0
0

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

RStudio’s code editor includes a set of lightweight Vim key bindings. You can turn these on in Tools | Global Options | Code | Editing:

global options

For those not familiar, Vim is a popular text editor built to enable efficient text editing. It can take some practice and dedication to master Vim style editing but those who have done so typically swear by it. RStudio’s “vim mode” enables the use of many of the most common keyboard operations from Vim right inside RStudio.

As part of the 0.99 preview release, we’ve included an upgraded version of the ACE editor, which has a completely revamped Vim mode. This mode extends the range of Vim key bindings that are supported, and implements a number of Vim “power features” that go beyond basic text motions and editing. These include:

  • Vertical block selection via Ctrl + V. This integrates with the new multiple cursor support in ACE and allows you to type in multiple lines at once.
  • Macro playback and recording, using q{register} / @{register}.
  • Marks, which allow you drop markers in your source and jump back to them quickly later.
  • A selection of Ex commands, such as :wq and :%s that allow you to perform editor operations as you would in native Vim.
  • Fast in-file search with e.g. / and *, and support for JavaScript regular expressions.

We’ve also added a Vim quick reference card to the IDE that you can bring up at any time to show the supported key bindings. To see it, switch your editor to Vim mode (as described above) and type :help in Command mode.

vim quick reference card

Whether you’re a Vim novice or power user, we hope these improvements make the RStudio IDE’s editor a more productive and enjoyable environment for you. You can try the new Vim features out now by downloading the RStudio Preview Release.


To leave a comment for the author, please follow the link and comment on his blog: RStudio Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Monitoring progress of a foreach parallel job

$
0
0

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Andrie de Vries

R has strong support for parallel programming, both in base R and additional CRAN packages.

For example, we have previously written about foreach and parallel programming in the articles Tutorial: Parallel programming with foreach and Intro to Parallel Random Number Generation with RevoScaleR.

The foreach package provides simple looping constructs in R, similar to lapply() and friends, and makes it easy execute each element in the loops in parallel.  You can find the packages at foreach: Foreach looping construct for R and doParallel.

Tracking progress of parallel computing tasks

Parallel programming can help speed up the total completion time of your project.  However, for tasks that take a long time to run, you may wish to track progress of the task, while the task is running.

This seems like a simple request, but seems remarkably hard to achieve. The reason boils down to this:

  1. Each parallel worker is running in a different session of R

  2. In some parallel computing setups, the workers don’t communicate with the initiating process, until the final combining step 

So, if it is difficult to track progress directly, what can be done?

It seems to me the  typical answer to this question fall into 3 different classes:

  • Use operating system monitoring tools, i.e. tools external to R.

  • Print messages to a file (or connection) in each worker, then read from this file, again outside of R

  • Use specialist back-ends that support this capability, e.g. the Redis database and the doRedis package

This is an area with many avenues of exploration, so I plan to briefly summarize each method and point to at least one question on StackOverflow that may help.

Method 1: Use external monitoring tools.

The question Monitoring Progress/Debugging Parallel R Scripts asks if it is possible to monitor a parallel job.

In his answer to this question, Dirk Eddelbuettel mentions that parallel back ends like MPI and PVM have job monitors, such as slurm and TORQUE.  However, tools that are simpler to use, like snow do not have monitoring tools.  In this case, you be forced to use methods like printing diagnostic messages to a file.

For parallel jobs using the doParallel backend, you can use standard operating system monitoring tools to see if the job is running on multiple cores.  For example, in Windows, you can use the "Task Manager" to do this. Notice in the CPU utilization how each core went to maximum once the script started:

Screenshot - 21_02_2015 , 13_18_34

Method 2: Print messages to a file (or connection) in each worker, then read from this file, again outside of R

Sometimes it may be sufficient, or desirable, to print status messages from each of the workers.  Simply adding a print() statement will not work, since the parallel workers do not share the standard output of the master job.

The question How can I print when using %dopar% asks how to do this using a snow parallel backend.

Steve Weston, the author of foreach (and one of the original founders of Revolution Analytics) wrote an excellent answer to this question.

Steve says that output produced by the snow workers gets thrown away by default, but you can use the makeCluster() argument "outfile" option to change that. Setting outfile to the empty string ("") prevents snow from redirecting the output, often resulting in the output from your print messages showing up on the terminal of the master process.

Steve says: to create and register your cluster with something like:

library(doSNOW)
cl <- makeCluster(4, outfile="")
registerDoSNOW(cl)

He continues: Your foreach loop doesn't need to change at all. This works with both SOCK clusters and MPI clusters using Rmpi built with Open MPI. On Windows, you won't see any output if you're using Rgui. If you use Rterm.exe instead, you will. In addition to your own output, you'll see messages produced by snow which can also be useful.

Also note that this solution seems to work with doSnow, but is not supported by the doParallel backend.

Method 3: Use specialist back-ends that support this capability, e.g. the Redis database and the doRedis package

The final approach is a novel idea by Brian Lewis, and uses the Redis database as a parallel back end.

Specifically, the R package rredis allows message passing between R and Redis. The package doRedis allows you to use foreach with redis as the parallel backend. What’s interesting about Redis is that this database allows the user to create queues and each parallel worker fetches jobs from this queue. This allows for a dynamic network of workers, even across different machines.

The package has a wonderful vignette.  Also take a look at the video demo at http://bigcomputing.com/doredis.html (reproduced below):

 

But what about actual progress bars?

During my research of available information on this topic, I could not find published a reliable way of creating progress bars using foreach.  

I came across some tantalising hints, e.g. at How do you create a progress bar when using the “foreach()” function in R?

Sadly, the proposed mechanism didn’t actually work.

What next?

I think there might be a way of getting progress bars with foreach and the doParallel package, at least in some circumstances.

I plan to pen my ideas in a follow-up blog post.

Meanwhile, can you do better?  Is there a way of creating progress bars with foreach in a parallel job?

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Visualizing Hubway Trips in Boston

$
0
0

Most Popular Hubway Stations (in order):

  1. Post Office Sq. – located in the heart of the financial district.
  2. Charles St. & Cambridge – the first Hubway stop after crossing from Cambridge over Longfellow Bridge.
  3. Tremont St & West – East side of the Boston Common
  4. South Station
  5. Cross St. & Hannover – entrance to North End combing from financial district.
  6. Boylston St & Berkeley – between Copley and the Common.
  7. Stuart St & Charles – Theatre district, just south of the Common.
  8. Boylston & Fairfield – located in front of the Boylston Street entrance to the Pru.
  9. The Esplanade (Beacon St & Arlington) – this stop is on the north end of the Esplanade running along the Charles.
  10. Chinatown Gate Plaza
  11. Prudential at Belvidere
  12. Boston Public Library on Boylston
  13. Boston Aquarium
  14. Newbury Street & Hereford
  15. Government Center


I received great feedback from the last post visualizing crime in Boston, so I’m continuing the Boston-related content.

Data

I used trip-level data, which Hubway has made available here. The data is de-identified, although some bicyclist information is provided – e.g. gender and address zip code of registered riders (there are over 4 times more trips taken by males than females).

I initially wanted to visualize the trips on a city-level map, but dropped the idea after seeing a great post on the arcdiagram package in R. The Hubway system is basically a network where the nodes are bike stations and the edges are trips from one station to another. Arc diagrams are a cool way to visualize networks.

Arc Diagram Interpretation

  • The arcs represent network edges, or trip routes.
  • The thickness of the arcs is proportionate to the popularity of the route, as measured by the number of trips taken on that route.
  • The size of the nodes are proportionate to the popularity of the node, as measured by “degree.” The degree of a node is defined as the number of edges connected to that node.

Data Cleaning

Some of the data was questionable. There were many trips which began and ended in the same station with a trip duration of 2 minutes. There were also trips that lasted for over 10 hours.

  • I dropped the trips with very low duration (1st duration percentile) and very high duration (99th duration percentile).
  • There were many trips which began and ended in the same station that were not questionable. I removed these because they were cluttering the arc diagram without adding much value.
  • I only used data from bicyclists in certain zip codes (see zip_code vector in the code below).
  • Since the dataset was so massive, I only plotted a random sample of 1000 trips.

Comments on Arcdiagram Package

  • My one issue with the arcdiagram package is that there is no workaround for very small node labels
  • Some arc diagrams have arcs both below and above the x-axis. This package doesn’t seem to include this optionality.
 
install.packages('devtools')
install_github('arcdiagram', username ='gastonstat')
library('devtools')
library(arcdiagram)

input='.../Hubway/hubway_2011_07_through_2013_11'
setwd(input)

zip_code=c('02116','02111','02110','02114','02113','02109')

stations=read.csv('hubway_stations.csv')
trips=read.csv('hubway_trips.csv')

# clean data - there are negative values as well as outrageously huge values
# negative values 
trips_2=trips[which(trips$duration>=0),]

# remove clock resets (if trip was less than 6 minutes and start/ended at same station)
p=as.vector(quantile(trips_2$duration,c(.01)))
trips_3=trips_2[which(trips_2$duration>=p1 & trips_2$strt_statn!=trips_2$end_statn),]
# remove outrageously high trips. anything above 99% percentile:
p9=as.vector(quantile(trips_3$duration,c(.99)))
trips_4=trips_3[which(trips_3$duration<=p99),]

# subset to only trips starting/ending in given zip codes
trips_5=trips_4[which(trips_4$zip_code %in% zip_code),]

set.seed(1000)
data=cbind(trips_5$strt_statn,trips_5$end_statn)
samp_n=seq(1,length(data)/2)
samp_set=sample(samp_n,1000,replace=FALSE)
samp=data.frame(data[samp_set,])

# merge on station names
names(samp)=c('id','id2')
m=merge(x=samp,y=stations)
names(samp)=c('id2','id')
m=merge(x=samp,y=stations)

# create sample matrix
samp_w_labels=data.frame(m[,'station'],m2[,'station'])
names(samp_w_labels)=c('start','end')
samp_mat=as.matrix(samp_w_labels)

# delete trips that end where they start
con=paste(samp_mat[,1],samp_mat[,2],sep='')
dup=duplicated(con)
dupp=samp_mat[dup,]
dupp=dupp[which(dupp[,1]!=dupp[,2]),]

# create weights for arcs...weights will by frequency of trips
# each arc represents
clist=data.frame(paste(dupp[,1],dupp[,2],sep=''))
names(clist)=c('clist')
ctab=data.frame(table(clist))
c_m=merge(x=clist,y=ctab)

# create network structure
g=graph.edgelist(dupp, directed=TRUE)
edges=get.edgelist(g)
deg=degree(g)
clus=clusters(g)

# create colors
pal=colorRampPalette(c('darkorchid1','darkorchid4'),bias=5)
colors=pal(length(clus$membership))
node_cols=colors[clus$membership]

# generate arcplot
arcplot(dupp, 
 lwd.arcs =.2*c_m$Freq,cex.nodes=.07*deg,
 col.nodes='black',bg.nodes=node_cols, pch.nodes = 21,
 ordering=order(deg,decreasing=TRUE),
 cex.labels=.18,
 horizontal=TRUE)

Some thoughts on Vim

$
0
0

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Gary R. Moser
Director of Institutional Research and Planning
The California Maritime Academy

I recently contacted Joseph Rickert about inviting Vim guru Drew Niel (web: vimcasts.org, book: "Practical Vim: Edit Text at the Speed of Thought") to speak at the Bay Area R User Group group. Due to Drew's living in Great Britain that might not be easily achieved, so Joe generously extended an invitation for me to share a bit about why I like Vim here.

When it comes to text editors, there are a number of strong contenders to choose from (...lots and lots of them). if you've found a tool with a rich feature set that makes you more productive (and more importantly, is something you like) you should probably stick with that. Given that editing text of various types is such a big part of our lives, however, it's worth turning a critical eye toward how well our tools facilitate this specific activity.

Do you spend a lot of time editing text? Yeah, me too. Personally this includes R code, Markdown, Latex, as well as notes and outlines (including this article). When I rediscovered Vim fairly recently, it was from the perspective of being a proficient R user. Therefore, my values and beliefs about what constitutes good software are framed by my experience with R. That includes being OK with an initial learning curve.

I have been using RStudio as my primary IDE for R since it was offered.  It's great software; they did something not too long ago that I really appreciate - they added a stripped-down Vim editing mode. Vim pushes users to the documentation pretty quickly (usually as the result of accidentally deleting large chunks of work), and as I dug in and began to discover its full functionality, I came to realize how much I was missing out on by using the emulator. The ability to set and jump to marks in a document, or utilizing multiple registers for copy/paste are two good examples of essential but missing features in RStudio.

Vim has been described as "a language for text editing," which I think is a useful way to view it. At the risk of sounding snotty, I would compare the experiences of using Vim (or another good editor) versus a plain-jane text editor to that of playing chess versus checkers. That is, there's an element of strategic and intentional action compared to simply playing one of a limited set of moves over and over again.

One of the things that makes Vim so interesting and different from other editors stems from its origins. As the result of being developed in the context of severe constraints (slow networks, no digital displays, limited system resources, and no mouse), Vim - then "ed" - had to accomplish the greatest amount of work with the least number of keystrokes. This requirement led to the development of a vast number of very specific commands that can be combined in useful ways. Drew Neil artfully compares this to playing notes, chords, and melodies on a piano. It's also an appropriate comparison for setting one's expectations toward becoming a skilled Vim user! Michael Mrozekon's humorous plot cleverly suggests that, not unlike R, Vim doesn't hold your hand.

Editor-learning-curve

It also speaks to my point about specificity. Emacs, for example, can be extended to be a web client or music player, hence the rabbit-hole learning curve, but isn't that getting away from the primary task of text editing?

The fundamental way that Vim differs from most other text editors is that it is explicitly modal; all software is technically modal in certain ways (that is, the same keys do different things under different circumstances), but with Vim it is a central design feature. Essentially, what this means is that by switching modes, a different keyboard comes into existence under your fingers.  Because Vim has four modes, and a very rich and terse set of key-bindings, it's like having four+ keyboards in one. The keyboard cheat sheet is a useful reference, especially in the beginning.

Vi-vim-cheat-sheet

Warning: after becoming familiar with Vim's basic functionality, going back to a typical text editor feels rather clumsy.

Vim as an interface to R using the Vim-R-plugin is mostly good for how I use it, but I expect to be dialing-in Vim for a long time before it's got all the features I want. I don't mind this, but I can see how someone else might. I encourage you consider your own tools and how well they facilitate your most frequent tasks. If you're an RStudio user, try giving Vim mode a go. A visit to www.vim.org will connect you to the resources you'll need.

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Ensemble Learning with Cubist Model

$
0
0

(This article was first published on Yet Another Blog in Statistical Computing » S+/R, and kindly contributed to R-bloggers)

The tree-based Cubist model can be easily used to develop an ensemble classifier with a scheme called “committees”. The concept of “committees” is similar to the one of “boosting” by developing a series of trees sequentially with adjusted weights. However, the final prediction is the simple average of predictions from all “committee” members, an idea more close to “bagging”.

Below is a demonstration showing how to use the train() function in the caret package to select the optimal number of “committees” in the ensemble model with cubist, e.g. 100 in the example. As shown, the ensemble model is able to outperform the standalone model by ~4% in a separate testing dataset.

data(Boston, package = "MASS")
X <- Boston[, 1:13]
Y <- log(Boston[, 14])

# SAMPLE THE DATA
set.seed(2015)
rows <- sample(1:nrow(Boston), nrow(Boston) - 100)
X1 <- X[rows, ]
X2 <- X[-rows, ]
Y1 <- Y[rows]
Y2 <- Y[-rows]

pkgs <- c('doMC', 'Cubist', 'caret')
lapply(pkgs, require, character.only = T)
registerDoMC(core = 7)

# TRAIN A STANDALONE MODEL FOR COMPARISON 
mdl1 <- cubist(x = X1, y = Y1, control = cubistControl(unbiased = TRUE,  label = "log_medv", seed = 2015))
print(cor(Y2, predict(mdl1, newdata = X2) ^ 2))
# [1] 0.923393

# SEARCH FOR THE OPTIMIAL NUMBER OF COMMITEES
test <- train(x = X1, y = Y1, "cubist", tuneGrid = expand.grid(.committees = seq(10, 100, 10), .neighbors = 0), trControl = trainControl(method = 'cv'))
print(test)
# OUTPUT SHOWING A HIGHEST R^2 WHEN # OF COMMITEES = 100
#  committees  RMSE       Rsquared   RMSE SD     Rsquared SD
#   10         0.1607422  0.8548458  0.04166821  0.07783100 
#   20         0.1564213  0.8617020  0.04223616  0.07858360 
#   30         0.1560715  0.8619450  0.04015586  0.07534421 
#   40         0.1562329  0.8621699  0.03904749  0.07301656 
#   50         0.1563900  0.8612108  0.03904703  0.07342892 
#   60         0.1558986  0.8620672  0.03819357  0.07138955 
#   70         0.1553652  0.8631393  0.03849417  0.07173025 
#   80         0.1552432  0.8629853  0.03887986  0.07254633 
#   90         0.1548292  0.8637903  0.03880407  0.07182265 
#  100         0.1547612  0.8638320  0.03953242  0.07354575 

mdl2 <- cubist(x = X1, y = Y1, committees = 100, control = cubistControl(unbiased = TRUE,  label = "log_medv", seed = 2015))
print(cor(Y2, predict(mdl2, newdata = X2) ^ 2))
# [1] 0.9589031

To leave a comment for the author, please follow the link and comment on his blog: Yet Another Blog in Statistical Computing » S+/R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A Speed Comparison Between Flexible Linear Regression Alternatives in R

$
0
0

(This article was first published on Publishable Stuff, and kindly contributed to R-bloggers)

Everybody loves speed comparisons! Is R faster than Python? Is dplyr faster than data.table? Is STAN faster than JAGS? It has been said that speed comparisons are utterly meaningless, and in general I agree, especially when you are comparing apples and oranges which is what I’m going to do here. I’m going to compare a couple of alternatives to lm(), that can be used to run linear regressions in R, but that are more general than lm(). One reason for doing this was to see how much performance you’d loose if you would use one of these tools to run a linear regression (even if you could have used lm()). But as speed comparisons are utterly meaningless, my main reason for blogging about this is just to highlight a couple of tools you can use when you grown out of lm(). The speed comparison was just to lure you in. Let’s run!

The Contenders

Below are the seven different methods that I’m going to compare by using each method to run the same linear regression. If you are just interested in the speed comparisons, just scroll to the bottom of the post. And if you are actually interested in running standard linear regressions as fast as possible in R, then Dirk Eddelbuettel has a nice post that covers just that.

lm()

This is the baseline, the “default” method for running linear regressions in R. If we have a data.frame d with the following layout:

head(d)
##         y      x1      x2
## 1 -64.579 -1.8088 -1.9685
## 2 -19.907 -1.3988 -0.2482
## 3  -4.971  0.8366 -0.5930
## 4  19.425  1.3621  0.4180
## 5  -1.124 -0.7355  0.4770
## 6 -12.123 -0.9050 -0.1259

Then this would run a linear regression with y as the outcome variable and x1 and x2 as the predictors:

lm(y ~ 1 + x1 + x2, data=d)
## 
## Call:
## lm(formula = y ~ 1 + x1 + x2, data = d)
## 
## Coefficients:
## (Intercept)           x1           x2  
##      -0.293       10.364       21.225

glm()

This is a generalization of lm() that allows you to assume a number of different distributions for the outcome variable, not just the normal distribution as you are stuck with when using lm(). However, if you don’t specify any distribution glm() will default to using a normal distribution and will produce output identical to lm():

glm(y ~ 1 + x1 + x2, data=d)
## 
## Call:  glm(formula = y ~ 1 + x1 + x2, data = d)
## 
## Coefficients:
## (Intercept)           x1           x2  
##      -0.293       10.364       21.225  
## 
## Degrees of Freedom: 29 Total (i.e. Null);  27 Residual
## Null Deviance:	    13200 
## Residual Deviance: 241 	AIC: 156

bayesglm()

Found in the arm package, this is a modification of glm that allows you to assume custom prior distributions over the coefficients (instead of the implicit flat priors of glm()). This can be super useful, for example, when you have to deal with perfect separation in logistic regression or when you want to include prior information in the analysis. While there is bayes in the function name, note that bayesglm() does not give you the whole posterior distribution, only point estimates. This is how to run a linear regression with flat priors, which should give similar results as when using lm():

library(arm)
bayesglm(y ~ 1 + x1 + x2, data = d, prior.scale=Inf, prior.df=Inf)
## 
## Call:  bayesglm(formula = y ~ 1 + x1 + x2, data = d, prior.scale = Inf, 
##     prior.df = Inf)
## 
## Coefficients:
## (Intercept)           x1           x2  
##      -0.293       10.364       21.225  
## 
## Degrees of Freedom: 29 Total (i.e. Null);  30 Residual
## Null Deviance:	    13200 
## Residual Deviance: 241 	AIC: 156

nls()

While lm() can only fit linear models, nls() can also be used to fit non-linear models by least squares. For example, you could fit a sine curve to a data set with the following call: nls(y ~ par1 + par2 * sin(par3 + par4 * x )). Notice here that the syntax is a little bit different from lm() as you have to write out both the variables and the parameters. Here is how to run the linear regression:

nls(y ~ intercept + x1 * beta1 + x2 * beta2, data = d)
## Nonlinear regression model
##   model: y ~ intercept + x1 * beta1 + x2 * beta2
##    data: d
## intercept     beta1     beta2 
##    -0.293    10.364    21.225 
##  residual sum-of-squares: 241
## 
## Number of iterations to convergence: 1 
## Achieved convergence tolerance: 3.05e-08

mle2()

In the bblme package we find mle2(), a function for general maximum likelihood estimation. While mle2() can be used to maximize a handcrafted likelihood function, it also has a formula interface which is simple to use, but powerful, and that plays nice with R’s built in distributions. Here is how to roll a linear regression:

library(bbmle)
inits <- list(log_sigma = rnorm(1), intercept = rnorm(1),
              beta1 = rnorm(1), beta2 = rnorm(1))
mle2(y ~ dnorm(mean = intercept + x1 * beta1 + x2 * beta2, sd = exp(log_sigma)),
     start = inits, data = d)
## 
## Call:
## mle2(minuslogl = y ~ dnorm(mean = intercept + x1 * beta1 + x2 * 
##     beta2, sd = exp(log_sigma)), start = inits, data = d)
## 
## Coefficients:
## log_sigma intercept     beta1     beta2 
##    1.0414   -0.2928   10.3641   21.2248 
## 
## Log-likelihood: -73.81

Note, that we need to explicitly initialize the parameters before the maximization and that we now also need a parameter for the standard deviation. For an even more versatile use of the formula interface for building statistical models, check out the very cool rethinking package by Richard McElreath.

optim()

Of course, if we want to be really versatile, we can craft our own log-likelihood function to maximized using optim(), also part of base R. This gives us all the options, but there are also more things that can go wrong: We might make mistakes in the model specification and if the search for the optimal parameters is not initialized well the model might not converge at all! A linear regression log-likelihood could look like this:

log_like_fn <- function(par, d) {
  sigma <- exp(par[1])
  intercept <- par[2]
  beta1 <- par[3]
  beta2 <- par[4]
  mu <- intercept + d$x1 * beta1 + d$x2 * beta2
  sum(dnorm(d$y, mu, sigma, log=TRUE))
}

inits <- rnorm(4)
optim(par = inits, fn = log_like_fn, control = list(fnscale = -1), d = d)
## $par
## [1]  1.0399 -0.2964 10.3637 21.2139
## 
## $value
## [1] -73.81
## 
## $counts
## function gradient 
##      431       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

As the convergence returned 0 it hopeful worked fine (a 1 indicates non-convergence). The control = list(fnscale = -1) argument is just there to make optim() do maximum likelihood estimation rather than minimum likelihood estimation (which must surely be the worst estimation method ever).

Stan’s optimizing()

Stan is a stand alone program that plays well with R, and that allows you to specify a model in Stan’s language which will compile down to very efficient C++ code. Stan was originally built for doing Hamiltonian Monte Carlo, but now also includes an optimizing() function that, like R’s optim(), allows you to do maximum likelihood estimation (or maximum a posteriori estimation, if you explicitly included priors in the model definition). Here we need to do a fair bit of work before we can fit a linear regression but what we gain is extreme flexibility in extending this model, would we need to. We have come a long way from lm

library(rstan)
## Loading required package: inline
## 
## Attaching package: 'inline'
## 
## The following object is masked from 'package:Rcpp':
## 
##     registerPlugin
## 
## rstan (Version 2.6.0, packaged: 2015-02-06 21:02:34 UTC, GitRev: 198082f07a60)
## 
## Attaching package: 'rstan'
## 
## The following object is masked from 'package:arm':
## 
##     traceplot
model_string <- "
data {
  int n;
  vector[n] y;
  vector[n] x1;
  vector[n] x2;
}

parameters {
  real intercept;
  real beta1;
  real beta2;
  real<lower=0> sigma;
}

model {
  vector[n] mu;
  mu <- intercept + x1 * beta1 + x2 * beta2;
  y ~ normal(mu, sigma);
}
"

data_list <- list(n = nrow(d), y = d$y, x1 = d$x1, x2 = d$x2)
model <- stan_model(model_code = model_string)
fit <- optimizing(model, data_list)
fit
## $par
## intercept     beta1     beta2     sigma 
##   -0.2929   10.3642   21.2248    2.8331 
## 
## $value
## [1] -46.24

An Utterly Meaningless Speed Comparison

So, just for fun, here is the speed comparison, first for running a linear regression with 1000 data points and 5 predictors:

plot of chunk unnamed-chunk-12

This should be taken with a huge heap of salt (which is not too good for your health!). While all these methods produce a result equivalent to a linear regression they do it in different ways, and not necessary in equally good ways, for example, my homemade optim routine is not converging correctly when trying to fit a model with too many predictors. As I have used the standard settings there is surely a multitude of ways in which any of these methods can be made faster. Anyway, here is what happens if we vary the number of predictors and the number of data points:

plot of chunk unnamed-chunk-13

To make these speed comparisons I used the microbenchmark package, the full script replicating the plots above can be found here. This speed comparison was made on my laptop running R version 3.1.2, on 32 bit Ubuntu 12.04, with an average amount of RAM and a processor that is starting to get a bit tired.

To leave a comment for the author, please follow the link and comment on his blog: Publishable Stuff.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

rud.is » R 2015-03-30 13:32:08

$
0
0

(This article was first published on rud.is » R, and kindly contributed to R-bloggers)

Over on The DO Loop, @RickWicklin does a nice job visualizing the causes of airline crashes in SAS using a mosaic plot. More often than not, I find mosaic plots can be a bit difficult to grok, but Rick’s use was spot on and I believe it shows the data pretty well, but I also thought I’d take the opportunity to:

  • Give @jennybc’s new googlesheets a spin
  • Show some dplyr & tidyr data wrangling (never can have too many examples)
  • Crank out some ggplot zero-based streamgraph-y area charts for the data with some extra ggplot wrangling for good measure

I also decided to use the colors in the original David McCandless/Kashan visualization.

Getting The Data

As I mentioned, @jennybc made a really nice package to interface with Google Sheets, and the IIB site makes the data available, so I copied it to my Google Drive and gave her package a go:

library(googlesheets)
library(ggplot2) # we'll need the rest of the libraries later
library(dplyr)   # but just getting them out of the way
library(tidyr)
 
# this will prompt for authentication the first time
my_sheets <- list_sheets()
 
# which one is the flight data one
grep("Flight", my_sheets$sheet_title, value=TRUE)
 
## [1] "Copy of Flight Risk JSON" "Flight Risk JSON" 
 
# get the sheet reference then the data from the second tab
flights <- register_ss("Flight Risk JSON")
flights_csv <- flights %>% get_via_csv(ws = "93-2014 FINAL")
 
# take a quick look
glimpse(flights_csv)
 
## Observations: 440
## Variables:
## $ date       (chr) "d", "1993-01-06", "1993-01-09", "1993-01-31", "1993-02-08", "1993-02-28", "...
## $ plane_type (chr) "t", "Dash 8-311", "Hawker Siddeley HS-748-234 Srs", "Shorts SC.7 Skyvan 3-1...
## $ loc        (chr) "l", "near Paris Charles de Gualle", "near Surabaya Airport", "Mt. Kapur", "...
## $ country    (chr) "c", "France", "Indonesia", "Indonesia", "Iran", "Taiwan", "Macedonia", "Nor...
## $ ref        (chr) "r", "D-BEAT", "PK-IHE", "9M-PID", "EP-ITD", "B-12238", "PH-KXL", "LN-TSA", ...
## $ airline    (chr) "o", "Lufthansa Cityline", "Bouraq Indonesia", "Pan Malaysian Air Transport"...
## $ fat        (chr) "f", "4", "15", "14", "131", "6", "83", "3", "6", "2", "32", "55", "132", "4...
## $ px         (chr) "px", "20", "29", "29", "67", "22", "56", "19", "22", "17", "38", "47", "67"...
## $ cat        (chr) "cat", "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A2", "A1", "A1", "A1...
## $ phase      (chr) "p", "approach", "initial_climb", "en_route", "en_route", "approach", "initi...
## $ cert       (chr) "cert", "confirmed", "probable", "probable", "confirmed", "probable", "confi...
## $ meta       (chr) "meta", "human_error", "mechanical", "weather", "human_error", "weather", "h...
## $ cause      (chr) "cause", "pilot & ATC error", "engine failure", "low visibility", "pilot err...
## $ notes      (chr) "n", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
 
# the spreadsheet has a "helper" row for javascript, so we nix it
flights_csv <- flights_csv[-1,] # js vars removal
 
# and we convert some columns while we're at it
flights_csv %>%
  mutate(date=as.Date(date),
         fat=as.numeric(fat),
         px=as.numeric(px)) -> flights_csv

A Bit of Cleanup

Despite being a spreadsheet, the data needs some cleanup and there’s no real need to include “grounded” or “unknown” in the flight phase given the limited number of incidents in those categories. I’d actually mention that descriptively near the visual if this were anything but a blog post.

The area chart also needs full values for each category combo per year, so we use expand from tidyr with left_join and mutate to fill in the gaps.

Finally, we make proper, ordered labels:

flights_csv %>%
  mutate(year=as.numeric(format(date, "%Y"))) %>%
  mutate(phase=tolower(phase),
         phase=ifelse(grepl("take", phase), "takeoff", phase),
         phase=ifelse(grepl("climb", phase), "takeoff", phase),
         phase=ifelse(grepl("ap", phase), "approach", phase)) %>%
  count(year, meta, phase) %>%
  left_join(expand(., year, meta, phase), ., c("year", "meta", "phase")) %>% 
  mutate(n=ifelse(is.na(n), 0, n)) %>% 
  filter(!phase %in% c("grounded", "unknown")) %>%
  mutate(phase=factor(phase, 
                      levels=c("takeoff", "en_route", "approach", "landing"),
                      labels=c("Takeoff", "En Route", "Approach", "Landing"),
                      ordered=TRUE)) -> flights_dat

I probably took some liberties lumping “climb” in with “takeoff”, but I’d’ve asked an expert for a production piece just as I would hope folks doing work for infosec reports or visualizations would consult someone knowledgable in cybersecurity.

The Final Plot

I’m a big fan of an incremental, additive build idiom for ggplot graphics. By using the gg <- gg + … style one can move lines around, comment them out, etc without dealing with errant + signs. It also forces a logical separation of ggplot elements. Personally, I tend to keep my build orders as follows:

  • main ggplot call with mappings if the graph is short, otherwise add the mappings to the geoms
  • all geom_ or stat_ layers in the order I want them, and using line breaks to logically separate elements (like aes) or to wrap long lines for easier readability.
  • all scale_ elements in order from axes to line to shape to color to fill to alpha; I’m not as consistent as I’d like here, but keeping to this makes it really easy to quickly hone in on areas that need tweaking
  • facet call (if any)
  • label setting, always with labs unless I really have a need for using ggtitle
  • base theme_ call
  • all other theme elements, one per gg <- gg + line

I know that’s not everyone’s cup of tea, but it’s just how I roll ggplot-style.

For this plot, I use a smoothed stacked plot with a custom smoother and also use Futura Medium for the text font. Substitute your own fav font if you don’t have Futura Medium.

gg <- ggplot(flights_dat, aes(x=year, y=n, group=meta)) 
gg <- gg + stat_smooth(mapping=aes(fill=meta), geom="area",
                       position="stack", method="gam", formula=y~s(x)) 
gg <- gg + scale_fill_manual(name="Reason:", values=flights_palette, 
                             labels=c("Criminal", "Human Error",
                                      "Mechanical", "Unknown", "Weather"))
gg <- gg + scale_y_continuous(breaks=c(0, 5, 10, 13))
gg <- gg + facet_grid(~phase)
gg <- gg + labs(x=NULL, y=NULL, title="Crashes by year, by reason & flight phase")
gg <- gg + theme_bw()
gg <- gg + theme(legend.position="bottom")
gg <- gg + theme(text=element_text(family="Futura Medium"))
gg <- gg + theme(plot.title=element_text(face="bold", hjust=0))
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(panel.border=element_blank())
gg <- gg + theme(strip.background=element_rect(fill="#525252"))
gg <- gg + theme(strip.text=element_text(color="white"))
gg

That ultimately produces:

flights

with the facets ordered by takeoff, flying, approaching landing and actual landing phases. Overall, things have gotten way better, though I haven’t had time to look in to the bump between 2005 and 2010 for landing crashes.

As an aside, Boeing has a really nice PDF on some of this data with quite a bit more detail.

To leave a comment for the author, please follow the link and comment on his blog: rud.is » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R / Finance 2015 Open for Registration

$
0
0

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

The annoucement below just went to the R-SIG-Finance list. More information is as usual at the R / Finance page.

Registration for R/Finance 2015 is now open!

The conference will take place on May 29 and 30, at UIC in Chicago. Building on the success of the previous conferences in 2009-2014, we expect more than 250 attendees from around the world. R users from industry, academia, and government will joining 30+ presenters covering all areas of finance with R.

We are very excited about the four keynote presentations given by Emanuel Derman, Louis Marascio, Alexander McNeil, and Rishi Narang.
The conference agenda (currently) includes 18 full presentations and 19 shorter "lightning talks". As in previous years, several (optional) pre-conference seminars are offered on Friday morning.

There is also an (optional) conference dinner at The Terrace at Trump Hotel. Overlooking the Chicago river and skyline, it is a perfect venue to continue conversations while dining and drinking.

Registration information and agenda details can be found on the conference website as they are being finalized.
Registration is also available directly at the registration page.

We would to thank our 2015 sponsors for the continued support enabling us to host such an exciting conference:

International Center for Futures and Derivatives at UIC

Revolution Analytics
MS-Computational Finance and Risk Management at University of Washington

Ketchum Trading
OneMarketData
RStudio
SYMMS

On behalf of the committee and sponsors, we look forward to seeing you in Chicago!

For the program committee:
Gib Bassett, Peter Carl, Dirk Eddelbuettel, Brian Peterson,
Dale Rosenthal, Jeffrey Ryan, Joshua Ulrich

See you in Chicago in May!

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on his blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Registration Open for R/Finance 2015!

$
0
0

(This article was first published on FOSS Trading, and kindly contributed to R-bloggers)
You can find registration information and agenda details (as they become available) on the conference website.  Or you can go directly to the registration page.  Note that there's an early-bird registration deadline of May 15.

The conference will take place on May 29 and 30, at UIC in Chicago.  Building on the success of the previous conferences in 2009-2014, we expect more than 250 attendees from around the world. R users from industry, academia, and government will joining 30+ presenters covering all areas of finance with R.

We are very excited about the four keynote presentations given by Emanuel Derman, Louis Marascio, Alexander McNeil, and Rishi Narang.  The main agenda (currently) includes 18 full presentations and 19 shorter "lightning talks".  As in previous years, several (optional) pre-conference seminars are offered on Friday morning.

There is also an (optional) conference dinner that will once-again be held at The Terrace at Trump Hotel. Overlooking the Chicago river and skyline, it is a perfect venue to continue conversations while dining and drinking.

We would to thank our 2015 sponsors for the continued support enabling us to host such an exciting conference:

International Center for Futures and Derivatives at UIC

Revolution Analytics
MS-Computational Finance at University of Washington

OneMarketData
Ketchum Trading
RStudio
SYMMS

On behalf of the committee and sponsors, we look forward to seeing you in Chicago!

For the program committee:
Gib Bassett, Peter Carl, Dirk Eddelbuettel, Brian Peterson, Dale Rosenthal, Jeffrey Ryan, Joshua Ulrich

To leave a comment for the author, please follow the link and comment on his blog: FOSS Trading.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Coarse Grain Parallelism with foreach and rxExec

$
0
0

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert

I have written a several posts about the Parallel External Memory Algorithms (PEMAs) in Revolution Analytics’ RevoScaleR package, most recently about rxBTrees(), but I haven’t said much about rxExec(). rxExec() is not itself a PEMA, but it can be used to write parallel algorithms. Pre-built PEMAs such as rxBTrees(), rxLinMod(), etc are inherently parallel algorithms designed for distributed computing on various kinds of clusters: HPC Server, Platform LSF and Hadoop for example. rxExec()’s job, however, is to help ordinary, non-parallel functions run in parallel computing or distributed computing environments.

To get a handle on this, I think  the best place to start is with R’s foreach() function which enables an R programmer to write “coarse grain”, parallel code. To be concrete, suppose we want to fit a logistic regression model to two different data sets. And to speed things up, we would like to do this in parallel. Since my laptop has two multi-threaded cores, this a straight-forward use case to prototype. The following code points to two of the multiple csv files that comprise the mortgageDefault data set available at Revolution Analytics’ data set download site.

#----------------------------------------------------------
# load needed libraries
#----------------------------------------------------------
 library(foreach)
 
#----------------------------------------------------------
# Point to the Data
#----------------------------------------------------------
dataDir <- "C:\DATA\Mortgage Data\mortDefault"
fileName1 <- "mortDefault2000.csv"
path1 <- file.path(dataDir,fileName1)
 
fileName2 <- "mortDefault2001.csv"
path2 <- file.path(dataDir,fileName2)
 
#----------------------------------------------------------
# Look at the first data file
system.time(data1 <- read.csv(path1))
#user  system elapsed 
   #2.52    0.02    2.55
dim(data1)
head(data1,3)
 #creditScore houseAge yearsEmploy ccDebt year default
#1         615       10           5   2818 2000       0
#2         780       34           5   3575 2000       0
#3         735       12           1   3184 2000       0

Note that it takes almost 3 seconds to read one of these files into a data frame.

The following function will read construct the name and path of a data set from parameters supplied to it, reads the data into a data frame and then uses R’s glm() function to fit a logistic regression model.

#----------------------------------------------------------- 
# Function to read data and fit a logistic regression
#-----------------------------------------------------------
glmEx <- function(directory,fileStem,fileNum,formula){
	         fileName <- paste(fileStem,fileNum,".csv",sep="")
		     path <- file.path(directory,fileName)
			 data <- read.csv(path)
		     model <- glm(formula=formula,data=data,family=binomial(link="logit"))
         return(summary(model))}
 
form <- formula(default ~ creditScore + houseAge + yearsEmploy + ccDebt)

Something like this might be reasonable if you had a whole bunch of data sets in a directory. To process the two data sets in parallel we set up and internal cluster with 2 workers, register the parallel backend and run foreach() with the %dopar% operator.

#----------------------------------------------------------
# Coarse grain parallelism with foreach	
#----------------------------------------------------------
cl <- makePSOCKcluster(2)        # Create copies of R running in parallel and communicating over sockets.
                                 # My laptop has 2 multi threaded cores 
registerDoParallel(cl)           #register parallel backend
system.time(res <- foreach(num = c(2000,2001)) %dopar% 
         glmEx(directory=dataDir,fileStem="mortDefault",fileNum=num,formula=form))
 
   #user  system elapsed 
   #5.34    1.99   43.54
stopCluster(cl)

The basic idea is that my two-core PC processes the two data sets in parallel. The whole thing runs pretty quickly: two logit models are fit on a million rows each in about 44 seconds.

Now, the same process can be accomplished with rxExec() as follows:

#-----------------------------------------------------------
# Coarse grain parallelism with rxExec
#-----------------------------------------------------------
rxOptions(numCoresToUse=2)               
rxSetComputeContext("localpar")               # use the local parallel compute context
rxGetComputeContext()
 
argList2 <- list(list(fileNum=2000),list(fileNum=2001))
 
system.time(res <- rxExec(glmEx,directory=dataDir,fileStem="mortDefault",formula=form,elemArgs=argList2))
#user system elapsed #4.85 2.01 45.54

First notice that rxExec() took about the same amount of time to run. This is not  surprising since, under the hood, rxExec() looks a lot like foreach() (while providing additional functionality). Indeed, the same Revolution Analytics team worked on both functions.

You can also see that rxExec() looks a bit like an apply() family function in that it takes a function, in this case my sample function glmEx(), as one of its arguments. The elemArgs parameter takes a list of arguments that will be different for constructing the two file names, while the other arguments separated by commas in the call statement are parameters that are the same for both. With this tidy syntax we could direct the function to fit models that are located in very different locations and also set different parameters for each glm() call.

The really big difference between foreach() and rxExec(), however, is the line

rxSetComputeContext("localpar")  

which sets the compute context. This is the mechanism that links rxExec() and pre-built PEMA’s to RevoScaleR’s underlying distributed computing architecture. Changing the the compute context allows you to run the R function in the rxExec() call on a cluster. For example, in the simplest case where you can log into an edge node on a Hadoop cluster, the following code would enable rxExec() to run the glmEx()  function on each node of the cluster.

myHadoopContext <- RxHadoopMR()

rxSetComputeContext(myHadoopContext)

In a more complicated scenario, for example where you are remotely connecting to the cluster, it will be necessary to include your credentials and some other parameters in the statement that specifies the compute context.

Finally, we can ratchet things up to a higher level of performance by using a PEMA in the rxExec() call. This would make sense in a scenario where you want to fit a different model one each node of a cluster while making sure that you are getting the maximum amount of parallel computation from all of the cores on each node. The following new version of the custom glm function uses the RevoScaleR PEMA rxLogit() to fit the logistic regressions:

----------------------------------------------------------
# Finer parallelism with rxLogit
#----------------------------------------------------------
glmExRx <- function(directory,fileStem,fileNum,formula){
	         fileName <- paste(fileStem,fileNum,".csv",sep="")
		     path <- file.path(directory,fileName)
		     data <- read.csv(path)
		     model <- rxLogit(formula=formula,data=data)
		     return(summary(model))}
argList2 <- list(list(fileNum=2000),list(fileNum=2001))
 
system.time(res <- rxExec(glmExRx,directory=dataDir,fileStem="mortDefault",formula=form,elemArgs=argList2))
#user system elapsed #0.01 0.00 8.33

Here, still running just locally on my laptop, we see quite an improvement in performance. The computation runs in about 8.3 seconds. (Remember that over two seconds of this elapsed time is devoted to reading the data.). Some of this performance improvement comes from additional, “finer grain” parallelism of the rxLogit() function. Most of the speedup, however, is likely due to careful handling of the underlying matrix computations.

In summary, rxExec() can be thought of as an extension of foreach() that is capable of leveraging all kinds of R functions in distributed computing environments.

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Recreating the vaccination heatmaps in R

$
0
0

(This article was first published on Benomics » R, and kindly contributed to R-bloggers)

In February the WSJ graphics team put together a series of interactive visualisations on the impact of vaccination that blew up on twitter and facebook, and were roundly lauded as great-looking and effective dataviz. Some of these had enough data available to look particularly good, such as for the measles vaccine:

Credit to the WSJ and creators: Tynan DeBold and Dov Friedman

Credit to the WSJ and creators: Tynan DeBold and Dov Friedman

How hard would it be to recreate an R version?

Base R version

Quite recently Mick Watson, a computational biologist based here in Edinburgh, put together a base R version of this figure using heatmap.2 from the gplots package.

If you’re interested in the code for this, I suggest you check out his blog post where he walks the reader through creating the figure, beginning from heatmap defaults.

However, it didn’t take long for someone to pipe up asking for a ggplot2 version (3 minutes in fact…) and that’s my preference too, so I decided to have a go at putting one together.

ggplot2 version

Thankfully the hard work of tracking down the data had already been done for me, to get at it follow these steps:

  1. Register and login to “Project Tycho
  2. Go to level 1 data, then Search and retrieve data
  3. Now change a couple of options: geographic level := state; disease outcome := incidence
  4. Add all states (highlight all at once with Ctrl+A (or Cmd+A on Macs)
  5. Hit submit and scroll down to Click here to download results to excel
  6. Open in excel and export to CSV

Simple right!

Now all that’s left to do is a bit of tidying. The data comes in wide format, so can be melted to our ggplot2-friendly long format with:

measles &lt;- melt(measles, id.var=c(&quot;YEAR&quot;, &quot;WEEK&quot;))

After that we can clean up the column names and use dplyr to aggregate weekly incidence rates into an annual measure:

colnames(measles) &lt;- c(&quot;year&quot;, &quot;week&quot;, &quot;state&quot;, &quot;cases&quot;)
mdf &lt;- measles %&gt;% group_by(state, year) %&gt;% 
       summarise(c=if(all(is.na(cases))) NA else 
                 sum(cases, na.rm=T))
mdf$state &lt;- factor(mdf$state, levels=rev(levels(mdf$state)))

It’s a bit crude but what I’m doing is summing the weekly incidence rates and leaving NAs if there’s no data for a whole year. This seems to match what’s been done in the WSJ article, though a more intepretable method could be something like average weekly incidence, as used by Robert Allison in his SAS version.

After trying to match colours via the OS X utility “digital colour meter” without much success, I instead grabbed the colours and breaks from the original plot’s javascript to make them as close as possible.

In full, the actual ggplot2 command took a fair bit of tweaking:

ggplot(mdf, aes(y=state, x=year, fill=c)) + 
  geom_tile(colour=&quot;white&quot;, linewidth=2, 
            width=.9, height=.9) + theme_minimal() +
    scale_fill_gradientn(colours=cols, limits=c(0, 4000),
                        breaks=seq(0, 4e3, by=1e3), 
                        na.value=rgb(246, 246, 246, max=255),
                        labels=c(&quot;0k&quot;, &quot;1k&quot;, &quot;2k&quot;, &quot;3k&quot;, &quot;4k&quot;),
                        guide=guide_colourbar(ticks=T, nbin=50,
                               barheight=.5, label=T, 
                               barwidth=10)) +
  scale_x_continuous(expand=c(0,0), 
                     breaks=seq(1930, 2010, by=10)) +
  geom_segment(x=1963, xend=1963, y=0, yend=51.5, size=.9) +
  labs(x=&quot;&quot;, y=&quot;&quot;, fill=&quot;&quot;) +
  ggtitle(&quot;Measles&quot;) +
  theme(legend.position=c(.5, -.13),
        legend.direction=&quot;horizontal&quot;,
        legend.text=element_text(colour=&quot;grey20&quot;),
        plot.margin=grid::unit(c(.5,.5,1.5,.5), &quot;cm&quot;),
        axis.text.y=element_text(size=6, family=&quot;Helvetica&quot;, 
                                 hjust=1),
        axis.text.x=element_text(size=8),
        axis.ticks.y=element_blank(),
        panel.grid=element_blank(),
        title=element_text(hjust=-.07, face=&quot;bold&quot;, vjust=1, 
                           family=&quot;Helvetica&quot;),
        text=element_text(family=&quot;URWHelvetica&quot;)) +
  annotate(&quot;text&quot;, label=&quot;Vaccine introduced&quot;, x=1963, y=53, 
           vjust=1, hjust=0, size=I(3), family=&quot;Helvetica&quot;)

Result

measles_incidence_heatmap_2I’m pretty happy with the outcome but there are a few differences: the ordering is out (someone pointed out the original is ordered by two letter code rather than full state name) and the fonts are off (as far as I can tell they use “Whitney ScreenSmart” among others).

Obviously the original is an interactive chart which works great with this data. It turns out it was built with the highcharts library, which actually has R bindings via the rCharts package, so in theory the original chart could be entirely recreated in R! However, for now at least, that’ll be left as an exercise for the reader…


Full code to reproduce this graphic is on github.

To leave a comment for the author, please follow the link and comment on his blog: Benomics » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Viewing all 405 articles
Browse latest View live