Category Archives: Uncategorized

Four Experiments on the Perception of Bar Charts

At this year’s InfoVis, I published a paper with two of my Tableau Research colleagues, Vidya Setlur and Anushka Anand. This paper explores how people make perceptual comparisons in a bar charts building on a previous study by Cleveland and McGill.

The paper itself is available: Four Experiments on the Perception of Bar Charts. And here are all the stimuli, data, and analyses.

R packages

In class today we covered R packages. A quick try to create a package in Windows revealed that the Windows version of R does not come with the necessary build tools. I tried again on a Mac and ran into problems where package.skeleton failed to create the package directories since .find.package couldn’t find my newly created package. After a little playing around I found that package names cannot have an ‘_’ (at least on a Mac).

The R CMD CHECK command is very nice. It expands on the idea of static code checking to also check documentation, the install process, example code, etc.

R complaints

I’ve recently read a number of complaints about the R programming language and thought I’d pull together the complaints into one place.

  • Inconsistent return types, list/vector confusion (Andrew Gelman)
    I always get mixed up about when to use [] and [[]].
  • Lack of useful types (Andrew Gelman)
    HavingĀ  nonnegative or [0,1) constrained floating point types would be quite useful in many circumstances. I haven’t used factors enough to know if they would work in most scenarios where an enumerated type is used in other languages. Having built in random variable types would be useful too.
  • Scalability and S4 complexity issues, mixed with R coding style issues (Andrew Gelman)
    I haven’t used S4, so I can’t comment on that. However, I have found it very useful to be able to type the name of a function on the command line and see it’s code directly. Unfortunately, built-in functions (e.g. lapply) don’t print out (probably because they really only exist in C code). It would be nice for such functions to print out an equivalent R implementation with a note saying that it really executes in C.
  • Vector indexing issues #1, #2, #2a, #3 (Radford Neal)
    As a CS guy I find 1-based vectors hard to justify, but Radford notes a number of other issues. I’ve been bitten by the automatic dimension dropping “feature” rather frequently.

Non-functional elements in R

This list is from John’s lecture:

  1. Operators and functions with side effects: (<<-, assign(), options(foo=))
  2. Nonstandard R objects: environments, connections.
  3. Random number generation
  4. Special mechanisms-“closures”

According to the R language definition, environments are mutable objects, so changes are visible outside of the function. R closures can have side effects due to the fact that when an R function returns a list of closures, these closures share the same environment. Since the environment is mutable, using <<- in the any of the closures will affect the other closures as well.

Example (inspired by a more involved example in John’s Software for Data Analysis):

> Counter <- function(start) {
+ t <- start
+ list(
+ inc = function() {t <<- t+1; t},
+ dec = function() {t <<- t-1; t} )
+ }
>
> counter <- Counter(5)
> counter[["dec"]]()
[1] 4
> counter[["dec"]]()
[1] 3
> counter[["inc"]]()
[1] 4

The Counter function returns a list of two functions “inc” and “dec”. Both functions are associated with the same environment. Thus successive calls to “inc” and “dec” operate on the same t variable. This use of closures has become largely out of date with the addition of S4 objects to the language.

Data Analysis and Regression (Chapter 1)

In trying to develop a visual statistical system, it is frustrating to deal with the many limitations of common statistical tests–normality, equal variance, or equal sample size conditions. These conditions make the tests brittle; that is, they only work on a subset of all interesting data sets, and which subset is difficult to determine. How should this brittleness be accounted for in a statistical system? One possibility is to run additional tests and to warn the user if some condition is not met. In practice warnings would occur frequently and it would be up to the user’s judgment to decide if they were meaningful. To overcome this issue, a statistical system could be imagined that would attempt to mimic the decisions of an expert statistician. It would look at a battery of test results and automatically determine what additional tests could be run depending on what conditions were met. How would one train such a system? And, I think, more importantly, how would one evaluate the uncertainty or error in the results of such a complicated system?

I recently came across the idea of robust statistics, which originated with Tukey and associated statisticians in the middle of the last century. This concept appears here in the introductory chapter to Mosteller and Tukey’s Data Analysis and Regression. The goal is to find statistical tests that work on a broad range of data distributions. From a system implementation standpoint, this approach is much preferable to user input or a complicated expert system. However, I have not seen these techniques used in any of my Stats classes. Have they been superseded by Bayesian or bootstrapping approaches?

The other principle topic in this first chapter is “vague concepts”. The authors give the example of standard deviation, which is a very specific method for measuring the spread of a distribution. However, to evaluate our use of the standard deviation we must be able to place it in context of all other possible measures of spread. This meta-level or “vague concept” is lost in many introductions to statistics.

Intro to Computer Graphics courses

Pete Shirley has invited people to follow along with the assignments in his Intro to Graphics course covering the Reyes architecture.

Pat Hanrahan’s new intro course is covering an eclectic mix of topics. You can also check out his interesting use of a course Wiki.

At UNC, another former BYU student, Brandon Lloyd, is teaching a more typical intro course covering rasterization and raytracing.

And back at BYU, Robert Burton’s intro course is asking students to reimplement the OpenGL pipeline.

Workshop Report

A couple weeks ago I attended a small Visualization workshop organized by the DHS and the CIA in New Mexico. It was my first time to the state. I met a number of interesting people in the Visualization community, including David Salesin, from Adobe, and Stephen Few.

The discussion focused on how documents and presentations could be produced more effectively and efficiently. Two main issues were raised:

  1. Best practices need to be codified.
    • There is a lot of research on how to communicate spread across a large number of fields including education, rhetoric, HCI, visualization, cognition, etc.
    • Having this knowledge consolidated and organized would be helpful for people.
    • With this knowledge, programs could be developed to automatically apply (or suggest) best practices, reducing the time needed to create an effective document.
  2. Documents need more background.
    • The analysis process behind a document is often as important as the document itself (containing just the final conclusions). If we had a way to track the analysis, it would be possible for document readers to drill-down into questionable conclusions. It would also be possible to check the analysis for consistency with changing conditions. For example if the document relied a piece of intelligence that was later shown to be false, the entire document could be flagged (automatically?) as “Overtaken by Events”.

If only everyone did this…

Today I came across the website of Marcia K. Johnson, a professor of Psychology at Yale. She links to PDFs of every single one of her publications all the way back to 1972. That’s impressive.

I was looking for her paper “Contextual prerequisites for understanding: Some investigations of comprehension and recall”, and was very surprised to find it online. Typically, anything written before about 1990 can be very hard to find online (unless it’s in the ACM Digital Library).