Publications

Automatic Selection of Partitioning Variables for Small Multiple Displays

Anushka Anand, Justin Talbot. Infovis 2015.

Abstract

Effective small multiple displays are created by partitioning a visualization on variables that reveal interesting conditional structure in the data. We propose a method that automatically ranks partitioning variables, allowing analysts to focus on the most promising small multiple displays. Our approach is based on a randomized, non-parametric permutation test, which allows us to handle a wide range of quality measures for visual patterns defined on many different visualization types, while discounting spurious patterns. We demonstrate the effectiveness of our approach on scatterplots of real-world, multidimensional datasets.

PDF

Four Experiments on the Perception of Bar Charts

Justin Talbot, Vidya Setlur, Anushka Anand. Infovis 2014.

Abstract

Bar charts are one of the most common visualization types. In a classic graphical perception paper, Cleveland & McGill studied how different bar chart designs impact the accuracy with which viewers can complete simple perceptual tasks. They found that people perform substantially worse on stacked bar charts than on aligned bar charts, and that comparisons between adjacent bars are more accurate than between widely separated bars. However, the study did not explore why these differences occur. In this paper, we describe a series of follow-up experiments to further explore and explain their results. While our results generally confirm Cleveland & McGill’s ranking of various bar chart configurations, we provide additional insight into the bar chart reading task and the sources of participants’ errors. We use our results to propose new hypotheses on the perception of bar charts.

PDF
(Note the version hosted by IEEE has some formatting problems in the figures which have been fixed in this version. We have asked IEEE to update their version, but they have not done so yet.)
data and code for replication

Just-in-time Length Specialization of Dynamic Vector Code

Justin Talbot, Zach Devito, Pat Hanrahan. ARRAY 2014.

Abstract

Dynamically typed vector languages are popular in data analytics and statistical computing. In these languages, vectors have both dynamic type and dynamic length, making static generation of efficient machine code difficult. In this paper, we describe a trace-based just-in-time compilation strategy that performs partial length specialization of dynamically typed vector code. This selective specialization is designed to avoid excessive compilation overhead while still enabling the generation of efficient machine code through length-based optimizations such as vector fusion, vector copy elimination, and the use of hardware SIMD units. We have implemented our approach in a virtual machine for a subset of R, a vector-based statistical computing language. In a variety of workloads, containing both scalar and vector code, we show near autovectorized C performance over a large range of vector sizes.

PDF

The design of an efficient vector virtual machine for data analytics

Justin Talbot, Ph.D. Thesis, Stanford University.

Abstract

The amount of parallelism available in commodity hardware has been steadily increasing. Recent hardware trends suggest that within the next few years, widely available computers will boast substantial parallelism–8 or more cores, each supporting hardware vectorized operations on 4 or more double-precision floating point values at a time. In parallel with this development, the rise of big data, and data science more generally, has created enormous interest in analytics that can extract value from data. However, the most popular languages for data analysis and statistical computing–R, Matlab, Python, and Excel–are designed for ease of use, not performance, and their current implementations may see little performance gains from the increasing parallelism of commodity hardware. This dissertation focuses on the design of virtual machines that aim to deliver the full performance potential of upcoming commodity-level parallel hardware while still providing a high-level, easy-to-use analytic experience.

First, this dissertation describes Riposte, a JIT compiler for the R language, designed to execute vector code on parallel hardware. To execute code with scalars and short vectors, Riposte uses a tracing-based approach adapted from recent Javascript VM work to extract hot loops. It then uses a novel partial length specialization pass to eliminate loop overhead and promote vectors into hardware SIMD units. To execute long vector code, Riposte uses a delayed evaluation approach to dynamically discover and extract sequences of vector operations. Once extracted, it then fuses operations to eliminate unnecessary memory traffic and then schedules them to run across multiple cores. On a variety of analytic workloads, we demonstrate that Riposte can run R code at near the speed of optimized C and for data parallel workloads we demonstrate good scalability out to 32 cores. At a high level, Riposte demonstrates that a vector-based, dynamically typed language can be an effective and efficient parallel programming model for data analysis tasks.

Second, this dissertation describes Phoenix++, a C++ library for MapReduce-style computation on shared memory machines that leverages compile-time specialization to generate high quality parallel code while greatly reducing the amount of code that users must write. We validate the approach by implementing a variety of common analytic workloads in Phoenix++ and demonstrating substantial speed up and improved scalability compared to previous work.

PDF via the Stanford Library

An Empirical Model of Slope Ratio Comparisons

Justin Talbot, John Gerth, Pat Hanrahan. Infovis 2012.

Abstract

Comparing slopes is a fundamental graph reading task and the aspect ratio chosen for a plot influences how easy these comparisons are to make. According to Banking to 45 degrees, a classic design guideline first proposed and studied by Cleveland et al., aspect ratios that center slopes around 45 degrees minimize errors in visual judgments of slope ratios. This paper revisits this earlier work. Through exploratory pilot studies that expand Cleveland et al.’s experimental design, we develop an empirical model of slope ratio estimation that fits more extreme slope ratio judgments and two common slope ratio estimation strategies. We then run two experiments to validate our model. In the first, we show that our model fits more generally than the one proposed by Cleveland et al. and we find that, in general, slope ratio errors are not minimized around 45 degrees. In the second experiment, we explore a novel hypothesis raised by our model: that visible baselines can substantially mitigate errors made in slope judgments. We conclude with an application of our model to aspect ratio selection.

PDF (corrected)
PDF (original)
data and code for replication

Riposte: A Trace-Driven Compiler and Parallel VM for Vector Code in R

Justin Talbot, Zachary DeVito, and Pat Hanrahan. PACT 2012

Abstract

There is a growing utilization gap between modern hardware and modern programming languages for data analysis. Due to power and other constraints, recent processor design has sought improved performance through increased SIMD and multi-core parallelism. At the same time, high-level, dynamically typed languages for data analysis have become popular. These languages emphasize ease of use and high productivity, but have, in general, low performance and limited support for exploiting hardware parallelism.

In this paper, we describe Riposte, a new runtime for the R language, which bridges this gap. Riposte uses tracing, a technique commonly used to accelerate scalar code, to dynamically discover and extract sequences of vector operations from arbitrary R code. Once extracted, we can fuse traces to eliminate unnecessary mem- ory traffic, compile them to use hardware SIMD units, and schedule them to run across multiple cores, allowing us to fully utilize the available parallelism on modern shared-memory machines. Our evaluation shows that Riposte can run vector R code near the speed of hand-optimized C, 5–50x faster than the open source implemen- tation of R, and can also linearly scale to 32 cores for some tasks. Across 12 different workloads we achieve an overall average speed- up of over 150x without explicit programmer parallelization.

PDF

Arc Length-based Aspect Ratio Selection

Justin Talbot, John Gerth, Pat Hanrahan. InfoVis 2011.

Abstract

The aspect ratio of a plot has a dramatic impact on our ability to perceive trends and patterns in the data. Previous approaches for automatically selecting the aspect ratio have been based on adjusting the orientations or angles of the line segments in the plot. In contrast, we recommend a simple, effective method for selecting the aspect ratio: minimize the arc length of the data curve while keeping the area of the plot constant. The approach is parameterization invariant, robust to a wide range of inputs, preserves visual symmetries in the data, and is a compromise between previously proposed techniques. Further, we demonstrate that it can be effectively used to select the aspect ratio of contour plots. We believe arc length should become the default aspect ratio selection method.

PDF

Phoenix++: Modular MapReduce for Shared-Memory Systems

Justin Talbot, Richard M. Yoo, Christos Kozyrakis. Second International Workshop on MapReduce and its Applications (MAPREDUCE) 2011.

Abstract

This paper describes our rewrite of Phoenix, a MapReduce framework for shared-memory CMPs and SMPs. Despite successfully demonstrating the applicability of a MapReduce-style pipeline to shared-memory machines, Phoenix has a number of limitations; its uniform intermediate storage of key-value pairs, ineffcient combiner implementation, and poor task overhead amortization fail to efficiently support a wide range of MapReduce applications, encouraging users to manually circumvent the framework. We describe an alternative implementation, Phoenix++, that provides a modular, extensible pipeline that can be easily adapted by the user to the characteristics of a particular workload. Compared to Phoenix, this new approach achieves a 4.7-fold performance improvement and increased scalability, while allowing users to write simple, strict MapReduce code.

PDF
Implementation and more info

An Extension of Wilkinson’s Algorithm for Positioning Tick Labels on Axes

Justin Talbot, Sharon Lin, Pat Hanrahan. InfoVis 2010.

Abstract

The non-data components of a visualization, such as axes and legends, can often be just as important as the data itself. They provide contextual information essential to interpreting the data. In this paper, we describe an automated system for choosing positions and labels for axis tick marks. Our system extends Wilkinson’s optimization-based labeling approach to create a more robust, full-featured axis labeler. We define an expanded space of axis labelings by automatically generating additional nice numbers as needed and by permitting the extreme labels to occur inside the data range. These changes provide flexibility in problematic cases, without degrading quality elsewhere. We also propose an additional optimization criterion, legibility, which allows us to simultaneously optimize over label formatting, font size, and orientation. To solve this revised optimization problem, we describe the optimization function and an efficient search algorithm. Finally, we compare our method to previous work using both quantitative and qualitative metrics. This paper is a good example of how ideas from automated graphic design can be applied to information visualization.

PDF
Implementation

Online Aggregation and Continuous Query support in MapReduce

Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, John Gerth, Justin Talbot, Khaled Elmeleegy, Russell Sears. SIGMOD 2010 Demo.

Abstract

MapReduce is a popular framework for data-intensive distributed computing of batch jobs. To simplify fault tolerance, the output of each MapReduce task and job is materialized to disk before it is consumed. In this demonstration, we describe a modified MapReduce architecture that allows data to be pipelined between operators. This extends the MapReduce programming model beyond batch processing, and can reduce completion times and improve system utilization for batch jobs as well. We demonstrate a modified version of the Hadoop MapReduce framework that supports online aggregation, which allows users to see “early returns” from a job as it is being computed. Our Hadoop Online Prototype (HOP) also supports continuous queries, which enable MapReduce programs to be written for applications such as event monitoring and stream processing. HOP retains the fault tolerance properties of Hadoop, and can run unmodified user-defined MapReduce programs.

PDF

EnsembleMatrix: Interactive Visualization to Support Machine Learning with Multiple Classifiers

Justin Talbot, Bongshin Lee, Desney Tan, Ashish Kapoor. CHI 2009.

Abstract

Machine learning is an increasingly used computational tool within human-computer interaction research. While most researchers currently utilize an iterative approach to refining classifier models and performance, we propose that ensemble classification techniques may be a viable and even preferable alternative. In ensemble learning, algorithms combine multiple classifiers to build one that is superior to its components. In this paper, we present EnsembleMatrix, an interactive visualization system that presents a graphical view of confusion matrices to help users understand relative merits of various classifiers. EnsembleMatrix allows users to directly interact with the visualizations in order to explore and build combination models. We evaluate the efficacy of the system and the approach in a user study. Results show that users are able to quickly combine multiple classifiers operating on multiple feature sets to produce an ensemble classifier with accuracy that approaches best-reported performance classifying images in the CalTech-101 dataset.

PDF

Vispedia: On-demand Data Integration for Interactive Visualization and Exploration

Bryan Chan, Justin Talbot, Leslie Wu, Nathan Sakunkoo, Mike Cammarano, Pat Hanrahan, ACM SIGMOD Demo Paper, 2009.

PDF

Vispedia: Interactive Visual Exploration of Wikipedia Data via Search-Based Integration

Bryan Chan, Leslie Wu, Justin Talbot, Mike Cammarano, Pat Hanrahan. IEEE Information Visualization, 2008.

Abstract

Wikipedia is an example of the collaborative, semi-structured data sets emerging on the Web. These data sets have large, non-uniform schema that require costly data integration into structured tables before visualization can begin. We present Vispedia, a Web-based visualization system that reduces the cost of this data integration. Users can browse Wikipedia, select an interesting data table, then use a search interface to discover, integrate, and visualize additional columns of data drawn from multiple Wikipedia articles. This interaction is supported by a fast path search algorithm over DBpedia, a semantic graph extracted from Wikipedia’s hyperlink structure. Vispedia can also export the augmented data tables produced for use in traditional visualization systems. We believe that these techniques begin to address the “long tail” of visualization by allowing a wider audience to visualize a broader class of data. We evaluated this system in a first-use formative lab study. Study participants were able to quickly create effective visualizations for a diverse set of domains, performing data integration as needed.

PDF

Structuring collections with Scatter/Gather extensions

Omar Alonso, Justin Talbot. SIGIR 2008: 697-698.

Abstract

A major component of sense-making is organizing—grouping, labeling, and summarizing—the data at hand in order to form a useful mental model, a necessary precursor to identifying missing information and to reasoning about the data. Previous work has shown the Scatter/Gather model to be useful in exploratory activities that occur when users encounter unknown document collections. However, the topic structure communicated by Scatter/Gather is closely tied to the behavior of the underlying clustering algorithm; this structure may not reflect the mental model most applicable to the information need. In this paper we describe the initial design of a mixed-initiative information structuring tool that leverages aspects of the well-studied Scatter/Gather model but permits the user to impose their own desired structure when necessary.

ACM Library

Visualization of Heterogeneous Data

Mike Cammarano, Xin Luna Dong, Bryan Chan, Jeff Klingner, Justin Talbot, Alon Y. Halevy, Pat Hanrahan.  IEEE Trans. Vis. Comput. Graph. 13(6): 1200-1207 (2007).

Abstract

Both the Resource Description Framework (RDF), used in the semantic web, and Maya Viz u-forms represent data as a graph of objects connected by labeled edges. Existing systems for flexible visualization of this kind of data require manual specification of the possible visualization roles for each data attribute. When the schema is large and unfamiliar, this requirement inhibits exploratory visualization by requiring a costly up-front data integration step. To eliminate this step, we propose an automatic technique for mapping data attributes to visualization attributes. We formulate this as a schema matching problem, finding appropriate paths in the data model for each required visualization attribute in a visualization template.

PDF

Two Stage Importance Sampling for Direct Lighting

David Cline, Parris K. Egbert, Justin F. Talbot, and David L. Cardon.  In Rendering Techniques 2006 (Eurographics Symposium on Rendering), pp. 103-113, (June 2006).

Abstract

We describe an importance sampling method to generate samples based on the product of a BRDF and an environment map or large light source. The method works by creating a hierarchical partition of the light source based on the BRDF function for each primary (eye) ray in a ray tracer. This partition, along with a summed area table of the light source, form an approximation to the product function that is suitable for importance sampling. The partition is used to guide a sample warping algorithm to transform a uniform distribution of points so that they approximate the product distribution. The technique is unbiased, requires little precomputation, and we demonstrate that it works well for a variety of BRDF types. Further, we present an adaptive method which allocates varying numbers of samples to different image pixels to reduce shadow artifacts

PDF

Importance Resampling for Global Illumination

Justin Talbot, Master’s Thesis, Brigham Young University, 2005. bibtex

Abstract

This thesis develops a generalized form of Monte Carlo integration called Resampled Importance Sampling. It is based on the importance resampling sample generation technique. Resampled Importance Sampling can lead to significant variance reduction over standard Monte Carlo integration for common rendering problems. We show how to select the importance resampling parameters for near optimal variance reduction. We also combine RIS with stratification and with Multiple Importance Sampling for further variance reduction. We demonstrate the robustness of this technique on the direct lighting problem and achieve up to a 33% variance reduction over standard techniques. We also suggest using RIS as a default BRDF sampling technique.

PDF (2,307K)
Powerpoint (7,919K)

Energy Redistribution Path Tracing

David Cline, Justin F. Talbot, and Parris K. Egbert. ACM Transactions on Graphics (SIGGRAPH 2005 Proceedings). bibtex

Abstract

We present Energy Redistribution (ER) sampling as an unbiased method to solve correlated integral problems. ER sampling is a hybrid algorithm that uses Metropolis sampling-like mutation strategies in a standard Monte Carlo integration setting, rather than resorting to an intermediate probability distribution step. In the context of global illumination, we present Energy Redistribution Path Tracing (ERPT). Beginning with an inital set of light samples taken from a path tracer, ERPT uses path mutations to redistribute the energy of the samples over the image plane to reduce variance. The result is a global illumination algorithm that is conceptually simpler than Metropolis Light Transport (MLT) while retaining its most powerful feature, path mutation. We compare images generated with the new technique to standard path tracing and MLT.

PDF (6,345K)

Importance Resampling for Global Illumination

Justin F. Talbot, David Cline, and Parris K. Egbert. Eurographics Symposium on Rendering 2005, pages 139-146. bibtex

Abstract

This paper develops importance resampling into a variance reduction technique for Monte Carlo integration. Importance resampling is a sample generation technique that can be used to generate more equally weighted samples for importance sampling. This can lead to significant variance reduction over standard importance sampling for common rendering problems. We show how to select the importance resampling parameters for near optimal variance reduction. We demonstrate the robustness of this technique on common global illumination problems and achieve a 10%-70% variance reduction over standard importance sampling for direct lighting. We conclude that further variance reduction could be achieved with cheaper sampling methods.

PDF (3,706K)
Powerpoint (1307K)

VTVS – A Complete Implementation of a Virtual Terrain Visualization System

Justin F. Talbot and Parris K. Egbert. IASTED VIIP 2003, Benalmadena, Spain, September 2003. bibtex

Abstract

VTVS landscape imageVTVS (Virtual Terrain Visualization System) is a complete terrain visualization system for interactive exploration of models of real world locations. It includes a fast LOD system for terrain rendering and support for large numbers of plants and buildings on top of the terrain. In this paper we discuss the implementation of the system and various algorithms we have developed. We provide examples of the success of VTVS and we discuss possible directions for future research.

PDF (968K)