Documentation for Proximity Graphs

an accompaniment to the paper:
"Measuring and Extracting Proximity in Networks", by Yehuda Koren, Stephen C. North, Chris Volinsky
Submitted to KDD 2006

Proximity Graph Options

α (alpha): α is a parameter which controls the tradeoff between the desire for a small graph for visualization and the desire to capture as much of the proximity in the graph as possible. Small values (e.g. α = 1) will result in graphs of just a few nodes, where are large values (e.g. α = 100) will aim to capture maximum proximity. We have found α = 10 generally has the right balance (see Figure 7 in the paper)

MinSize: Provides a minimum number of nodes to present in the graph. Useful when a lot of the proximity is captured in a very small graph - such as when the names submitted are directly connected - and the user would like to see a broader community around the given nodes.

MaxSize: Provides a maximum number of nodes to present in the graph. Useful when the algorithm returns a graph that is large or complicated. Setting a maximum forces the algorithm to provide a smaller graph more amenable to visualization. (Note: due to computing restraints, we have set a global max size of 50 nodes).

Neighborhood Size: How much should we expand the {s,t} neighborhoods? Setting this to 1000 typically is enough to capture multiple paths between the actors; if they are far apart in the graph, or if the user is submitting many names, this parameter may need to be increased. (see Section 6 in our paper)

degree: Upper bound on node degrees in candidate graph. Limiting node degree can focus the proximity graph on the most relevant edges.

Layout Options

Digraph: Lays out the results in a k-layered directed graph, using the Graphviz function "dot". Actors are placed as closer to one endpoint or another based on their proximity to those endpoints. This sometimes allows for interesting features to emerge, such as the temporal progression of movie stars shown in Figure 12. (Note: Digraph layout only works currently with two actors).

Layout: these options control how the graph is laid out, and refer to values for the -overlap option in the Graphviz "neato" function. The three values "scalexy", "orthoxy", and "compress" refer to different constraint algorithms used to minimize node overlaps. We have found that "compress" generally results in the nicest layouts, but with small graphs and/or high-resolution screens, the other two may create more interesting layouts. Please see the documentation for more detail

Proximity Graph Gallery

Figures from the paper (and others):

Doris Day -> Halle Berry [ Database=IMDB, α =10, maxsize=9, neighborhood size = 500, degree=10 ]

Computer Scientists (Rivest, Adelman, Shamir, Knuth) [ Database=DBLP, α =100, minsize=12, neighborhood size = 500, degree=10 ]. This result is slightly different than the one in the paper due to a small bug that was detected in our processing of the database.

Meryl Streep, Groucho Marx, Judy Garland [Database=IMDB, α =10, minsize=12, neighborhood size = 500, degree=10 ]

John Lennon -- Ringo Starr [Database=IMDB, α =10, minsize=10, neighborhood size = 500, degree=10 ]. Despite the fact that IMDB is a movie database, it contains lots of musicians as well (who often show up in movies). All of the members of this graph have some connection to the Beatles.

KDD 2006 General Chair and Program Co-Chairs (Ungar, Craven, Gunopulous) [Database=DBLP, α =50, maxsize=20, neighborhood size = 1000, degree=20 ]. Collaboration graph of this year's KDD Program Chair and two Program co-chairs.