SIGCSE 2007 Workshop, March 10, 2007

Overview

Opening a Graph

In order to start manipulating graphs using DukeGuess, we first need to get a graph opened! In this example we will open up a graph representing the characters and relationships between characters in the first season of the popular show "24".

Start DukeGuess by double clicking on "DukeGuess.jar" in the "DukeGuess" folder.

s

DukeGuess now displays a file loader box. First, click the button labeled "..." and then select the file called "twentyFour.gdf". Then click "Ok"

You should now have a graph that looks something like this:


Basic Functions

What you see now is the main DukeGuess application. You will be doing all of your graph manipulations from this screen. The area showing blue blocks and green lines is called the Visualization area. In the visualization area you can zoom in and out, move nodes (squares) and edges(lines), and move the "camera" around.


Go ahead and try these things out now with our "24" graph! If you need to re-center the graph, click on the "Center" option underneath the "Display" menu at the top.
In order to move nodes around, first click on the "Manipulate Nodes" button at the very bottom of the DukeGuess window. Now if you left click a node, you can drag it around to a new location. You can also change the size of nodes.
The Manipulate Nodes Button

The white area below the visualization area with the ">>>" prompt is called the Interpreter. The function of the Interpreter is to take Python code commands and use them to manipulate the way the graph looks, or to found out more information about the graph. We will go into more detail on how to use the interpreter later.

Viewing More Information

In order to view more information about the graph, go to the "Display" menu at the top and click on "Information Window". DukeGuess now displays an information window to the left. When the mouse hovers over a node or edge, the information window will display detailed information for it in this new window.

Different Layouts

When DukeGuess first loads a graph, it automatically loads it with the "random" layout. This means that nodes placed at random locations within the window. While the random layout can have interesting results, it is generally not the most helpful for examining the structure of a graph. Fortunately, DukeGuess has several built-in layouts that are interesting to view and simple to run. To access them, go to the "Layout" menu at the top of the window and select the layout you desire. In this case, lets select the "GEM" Layout. It is a good layout that organizes things nicely.

Select the GEM Layout

Your graph should look like this after running the GEM Layout

For those who are not a fan of the show, Jack Bauer is the main character who often saves the day. Place your mouse over nodes and see if you can find a node named "Jack". As you can see, he holds a relatively central place in the graph.

Saving your Graph

To save all the work you have done to your graph, go to the "File" menu and select "Export GDF". Although it is recommended that you save your work as a *.gdf file, you can name the file anything you would like.

Node Functions Panel

The Node Functions Panel is an addition to DukeGuess that allows you to easily access different properties of nodes. This section will describe how to open up the panel and explain each of its different uses.

Opening the Panel

To open the panel, go to the "File" menu at the top of the window, and select "Run Script...". You should open the "nodeScript.py" file in the DukeGuess folder.

This will add a panel to the bottom of your DukeGuess window that contains several buttons and a small text box. It looks like the image below

Using the Panel

Selecting Nodes

Using the Panel is fairly simple. Before you may use any of the buttons, though, you must select a node. You may do this in two different ways: either click on a node, or type its name into the text box and hit enter (Note that capitalization DOES matter). This will make the node "selected" and turn its color red. To select a different node, simply click on a different node or enter a different name into the text box. To clear a selected node without selecting another, click on the "Clear Current Node" button.

Jack is selected

Outdegree and Betweenness Buttons

The buttons labeled "Get outdegree" and "Get Betweenness" operate in similar ways. Once you have selected a node, if you click on the outdegree button the outdegree of the selected node is printed to the text box. If you click on betweenness, the betweenness is printed to the text box.

For more information on outdegree and betweennness, see their Wikipedia articles: Degree on Wikipedia and Betweenness on Wikipedia

The outdegree of Jack is 9 (his betweenness is 309.83)
Highlight Neighbors

This is a simple button that allows you to highlight all the selected nodes neighbors. A neighbor is a node that is directly connected to the selected node. This button will turn all the neighbors orange. If you would like to return them to their original color either click the "Clear Current Node" button or click on a different node.

Jack with his neighbors highlighted
Highlight Nodes X Edges Away

This button is a more powerful version of the "Highlight Neighbors" button. Instead of highlighting just neighbors, you can dictate the distance away a node must be from your selected node in order to be highlighted orange. For example, if you select "2" for Jack, but it will select all the neighbors for Jack's neighbors whom Jack does not have a connection with. These nodes are a distance of 2 away.

When you click on the "Highlight Nodes X Edges Away" button, a box comes up asking you to enter the distance you would like to highlight. Either enter a distance or type "all". If you type "all", something a little bit different will happen. Instead of highlighting nodes at just one set distance, the program will highlight nodes at every distance according to their distance. For example, neighbors will be highlighted one color, nodes 2 away will be highlighted another color, nodes 3 away another color, and so on. Nodes that are not reachable from the selected node (a distance of infinity) will be highlighted black. A color key also pops up to help you identify the colors.

Type in all to get a color gradient

Below is what the "24" graph looks like if you do an "all-Highlight" on Jack:


The Grouping Buttons

The last two buttons, "Save Group" and "Clear Group" have a specialized function. If you hold the shift key and left click on a node, you'll notice it changes color to a light blue. This means the node has been added to a group. To deselect a node from the group, shift-click it again.If you would like to save this group click on the "Save Group" button. If you would like to completely clear the list, click on the "Clear Group" button.

Any groups that you save will be available for use in the Interpreter for the rest of your DukeGuess session.

Put Jack, Kim, and Victor into a group called "group1"


Using the Interpreter

The Interpreter is the most powerful part of the Guess program, but it is also the most complicated to learn how to use. It allows you to access the different qualities of individual nodes, groups of nodes, edges, and the entire graph.

In this section, code that you should type into the Interpreter will be in italics and seperated from the rest of the text.

Go ahead and bring the Interpreter back into view by clicking on the "Interpreter" tab at the bottom of the Guess window.

Accessing Attributes

Attributes are something that we have alluded to so far, but have not throroughly described. Both edges and nodes have several different attributes associated with them. This is the information that is displayed in the Information Window when you hover over a node or edge. We can also access this information using the Interpreter. We do this by using the '.' (dot) operator. For example, to access Jack's color, we would type: Jack.color

The Interpreter then prints out Jack's color. Go ahead and try it. In this case, Jack's color is blue.



In order to access edges, we need to type a little more. In DukeGuess, edge's are defined by their end nodes. DukeGuess uses the "-" symbol to represent the edge between two nodes. For example, an edge between Jack and Kim would be represented as "Jack-Kim" or "Kim-Jack". We can then access this node's color by typing: (Jack-Kim).color

We have to put "Jack-Kim" within parenthesis because the '.' operator has a higher priority than '-'. This means that DukeGuess would try to find an edge between Jack and Kim's color (which is green) instead of first finding the edge between Jack and Kim, and then finding that edge's color.


Changing Attributes

It is also possible to change the value of attributes for nodes and edges. To do this we use both the '.' and the equals sign, '=' to "assign" the attribute a new value. For an example of this, lets first make Jack purple, and then make the edge between Jack and Kim orange. Here is the code:

> Jack.color = purple > (Jack-Kim).color = orange

Now it is easier to find Jack and the edge between Jack and Kim.



Below is a list of node attributes that you may access, ones that are immutable are marked with an asterisk(*)

Final Example

Here is a final example that will help you put together a couple of the things you have learned.
Let's say we want to investigate what impact the removal of Jack and all of his neighbors would have on the "24" graph. First, make sure you have the "twentyFour.gdf" file loaded and that you have opened the Node Functions Panel. Select Jack and highlight his neighbors. Now shift click Jack and all of his neighbors and click on "Save Group" to save these nodes to a group. Check the image below to make sure you got all of the nodes.

Now save this group under the name "friends". To make it appear like we have removed Jack and his friends, we need to type some commands into the interpreter.

friends.visible = 0

You'll notice something strange about this line of code -- it doesn't contain any node names! In DukeGuess, if you have a group of nodes you can access and change the same attribute for all nodes in the group by using the '.' (dot) operator on the group name the same way you would use it on a node name.
You should end up with a graph that looks something like the one below. Clearly Jack and his friends are a key portion of this social network! Also notice that DukeGuess makes invisible edges attached to invisible nodes

Steven Johnson's popular book, Everything Bad Is Good for You: How Today's Popular Culture Is Actually Making Us Smarter, looks at how today's popular entertainment is actually far more complex than the culture of the past and requires more cognitive work of the pop culture consumer. As an example, he compares the social network of the characters in 24 (twentyFour.gdf) as compared to the popular 80s show Dallas (dallas.gdf).

In what ways, does the social network imply that 24 is indeed more complex?

Basic Graph Operations

Consider the following graph (in simplegraph.gdf):

This graph has six nodes (A-F) and eight edges. It can be represented by the following dictionary:

graph = {'A': ['B', 'C'], 'B': ['C', 'D'], 'C': ['D'], 'D': ['C'], 'E': ['F'], 'F': ['C']} This is a dictionary whose keys are the nodes of the graph. For each key, the corresponding value is a list containing the nodes that are connected by a direct arc from this node.

In Gython, the extension of Python used in GUESS, there is a -> operator that accesses a directed edge between two nodes. Therefore, you can create the graph above using with the following code.

> addNode(A) > addNode(B) > addNode(C) > addNode(D) > addNode(E) > addNode(F) > addDirectedEdge(A,B) # A -> B > addDirectedEdge(A,C) # A -> C > addDirectedEdge(B,C) # B -> C > addDirectedEdge(B,D) # B -> D > addDirectedEdge(C,D) # C -> D > addDirectedEdge(D,C) # D -> C > addDirectedEdge(E,F) # E -> F > addDirectedEdge(F,C) # F -> C

GUESS automatically defines a variable g that represents the entire graph. g.nodes and g.edges are the list of of all nodes and edges in the graph, respectively.

There are actually five different operators that can be used to represent an edge between node1 and node2.

	node1->node2 # directed edge (node 1 to node 2)
	node1<-node2 # directed edge (node2 to node1)
	node1-node2  # undirected edge
	node1<->node2 # undirected edge (or bidirectional)
	node1?node2 # any type of edge between node1 and node2

Nodes are objects. To access node and edge attributes, use the dot operator (".").

node1.color (node1?node2).color (node1->node2).weight node1.name (node2-node1).visible To change Node and Edge attributes, use the dot operator followed by the assignment operator.
node1.color = blue               # Make node1 blue
(node1?node2).color = cyan       # Make all types of edges between node1 and node2 cyan
(node1->node2).weight = 5        # Make the edge from node1 to node2 have a weight of 5
node1.visible = 0                # Make node1 invisible
For numerical attributes, you may access the max or min for that attribute over all nodes or edges by typing the attribute name plus the dot operator and min or max.

For example, weight.min returns the minimum weight for all edges, and salary.max returns the max salary for all nodes. See our GUESS Quick Reference Sheet for more information on node and edge attributes.

Path Finding

Let's write a simple function to determine a path between two nodes. It takes the start and end nodes as arguments. It will return a list of nodes (including the start and end nodes) comprising the path. When no path can be found, it returns None. The same node will not occur more than once on the path returned (i.e. it won't contain cycles). The algorithm uses an important technique called backtracking: it tries each possibility in turn until it finds a solution.

    # returns list of nodes in path from start to end
    #  or None if no path exists
    def find_path(start, end, path=[]):
        path = path + [start]
        if start == end:
            return path
        if start not in g.nodes:
            return None
        adj = start.getNeighbors()
        for node in adj:
            if node not in path:
                newpath = find_path(node, end, path)
                if newpath: return newpath
        return None

One new feature here is the use of default arguments with path=[]. The function can now be called with only two arguments as in: findpath(node1, node2). A sample run (using the graph above):

    > find_path(A, D)
    [A, B, C, D]
    > 
The second 'if' statement is necessary only in case there are nodes that are listed as end points for arcs but that don't have outgoing arcs themselves, and aren't listed in the graph at all. Such nodes could also be contained in the graph, with an empty list of outgoing arcs, but sometimes it is more convenient not to require this.

Note that while the user calls find_graph() with two arguments, it calls itself with a third argument: the path that has already been traversed. The default value for this argument is the empty list, '[]', meaning no nodes have been traversed yet. This argument is used to avoid cycles (the first 'if' inside the 'for' loop). The 'path' argument is not modified: the assignment path = path + [start] creates a new list. If we had written path.append(start) instead, we would have modified the variable path in the caller, with disastrous results.

Exercises

  1. It is simple to change this function to return a list of all paths (without cycles) instead of the first path it finds. Write find_all_paths
    
    A sample run:
    
    
        > find_all_paths(A, D)
        [[A, B, C, D], [A, B, D], [A, C, D]]
        > 
    
  2. Another variant would finds the shortest path, write find_shortest_path based off of the function above to return the shortest path from start to end

    A sample run is below:

        > find_shortest_path(graph, A, D)
        [A, C, D]
        > 
    
  3. The diameter of a graph is the longest shortest path between any two vertices. Write a function findDiameter that returns the longest path between any two vertices.

  4. Consider the following depth-first search function dfs that takes a node as an argument and then visits every node reachable from
    # visit each connected node in the graph (g) start from u in a depth-first manner 
    def dfs(u):
      stack = [] 
      addNodeField("visited",Types.BOOLEAN,FALSE) # add visited field if necessary
      stack.append(u) # append to back works as push
      while len(stack) > 0: 
        v = stack.pop() 
        v.visited = true
        adj = v.getNeighbors()
        for w in adj:
          if not w.visited:
            stack.append(w)
    

    Now, write a function bfs that traverses the graph from a node in a breadth-first manner. Hint: use a queue. You should also be able to keep track of the unweighted distance from the source as you proceed. We have started it for you below.

    # visit each connected node in the graph (g) start from u in a breadth-first manner 
    def bfs(u):
      queue = []
      maxDist = len(g.nodes)   # no path can be longer than the number of vertices
      addNodeField("dist",Types.INTEGER,Integer(maxDist)) # add field to store distance
    
    
Solutions

GUESS Centrality

From Wouter de Nooy, Andrej Mrvar, Vladimir Batagelj,Exploratory Social Network Analysis with Pajek, Cambridge University Press, 2005.

In the 1970s, E.M. Rogers and D.L. Kincaid studies the diffusion of family planning methods in 24 villages in the Republic of Korea. In the DukeGUESS folder, there are two files, korea1.gdf and korea2.gdf, that represent the communication networks among women in the two villages. In both networks, each vertex represents a woman in the village and an edge between two vertices indicates that the two women discussed family planning. The vertices also have other attributes:

Thus, you can make all women who did not adopt the family-planning methods not visible with the following line:
(adopter == false).visible = false
Note that in GUESS, you can use 0 and 1 for false and true.

The successful village which had a successful family planning program is on the left below (korea1.gdf) and, the unsuccessful group is on the right (korea2.gdf). When looking at the graphs of these two networks below, it is easy to see that the network structure does differ.

korea1korea2
The centrality is the measure of how central a particular a vertex is in the social network. One would expect entities with high centrality to be hubs, having better access to information, and more opportunities to spread information. There are three relevant measures of centrality

Centralization is the measure of centrality for an entire network - as opposed to just one vertex. In a highly centralized network, information spreads easily, but the center is indispensable for the transmission of information.

  • Analyze the networks and hypothesize how the structure of the network, specifically the centrality of nodes, may have influenced the success of the family-planning program in one village and its relative failure in the other village. Justify your answer.

    A few tips:

    
    
    

    Answer

    In this assignment we are to compare to social networks from a family planning program in South Korea. In one network, the program succeeded, and in the other it failed. When looking at the graphs of these two networks below, it is easy to see that the network structure does differ. The successful group is on the left, the unsuccessful group on the right.

    When we compare these networks in terms of centrality and centralization, we get the following numbers:

    Successful Network:

    • Mean Degree: 3.54
    • Degree Centralization: 2.6245
    • Closeness Centralization: .34808
    • Betweenness Centralization: .50888

    Unsuccessful Network:

    • Mean Degree: 4.3
    • Degree Centralization: 2.4
    • Closeness Centralization: .28
    • Betweenness Centralization: .18
    *Note: In order to calculate Closeness and Betweenness Centralization for these graphs, I had to remove the nodes disconnected from the main group in each netwrok. Those centralization calculations require that a graph be connected.

    Conclusion

    When looking at the two graphs above, the successful graph on the left looks organized while the unsuccessful graph on the right looks disorganized, especially towards the center.

    As the numbers show, even though the unsuccessful graph had more connections per person, the successful graph is more centralized, especially when the betweenness centralization is used. This centralized structure allows for better communcation, resulting in the successful implementation of the family program by the network on the left.


    Prestige

    Diffusion of innovations resembles a contagion process because social contacts are needed to persuade people to adopt innovations. It is hypothesized, therefore, that prestige is associated with adoption time: less prestigious actors adopt later because they wait for more prestigious opinion leaders to adopt first. The file Galesburg_discussion.gdf contains a network of discussion ties among thirty-one physicians in Galesburg (Illinois) in the 1950s. The researchers asked each physician to name three doctors with whomthey would choose to discuss medicalmatters. For seventeen physicians, the date that they first prescribed a new drug (gammanym) was recorded. The attribute adoptionTime measures the adoption time as the number of months since the introduction of the drug. Note that the adoption time is unknown (code 9999998) for fourteen physicians. For most of them, the new drug was not relevant.

    Investigate whether adoption time is associated with the structural prestige rather than the centrality of doctors in the discussion network. Compute indices of prestige (indegree, restricted input domain with a maximum distance of 2, and proximity prestige) as well as the corresponding centrality measures in the undirected network. Use rank correlation and note that adoption time is higher when a doctor adopts later.

    Another hypothesis states that friendship relations are more important than discussion relations for the adoption of a new drug because it is easier to persuade friends than people you only know professionally. Physicians with many direct or indirect friends would adopt sooner than physicians with less central positions in the friendship network. The file Galesburg_friends.gdf contains the friendship network between the doctors. Is the adoption time of the new drug related to prestige or centrality in the friendship network rather than in the discussion network?

    Galesburg_discussionGalesburg_friends


    Answer

    Below are the numbers the assignment asks for:

    Discussion Graph

    • Degree prestige: Spearman: -.43 Pearson: -.37
    • Input Domain Prestige: Spearman: -.46 Pearson: -.41
    • Proximity Prestige: Spearman: -.46 Pearson: -.43
    • Degree Centrality: Spearman: -.32 Pearson: -.27
    • Betweenness Centrality: Spearman: -.19 Pearson: -.28
    • Closness Centrality: Spearman: -.18 Pearson: -.14

    Friends Graph

    • Degree prestige: Spearman -.61 Pearson -.63
    • Input Domain Prestige: Spearman: -.64 Pearson: -.64
    • Proximity Prestige: Spearman: -.63 Pearson: -.65
    • Degree Centrality: Spearman: -.64 Pearson: -.65
    • Betweenness Centrality: Spearman: -.50 Pearson: -.45
    • Closness Centrality: Spearman: -.52 Pearson: -.53
    This shows that friendship ties are more important for diffusion than the professional discussion ties. Both the correlations between prestige and diffusion time and centrality and diffusion time are moderate or weak in the Professional Discussion graph, while all the correlations in the Friends Graph are either moderate or strong. In addittion, we find that prestige is a better measure of social standing than centrality, as the centrality measurements always measured less than their prestige counterparts.

    Duke Scrobbler

    Demo system, show data and taste