Difference between revisions of "Basic Queries"

Revision as of 14:37, 3 February 2015

This tutorial assumes you have the HelloWorld Java project in your workspace and have just completed the Learning Atlas Setup tutorials.

If you haven't already enter the following queries in the Atlas Shell. Don't worry what they mean right now, by the end of this tutorial you will!

var declaresEdges = universe.edgesTaggedWithAny(Edge.DECLARES).retainEdges()
var appMethods = declaresEdges.forward(universe.project("HelloWorld"))
var methods = app.nodesTaggedWithAny(Node.METHOD)
var initializers = app.methods("<init>") union app.methods("<clinit>")
var constructors = app.nodesTaggedWithAny(Node.IS_CONSTRUCTOR)
var callEdges = universe.edgesTaggedWithAny(Edge.CALL).retainEdges()
var graph = (appMethods difference (initializers union constructors)).induce(callEdges)
show(graph)

Your Eclipse should now look something like the following screenshot.

Note: You can drag and drop Eclipse views around and rearrange them by pinning views to different areas in Eclipse.

Let's take a look at the graph that Atlas computed for us in response to our queries. We see a graph with 7 methods (A, B, ... G). We also see that all of these methods are public (green dot icon) static (S icon) methods. We see that all methods are declared inside the class MyClass. MyClass is declared by the package com.example which is declared in the project HelloWorld. Most importantly, we see that many of the methods have a directed call relationship to other methods. Atlas has recorded and abstracted the relationships in the MyClass.java source code file. You can see in the source that method A calls method B in its body. In the graph this is represented by a call edge from the A method node to the B method node.

As a developer if I wanted to know what methods method A calls it would be pretty easy to open up the MyClass.java source code file, find the method and read its source. However if I wanted to know what methods called method A I would have to search all source files in my project since method A could be called from anywhere. With Atlas you can use the graph abstraction to easily answer this question. First we start at method A on the graph and follow all call edges backwards to find the methods that call method A. In this case no methods call method A. If we look at the graph node for method B we quickly learn that method B is called by methods A and C. This graph with call edges is typically called a "call graph". A graph that starts at a set of method nodes and expands backwards along call edges is called a "reverse call graph". Call graphs show a rough granularity of program control flow at the method level. Atlas tracks many relationships between program artifacts, which we will explore more at a later time.

Query Language

Atlas indexes Eclipse projects in the workspace, and produces an index, which is essentially a graph data structure containing all the necessary program artifacts and relationships needed to perform program analysis. The entire graph is usually referred to as the "universe". The universe graph may be accessed directly via Common.universe(). Note that on the Atlas Shell we can just type universe instead of Common.universe(), because we have automatically imported the class Common and the ., (, and ) characters are implicit.

The Atlas query language is an internal DSL embedded within Java. The primary interface used to build queries is Q. Queries written using Q are evaluated, yielding a Graph, which is a subset of the "universe". One may write queries entirely in terms of Graph, but most routine queries are easier to write using Q. This tutorial focuses on the common uses of Q (query) objects.

Queries are methods of a Q object that tend to follow the form graph.operation(orgin). Here "graph" is the universe or a subgraph of the universe that the graph traversal will operate in. The "operation" is the Q method to perform in the context of the subgraph. The "origin" is Q that defines where the operation (typically a graph traversal) will begin. Note that Q’s can be chained together to form more complex queries.

In the sections below we are going to explore some of the most common Q operations in detail. To follow along you should declare a few variables to represent different methods in our example graph. Declare these variables by running the following queries in the Atlas Shell.

var A = app.methods("A")
var B = app.methods("B")
var C = app.methods("C")
var D = app.methods("D")
var E = app.methods("E")
var F = app.methods("F")
var G = app.methods("G")

In the examples below you will need to pass the query to the show method to display the graph. Example: show(graph.forward(D)).

graph.forward(origin)

Selects the subgraph reachable from the given nodes using a forward transitive traversal. This query includes the origin in the resulting graph.

graph.forward(D) will output the graph D->E and D->G.

graph.forward(C) will output the graph C->B->C, C->D->E and C->D->G.

graph.forwardStep(origin)

Selects the subgraph reachable from the given nodes along a path length of 1 in the forward direction. This query includes the origin in the resulting graph.

graph.forwardStep(D) will output the graph D->E and D->G.

graph.forwardStep(C) will output the graph C->B and C->D.

graph.forwardStep(F) will output a graph with only F.

graph.successors(origin)

Selects the successors reachable from the given nodes along a path length of 1. By definition this query does not include the origin unless a node succeeds itself (has a self pointer). The final result never includes edges.

graph.successors(C) will output a graph with only D and B.

graph.successors(F) will output an empty graph.

graph.reverse(origin)

Selects the subgraph reachable from the given nodes using reverse transitive traversal. This query includes the origin in the resulting graph.

graph.reverse(D) will output the graph D<-C<-B<-A and D<-C<-B<-C.

graph.reverse(C) will output the graph C<-B<-A and C<-B<-C.

graph.reverseStep(origin)

Selects the subgraph reachable from the given nodes along a path length of 1 in the reverse direction. This query includes the origin in the resulting graph.

graph.reverseStep(D) will output the graph D<-C.

graph.reverseStep(C) will output the graph C<-B.

graph.predecessors(origin)

Selects the predecessors reachable from the given nodes along a path length of 1. By definition this query does not include the origin unless a node precedes itself (has a self-pointer). The final result never includes edges.

graph.predecessors(C) will output a graph with only B.

graph.predecessors(F) will output an empty graph.

graph.between(from, to)

Selects the subgraph such that the given nodes in to are reachable from the nodes in from using forward traversal. Note this query selects all paths in the graph between the “from” and “to” nodes.

graph.between(C, A) will output an empty graph.

graph.between(C, E) will output the graph C->D->E, C->B->C.

graph.betweenStep(Q from, Q to)

Selects the subgraph such that the given nodes in to are reachable from the nodes in from using forward step traversal.

graph.betweenStep(C, D) will output the graph from C->D.

graph.betweenStep(D, C) will output an empty graph.

graph.betweenStep(C, E) will output an empty graph.

Note: We can intuitively say that if there is a direct edge from the 'from' node to the 'to' node then betweenStep() will output the nodes with the edge between them. A possible implementation of betweenStep could be:

graph.forwardStep(from).intersection(graph.reverseStep(to))

graph.union(otherGraphs...)

Yields the union of this graph and the given graphs. That is, the resulting graph's nodes are the union of all nodes, and likewise for edges.

B.union(C) outputs a graph with nodes B and C.

A.union(B, C) outputs a graph with nodes A, B, and C.

graph.union(C) outputs the entire graph (shown previously).

graph.difference(otherGraphs...)

Select the current graph, excluding the given graphs. Note that, because an edge is only in a graph if its nodes are in a graph, removing an edge will necessarily remove the nodes it connects as well. Removing either node would remove the edge as well. This behavior may seem counter-intuitive if one is thinking in terms of removing a single edge from a graph. Consider the graphs: g1: a->b->c and g2: a->b. Then g1.difference(g2) yields the graph containing only node c: because b is removed, so b->c is also removed. In general, this operation is useful for removing nodes from a graph, but may not be as useful for operating on edges.

B.difference(C) outputs a graph with only node B.

B.difference(A, B) outputs an empty graph.

graph.difference(C) outputs the figure above without the node C and any edges entering or leaving node C.

graph.differenceEdges(otherGraphs...)

Selects the current graph, excluding the edges from the given graphs.

graph.differenceEdges(universe.edgesTaggedWithAny(Edge.CALL)) outputs a graph with only the nodes A,B,C,D,E,F,G.

graph.intersection(otherGraphs...)

Yields the intersection of this graph and the given graphs. That is, the resulting graph's nodes are the intersection of all node sets, and likewise for edges.

A.intersection(B) outputs an empty graph.

graph.intersection(C) outputs a graph with only the node C.

graph.leaves()

Selects the nodes from the given graph with no successors.

graph.leaves() outputs a graph with the nodes E, F, and G.

graph.roots()

Selects the nodes from the given graph with no predecessors.

graph.roots() outputs a graph with the nodes A, F.

subgraph.induce(graph)

Adds edges from the given graph to the subgraph.

var subgraph = B.union(C)

subgraph.induce(graph) will induce the edges between B and C and gives the output graph B->C->B.

var subgraph2 = B.union(F)

subgraph2.induce(graph) will result in a graph with nodes B and F (with no new edges added since no edges between B and F existed in the graph).

graph.nodesTaggedWithAny(...)

Selects the subgraph where Nodes are tagged with at least one of the given tags. The final result contains only nodes.

graph.nodesTaggedWithAny(Node.METHOD) will output a graph with only the METHOD nodes A,B,C,D,E,F,G.

graph.nodesTaggedWithAny(Node.CLASS) will output an empty graph because graph does not contain CLASS nodes.

graph.nodesTaggedWithAny(Node.METHOD, Node.CLASS) will output a graph with only the METHOD nodes A,B,C,D,E,F,G.

graph.nodesTaggedWithAll(...)

Selects the subgraph where Nodes are tagged with all of the given tags. The final result contains only nodes.

graph.nodesTaggedWithAll(Node.METHOD) will output a graph with only the METHOD nodes A,B,C,D,E,F,G.

graph.nodesTaggedWithAll(Node.CLASS) will output an empty graph because graph does not contain CLASS nodes.

graph.nodesTaggedWithAll(Node.METHOD, Node.CLASS) will output an empty graph because graph does not contain nodes that are both CLASS and METHOD nodes.

graph.edgesTaggedWithAny(...)

graph.edgesTaggedWithAll(...)

graph.retainNodes()

graph.retainEdges()

Chaining Queries

Remember that we can chain queries together to form more complex queries and that Q objects may contain multiple nodes and edges (so an origin can include multiple starting points). Let’s take a look at the queries we started the example with again.

1) First create a subgraph (called declaresEdges) of the universe that only contains nodes and edges connected by a declares relationship.

var declaresEdges = universe.edgesTaggedWithAny(Edge.DECLARES).retainEdges()

Note: We called retainEdges() here because we want to throw away nodes that are not connected by a declares edge, otherwise those nodes may still be in the resulting declaresEdges graph.

2) We then define yet another subgraph (called app) which contains nodes and edges declared under the HelloWorld project.

var app = declaresEdges.forward(universe.project("HelloWorld"))

3) From the app subgraph we select all nodes that are methods.

var appMethods = app.nodesTaggedWithAny(Node.METHOD)

4) From the app subgraph we select all nodes that are methods that are named "<init>" (instance initializer methods) or "<clinit>" (static initializer methods). We are using a query method called methods(String methodName) that selects methods that have a name that matches the given string. We will explore more query methods in the next section.

var initializers = app.methods("<init>") union app.methods("<clinit>")

Note that since this is Scala the ".", "(", ")", and ";" characters are implied. So the above could also be written as the following:

var initializers = app.methods("<init>").union(app.methods("<clinit>"));

5) From the app subgraph we select all nodes that are constructors.

var constructors = app.nodesTaggedWithAny(Node.IS_CONSTRUCTOR)

6) From the universe create a subgraph (called callEdges) that only contains nodes and edges connected by a call relationship.

var callEdges = universe.edgesTaggedWithAny(Edge.CALL).retainEdges()

7) Define graph to be the methods in the app ignoring initializers and constructors with call edges added in where they exist.

var graph = (appMethods difference (initializers union constructors)).induce(callEdges)

8) Display the graph.

show(graph)

Note that until you called show(graph) you were only defining and chaining together Q's. The show(graph) is the step that actually evaluates your queries, up until that point you are simply defining a recipe for a graph. You can think of Q's as a way to specify constraints to a constraint satisfaction problem (CSP) that is solved when the graph is finally evaluated (in this example when show(graph) is invoked). Just like most CSPs there are many ways you could specify the constraints that get you the same answer, some more efficient than others. As you are writing queries consider the cost and whether you could write the query in a more efficient manner. As a rule of thumb, if you are able to reduce the size of your node/edge sets early (i.e. perform your main query on a smaller subgraph of the universe) then your queries are likely to be much more efficient. In later sections we will be pointing out optimizations that you could make to your queries as we go.

Revision as of 14:28, 3 February 2015 (view source) BenHolland (Talk \| contribs) ← Older edit		Revision as of 14:37, 3 February 2015 (view source) BenHolland (Talk \| contribs) (→‎Chaining Queries) Newer edit →
Line 223:		Line 223:

	<code>show(graph)</code>		<code>show(graph)</code>
		+
		+	Note that until you called <code>show(graph)</code> you were only defining and chaining together <code>Q</code>'s. The <code>show(graph)</code> is the step that actually evaluates your queries, up until that point you are simply defining a recipe for a graph. You can think of Q's as a way to specify constraints to a constraint satisfaction problem (CSP) that is solved when the graph is finally evaluated (in this example when <code>show(graph)</code> is invoked). Just like most CSPs there are many ways you could specify the constraints that get you the same answer, some more efficient than others. As you are writing queries consider the cost and whether you could write the query in a more efficient manner. As a rule of thumb, if you are able to reduce the size of your node/edge sets early (i.e. perform your main query on a smaller subgraph of the universe) then your queries are likely to be much more efficient. In later sections we will be pointing out optimizations that you could make to your queries as we go.