Extensible Common Software Graph

From AtlasWiki
Revision as of 04:40, 27 February 2014 by Pi (Talk | contribs)

Jump to: navigation, search

Introduction

The eXtensible Common Software Graph (XCSG) schema defines a semantically rich graph representation of software (i.e. source code or binaries) suitable for many applications such as mining software for patterns, malware and defect detection, building static analysis tools, and code comprehension. XCSG is based on the eXtensible Common Intermediate Language developed for the DARPA Software Enabled Control program.

Like XCIL, XCSG has semantically precise definitions for program artifacts to enable a harmonious representation of software written in different languages. Without precise semantics analyses tools can easily develop a language bias that leads to incorrect processing of other languages, especially while analyzing software written in different languages. For example the static keyword in C and Java have overlapping but incompatible uses. However, unlike XCIL, XCSG is tailored for use with a graph database and encompasses definitions for analysis results such as control flow and data flow graphs.

Graph Database Requirements

XCSG assumes that a graph database allows nodes and edges to have tags and attributes.

XCSG also requires that more than one edge can exist between two nodes. Invariably, though not strictly enforced, each edge between two nodes has distinct tags or attributes. Therefore one could also think of this requirement as multiple colored graphs that have been superimposed.

The graph database in Atlas meets these requirements.

Kinds

Nodes and edges in XCSG can be of one or more kinds. The word "kind" is used instead of "type" to make it easier to talk about Java or C++ types and node or edge kinds without causing confusion. Nonetheless XCSG kinds are more or less equivalent to types in languages like Java or C++.

Kinds specify the expected attributes and tags for a node or edge. Kinds for nodes and edges live in separate name spaces. This means that the expected attributes and tags for a DataFlow node may be different than the required attributes and tags for a DataFlow edge. By allowing related node and edge kinds to have the same name, similar concepts can be conveniently grouped. For example, when we talk about a DataFlow graph we are referring to a graph composed of DataFlow nodes and DataFlow edges.

Tags

Every kind is also a tag. We know that a node or edge is of a given kind by checking that it has been tagged with said kind.

Some can be grouped by concept, for example public, protected, and private. For both readability and compactness XCSG specifies tag groups, and in turn the specifications of kinds may refer to tag groups instead of directly to the graphs in a group.

Connectivity

Kinds for nodes can specify expected in and out edge kinds. Likewise, kinds of edges can specify expected from and to node kinds.