Extensible Common Software Graph
The eXtensible Common Software Graph (XCSG) schema defines a semantically rich graph representation of software (i.e. source code or binaries) suitable for many applications such as mining software for patterns, malware and defect detection, building static analysis tools, and code comprehension. XCSG is based on the eXtensible Common Intermediate Language developed for the DARPA Software Enabled Control program.
Like XCIL, XCSG has semantically precise definitions for program artifacts to enable a harmonious representation of software written in different languages. Without precise semantics, analysis tools can easily develop a language bias that leads to incorrect processing of other languages, especially while analyzing software written in multiple languages. For example, the static keyword in C and Java have overlapping but incompatible uses, which XCSG disambiguates. XCSG improves upon XCIL by being tailored for a graph database, and by encompassing representations of analysis results such as control flow and data flow graphs.
XCSG retains much of the UML-based nomenclature in XCIL, with some departures for the sake of user friendliness and to better fit a graph database.
This material is based on research sponsored by DARPA under agreement number FA8750-12-2-0126. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon.
Graph Database Requirements
XCSG assumes that a graph database allows nodes and edges to have tags and attributes.
XCSG also requires that more than one edge can exist between two nodes. Invariably, though not strictly enforced, each edge between two nodes has distinct tags or attributes. Therefore one could also think of this requirement as multiple colored graphs that have been superimposed.
The graph database in Atlas meets these requirements.
Nodes and edges in XCSG can be of one or more kinds. The word "kind" is used instead of "type" to make it easier to talk about Java or C++ types and node or edge kinds without causing confusion. Nonetheless XCSG kinds are more or less equivalent to types in languages like Java or C++.
Kinds specify the expected attributes and tags for a node or edge. Kinds for nodes and edges live in separate name spaces. This means that the required attributes and tags for a DataFlow node may be different than the required attributes and tags for a DataFlow edge. By allowing related node and edge kinds to have the same name, similar concepts can be conveniently grouped. For example, when we talk about a DataFlow graph we are referring to a graph composed of DataFlow nodes and DataFlow edges.
Nodes and edges in XCSG can be tagged. XCSG defines a set of standard tags, however users can tag nodes and edges with any arbitrary String tag. Unlike UML tags, XCSG tags do not have values.
Every kind is also a tag. We know that a node or edge is of a given kind by checking that it has been tagged with said kind.
Some tags can be grouped by concept, for example public, protected, and private. XCSG defines tag enumerations to group related tags. In turn the specifications of kinds may refer to tag groups instead of directly to the graphs in a group.
In principle a user could implement their own enumerations and place them in attributes. However this is highly discouraged. Well designed tag enumerations should seldom collide, and lead to more compact queries on the graph database.
Nodes and edges in XCSG can have a set of attributes in the form of key-value pairs. Keys are unique, therefore if a subsequent write to a key will override the previous value associated with a key.
XCSG prescribes these possible types for attribute values:
Although your graph database may support placing arbitrary objects into attribute values, these are not guaranteed to be serialized or otherwise interchangeable. Although arbitrary objects may be useful to store temporary values while processing an XCSG graph, they should not be part of the result of said processing.
It's important to note that there is no boolean type. If a concept is boolean it should be represented as a tag. Encoding booleans as Strings or Numbers is highly discouraged.
Kinds for nodes can specify expected in and out edge kinds. Likewise, kinds of edges can specify expected from and to node kinds.