This document is the current gospel. Please use and refer to it,
preferrably with version number. Parts likely to be slightly updated
will be marked as such.
The Software Development Foundation (SDS) is an open architecture
designed for developing tools for software development. Based on XML,
the SDS makes it easy for most languages and other systems to
incorporate it's tools. The core of SDS is the Code Structure Format
(CSF) which collects most interesting information about source code
which can be easily utilised by tools.
2. Thoughts
Most popular languages share certain features, but some have their own
specialties which can be good or bad. The shared features are
basically what CSF should capture, but should also leave room for
specialties; e.g multimethods are a central part of CLOS and to be
able to get useful information about CLOS we need multi-method
support.
Multi-dispatch
Ironically in the CLOS-case with multimethods, which can be
seen as a special case, is in fact a much cleaner and general solution
than the classic message-passing paradigm of Smalltalk, C++, Java,
etc. Message-passing works decently in Smalltalk and to a certain
extent in Java, but in C++ it works in the Java/Smalltalk cases, but is
clearly broken when it comes to operator overloading and
templates. In most ways the CLOS approach is the best way to proceed
on methods and classes as it is a superset of the functionality in
other languages. This will however change how some functions are
represented, ie a C++ function DRAW in class SHAPE will now look like
DRAW(SHAPE *) instead of the C++ syntax SHAPE.DRAW(). Serious
developers know that the dispatch is done on the first argument which
is the this-pointer which is of type "SHAPE *".
Modules
However, CLOS doesn't have the classic module-concept,
but has packages which is more orthogonal to the Lisp-reader. Java
has a module-concept (ironically called packages) which is quite
useful and orthogonal to the language, but lacks some features. C++
has something called
namespaces which aren't modules but may resemble a poor version of
Common Lisp packages (which basically are "spaces" for "names"). C++
inherited several concepts from C which can be thought of as modules
(separate headerfiles, libraries, prefix-hacks) but is far from having
a module-concept. Python has modules which seems to work as modules
and it's a redeeming feature of a language composed of special cases.
Modules are a good thing and most people I know
design and think of code as being in modules and code being
modular. CSF must in some way capture modular information but should
leave actual modules to other formats, e.g SDOC because of the big
differences. Possibly barring some Python features, going for a
"namespace" solution will capture most of the information.
File-orientation
Some languages are file-oriented, while others are not. C and C++
are clearly file-oriented, Java is in many ways
directory/package-oriented while CLOS is really difficult to
place. Python is also file-oriented, but can be seen from different
perspectives due to it's module-concept. Assuming that code that is
parsed for CSF come from actually files, continuing to build on a
LOCATION tag with FILE info seems decent enough. Room must also be
given to allow "static" initialiser in C/C++ giving the method or
variable a file-scope.
Declarations
As said, multimethods is the right way to go for representing
methods, but representation of methods (or functions or procedures) in
code behave very differently in various languages. Some languages
insist on the method being called must be known in advance thereby
requiring declarations. Declarations in themselves are interesting as
they provide much information, and because their location may be of
interest (e.g "Where did I declare that method? Where must I change my
code?"). Declarations must also be possible to combine with the
definition.
Method names
As for method-names, C doesn't dispatch on arguments but only the
name and therefore only one method/function may be registered on a
name. The other mentioned languages don't have this restriction, and
by accepting multi-methods we cannot compare functions on name only
and need to compare types when trying to find matches. This increases
complexity and requires some form of making a scheme to calculate id's
of various objects. Adding to it that not all languages are
case-sensitive makes things even more difficult.
Method arguments
Methods in some languages have support for optional arguments,
arguments with default values, "rest" arguments and keyword
parameters, and this is something CSF will support.
Short summary..
This basically means that no language fit 100% into CSF, but the
most common languages should easily have support for 80% of their
constructs. For some languages, CSF will be an upgrade (e.g
multi-method representation) while others might miss their advanced
features.
3. The CSF-structure
CSF is a XML-based data format and therefore follows the XML
standard, implicating that a CSF data-file is required to follow the
XML standard as well. It is preferred that generated CSF files are in
UTF-8 format. The actual format of each element can be found in the
DTD (Appendix B) and in the DTD
documentation
Identification
Most major elements of the CSF dataformat are identifiable with an
ID attribute. This has type CDATA (not type ID) and has a specified
format in BNF (E is the empty string):
SEP -> '@'
ID -> SEP TYPE SEP NAME SEP FULLNAME SEP PARAMLIST SEP LOCATIONS SEP
TYPE -> "method" | "class" | "enum" | "typespec" |
"package" | "variable"
NAME -> "the name of the object"
FULLNAME -> E | "the fully qualified name of the object"
PARAMS -> E | PARAMLIST
PARAMLIST -> PARAMTYPE | PARAMLIST ',' PARAMTYPE
PARAMTYPE -> "name of parameter type"
LOCATIONS -> E | LOCLIST
LOCLIST -> LOC | LOCLIST LOC
LOC -> '[' "filename" LINEINFO ']'
LINEINFO -> E | ':' STARTLINE ',' ENDLINE ',' STARCOL ',' ENDCOL
STARTLINE -> E | number
ENDLINE -> E | number
STARTCOL -> E | number
ENDCOL -> E | number
The primary way that should be used by CSF tools is TYPE and
LOCATIONS. If TYPE fails, one should check for FULLNAME or if that
fails, NAME. If any ambiguities are present, PARAMS is used.
FULLNAME and PARAMS are language-dependent and may be skipped by
any tool. The language is specified in the CSF-tag.
Some examples of use of Id is found in appendix A
Info
Most of the major tags have an INFO-field to contain most of the
information for that element. Each of the major elements have different
meanings in the various fields and documentation should be checked for
each of them. The form of info is simple:
<info type="infotype" value="the_value" info="extra_info"/>
Please not that the info-field is made this way to be able to be
able to handle changes to elements in CSF, in a graceful way. This
document will be updated with new known values when they are
introduced. A changelog and version-system will be used to make this
simple to keep track of. Only those info-fields of interest for the
tools should be used, the rest should be ignored. Several info-fields
with the same name is usually allowed. Expect a notice when there
should only be one occurence of a specified type.
3.2 Method
Methods are probably the most used abstraction mechanism and needs
careful design. Methods also come in many forms and shapes and might
also have different semantics in different languages. Most of the work
needed for front-ends and tools will be related to methods, so we
should do this tag right and make it as convenient as possible.
The basic method looks like this:
<method id=some_id name=some_name>
<where ...> +
<access ...>
<info ...> *
<retval ...> *
<arg ...> *
... (the content of the method)
</method>
Needless to say, much of the complexity is hidden in the
subfields. Please not that for various reasons the id-field is mere
CDATA. The content can be new methods, variables, directives, etc.
Type: | Value: | Info-field: | Explanation |
package | name of
package | | Names the package it is member of (can also
be an id), where applicable. Helpful for tools with limited info. |
class | name of class | | Names the owning class, ie
the class where the method is. If the method is in a package-scope, use
the package field instead. This field eases work when sorting
elements later and making pointers to "parent". Using an id is also
allowed. |
dispatch | none | | Specifies how
dispatch is done for the function. The default is that this is not
specified but this can be done when one feels like it. |
| single | | This is the
default in Java and is the same as virtual functions in C++. The
dispatch is done on the "owning" object. |
| multi | | This is the
default for CLOS generic functions. The dispatch is done on the type
of all passed arguments. |
language | name of
language | | Specifies which language the function
is in. This is useful for languages where one can have functions from
several other languages. This is mainly a feature for documentation. |
mod | member | | The mod field
specifies what kind of function we're dealing with. Specifying this as
'member' tells us that it is a member-function and "belongs to" a
class. |
| friend | |
Specifying mod as
'friend' tells us that this function is friend to some class. (C++ specific) |
| static | | Specifying mod as
'static' tells us that this function is a class-function (ie
static). (C++/Java) |
| abstract | | Specifying mod as
'abstract' tells us that this function has to be implemented in a
subclass. This is the same as a pure virtual function in C++. When
type is abstract, virtuality is implied. |
| constructor | | Specifying mod as
'constructor' tells us that this function constructs/creates an object. |
| destructor | | Specifying mod as
'destructor' tells us that this function is called when an object is
deleted/wiped. (C++) |
| operator | | Specifying mod as
'operator' tells us that this function is really an operator (which
means overloading in C++) |
| virtual | | Specifying mod as
'virtual' tells us that this function is a polymorphic function with
single dispatch. It's default in Java but you must specify it for C++. |
| native | | Specifying mod as
'native' tells us that this function is a native function and is not
part of the interpreted system. (Java) |
| function | | Specifying mod as
'function' tells us that this is a normal function. This is redundant
but may be added. |
| method | | Specifying mod as
'method' tells us that this is a method with dispatch (single or
multi) and is redundant when specifying dispatch. But a CSF-frontend
might want to add this when it sees fit. |
| generic | | Specifying mod as
'generic' tells us that this is sortof a declaration for later methods
and is not specialised or contain any implementation. (CLOS) |
| explicit | | Specifying mod as
'explicit' tells us that this function must be explicitly called and
should not be used by auto-converters. (C++) |
| final | | Specifying mod as
'final' tells us that this function cannot be reimplemented in a
subclass. (Java) |
| const | | Specifying mod as
'const' tells us that this function is not allowed to change the
object it belongs to and/or allowed to call non-const member-functions. (C++) |
| macro | | Specifying mod as
'macro' tells us that this is a powerful macro (e.g as in CL) and
might not follow normal evaluation of arguments. |
| accessor | | Specifying mod as
'accessor' tells us that this is a function wrapper for an object,and
might provide e.g a reader and a writer. |
| reader | | Specifying mod as
'reader' tells us that this is a function wrapper which reads the
value of an object. a typical getXXX() function is a reader. |
| writer | | Specifying mod as
'writer' tells us that this is a function wrapper which assignes a
value to an object. a typical setXXX() function is a writer. |
optim | inline | | The optim-type field
specifies what kindof optimisations are applied to the function. Specifying this as
'inline' tells us that it is meant to be inlined. |
| memoized | |
Specifying the optim field as
'memoized' tells us that this function's results are memoised. |
calls | function called | comma-separated
arguments | The calls names a function (may be an id) that is called. |
calledby | function called by | comma-separated
arguments | The calledby field names a function (may be an id) that calls
it. |
calling | convention | | Specifies the
calling convention/mangling used for the function. Common conventions
are pascal, c, fortran or c++. |
throw | exception | |
The throw field names an exception (may be an id) that the function may throw. |
advise | before | |
Says that this is function is a before-method for the real-method. (CLOS) |
| after | |
Says that this is function is a after-method for the real-method. (CLOS) |
| around | |
Says that this is function is an around-method for the real-method. (CLOS) |
parent | the id | | |
Names the "parent" method of this method. This means the method a
step up in the (inheritance) hierarchy (or if difficult to compute,
the unspecialised one), and if none is above, possible the generic. (any) |
arginfo | allow-any-keyword | |
Specifies that the method eats all keywords given. (CLOS allow-any-keys, ??) |
metaclass | metaclass-spec | |
Specifies the metaclass of the method. Can be an id or a name. (CLOS) |
documentation | text | |
Documentation which is part of the language, e.g like the first string
in a method in CL is documentation. Should not be used for comments in
languages like C++/Java. |
pattern | name | explanation |
Specifies which pattern this function is or is part of. Examples are
higher-order, function-builder, .. The explanation of the pattern
should be in the info-field. |
3.3 Class
Classes are an important abstraction mechanism and in most respects
replaces older constructs like structs, records, and to a certain
degree unions. Classes in CSF do not differ a lot from classes in
Java, C++ or CLOS, but also serves as placeholder for info about a
struct, union, etc. The form is relatively easy to understand:
<class id=class_id name=name_of_class>
<location ...>
<access ...>
<inherit ...> *
<info ...> *
... (the content of the class)
</class>
The content of the class can be asically be just about any content
(class, method, enum, variable, typespec, comment or directive). The
DTD allows packages as well, but this is at best uncommon in actual code.
INFO
The INFO-field is serving the same role for CLASS as INFO
is for METHOD. Several INFO with the same type are allowed. Some known values
with meaning:
Type: | Value: | Info-field: | Explanation |
mod | normal | | Specifies
what kind of class we deal with. normal is an ordinary class. |
| struct | | Specifies
that the "class" is a struct (C/C++). |
| union | | Specifies
that the "class" is a union (C/C++). |
| interface | | Specifies
that the "class" is an interface (Java). |
| final | | Specifies
that the "class" is 'final', ie can not be inherited (Java). |
| abstract | | Specifies
that the "class" is 'abstract', ie can not be instantiated (Java++). |
| template | | Specifies
that the "class" is a template/parametrised class (C++/Pizza/++). |
friend | name of class | | Specifies
a friend class of a class. Using an id is allowed. Several friends may be
specified. (C++) |
param | full text of param | | Specifies
a parameter to the (template) class. Several parameters may be specified. (C++) |
metaclass | metaclass-spec | |
Specifies the metaclss of the class. Can be an id or a name. (CLOS/Smalltalk) |
documentation | text | |
Documentation which is part of the language, e.g like the first string
in a method in CL is documentation. Should not be used for comments in
languages like C++/Java. |
pattern | name | explanation |
Specifies which pattern this function is or is part of. Examples are
singleton, iterator, .. The explanation of the pattern
should be in the info-field. |
Type: | Value: | Info-field: | Explanation |
type | some-value | extra | Specifies
the type of the variable. If it is a linkable type, specify a legal
id. The info-field should be "array" if it is an array field and then
the dimension field should be used. |
dimension | dim
string | | The dimension string is on the form [n]*
where n is the size of that particular array or 0 when it isn't specified. A C++ array int
foo[4][5] would be [4][5] while a triple-dimension array
in Java would be [0][0][0]. |
mod | static | | Specifies
that the variable is a static member of a class. |
| auto | | Specifies
that the variable is automatically allocated on the stack (C/C++). |
| volatile | | Specifies
that the variable is supposed to be a real variable and not to be
optimised away or tucked away in a register (C/C++). |
| register | | Specifies
that the variable is allowed to be tucked away in a register (C/C++). |
| extern | | Specifies
that the variable is external and is specified somewhere else (C/C++). |
| mutable | | Specifies
that the variable is mutable, and may be altered even by
const-functions and in constant classes (C++). |
| defparameter | | Specifies
that the variable is a DEFPARAMETER and is reset every time the file
is loaded (CL). |
| dynamic | | Specifies
that the variable should be bound dynamically (CL). |
| lexical | | Specifies
that the variable should be bound lexically (CL). |
class | name of
class | | Names the class it is member of (can also
be an id), where applicable. Helpful for tools with limited info. |
package | name of
package | | Names the package it is member of (can also
be an id), where applicable. Helpful for tools with limited info. |
documentation | text | The
documentation which is part of a variable-declaration as in CL's
DEFVAR or DEFCONSTANT |
pattern | name | If the variable is part of a specific
pattern, that could be added here, e.g hook. Explanation might be
in the extra info-field. |
3.9 Directive
Directives take many forms and shapes in the various languages. To
support all is insane, but some can be supported and can be useful,
e.g #include can be useful for include-trees or #define to find
specific macros. How these directives are treated is up to the
individual app, and they can easily be ignored.
<directive type=dir_type value=dir_value info=extra_info/>
The problem is that this is not directly intuitive and
language-dependent. Some examples
are included:
#define NULL 0L
<directive type="define" value="NULL" info="0L"/>
#define MAX(a,b) (((a) > (b)) ? (a) : (b))
<directive type="define" value="MAX(a,b)" info="(((a) > (b)) ? (a) : (b))"/>
#include <stdio.h>
<directive type="include" value="<stdio.h>"/>
#pragma align 1
<directive type="pragma" value="align" info="1"/>
A better solution would be nice.
3.13 Access
The access element is also used in several CSF elements to describe
which access the described object has in its context. Its form is
simple:
<access visibility=the_visibility scope=the_scope/>
[Add details later, see dtd-documentation ever so long]
3.14 Other elements
Other elements mentioned but not defined include LOCATION and
ACCESS. Those remain the same as their CSF1 versions.
Appendix A - Id examples
C++ - file.c
1 class A {
2
3 void foo(const char *, A *a);
4
5 };
6
7 void
8 A::foo(const char*, A *a) { }
A has id: @class@A@A@@[file.c:1,5,,]@
foo declaration has id: @method@foo@A::foo@const char*,A*@[file.c,3,3,,]@
foo def has id: @method@foo@A::foo@const char*,A*@[file.c,8,8,,]@
combined foo has id: @method@foo@A::foo@const char*,A*@[file.c,3,3,,][file.c,8,8,,]@
Java - A.java
1 package foo;
2
3 class A {
4
5 void bar(String, A a) { }
6
7 }
foo has id: @package@foo@foo@@[A.java,1,1,,]@
A has id: @class@A@foo.A@@[A.java,3,7,,]@
bar has id: @method@bar@foo.A.bar@String,A@[A.java,5,5,,]@
Appendix B - The CSF DTD
<!--
DTD for proposed Code Structure Format (CSF 2)
November 99
-->
<!ENTITY % text " #PCDATA ">
<!-- add preproc? -->
<!-- these may occur anywhere in a csf-document -->
<!ENTITY % freeforall " package | class | method | enum | variable | typespec | comment | directive ">
<!-- the element -->
<!ELEMENT csf ((%freeforall;)*)>
<!ATTLIST csf
language CDATA ""
>
<!-- describes a location -->
<!ELEMENT location EMPTY>
<!ATTLIST location
file CDATA ""
startline CDATA "-1"
startcol CDATA "-1"
endline CDATA "-1"
endcol CDATA "-1"
>
<!ELEMENT info EMPTY>
<!ATTLIST info
type CDATA #IMPLIED
value CDATA #IMPLIED
info CDATA #IMPLIED
>
<!-- describes some namespace/package, needs more work -->
<!ELEMENT package (location?,info*,(%freeforall;)*) >
<!ATTLIST package
id CDATA #REQUIRED
name CDATA #IMPLIED
>
<!-- The class abstraction -->
<!ELEMENT class (location,access,inherit*,info*,(%freeforall;)*) >
<!ATTLIST class
id CDATA #REQUIRED
name CDATA #IMPLIED
>
<!-- function/method -->
<!-- itsdecl should be changed.. may be more than one decl -->
<!ELEMENT method (where+,access,info*,retval*,arg*) >
<!ATTLIST method
id CDATA #REQUIRED
name CDATA ""
>
<!ELEMENT where (location)>
<!ATTLIST where
what (declaration|definition|unknown) "unknown"
>
<!-- see other docs -->
<!ELEMENT retval (info*)>
<!ELEMENT arg (info*)>
<!-- some variable -->
<!ELEMENT variable (location,access,info*) >
<!ATTLIST variable
id CDATA #REQUIRED
name CDATA ""
>
<!ELEMENT enum (location,access,enumval*) >
<!ATTLIST enum
id CDATA #REQUIRED
name CDATA ""
>
<!ELEMENT enumval EMPTY>
<!ATTLIST enumval
name CDATA ""
value CDATA ""
>
<!ELEMENT typespec (location,access,info*) >
<!ATTLIST typespec
id CDATA #REQUIRED
name CDATA #REQUIRED
>
<!ELEMENT access EMPTY>
<!ATTLIST access
visibility CDATA ""
scope CDATA ""
>
<!ELEMENT inherit (info*)>
<!ATTLIST inherit
name CDATA ""
>
<!ELEMENT comment (location,text)>
<!ELEMENT directive (location)>
<!ATTLIST directive
name CDATA ""
value CDATA ""
info CDATA ""
>
<!ELEMENT text (%text;)>