|
CSF SpecificationWritten by Stig E Sandø Status of this documentThis document is the current gospel. Please use and refer to it, preferrably with version number. Parts likely to be slightly updated will be marked as such.
AbstractThe Software Development Foundation (SDS) is an open architecture designed for developing tools for software development. Based on XML, the SDS makes it easy for most languages and other systems to incorporate it's tools. The core of SDS is the Code Structure Format (CSF) which collects most interesting information about source code which can be easily utilised by tools. Table of Contents
1. IntroductionSome of the idea behind CSF v1 was to capture a large subset of the information available about code in various forms. CSF v1 disn't capture enough information and parts of CSF was confusing. This specification presents one way to look at code and hopefully it will capture a large enough subset without ruining readability and uability. When referring to CSF from this point I mean CSF v2, any references to the old CSF will be to CSF v1. The description of CSF will first include some thoughts about language features and should be read. Then a description of the most important elements follows. Please check the DTD-documentation and the DTD when this specification is unclear. All values for info-fields are suggestions in this document and might be different from the values you should use when implementing. Please check the collected list for values and their meaning when implementing CSF support. 2. ThoughtsMost popular languages share certain features, but some have their own specialties which can be good or bad. The shared features are basically what CSF should capture, but should also leave room for specialties; e.g multimethods are a central part of CLOS and to be able to get useful information about CLOS we need multi-method support. Multi-dispatchIronically in the CLOS-case with multimethods, which can be seen as a special case, is in fact a much cleaner and general solution than the classic message-passing paradigm of Smalltalk, C++, Java, etc. Message-passing works decently in Smalltalk and to a certain extent in Java, but in C++ it works in the Java/Smalltalk cases, but is clearly broken when it comes to operator overloading and templates. In most ways the CLOS approach is the best way to proceed on methods and classes as it is a superset of the functionality in other languages. This will however change how some functions are represented, ie a C++ function DRAW in class SHAPE will now look like DRAW(SHAPE *) instead of the C++ syntax SHAPE.DRAW(). Serious developers know that the dispatch is done on the first argument which is the this-pointer which is of type "SHAPE *". ModulesHowever, CLOS doesn't have the classic module-concept, but has packages which is more orthogonal to the Lisp-reader. Java has a module-concept (ironically called packages) which is quite useful and orthogonal to the language, but lacks some features. C++ has something called namespaces which aren't modules but may resemble a poor version of Common Lisp packages (which basically are "spaces" for "names"). C++ inherited several concepts from C which can be thought of as modules (separate headerfiles, libraries, prefix-hacks) but is far from having a module-concept. Python has modules which seems to work as modules and it's a redeeming feature of a language composed of special cases. Modules are a good thing and most people I know design and think of code as being in modules and code being modular. CSF must in some way capture modular information but should leave actual modules to other formats, e.g SDOC because of the big differences. Possibly barring some Python features, going for a "namespace" solution will capture most of the information. File-orientationSome languages are file-oriented, while others are not. C and C++ are clearly file-oriented, Java is in many ways directory/package-oriented while CLOS is really difficult to place. Python is also file-oriented, but can be seen from different perspectives due to it's module-concept. Assuming that code that is parsed for CSF come from actually files, continuing to build on a LOCATION tag with FILE info seems decent enough. Room must also be given to allow "static" initialiser in C/C++ giving the method or variable a file-scope. DeclarationsAs said, multimethods is the right way to go for representing methods, but representation of methods (or functions or procedures) in code behave very differently in various languages. Some languages insist on the method being called must be known in advance thereby requiring declarations. Declarations in themselves are interesting as they provide much information, and because their location may be of interest (e.g "Where did I declare that method? Where must I change my code?"). Declarations must also be possible to combine with the definition. Method namesAs for method-names, C doesn't dispatch on arguments but only the name and therefore only one method/function may be registered on a name. The other mentioned languages don't have this restriction, and by accepting multi-methods we cannot compare functions on name only and need to compare types when trying to find matches. This increases complexity and requires some form of making a scheme to calculate id's of various objects. Adding to it that not all languages are case-sensitive makes things even more difficult. Method argumentsMethods in some languages have support for optional arguments, arguments with default values, "rest" arguments and keyword parameters, and this is something CSF will support. Short summary..This basically means that no language fit 100% into CSF, but the most common languages should easily have support for 80% of their constructs. For some languages, CSF will be an upgrade (e.g multi-method representation) while others might miss their advanced features. 3. The CSF-structureCSF is a XML-based data format and therefore follows the XML standard, implicating that a CSF data-file is required to follow the XML standard as well. It is preferred that generated CSF files are in UTF-8 format. The actual format of each element can be found in the DTD (Appendix B) and in the DTD documentation IdentificationMost major elements of the CSF dataformat are identifiable with an ID attribute. This has type CDATA (not type ID) and has a specified format in BNF (E is the empty string): SEP -> '@' ID -> SEP TYPE SEP NAME SEP FULLNAME SEP PARAMLIST SEP LOCATIONS SEP TYPE -> "method" | "class" | "enum" | "typespec" | "package" | "variable" NAME -> "the name of the object" FULLNAME -> E | "the fully qualified name of the object" PARAMS -> E | PARAMLIST PARAMLIST -> PARAMTYPE | PARAMLIST ',' PARAMTYPE PARAMTYPE -> "name of parameter type" LOCATIONS -> E | LOCLIST LOCLIST -> LOC | LOCLIST LOC LOC -> '[' "filename" LINEINFO ']' LINEINFO -> E | ':' STARTLINE ',' ENDLINE ',' STARCOL ',' ENDCOL STARTLINE -> E | number ENDLINE -> E | number STARTCOL -> E | number ENDCOL -> E | number The primary way that should be used by CSF tools is TYPE and LOCATIONS. If TYPE fails, one should check for FULLNAME or if that fails, NAME. If any ambiguities are present, PARAMS is used. FULLNAME and PARAMS are language-dependent and may be skipped by any tool. The language is specified in the CSF-tag. Some examples of use of Id is found in appendix A InfoMost of the major tags have an INFO-field to contain most of the information for that element. Each of the major elements have different meanings in the various fields and documentation should be checked for each of them. The form of info is simple: <info type="infotype" value="the_value" info="extra_info"/> Please not that the info-field is made this way to be able to be able to handle changes to elements in CSF, in a graceful way. This document will be updated with new known values when they are introduced. A changelog and version-system will be used to make this simple to keep track of. Only those info-fields of interest for the tools should be used, the rest should be ignored. Several info-fields with the same name is usually allowed. Expect a notice when there should only be one occurence of a specified type. 3.1 The ToplevelThe toplevel of a CSF-document is the CSF-tag, which has a language tag, specifying the language of the content in the file. <CSF language="language name"> ... </CSF> The three small dots contain the actual content which can be
3.2 MethodMethods are probably the most used abstraction mechanism and needs careful design. Methods also come in many forms and shapes and might also have different semantics in different languages. Most of the work needed for front-ends and tools will be related to methods, so we should do this tag right and make it as convenient as possible. The basic method looks like this: <method id=some_id name=some_name> <where ...> + <access ...> <info ...> * <retval ...> * <arg ...> * ... (the content of the method) </method> Needless to say, much of the complexity is hidden in the subfields. Please not that for various reasons the id-field is mere CDATA. The content can be new methods, variables, directives, etc. WHERE The WHERE-field is a wrapped LOCATION-field where we add some info about the declaration. It might be expanded. It currently has this form: <where type=declaration|definition|unknown> <location ...> </where> INFO Several INFO with the same type are allowed. Some known values with meaning:
3.3 ClassClasses are an important abstraction mechanism and in most respects replaces older constructs like structs, records, and to a certain degree unions. Classes in CSF do not differ a lot from classes in Java, C++ or CLOS, but also serves as placeholder for info about a struct, union, etc. The form is relatively easy to understand: <class id=class_id name=name_of_class> <location ...> <access ...> <inherit ...> * <info ...> * ... (the content of the class) </class> The content of the class can be asically be just about any content
(class, method, enum, variable, typespec, comment or directive). The
DTD allows packages as well, but this is at best uncommon in actual code.
INFO The INFO-field is serving the same role for CLASS as INFO is for METHOD. Several INFO with the same type are allowed. Some known values with meaning:
3.4 PackagePackages (or modules or namespaces) are wildly different in various languages but a simple subset may be represented as: <package id=pack_id name=pack_name> <location ...> ? (where declared if declared) <info ...> * ... (the content of the package) </package> Several INFO with the same type are allowed. Some known values with meaning:
3.5 VariableVariables tend to be useful and there are few languages without them. They might have specified static types, or the value they point to might have type, or they might be typeless. Variables differ a lot between languages and they get some of the same treatment as methods and classes. <variable id=var_id name=variable_name> <location ...> <access ...> <info ...> * </variable> Several INFO with the same type are allowed. Some known values with meaning:
3.6 EnumEnumerations seem to be C/C++ specific, though I think Pascal had something along the same line. The ENUM element is pretty simple <enum id=var_id name=enum_name> <location ...> <access ...> <enumval ...> * </enum> Where ENUMVAL has the obvious form: <enumval name=enumval_name value=the_value/> 3.7 TypespecType-aliases and type-specifiers take on many forms in various languages and the C/C++ typedef is the most famous. Several other constructs exist in other languages that we also want to cover, e.g DEFTYPE in CL. <typespec id=type_id name=type_name> <location ...> <access ...> <info ...> * </typespec>
3.8 CommentComments are kept separately because they usually are not part of ASTs and often do not contain info directly related to code. They do however often have important info, as in Javadoc comments and should be saved for processing by other formats. The format is really simple and allows reconstruction of what language-element the comment belonged to. <comment> <location ...> <text>the comment</text> </comment> 3.9 DirectiveDirectives take many forms and shapes in the various languages. To support all is insane, but some can be supported and can be useful, e.g #include can be useful for include-trees or #define to find specific macros. How these directives are treated is up to the individual app, and they can easily be ignored. <directive type=dir_type value=dir_value info=extra_info/> The problem is that this is not directly intuitive and language-dependent. Some examples are included: #define NULL 0L <directive type="define" value="NULL" info="0L"/> #define MAX(a,b) (((a) > (b)) ? (a) : (b)) <directive type="define" value="MAX(a,b)" info="(((a) > (b)) ? (a) : (b))"/> #include <stdio.h> <directive type="include" value="<stdio.h>"/> #pragma align 1 <directive type="pragma" value="align" info="1"/> A better solution would be nice. 3.10 InheritThe INHERIT-field contains info about what a class inherits. It's a relatively simple field: <inherit name=name_of_class> <info ...>* </inherit> Several INFO with the same type are allowed. Some known values with meaning:
3.11 Arg and RetvalThese two fields have the same structure for simplicity, though a return value usually does not have a specified name. The form is simple: <arg> <info ...> * </arg> Needless to say, the INFO field is vital here, and several INFO with the same type are allowed. Some known values with meaning:
3.12 LocationThe location element is central to most major CSF elements and specifies the position of an object described by the CSF element. It is very lenient, and have the following structure where some attributes may be used: <location file=filename startline=num startcol=num endline=num endcol=num position=num/> 3.13 AccessThe access element is also used in several CSF elements to describe which access the described object has in its context. Its form is simple: <access visibility=the_visibility scope=the_scope/> [Add details later, see dtd-documentation ever so long] 3.14 Other elementsOther elements mentioned but not defined include LOCATION and ACCESS. Those remain the same as their CSF1 versions. Appendix A - Id examplesC++ - file.c1 class A { 2 3 void foo(const char *, A *a); 4 5 }; 6 7 void 8 A::foo(const char*, A *a) { }
A has id: @class@A@A@@[file.c:1,5,,]@ Java - A.java 1 package foo; 2 3 class A { 4 5 void bar(String, A a) { } 6 7 }
foo has id: @package@foo@foo@@[A.java,1,1,,]@ Appendix B - The CSF DTD<!-- DTD for proposed Code Structure Format (CSF 2) November 99 --> <!ENTITY % text " #PCDATA "> <!-- add preproc? --> <!-- these may occur anywhere in a csf-document --> <!ENTITY % freeforall " package | class | method | enum | variable | typespec | comment | directive "> <!-- the element --> <!ELEMENT csf ((%freeforall;)*)> <!ATTLIST csf language CDATA "" > <!-- describes a location --> <!ELEMENT location EMPTY> <!ATTLIST location file CDATA "" startline CDATA "-1" startcol CDATA "-1" endline CDATA "-1" endcol CDATA "-1" > <!ELEMENT info EMPTY> <!ATTLIST info type CDATA #IMPLIED value CDATA #IMPLIED info CDATA #IMPLIED > <!-- describes some namespace/package, needs more work --> <!ELEMENT package (location?,info*,(%freeforall;)*) > <!ATTLIST package id CDATA #REQUIRED name CDATA #IMPLIED > <!-- The class abstraction --> <!ELEMENT class (location,access,inherit*,info*,(%freeforall;)*) > <!ATTLIST class id CDATA #REQUIRED name CDATA #IMPLIED > <!-- function/method --> <!-- itsdecl should be changed.. may be more than one decl --> <!ELEMENT method (where+,access,info*,retval*,arg*) > <!ATTLIST method id CDATA #REQUIRED name CDATA "" > <!ELEMENT where (location)> <!ATTLIST where what (declaration|definition|unknown) "unknown" > <!-- see other docs --> <!ELEMENT retval (info*)> <!ELEMENT arg (info*)> <!-- some variable --> <!ELEMENT variable (location,access,info*) > <!ATTLIST variable id CDATA #REQUIRED name CDATA "" > <!ELEMENT enum (location,access,enumval*) > <!ATTLIST enum id CDATA #REQUIRED name CDATA "" > <!ELEMENT enumval EMPTY> <!ATTLIST enumval name CDATA "" value CDATA "" > <!ELEMENT typespec (location,access,info*) > <!ATTLIST typespec id CDATA #REQUIRED name CDATA #REQUIRED > <!ELEMENT access EMPTY> <!ATTLIST access visibility CDATA "" scope CDATA "" > <!ELEMENT inherit (info*)> <!ATTLIST inherit name CDATA "" > <!ELEMENT comment (location,text)> <!ELEMENT directive (location)> <!ATTLIST directive name CDATA "" value CDATA "" info CDATA "" > <!ELEMENT text (%text;)> |