Cicada ---> Online Help Docs ---> Customizing the Cicada language

cclang.c

Inside the file cclang.c lives an array called cicadaLanguage[] which defines basically every symbol we find inside of a script. Each array element is a commandTokenType structure variable that defines one operator in the Cicada language:


    typedef struct {
       const char *cmdString;
       ccInt precedence;
       const char *rtrnTypeString;
       const char *translation;
    } commandTokenType;
   

The first string, cmdString, is the operator symbol or name as written in Cicada. The precedence level determines how operators are grouped (see Table 2). Next comes rtrnTypeString which explains to the Cicada compiler what type(s) of data this operator ‘returns’ to the surrounding expression. The final string, translation, either encodes the operator directly as bytecode (Cicada’s native language), or else ‘expands’ the operator in terms of other Cicada commands. Most operators have a direct bytecode translation, indicated by the fact that their translation begins with a inbytecode marker (which concatenates to the following string as an unprintable character). The remaining operators lack the inbytecode marker, and their translation string is just an expression built from previously-defined operators.

To add a new operator into the language, simply add a new entry to the end of the cicadaLanguage[] array, and fill in the four fields.


cmdString:

The command string defines the various operators that Cicada expects to see in the command. Typically a command string consists of a concatenation of constant operator strings, which serve as signposts for the compiler, placeholders for sub-expressions within the command, and tokens indicating miscellaneous objects like variable names and hardcoded constants. There must be some distinctive constant string in the first or second position of a command sequence, so that the compiler knows what it’s looking at.

Many operators take left-hand and/or right-hand arguments, as indicated by keywords like type3arg to the left and/or right of the respective operator string. For example, the full command string of the define operator is


    type3arg "::" type7arg
   

The define command requires both a left-hand expression or argument (the member to define) and a right-hand argument, which can be a member name but also a type like bool. The two arguments expect different objects and therefore require a different ‘type’ of expression. These types are unrelated to variable/member types. Looking at the comment before the cicadaLanguage[] array, we see that a type 3 argument represents a variable or function, and a type 7 argument represents a code-containing expression, as one would expect. The type specifications allow the compiler to throw type-mismatch errors when expressions don’t make sense.

In some cases two different commands will use the same operator string. For example, compare ‘*’ is used as a multiplication operator as well as the void operator, and ‘-’ is either subtraction or negation. This is only allowed if one of the operators expects a left-hand argument and the other doesn’t, so that the compiler will immediately know which of the two operators it is looking at when it sees the operator string in a script. For example, when it stumbles upon a ‘-’, that symbol will be interpreted as a subtraction if there is an unattached expression just to the left, and a negation otherwise.

More complex definitions like while ... do can involve several operators.


    "while" type6arg "do" type1arg
   

The pattern is always: operator strings like ‘do’ alternating with arguments. Some complex definitions involve an optionalargs keyword: everything before the keyword is required, but everything afterwards is optional. For example, the if command


    "if" type6arg "then" type1arg optionalargs "else" type1arg
   

requires an if and a then, but the else is optional.

There are 10 allowed argument types: type0arg through type9arg. There are also a few special types. A typeXarg accepts any type of argument, and is used by the (...) operator with no left-hand argument (i.e. the grouping operator, not a function call) to allow the user to group any sort of expression, even entire commands. The commentarg keyword denotes an block of text to be entirely ignored until the next operator string is encountered (i.e. everything from a comment bar ‘|’ to an end-of-line is skipped). chararg and stringarg treat the argument as text containing one or several characters respectively.

Finally, there are several special operators that don’t have any operator string at all. If the operator string is simply int_constant, then the operator is read when the compiler encounters a number that it deems to be an integer; and the operator whose operator string is double_constant corresponds to a floating-point number. The variable_name operator is assumed to apply whenever the compiler encounters a novel word beginning with a letter (which may be followed by underscores and numbers). In these three cases the number or word should be thought of as an argument, insofar as the bytecode is concerned.

The final class of special operator strings is the adapters, an important element of Cicada scripting that is explained in the next section. Suffice to say that there is an adapter for each of the 10 argument types, type0arg_adapter through type9arg_adapter, along with a noarg_adapter.


precedence:

The precedence level determines how operators are bound into expressions. The high-precedence operators are grouped most tightly to their neighbors, and evaluated before the low-precedence operators. Thus A = 2 * B - 2 is grouped: A = ( (2*B) - 2) because of the three gluing operators ‘= * -’, multiplication ‘*’ has the highest precedence and assignment ‘=’ has the lowest precedence. The precedence level is just an integer, although notice that cclang.c predefines a keyword for each precedence level and uses those keywords instead of numbers.

The cicadaLanguageAssociativity[] array in cclang.c explains how to group operators of the same precedence level, when there are no parentheses to break the tie. This can be important. For example, multiplication and division operators have precedence level 11, and the eleventh entry of the associativity array (i.e. cicadaLanguageAssociativity[10]) is l_to_r signifying left-to-right grouping. Therefore the expression 8/2/4 groups as (8/2)/4 which equals 1, as opposed to 8/(2/4) which equals 16. On the other hand, assignment works at precedence level 5, which has r_to_l or right-to-left grouping. Therefore A = B = C = 2 groups as A = (B = (C = 2)), so that 2 copies to C, then to B, then to A. If the grouping were the other way, then each assignment would only rewrite A.

The size of the cicadaLanguageAssociativity[] array determines the allowed precedence levels. So by adding an entry to that table we would bump up the maximum allowable operator precedence level to 16. Anything outside the interval [1, max_precedence] will cause an out-of-range compiler error.


rtrnTypeString:

Many operators ‘return’ a value to the enclosing expression, and which type(s) of value they are allowed to return is encoded in the return-types string. For example, the addition operator has the return-types string "567", so its return value can be construed as being of type 5, 6, or 7. The return types correspond to the argument types from the cmdString field (not precedence levels), so the expression A = 2+5 is legal because the assignment operator expects a type-6 right-hand argument, but 2 + 5 = A is illegal because the left-hand argument of the assignment operator should be of type 3.

There is a special argXtype return type which is paired with a typeXarg argument type. This is used by the grouping operator (...), causing the type inside the parentheses (its ‘argument’) to be the type returned back to the enclosing expression. The parentheses only force a grouping, without affecting the type of the enclosed expression.

An entire script must be of type 0 -- Cicada enforces this using adapters (see below).


translation:

The last field of an operator definition explains how it will be translated into bytecode. If the bytecode string begins with a inbytecode keyword, then the string contains a list of integers which are the bytecode representation of the operator. For ease of reading, the bytecode translations in cclang.c are built from string macros defined in cicada.h. If there is no inbytecode keyword, then the string is interpreted as a fragment of Cicada code that will be translated into bytecode using previously-defined operators -- so it is best to define those operators last.

The translation strings of bytecode-written operators have strings of numbers separated by spaces in their translation strings, but also some funny letters: ‘a’, ‘j’ and ‘p’. The ‘a’ letter stands for an argument that is to be substituted into the bytecode at the given location, and is followed immediately (no space) by a number from 1 to 9 indicating which argument. (Cicada only supports up to 9 arguments in an operator.) For example, the assignment operator has a bytecode string


    inbytecode "8 1 a1 a2"
   

meaning that the operation consists of two integers (8 followed by 1), then the first (left-hand) argument, and last the second (right-hand) argument. a1 and a2 will each be replaced by potentially-long bytecode expressions (think f().a = 5+cos(b)). In cclang.c the macro bcArg(x) produces the a1 and a2 strings, so the translation string reads


    inbytecode bc_define(equFlags) bcArg(1) bcArg(2)
   

where bc_define(equFlags) produces a define operator "8" with equate flags: "1".

The ‘j’ and ‘p’ bytecode symbols are used to specify jump offsets (‘j’ -- effectively gotos) and jump positions (‘p’) in the bytecode. Offsets are the number of code words to jump ahead from the offset word (negative offsets jump backwards), and the cicada compiler calculates these as the difference between a jump (j) marker and a target position (p) marker. Each position/jump marker is followed immediately by a number 1-9 indicating which position to define/jump to. For example, the bytecode string of the if-then-else command which potentially takes 3 arguments is:


    inbytecode "3 j1 a1 a2 1 j2 p1 a3 p2"
   

In cclang.c the position markers are produced using bcPosition() macros, and the jump operators have dedicated macros taking the jump offsets as arguments, so this same operator definition reads


    inbytecode bc_jump_if_false(1) bcArg(1) bcArg(2)
                       bc_jump_always(2) bcPosition(1) bcArg(3) bcPosition(2)
   

In this case the first bytecode command -- bytecode operator 3 which is the jump-if-false command -- jumps to the position of the first position marker, so the compiler calculates this offset by taking the difference in the code position between the j1 command and p1, and puts that value in place of the j1 word. Likewise, there is an unconditional jump later on to the end (j2). The third argument a3 may or may not exist because the third argument is in the optional else block: if there is no else then a3 is basically ignored, but the second position marker is still defined.

Many of the adapter operators (explained in the next section) have anonymousmember keywords in their bytecode. These are replaced by unique (and negative) member IDs that are found nowhere else in the script: the first use of a anonymousmember in the bytecode becomes the number -1, the second use represents a -2 in the bytecode, etc. These are used to define hidden member IDs that won’t conflict with the positive IDs of user-defined members.

Scripted operators -- those without inbytecode keywords -- work basically the same as bytecoded operators except that the arguments have to be encoded with the special keywords arg1 through arg9. For example, the remove function is defined at the beginning of cclang.c in bytecode, and the [-< ... >] syntax for removal cites remove in its scripted translation:


    "remove " arg1 "[<" arg2 "," arg3 ">]"
   

So the command arr[-<2,3>] is first translated into remove arr[<2,3>], then into bytecode.

Some operators (usually comments) have no effect on the bytecode whatsoever, and for those we give neither bytecode nor a script translation but instead write removedexpression for their translation string. The |* ... *| comment block uses this keyword, as does the line-continuation operator & which ignores everything to the end of the line. Oddly enough the single-line comment | ... doesn’t use this keyword, and the reason is that it always separates two sentence-level commands -- so in terms of bytecode it works a lot like a comma or end-of-line.


Prev: Customizing the Cicada language    Next: Cicada bytecode


Last update: November 12, 2025