Cicada ---> Online Help Docs ---> Customizing the Cicada language

cicada.c

Inside the file cicada.c lives an array called cicadaLanguage[] which defines basically every symbol we find inside of a script. Each array element is a commandTokenType structure variable that defines one operator in the Cicada language:


    typedef struct {
       const char *cmdString;
       ccInt precedence;
       const char *rtrnTypeString;
       const char *translation;
    } commandTokenType;
   

The first string, cmdString, is the operator symbol or name as written in Cicada. Then we give the precedence level of the operator (see Table 2). Next comes a string that explains to the Cicada compiler what type(s) of object this operator ‘returns’ to the surrounding expression. The final string, translation, either encodes the operator directly as bytecode (Cicada’s native language), or else ‘expands’ the operator in terms of other Cicada commands. The first batch of operators have direct bytecode translation, indicated by the fact that their translation begins with a inbytecode marker (which concatenates to the following string as an unprintable character). But for the second group of operators, towards the end and lacking a inbytecode marker, the translation is just a fragment of a script built from the previous operators.

To add a new operator into the language, simply add a new entry to the end of the cicadaLanguage[] array, in the indicated space, and fill in the four fields.


cmdString:

The command string defines what sort of object in the script will be recognized as the operator. For most operators, operators are recognized by some string of letters, symbols, etc. that marks the operator in the script. (There is really no restriction on this string other than it consist of printable characters.) For example, looking through the array we find simple operators like \ (written with two backslashes in the C string) and exit.

Many of the operators take left-hand and/or right-hand arguments, as indicated by keywords like type3arg to the left and/or right of the operator string. These keywords are part of the command string -- they are just unprintable characters that get concatenated to the operator strings (because they are only separated by a space). For example, the full command string of the define operator is


    type2arg "::" type6arg
   

The define operator requires both a left-hand expression or argument (the member to define) and a right-hand argument, which can be a member name but also a type like bool. The two arguments therefore have a different ‘type’. Looking at the comment before the cicadaLanguage[] array, we see that a type 2 argument represents a variable or function, and a type 6 argument represents a code-containing expression, as expected. Having different types allows the compiler to throw type-mismatch errors when expressions don’t make sense.

In some cases two different operators will have the same operator string. For example, compare ‘*’ as a multiplication operator versus the void, or ‘-’ as either subtraction or negation. This is only allowed if one of the operators expects a left-hand argument and the other doesn’t, so that the compiler will immediately know which of the two operators it is looking at when it sees the operator string in a script. For example, when it stumbles upon a ‘-’, that symbol will be interpreted as a subtraction if there is a dangling expression just to the left, and a negation otherwise.

More complex definitions like while ... do can involve several operator strings.


    "while" type5arg "do" type1arg
   

The pattern is always: operator strings like ‘do’ alternating with arguments. Sometimes, these complex definitions involve a optionalargs keyword: everything before the keyword is required, but everything afterwards is optional. For example, the if command


    "if" type5arg "then" type1arg optionalargs "else" type1arg
   

requires an if and a then, but the else is optional.

There are 10 allowed argument types: type0arg through type9arg. (Cicada only uses 9 of these types). There are also a few special types. A typeXarg accepts any type of argument, and is used by the (...) operator with no left-hand argument (i.e. the grouping operator, not a function call) to allow the user to group any sort of expression, even entire commands. The commentarg keyword denotes an block of text to be entirely ignored until the next operator string is encountered (i.e. everything from a comment bar ‘|’ to an end-of-line is skipped). chararg and stringarg treat the argument as text containing one or several characters respectively.

Finally, there are several special operators that don’t have any operator string at all. If the operator string is simply int_constant, then the operator is read when the compiler encounters a number that it deems to be an integer; and the operator whose operator string is double_constant corresponds to a floating-point number. The variable_name operator is assumed to apply whenever the compiler encounters a novel word beginning with a letter (which may be followed by underscores and numbers). In these three cases the number or word should be thought of as an argument, insofar as the bytecode is concerned.

The final class of special operator strings is the adapters, an important element of Cicada scripting that is explained in the next section. Suffice to say that there is an adapter for each of the 10 argument types, type0arg_adapter through type9arg_adapter, along with a noarg_adapter.


precedence:

The precedence level determines how operators are bound into expressions. The high-precedence operators are grouped most tightly to their neighbors, and evaluated before the low-precedence operators. Thus A = 2 * B - 2 is grouped: A = ( (2*B) - 2) because of the three gluing operators ‘= * -’, multiplication ‘*’ has the highest precedence and assignment ‘=’ has the lowest precedence. The precedence level is just an integer, although notice that cicada.c predefines a keyword for each precedence level and uses that names instead in the operator definitions.

The cicadaLanguageAssociativity[] array in cicada.c explains how to group operators of the same precedence level, when there are no parentheses to break the tie. This can be important. For example, multiplication and division operators have precedence level 11, and the eleventh entry of the associativity array (i.e. cicadaLanguageAssociativity[10]) is l_to_r signifying left-to-right grouping. Therefore the expression 8/2/4 groups as (8/2)/4 which equals 1, as opposed to 8/(2/4) which equals 16. On the other hand, assignment works at precedence level 5, which has r_to_l or right-to-left grouping. Therefore A = B = C = 2 groups as A = (B = (C = 2)), so that 2 copies to C, then to B, then to A. If the grouping were the other way, then each assignment would rewrite A.

The size of the cicadaLanguageAssociativity[] array determines the allowed precedence levels. So by adding an entry the maximum allowable operator precedence level will be 16. Anything outside the interval [1, max_precedence] will cause an out-of-range compiler error.


rtrnTypeString:

Many operators ‘return’ a value to the enclosing expression, and which type(s) of value they are allowed to return is encoded in the return-types string. For example, the addition operator has the return-types string "456", so its return value can be construed as being of type 4, 5 or 6. The return types correspond to the argument types from the cmdString field, so the expression A = 2+5 is legal (because the assignment operator expects a type-5 right-hand argument) but 2 + 5 = A is illegal (because the left-hand argument of the assignment operator should be type 2). Notice how each ‘type’ is really an operator argument type: arguments have one type and entire operators have many, which is maybe backwards to the way we usually think.

There is a special argXtype return type which is paired with a typeXarg argument type. This is used by the grouping operator (...), causing the type inside the parentheses (its ‘argument’) to be the type returned back to the enclosing expression. The parentheses only force a grouping, without affecting the type of the enclosed expression.

An entire script must be of type 0 -- Cicada enforces this using adapters (see below).


translation:

The last field of an operator definition explains how it will be translated into bytecode. If the bytecode string begins with a inbytecode keyword, then the string contains a list of integers which are the bytecode representation of the operator. For ease of reading, the bytecode translations in cicada.c are built from string macros defined in cicada.h. If there is no inbytecode keyword, then the string is interpreted as a fragment of Cicada code that will be translated into bytecode using previously-defined operators -- so it is best to define these operators last.

The translation strings of bytecode-coded operators have strings of numbers separated by spaces in their translation strings, but also some funny letters: ‘a’, ‘j’ and ‘p’. The ‘a’ letter stands for an argument that is to be substituted into the bytecode at the given location, and is followed immediately (no space) by a number from 1 to 9 indicating which argument. (Cicada only supports up to 9 arguments in an operator.) For example, the assignment operator has a bytecode string


    inbytecode "8 1 a1 a2"
   

meaning that the operation consists of two integers (8 followed by 1), then the first (left-hand) argument, and last the second (right-hand) argument. Each of these arguments can themselves be expressions (think f().a = 5+cos(b)), in which case the entire expression translated into bytecode is substituted for the argument. In cicada.c the macro bcArg(x) produces the “ax” string, so the assignment operator reads


    inbytecode bc_define(equFlags) bcArg(1) bcArg(2)
   

where bc_define(equFlags) produces a define operator with equate flags: "8 1".

The ‘j’ and ‘p’ bytecode symbols are used to specify jump offsets (‘j’ -- effectively gotos) and jump positions (‘p’) in the bytecode. Offsets are the number of code words to jump ahead from the offset word (negative offsets jump backwards), and the cicada compiler calculates these as the difference between a jump (j) marker and a target position (p) marker. Each of position/jump marker is followed immediately by a number 1-9 indicating which position to define/jump to. For example, the bytecode string of the if-then-else command which potentially takes 3 arguments is:


    inbytecode "3 j1 a1 a2 1 j2 p1 a3 p2"
   

In cicada.c the position markers are produced using bcPosition() macros, and the jump operators have dedicated macros taking the jump offsets as arguments, so this same operator definition reads


    inbytecode bc_jump_if_false(1) bcArg(1) bcArg(2)
                       bc_jump_always(2) bcPosition(1) bcArg(3) bcPosition(2)
   

In this case the first bytecode command -- bytecode operator 3 which is the jump-if-false command -- jumps to the position of the first position marker, so the compiler calculates this offset by taking the difference in the code position between the j1 command and p1 (which is effectively the beginning of argument 3 because p1 is not a code word) and puts that value in place of the j1 word. Likewise, there is an unconditional jump later on to the end (j2). The third argument a3 may or may not exist because the third argument is in the optional else block: if there is no else then a3 is basically ignored, but the second position marker is still defined.

Many of the adapter operators (explained in the next section) have anonymousmember keywords in their bytecode. These are replaced by unique (and negative) member IDs that are found nowhere else in the script: the first use of a anonymousmember in the bytecode becomes the number -1, the second use represents a -2 in the bytecode, etc. These are used to construct hidden members that won’t bother anyone by conflicting with user-defined members.

Scripted operators -- those without inbytecode keywords -- work basically the same as bytecoded operators except that unfortunately the arguments have to be encoded with special keywords (arg1 through arg9) rather than directly in the string. For example, a cicadaLibraryFunction() function is defined at the beginning of cicada.c in bytecode, and the call() function uses its definition in its script translation:


    "cicadaLibraryFunction#0(" arg1 ")"
   

So the command call("myF", 12) is first translated into cicadaLibraryFunction#0("myF", 12) before being converted into bytecode.

Some operators (usually comments) have no effect on the bytecode whatsoever, and for those we give neither bytecode nor a script translation but instead write removedexpression for their translation string. The |* ... *| comment block uses this keyword, as does the line-continuation operator & which ignores everything to the end of the line. Oddly enough the single-line comment | ... doesn’t use this keyword, and the reason is that it always breaks two separate commands -- so in terms of bytecode it works a lot like a comma or end-of-line.


Prev: Customizing the Cicada language    Next: Cicada bytecode


Last update: May 8, 2024