A cql Tutorial

Don Caldwell


Introduction

Cql is a small language designed to query data. It combines elements of C, awk, and simple databases. A goal of cql is to provide features that allow optimizations not available in tools like awk. The basic cql model is much like awk. Execute any begin operations, select rows from the input based possibly on regular expression patterns, execute actions on the selected rows, then perform any end operations. This tutorial is intended to help new users get started using cql to query data.

Basics

A cql program consists of a declaration part followed by an expression part. The declaration part consists of one or more schema declarations. An schema declaration resembles a C structure (except that there is no struct keyword) and optional attributes to describe something about the input. The expression part consists of functions or labeled expressions which describe the computation associated with the query. The form of a cql program is:
    declaration part
    expression part
The legal data types for entries in a schema declaration are: With the exception of schema types, variables with these types may be defined in cql functions. This restriction is likely to change.Cql requires a schema to describe the data to be queried. The schema gives the data fields names and types, field and line delimiters, input filenames and comments. Consider the following dataset. The emp file is similar to that in the awk book:
 
Name Rate
Hours
John
8.75
20
Ken
9.10
60
Dave
10.50
55
Phong
10.50
40
Glenn
10.50
95
Don
5.00
42
Lefty
9.50
60
Andrea
8.50
41
Becky
8.00
43
Lynn
8.40
42

The dataset includes the names of employees, their rate of pay and the number of hours that they have worked in the current pay period. A schema file for the emp table called emp.cql contains the following:

       1        Emp { char* name; float rate;  int   hours; }
       2
       3        Emp.delimiter = "\t";
       4        Emp.comment = "worker bees and managers";
       5        Emp.input = "emp";
Line 1 in emp.cql is the schema. It describes the three columns in the emp file. Line 3 tells cql that the tab character separates fields in the file. Line 4 is obviously a comment for the schema. Line 5 gives the name of the input file. The input file may also be a named pipe or "/dev/fd/0" for operating systems having an fd filesystem so that a cql program may execute as part of a pipeline. This is actually a complete cql program. Cql provides a default select expression that selects everything and a default action expression that prints the line.

Include Syntax

Cql supports an #include syntax similar to C:
#include "filename"
To include a file, cql searches for the file in the current directory first. If cql does not find the file, it searches in turn all of the ...lib/cql directories corresponding to the ...bin directories in the command search path(PATH). For example, imagine that we have the search path PATH=/usr/bin:/usr/common/bin:/usr/local/bin. If cql does not find "filename" in the current directory, it then searches for the file in /usr/lib/cql, /usr/common/lib/cql, and /usr/local/lib/cql. It stops searching when it finds the file. If the include file is not in one of these directories, the user must provide the paths of directories to search on the command line. Like the C compiler, cql has an -I command line argument for this purpose. The file minemp.sh uses the include syntax to include the emp.cql file and simply prints its input:
       1     cql -e '
       2     #include "emp.cql"
       3     schema=Emp;
       4     '
Expressions can be in two possible forms: labeled or functional. The labeled form is:
    label: statement;
The functional form is:
    void label() { statement; }
These forms are equivalent except that the functional form can take integer and floating point parameters and can be called by other functions. The labeled form is mainly for convenience to programmers invoking cql with expressions on the command line. We will use the functional form in this tutorial. By far, the most common expressions are the begin, select, action and end expressions because they correspond with the awk BEGIN and END actions and pattern-action pairs and are invoked by cql in a sequence appropriate to their name. We will see in the next paragraph that other function names are sometimes useful.

The following cql script is a translation of the ackermann function in awk as shown on p.284 of 'The UNIX Programming Environment' by Kernighan and Pike.

       1     null { char*    null; }
       2     null.input = "/dev/null";
       3
       4     int ack(int a, int b) {
       5             if (a == 0) return b + 1;
       6             if (b == 0) return ack(a - 1, 1);
       7             return ack(a - 1, ack(a, b - 1));
       8     }
       9
       10     int end() { printf("%d\n", ack(3, 3)); }
The first two lines are some goo to make cql happy -- cql requires a schema. The end function on line 10 calls the ack function defined on line 4. If we delete the schema, the type specifications from the ack parameters and remove the semicolons from the lines in the function, we have the awk version of the ackermann function. The expression part is similar to an awk program. The begin expression is similar to the awk BEGIN construct in that it is the place for initializingvalues and printing prefaces to reports. The end expression likewise executes its actions when all data has been consumed. In cql however, all variables mustbe declared (as in C) and the begin expression is the place to do this when we want the variables to be globally visible. Variables defined in functions not named begin are local to the function.

The pattern-action pairs in awk programs are separated in cql programs into select and action expressions. For example, the next file, hipay.sh prints information for highly paid employees:

       1     cql -e '
       2     #include "emp.cql"
       3     schema=Emp;
       4     Emp.rate > 8.0;
       5     '
Here we see on line 4 that unlabeled expressions are given the select label by default. We have seen the first 3 lines in a previous example. If we want to find the hardest working employee, we must introduce variables to store the information on the most highly paid employee seen so far. Awk would find this hard worker with the following two liner.
       1    awk '$3 > max { max = $3; maxline = $0 }
       2    END           { print max, maxline }' < emp
Cql would require a few more lines, but the program is just as straightforward.
       1     cql -e '
       2     #include "emp.cql"
       3     schema=Emp;
       4     void begin()  { int max; char* maxline; }
       5     void action() { if(Emp.hours > max)
       6                   { max = Emp.hours; maxline = cql.input; }}
       7     void end()    { printf("%d %s\n", max, maxline); }
       8     '
Notice that the cql equivalent for $0 is the cql pseudo variable cql.input. Other pseudo variables are described in the cql manual page. Line 2 includes the file "emp.cql" that we described above. Line 3 confirms to cql that we want the main schema to be Emp. By default, the first schema defined is the main schema. The begin expression on line 4 defines the two variables in which we will need to save the information. We have no select expression meaning that cql selects everything. In the action expression we test to see if we have found a new maximum and save it if we have. When all input has been consumed, line 7 reports the result. If we want to find the all of the most highly paid employees, we have to work a little harder. Two solutions to this problem are described in the section on multiple traversals below.

Selection

The awk style of programming matches patterns with actions while the cql style separates the processing into exactly one select expression and exactly one action expression per loop through the data. To perform awk like processing in cql, the action expression must contain a set of if-else if-else style statements. While not as elegant as the awk model, the speed improvement afforded by cql makes the effort worthwhile.  Here is an example preceded by its awk equivalent:
       1     NF != 3   { print $0, "number of fields is not equal to 3" }
       2     $2 < 3.35 { print $0, "rate is below minimum wage" }
       3     $2 > 10   { print $0, "rate exceeds $10 per hour" }
       4     $3 < 0    { print $0, "negative hours worked" }
       5     $3 > 60   { print $0, "too many hours worked" }
       1     #include "emp.cql"
       2     schema=Emp;
       3     void action()
       4     {       /* we do not handle NF */
       5         if(Emp.rate > 10.0)
       6             printf("%s rate exceeds $10 per hour\n", cql.input );
       7         else if(Emp.rate < 3.35)
       8             printf("%s rate is below minimum wage\n", cql.input );
       9         if(Emp.hours > 60)
      10             printf("%s too many hours worked\n", cql.input );
      11         else if(Emp.hours < 0)
      12             printf("%s negative hours worked\n", cql.input );
      13     }
Since we are selecting on multiple criteria, we use control flow in the action function for filtering instead of a select function. In our example, the awk script would test all of the patterns on all of the rows. The cql program will only test all criteria on rows where the rate of pay is less than or equal to 10.0 and the hours worked is less than or equal to 60 hours. Awk could of course also use control flow.  A select expression cound have been used to find rows satisfying the conditions.
    void select()
    { Emp.rate > 10.0 || Emp.rate < 3.35 || Emp.hours > 60 || Emp.hours < 0; }
The point is that if we carefully structure our control flow in the action function, we can speed up the execution of the program by avoiding execution of match tests for uncommon cases or use a select function to use an index. Cql does not have an awk-like NF. Instead skipped fields are given values appropriate for the type. For example, skipped string fields are given the value "" and integers are given the value 0. In this example we did not wrap the cql program in the familiar cql -e ' program ' goo. To run this program in outliers.cql, execute the command:

    cql -f outliers.cql

Regular Expressions

Typical programs compare string variables to regular expressions. Cql uses two styles of regular expression. For the string comparison operations (==, !=), cql regards the right hand side expression as a ksh file match pattern. Cql also supports POSIX egrep(1) style regular expression matching with ed(1) style substitution in the cql.sub() function described below. User experience has shown that ksh style pattern matching seems to fit more naturally in the string comparison operations, while cql.sub() provides the power of full regular expressions where it is needed. The following cql program is equivalent to the awk program - awk '$1 ~ "[Ll]efty"' < emp:
       1         cql -e '
       2         #include "emp.cql"
       3         schema=Emp;
       4         Emp.name == "[Ll]efty";
       5         '
The + operator applied to strings concatenates them as shown on line 4 below.
       1     null { char*    null; }
       2     null.input = "/dev/null";
       3
       4     int end() { printf("%s\n", "The Amazing " + "Dr. Ek"); }
The printf statement prints the strings combined by the plus sign. The - sign is illegal with strings. If we want to remove part of a string we must use the cql.sub() function described in the next paragraph.

To alter a matched string, we must use the regular expression substitution function cql.sub(string,old,new,flags). This function returns the value of string after substituting new for matches of the egrep style regular expression old. Flags can be a combination of

If flags does not include g, then only the first match participates in the substitution. The next example replaces the first match of the regular expression "[Ll]efty" with "Amazing". It then promotes a form of local heresy.
       1         cql -e '
       2         #include "emp.cql"
       3         schema=Emp;
       4         void begin() { char* cheer; cheer = "How about those Packers!!"; }
       5         void select() { name == "[Ll]efty"; }
       6         void action() {
       7             printf("%s, ", cql.sub(name, "[Ll]efty", "Amazing", ""));
       8             printf("%s\n", cql.sub(cheer, "P[a-z]*", "Cowboys",""));
       9         }
      10         '

Multiple Traversals

We could check the rate of pay of all employees to find out what the maximum is and then find those employees who make that amount. Cql provides a hook to traverse the data more than once. Lets look at the code:
       1     cql -e '
       2     #include "emp.cql"
       3     schema=Emp;
       4     void begin()    { float max; }
       5     void action()   { if(Emp.rate > max) max = Emp.rate; }
       6     void end()      { cql.loop("select_2", "action_2", ""); }
       7     void select_2() { Emp.rate == max; }
       8     void action_2() { printf("%s\n", cql.input); }
       9     '
The beginning of the file looks like the previous example. The begin function declares the variable max that will save the maximum rate of pay for us. The action function on line 5 collects the highest rate of pay seen so far. When we reach the end of input, the end action on line 7 sets up the second pass. Cql.loop() provides the linkage to the next iteration. It tells cql to traverse the input again, this time using the select_2 function to filter the data, and execute the action_2 function when the selection criterion is satisfied. If the volume of data is large, using a select expression in the second pass as we have done on line 7 above could save IO since we could take advantage of the index cql builds. Since we don't need to do anything when we are finished, the third argument tocql.loop() is the empty string indicating that there is no end function. The name cql.loop() is somewhat misleading. It simply chains two select-action-end expression triples together. The name was chosen to indicate that cql is to loop over the data again. It turns out that there is an easier way to solve our problem. Since strings are first class in cql, we may simply append the lines containing our desired output to a string variable:
       1     cql -e '
       2     #include "emp.cql"
       3     schema=Emp;
       4     void begin()  { float maxrate; char* maxline; }
       5     void action() {
       6         if(Emp.rate > maxrate) { maxline = cql.input; maxrate = Emp.rate; }
       7         else if(Emp.rate == maxrate) { maxline = maxline + "\n" + cql.input; }
       8     }
       9     void end() { printf("%s\n", maxline); }
       10     '
If we subsequently find employees whose rate of pay is equal to the current maximum, we execute the body of the if statement on line 7 appending the current line to maxline. You may have noticed in the previous examples, that we didn't initialize the variables max and maxline. Cql sets variables to sane values when they are defined.

A More Advanced Example

The following example makes use of many of cql's advanced features. The example is a translation of an example from the 'Reports and Databases' chapter in 'The Awk Programming Language'. Consider the following countries file:
Country Area Population Continent
Brazil
3286
163
South America
Russia
6592
148
Asia
Canada
3849
28
North America
China
3695
1210
Asia
USA
3717
267
North America
Nigeria
356
126
Africa
India
1222
952
Asia
Mexico
756
96
North America
France
210
58
Europe
Chile
292
14
South America
Japan
145
125
Asia
Germany
137
84
Europe
Australia
2966
18
Australia
England
94
46
Europe
Indonesia
705
194
Asia
Denmark
16
5
Europe
Greece
50
10
Europe
Zimbabwe
150
12
Africa

The problem is to generate a report like that on P.97 of the awk book . The solution calls for two awk programs: one to accumulate totals and reformat the data, and one to print the report. The first program takes two passes: one to accumulate totals, and one to print out the reformatted lines including the totals. To do this, awk provides a way to pass variable assignments in the data file list. This mechanism is necessary for passing state so that awk can combine the passes in one file. The cql loop mechanism replaces this functionality. By placing the variable definitions in the begin function, we make the variables visible globally. First we declare a schema for the raw data in a file called 'countries.cql':

       1     Countries {
       2         char* country;
       3         int   area, pop;
       4         char* continent;
       5     }
       6     Countries.delimiter = "\t";
       7     Countries.comment = "awkbook p24";
       8     Countries.input   = "countries.gz";
Next we invoke the following cql script called prep3.sh (to match the awk script name) to accumulate the totals and arrange the data for the formatting script:
       1     cql -e '
       2     #include "countries.cql"
       3     schema=Countries;
       4
       5     void begin() { int ararr[]; int poparr[]; int areatot, poptot; float den; }
       6
       7     void action()
       8     {
       9         ararr[continent] += area; areatot += area;
      10         poparr[continent] += pop; poptot += pop;
      11     }
      12
      13     void end()
      14     {
      15         cql.loop("", "action_2", "");
      16     }
      17
      18     void action_2()
      19     {
      20          den = 1000.0*pop/area;
      21          printf("%s:%s:%s:%f:%d:%f:%f:%d:%d\n", continent, country,
      22                 pop, 100.0*pop/poptot, area, 100.0*area/areatot,
      23                 den, poparr[continent], ararr[continent]);
      24     } 
      25
      26
      27     ' | /bin/sort -t: +0 -1 +6rn
The begin expression on line 5 defines several scalar and array variables for our accumulations. The action expression on lines 7-11 accumulates the totals in the first pass of this program. The end expression on lines 13-16 has a cql.loop() statement that is the key to cql's multipass mechanism. Cql.loop() statements are allowed only in end expressions. As we mentioned before, with cql.loop() statements, any number of passes through the data are possible. In this case we are interested only in another action statement, so we provide the empty string for the select and end arguments to cql.loop(). On line 20, where we compute the population density, note that we use a floating point constant for the numerator. This forces cql to evaluate the expression as a floating point number. We apply the same principal when we compute the arguments to the printf statement on lines 21-23. Finally, we pass the output of the cql program to the sort command. Cql can recognize files which have been compressed with gzip or vdelta transparently. The file "countries.gz" file in the previous example was compressed with gzip and the output of cql is the the same as it would have been if the file were not compressed. The next cql example takes the previous cql program as input and generates output compatible with the troff(1) tbl macros. The schema in the file conform.cql describes the output of prep3.sh and the input to the next cql program:
       1     Conform {
       2         char* continent;
       3         char* country;
       4         int   pop;
       5         float poppct;
       6         int   area;
       7         float areapct;
       8         float density;
       9         int   poptot;
      10         int   areatot;
      11     }
      12     Conform.delimiter = ":";
      13     Conform.comment = "awkbook p94";
      14     Conform.input   = "/dev/fd/0";
Notice the last line in the schema. This example takes advantage of the fd filesystem type, common on many UNIX like systems, for accessing files opened by a process. In this case, cql programs using this schema would take input from the file descriptor normally associated with the standard input.  The reason for this choice for input will be clear when we show how everything is invoked below. The following script called form4.sh takes the output of prep3.sh and produces a report in tbl format identical to that produced by the awk script on page 96 of the awk book.
       1        cql -e '
       2        #include "conform.cql"
       3        schema=Conform;
       4        
       5        void begin()
       6        {
       7            char* datestr;
       8            char* prev="", cs;
       9            float popp, areap;
      10            int   gpop, garea;
      11            float gpoppct, gareapct;
      12            int   spoptot, sareatot;
      13        
      14            datestr = "January 1, 1988";
      15            popp = 0.0; areap=0.0;
      16        
      17            printf(".TS\ncenter;\n");
      18            printf("l c s s s r s\nl l c s c s c\nl l c c c c c.\n");
      19            printf("%s\t%s\t%s\n\n%s\t%s\t%s\t%s\t%s\n\n",
      20                   "Report No. 3", "POPULATION, AREA, POPULATION DENSITY", datestr,
      21                   "CONTINENT", "COUNTRY", "POPULATION", "AREA", "POP. DEN.");
      22            printf("\t\t%s\t%s\t%s\t%s\t%s\n",
      23                   "Millions ", "Pct. of", "Thousands ", "Pct. of", "People per");
      24            printf("\t\t%s\t%s\t%s\t%s\t%s\n",
      25                   "of People", "Total ", "of Sq. Mi.", "Total ", "Sq. Mi. ");
      26            printf("\t\t_\t_\t_\t_\t_\n.T&\nl l n n n n n.\n");
      27        }
      28        
      29        void totalprint()
      30        {
      31            printf(".T& \nl s n n n n n.\n");
      32            printf("\t_\t_\t_\t_\t_\n");
      33            printf("   TOTAL for %s\t%d\t%.1f\t%d\t%.1f\n",
      34                   prev, spoptot, popp, sareatot, areap);
      35            printf("\t=\t=\t=\t=\t=\n.T&\nl l n n n n n.\n");
      36        }
      37        
      38        void action()
      39        {
      40            if(continent != prev)
      41            {
      42                if(cql.record > 1)
      43                    totalprint();
      44                cs = continent; prev = continent;
      45                popp = poppct; areap = areapct;
      46            } else {
      47                cs = "";
      48                popp += poppct; areap += areapct;
      49            }
      50            printf("%s\t%s\t%d\t%.1f\t%d\t%.1f\t%.1f\n",
      51                   cs, country, pop, poppct,area,areapct,density);
      52            gpop += pop; gpoppct += poppct;
      53            garea += area; gareapct += areapct;
      54            spoptot = poptot; sareatot = areatot;
      55        }
      56        
      57        void end()
      58        {
      59            totalprint();
      60            printf(".T&\nl s n n n n n.\n");
      61            printf("GRAND TOTAL\t\t%d\t%.1f\t%d\t%.1f\n",
      62                   gpop, gpoppct, garea, gareapct);
      63            printf("\t=\t=\t=\t=\t=\n.TE\n");
      64        }
      65        '
To get output, we would execute the pipeline
    sh ./prep3.sh | sh ./form4.sh > report.tbl
The cql code is similar on most ways to the awk equivalent. As with the awk example, we are saving information in global arrays and variables. A cleaner implementation would have passed in parameters. Because the begin, select, action and end functions are invoked implicitly and can therefore not take parameters, we are forced to pass state to those functions in the variables defined in the begin function. One striking difference is the use of named fields in the cql code. This eliminates a common source of error in awk programs, the need to address fields with the $ syntax.

Joining Files

The awk book contains an example of a natural join of their countries file shown above with a capitals file that contains country names and their capital cities. Our version of the capitals file looks like this:
 
Country Capital
Brazil Brasilia
Russia Moscow
Canada Ottawa
China Beijing
USA Washington DC
Nigeria Abuja
India New Delhi
Mexico Mexico City
France Paris
Chile Santiago
Japan Tokyo
Germany Bonn
Australia Canberra
England London
Indonesia Jakarta
Denmark Copenhagen
Greece Athens
Zimbabwe Harare

The awk book shows two ways to perform a natural join of these files, an ad hoc way and a general way. The general way is a somewhat painful set of functions to handle the details of matching the lines. Further, awk requires that the files be sorted on the join field. The natural join is much easier to express in cql. The following is the cql version of the awk natural join example:

       1     Countries {
       2         Capitals* country;
       3         int   area, pop;
       4         char* continent;
       5     }
       6     Countries.delimiter = "\t";
       7     Countries.comment = "awkbook p24";
       8     Countries.input   = "countries";
      09     
      10     Capitals {
      11         extern char* country;
      12         char* capital;
      13     }
      14     Capitals.delimiter = "\t";
      15     Capitals.comment = "awkbook p102";
      16     Capitals.input   = "capitals";
      17     
      18     schema=Countries;
      19     
      20     void action() {
      21         printf("%s\t%d\t%d\t%s\t%s\n",
      22                country, area, pop, continent, country.capital);
      23     }
On line 2 we see the type of the country field is Capitals, the name of the schema for the capitals file. This is the way cql encodes the key relationship. cql, by default, assumes that the first field of Capitals relates to the contents of the country field in Countries. If that is not the case, then the keyword "extern" must be used to inform cql of the key relationship. We have made this explicit on line 11 although, in this case, it is unnecessary. Notice the dot notation used toaccess the capital on line 22.

A More Complicated Join

This example shows three schemas in two files. The group.db file relates to information in the passwd.db file. Our passwd.db and group.db files are similar in structure to the /etc/passwd and /etc/group files on UNIX and Linux systems. Let's say that the group.db file contains:
 
gsf : * : 10001 : gsf
byers : * : 20001 : gsf,byers
abc : * : 20004 : gsf,byers,jpl
ac : * : 20002 : gsf,jpl
byers : * : 10002 : byers
bc : * : 20003 : byers,jpl
jpl : * : 10003 : jpl

and the passwd.db file contains:
 

gsf : * : 90001 : 10001 : Glenn Fowler, 123 F St, rm 1, /home/gsf : /home/gsf : /bin/ksh
byers : * : 90002 : 10002 : Simon Byers, 345 B St, rm 2, /home/byers : /home/byers : /bin/tcsh
jpl : * : 90003 : 10003 : John Linderman, 222 L St, rm 0, /home/jpl : /home/jpl : /bin/sh

The gid field in the passwd.db file provides the key to a row in the group.db file. The passwd.db file has a field that we call info here that has information on the person who owns the login. The last field in the group.db file may be a list of the group members. The elements of the members array are keys to rows in the passwd.db file. Here is the contents of the cql schema file passwd.cql:

       1     passwd {
       2         extern char*   pname;
       3         char*              ppass;
       4         int                   uid;
       5         group*            gid;
       6         info                 info;
       7         char*              home;
       8         char*              shell;
       9     }
      10     group {
      11         char*            gname;
      12         char*            gpass;
      13         int                 gid;
      14         passwd*      members[];
      15     }
      16     info {
      17         passwd*      name;
      18         char*           address;
      19         char*           office;
      20         char*           home;
      21     }
      22
      23     passwd.delimiter    =  ":";
      24     passwd.comment   = "password file";
      25     passwd.input          = "passwd.db";
      26     group.delimiter       = ":";
      27     group.comment      = "group file";
      28     group.input              = "group.db";
      29     group.members.delimiter         = ",";
      30     group.members.comment        = "group member list";
      31     passwd.info.delimiter          = ",";
      32     passwd.info.comment         = "not enforced";
      33
The passwd record type includes an element of info type. To distinguish the info component from the rest of the passwd record, the cql schema includes delimiters for both on lines 23 and 31. Notice the array notation for the members on line 14. This is necessary for us to iterate over the members later. Sincethe type of members is passwd*, we will be able to relate each to its corresponding entry in the passwd.db file. The extern keyword used in the passwd schema informs cql that passwd may be joined with another schema on pname. In addition to the delimiter attribute added to record types by cql the user may specify comment, format, input, permanent, or scanlimit. In the example above, lines 24, 27 and 30 comment on the schemas. A schema input specification tells cql the name of the file containing data for that schema. For example, line 25 specifies that the input for the passwd schema is in the file 'passwd.db'. The input specification applies to all schemas enclosed in the schema for which an input is specified. This condition is denoted through the use of the dot notationdescribing the schema attributes. In the example, since the info attributes on lines 31 and 32 are prefixed by passwd, the enclosing schema, the infoschema input is also 'passwd.db'. The same is true for group members as specified on lines 29 and 30.

Printing the members of each group is a simple matter:

       1    #include "passwd.cql"
       2
       3    schema = group;
       4
       5    void action() {
       6        int     i;
       7
       8        printf("%s:", gname);
       9        for (members[i])
      10            printf(" %s", members[i]);
      11        printf("\n");
      12    }
Line 3 changes the main schema to group. Line 6 defines a variable needed to iterate over the members array. Lines 9 and 10 are the actual iteration. This is a special form of the for loop for arrays.

Permanent gives cql a hint that the integer argument is the number of permanent fields. If the data is to be stored incdb(ast3) method format, this hint gives the cdb subsystem the opportunity to save space. If there is no permanent hint, then all fields are considered permanent. Scanlimit gets an integer argument that tells cql to limit the number of index partitions to search.

Multiline Records

Some datafiles contain multiline records. Consider the following datafile of fun places:
 
Bell Laboratories
Murray Hill
New Jersey
USA
AT&T Laboratories
Florham Park
New Jersey
USA
Disneyland Paris
Paris
France
An awk script like
BEGIN { RS = ""; FS = "\n" }
would capture the format. In cql, we declare the field delimiter and the record terminator to be newline.
       1     funplaces {
       2         char* name;
       3         char* location;
       4         char* country;
       5     };
       6
       7     funplaces.input = "funplaces"
       8     funplaces.delimiter = "\n";
       9     funplaces.terminator = "\n";

Cdb Integration

A schema format allows the user to assign a cdb format method. Cql currently supports flat, and cdb format methods. The default format is flat. The following example in file dater.cql illustrates:
       1     dater
       2     {
       3         double:format="8{be}F"   d;
       4         long:format="4{be}L"      l1;
       5         long:format="4{be}L"      l2;
       6     }
       7
       8     schema=dater;
       9
      10     dater.input = "dater";
      11
      12
      13     void action() { printf("%f %ld %ld\n", d, l1, l2); }
This schema describes a binary file containing three fields with field delimiters or record terminators. Line 3 declares a big endian double precision floating point number of length 8. Lines 4 and 5 declare big endian long integers of length 4.

Appendix

Expression Syntax

Cql's expression syntax was borrowed from C so anyone familiar with C can become comfortable with cql expressions effortlessly. Here is the list of operators and their precedence:
token operator class precedence associativity
, sequential evaluation binary 1 left
= assignment binary 2 right
? : conditional ternary 3 right
|| logical or binary 4 left
&& logical and binary 5 left
| bitwise or binary 6 left
^ bitwise xor binary 7 left
& bitwise and binary 8 left
== != equality test binary 9 nonassoc
< > <= >= relational binary 10 nonassoc
<< >> left/right shift binary 11 left
+ - additive binary 12 left
* / % multiplicative binary 13 left
! ~ logical not bitwise not unary 14 right
++ -- increment/decrement unary 15 right
(type name) cast unary 16 right
f() function call postfix 17 left

The most striking difference between C and cql is that in cql character strings are first class types. This means that character strings may participate in anyof the expressions that are legal for integers. Typically, a character string variable would be compared to a string literal or a regular expression. Cql uses thecdb library for I/O. This is the reason why cql is able to handle compressed files transparently for example.

Cdb formatting

It also makes it possible for cql to handle binary files. The schema definition must have its elements augmented with some additional information for this to be possible. This additional information is the format specification language used by the cdb command. The general form is:
    type:format="string"        varname;
The format string is of the form:
    D*L{l}T
D is optional and tells cdb that there are D values of T.  The * character is a literal and is required only when a D value is supplied. L is optional and specifies the length of the variable. If l is present, it must be surrounded by braces and tells cql the layout. If a cql format specification has D as part of the specification, cql regards the schema element as an array. D and L are integers. T may take on of the following values:
symbol
type
delimited
b
binary
n
B
binary
y
f
double
n
F
double
y
i
signed integer
n
I
signed integer
y
l
long
n
L
long
y
s
character string
n
S
character string
y
x
ignored field
n
X
ignored field
y
u
unsigned integer
n
U
unsigned integer
n
w
unsigned long
n
W
unsigned long
y

The symbols w and W are intended to be able to contain the largest integral type supported by the platform. An lvalue, if present, may take one of the following values:

layout meaning
ebcdic ibm ebcdic encoding
ascii ascii encoding
le little endian numeric type (e.g. intel)
be big endian numeric type (e.g. sun)
sf sfio integer encoding
ibm ibm 370 floating point
bcd binary coded decimal