In Grouping and Summarizing Data, we learned how to group data in a table as needed and to summarize the grouped data. The data grouping and summarization groups data by a single or multiple fields or an expression in ascending order based on equivalent values. Alignment grouping or enumeration grouping apply to data analysis tasks where data is required to be sorted in a specified order or be grouped by specified conditions.
Usually sort function is used to sort data in a sequence or a table sequence, with a result listed either in ascending order or in descending order. To sort data in a specified order, esProc provides the align function. For example:
|
A |
B |
1 |
[three,one,four,six,two] |
[one,two,three,four,five,six] |
2 |
=A1.sort() |
=A1.align(B1) |
3 |
=demo.query("select EID,NAME,GENDER, STATE,DEPT from EMPLOYEE") |
[California,Texas,Florida,Illinois] |
4 |
=A3.align(B3,STATE) |
|
The sequence in A1 is composed of string members, and B1 provides a sequence used as the standard for alignment:
Both A2 and B2 re-sort members of the sequence in A1. A2 uses the common sort function, while B2 uses the align function to sort the sequence according to B1’s sequence. Results of A2 and B2 are as follows:
The sort function only sorts the members alphabetically in ascending or descending order, while align function sorts them according to the positions of members of the standard sequence. If a member in the standard sequence can’t find a matching member in the target sequence, the value in the corresponding position will be a null.
The most cases in which the align function is applied are sorting records of a table sequence or a record sequence. For example, A3 selects records of some employees, and A4 gets the records of certain states from them in the specified order. A3’s result is as follows:
According to the order of the sequence in B3, A4 finds records of the employees in these certain states:
The alignment operation finds the record of the first employee in each of these states and returns a record sequence composed of these records.
In the previous example, only the first-found eligible record for each state has been got after the alignment operation is performed. In many cases, however, you need all the eligible records, which can be realized with align@a. For example:
|
A |
B |
1 |
=demo.query("select EID,NAME,GENDER, STATE,DEPT from EMPLOYEE") |
[California,Texas,Florida,Illinois] |
2 |
=A1.align@a(B1,STATE) |
|
3 |
=A2.new(STATE,~.count(GENDER=="M"):Male,~.count(GENDER=="F"):Female) |
|
4 |
=A1.align@n(B1,STATE) |
|
By adding @a option to the align function, A2 gets all eligible records for each state:
This time the return value of A2’s alignment grouping is a sequence consisting of groups (sub-sequences). Similar to the result of group function, the result of align@a function is also used for further handling. For example, A3 takes a further step to calculate the numbers of male and female employees in each state based on the grouping result:
In alignment grouping, there may be some data that cannot be matched with any member of the standard sequence. For this type of scenario, esProc provides @n option to realize another type of alignment grouping, which puts all records that haven’t corresponding members in the standard sequence into a separate group. As is shown by the grouping result of A4:
By comparing the alignment grouping results of A2 and A4, you can notice that the result of A4 has one more group which stores the records the employees who are not from the four specified states.
The sequence based on which the alignment grouping is carried out can also be obtained through computation. For example:
|
A |
1 |
=demo.query("select EID,NAME,GENDER, STATE,DEPT from EMPLOYEE") |
2 |
=A1.groups(STATE;count(~):Count) |
3 |
=A2.sort(-Count) |
4 |
=A1.align@a(A3:STATE,STATE) |
A2 groups the records and computes the number of employees of each state using groups function, and then A3 sorts the result by the count of employees in descending order. Results of A2 and A3 are as follows:
A4 gets the sequence of states from A3’s sorting result to use it as the standard for the alignment grouping. The result is as follows:
The fourth type of alignment grouping is P.align(n,y), which divides members into n groups and then directly puts each member into the corresponding group by evaluating the grouping expression y. For example:
|
A |
1 |
=demo.query("select EID,NAME,GENDER, STATE,DEPT from EMPLOYEE") |
2 |
=A1.align@a(26,asc(left(NAME,1))-64) |
3 |
[HR,R&D,Finance,Marketing] |
4 |
=A1.align@a(A3.len(),A3.pos(DEPT)) |
In A2’s expression =A1.align@a(26,asc(left(NAME,1))-64), left(NAME,1) gets the initial of the name of an employee, which is then converted by asc function into ASCII code, and, by subtracting 64 from it, the initial will be placed into a certain group number within 1~26. The alignment grouping result is as follows:
The employees in the same group have the same initial.
Sometimes locating functions, like pos and pmax, are used in P.align(n,y) to compute the group number while performing alignment grouping. A4 groups the records of employees in alignment according to the sequence of specified departments. Its result is as follows:
With group function, the resulting groups will be sorted by values of the grouping field in ascending order, whereas the alignment grouping can produce a result sorted in specified order.
Sometimes data are to be grouped not by the equivalent values but by some conditions, like the range of the order amount and the employees’ age groups.
The grouping conditions can be represented by strings, like "?>3", "[2,3,5,7].pos(?)>0". The penum function is used to judge which condition(s) a number can satisfy according to a sequence of strings of conditional expressions. For example:
|
A |
1 |
[?>=85,?>=70,?>=60] |
2 |
=A1.penum(100) |
3 |
=A1.penum(66) |
4 |
=A1.penum(54) |
There is a sequence consisting of strings of conditional expressions in A1:
Respectively, A2 and A3 find the condition(s) 100 and 66 satisfy. Results are as follows:
As is shown by the result, when a certain number satisfies multiple conditions, only the ordinal number of the first-found condition will be returned. 100, for example, can satisfy all three conditions, but the returned result is 1.
As the number 54 in the expression in A4 doesn’t satisfy any of the conditions, the returned result is null.
Similar to the alignment grouping, penum function also uses @n option, which returns k+1 when data doesn’t satisfy any of the k conditions. In particular, if a conditional expression is null, it can be satisfied by any data. For example:
|
A |
1 |
[?>=85,?>=70,?>=60] |
2 |
=A1.penum@n(54) |
3 |
[?>=85,?>=70,?>=60,null] |
4 |
=A3.penum(54) |
Results of A2 and A4 are as follows:
The number 54 in A2 doesn’t satisfy any of the three conditions, so penum@n function returns a result of 4. But in A4, it satisfies the fourth condition – null. Note that if the null is listed in the conditions, it must be put at the end.
The penum function also uses @r option to allow returning the overlapped conditions the specified data can satisfy. In this case, a sequence composed of ordinal numbers of all eligible conditions will be returned. For example:
|
A |
1 |
[?>=85,?>=70,?>=60,null] |
2 |
=A1.penum@r(100) |
3 |
=A1.penum@r(66) |
4 |
=A1.penum@r(54) |
A2, A3 and A4 return respectively a sequence composed of ordinal number(s) of the condition(s) 100, 66 and 54 satisfy. Results are as follows:
Note: When using the penum@r function, though the condition null is defined to be satisfied by any data, its ordinal number won’t appear in the result if there is at least one eligible condition. Besides, with the @r option the function returns a sequence, instead of a number, when there is only one eligible condition.
You can group members of sequence P using P.enum(E,y) function based on a sequence E consisting of strings of conditional expressions. The operation will compute the expression y using each member of sequence P, judge which condition(s) the result satisfies, and place the member of P into the group corresponding to the condition. By default, each member of P will be grouped according to the first condition it satisfies. For example:
|
A |
1 |
=demo.query("select EID,NAME,GENDER, BIRTHDAY,SALARY from EMPLOYEE") |
2 |
[?>=15000,?>=12000,?>=9000] |
3 |
=A1.enum(A2,SALARY) |
A1 gets a table sequence by query and A2 is a sequence of enumeration conditions as shown below:
According to the sequence of conditions, the result of enumeration grouping in A3 is as follows:
In each group produced by enumeration grouping, not all SALARY values are equal; they just satisfy the same condition. Similarly, though a record may satisfy both the first condition and the other two conditions, it will, by default, be put into the group where only the first one is satisfied. For the records that don’t satisfy any conditions, like the first and the fourth records , they won’t be grouped.
Similar to the penum function, a null condition can be appended at the end, under which all records that don’t satisfy all the previous conditions can be placed when performing enumeration grouping. Or you can use enum@n, which has a same effect. For example:
|
A |
1 |
=demo.query("select EID,NAME,GENDER, BIRTHDAY,SALARY from EMPLOYEE") |
2 |
[?>=15000,?>=12000,?>=9000] |
3 |
=A1.enum@n(A2,SALARY) |
4 |
[?>=15000,?>=12000,?>=9000,null] |
5 |
=A1.enum(A4,SALARY) |
Results of A3 and A5 are same:
By comparing the result with that of the previous example, it can be seen that both the use of enum@n function and the appending of null at the end of the sequence of conditions will put all records that don’t satisfy the other conditions into a separate group.
By default, it is assumed for the enum function that there are no overlapped groups, that is, each member of P won’t satisfy more than one conditional expression at the same time. The @r option is needed if members should be put into more than one group. For example:
|
A |
1 |
=demo.query("select EID,NAME,GENDER, BIRTHDAY,SALARY from EMPLOYEE") |
2 |
[?>=15000,?>=12000,?>=9000] |
3 |
=A1.enum@r(A2,SALARY) |
A3 uses enum@r to perform enumeration grouping allowing overlapped groups. Result is as follows:
As can be seen, members of the first group appear in all the three groups and members of the second group appear repeatedly in the latter two groups after the records are repeatedly grouped.