Primary Key and Index

Read(375) Label: primary key, index,

esProc allows specifying one or more fields as a table sequence’s primary key, based on which certain functions make queries. This simplifies code and effectively increases performance.

find and pfind functions

Primary keys are common in database tables. A value of the primary key field is used to uniquely identify a record. A primary key must not contain identical values, which is a mandatory requirement by many databases.

esProc allows multiple primary key fields in a TSeq. The keys’ order should be fixed. Key values are called primary keys. We consider that a primary key has unique values among records. But no mandatory check will be executed and error won’t be reported even if there are identical key values. Both T.pfind(k) and T.find(k) search table sequence T for records according to primary key value k. find returns the first-found record and pfind returns the sequence number of this record. k can be a single value or a sequence depending on the number of the keys in T.

To specify a primary key consisting of one or more fields, use T.keys(Fi,…) function, which defines Fi,… as the primary key of table sequence T. create(Fi,…) function allows specifying key fields by adding the sign # before field names when creating an empty table sequence, like =create(#OrderID,Client,SellerId,Amount,OrderDate). OrderID is the primary key of this newly-created empty table sequence.

Locating functions T.select() and T.pselect() are conventionally used to search records in a table sequence. We’ll compare their uses with pfind and find.

Here EMPLOYEE table in demo database is the table sequence on which a query will be executed according to a newly-added FullName field:

To illustrate the performance advantage of primary key in data query, 10,000 random full names will be generated, on which pselect and pfind will respectively make query. The time taken by both will be recorded.

 

A

1

=demo.query("select * from EMPLOYEE")

2

=A1.derive(NAME+" "+SURNAME:FullName)

3

=10000.(A2(rand(A2.len())+1).FullName)

4

=now()

5

=A3.(A2.pselect(FullName:A3.~))

6

=interval@ms(A4,now())

7

=now()

8

>A2.keys(FullName)

9

=A3.(A2.pfind(A3.~))

10

=interval@ms(A7,now())

Based on the same data, A5 and A9 query the table sequence to get positions of the requested records using pselect and pfind respectively. In A9, keys function is used to set the primary key for the table sequence before pfind starts to work. A6 and A10 compute respectively the milliseconds the queries take:

 

But the query results in A5 and A9 are same:

Similarly, you can compare select function and find function. To be functionally in line with find function, @1 option is added to select function to get only the first-found result and return it:

 

A

1

=demo.query("select * from EMPLOYEE")

2

=A1.derive(NAME+" "+SURNAME:FullName)

3

=10000.(A2(rand(A2.len())+1).FullName)

4

=now()

5

=A3.(A2.select@1(FullName==A3.~))

6

=interval@ms(A4,now())

7

=now()

8

>A2.keys(FullName)

9

=A3.(A2.find(A3.~))

10

=interval@ms(A7,now())

A6 and A10 compute respectively the milliseconds the queries take:

 

Still, the query results in A5 and A9 are same:

It is clear that the primary-key-based query functions are much more efficient than the traditional locating functions.

The primary key and the index table

Why, in esProc, will the efficiency increase significantly when the primary key is used to make query? The reason is that the index table of the primary key has been used for computation.

When calling the keys function or making a primary-key-based query on a table sequence for the first time without an index table, an index table will be generated according to the primary key. A hash table will have been created according to all primary key values when the index table is generated. It divides the primary key values into multiple groups by the hash values, which correspond to the group numbers.

Normally, records of a table sequence need to be traversed when performing a key-based query until each targeted record is found. An average of n/2 comparisons are needed for querying a table sequence containing n records.

Thanks to the index table, searching a table sequence for a certain record according to the primary key value would make a difference. First, the hash values will be computed according to the primary key values. According to the hash values, the group corresponding to the targeted record will be directly found in the index table. Then we just need to compare the key value with records of the identified group. In this way, for a table sequence containing n records and with its primary key values distributed in k groups by hash values, only an average of n/2k comparisons are needed. Though hash values must be computed before an index table is generated and the query is executed, the number of comparisons is reduced significantly, and particularly, the index table needs to be generated only once. Therefore, the more the data in a table sequence and the more the queries, the higher the efficiency when performing the primary-key-based queries.

T.index(n) function can be used to create an index table for the table sequence T’s primary key in advance. n is the length of the index table. Default value will be used if n isn’t supplied.

What we should know is that the reason the primary-key-based query functions, such as find and pfind, can enhance computational performance effectively is that an index table has been created for the table sequence’s primary key. So if the primary key itself can be used as the index to locate records, it is unnecessary to create one.The EID field in the above EMPLOYEE table, for example, represents the positions of the records in the table sequence, thus it is more efficient to use it to find the corresponding records.

 

A

1

=demo.query("select * from EMPLOYEE").keys(EID)

2

=10000.(A1(rand(A1.len())+1).EID)

3

=now()

4

=A2.(A1(A2.~))

5

=interval@ms(A3,now())

6

=now()

7

=A2.(A1.find(A2.~))

8

=interval@ms(A6,now())

This time 10,000 sequence numbers of employees are generated randomly. A4 finds the corresponding records according to these sequence numbers; A7 still uses find function to find records. A5 and A8 compute respectively the time the two methods take:

 

You see that it is much faster to locate records using sequence numbers directly. Because this method doesn’t compare field values, nor does it compute the hash values and create an index table. The query results in A4 and A7 are same:

Thus you need to weigh the pros and cons when trying to increase efficiency by creating an index table for a table sequence’s primary key in esProc.

switch function

Besides find and pfind functions, switch function also searches for records in the table sequence by automatically using the index table created for the primary key values. For example:

 

A

1

=file("D:/files/txt/PersonnelInfo.btx")

2

=A1.import@b()

3

=A1.import@b()

4

=demo.query("select * from STATES")

5

=now()

6

>A2.(State=A4.select@1(ABBR:A2.State))

7

=interval@ms(A5,now())

8

=now()

9

>A3.switch(State,A4:ABBR)

10

=interval@ms(A8,now())

Both A2 and A3 contain personnel information imported from the following bin file PersonnelInfo:

A4 contains states information:

In both A6 and A9, State field of the PersonnelInfo file is switched into corresponding states information. A6 uses select@1 function while A9 uses switch function. A7 and A10 compute respectively the time the two methods take:

 

After the cellset code is executed, values of A2 and A3 are same:

State field values have been converted into corresponding records in states table.

Before switch function is executed, an index table will also be created for the corresponding field in the other table sequence. In this example, an index table is created for the ABBR field of A4’s states table to increase the matching efficiency. It’s good for you to learn to use the function properly when a foreign key field is generated to reference records.