Exercise 6 (by Mikael Eriksson, TA HT2015)


Concerning binary trees: Substitue ”except” for ”bar” and the sentence makes sense: ”A binary tree is full if every node except the leaves has 2 children”. The other concept that I was mentioning, is stronger and is actually called a ”complete” binary tree.


We covered inorder, preorder and postorder traversals of trees. You can use postorder traversal of an arithmetic expression tree in order to evaluate (compute) the resulting value. This is related to lab 1, but you computed the evaluation with a stack instead of tree traversal.


Bertillon and anthropometry


We discussed both hashing and tree-structure for storing and searching records. We assume that the possible values are quantized/discretized into a number of categories. Also, we spent some time discussing the issue with uncertainties when values are close to the edges between the categories. For example, if someone is measured to be 180 cm, we may be uncertain if that is filed in the 170-180 cm box, or the 180-190 cm box.


Hasing: The hash-function maps a document (all of the measurements) to an integer that corresponds to the document's location. The hashfunction should have few collisions, but we can never expect to be able to make it injective (”1 to 1”). One hash search operation is O(1) (constant time), but with n uncertainties, we have to check 2^n possible locations.


With a tree, one search is O(log N) (where N is the total number of documents), but with n uncertainties, we have to check more than O(log N), but specifying how much exactly analytically is probably a bit messy. This problem could be the source for an interesting lab: Create a data set with some random distribution and test by simulation what effect the quantization size and total number of documents has on the search times for hash-table and tree respectively, by experimental measurement.


To quote Paul Kneppter et al., ”Urban Crime Prevention, Surveillance, and Restorative Justice...”:


””” Bertillon solved this problem [of storing and finding criminal records] by translating the body measurements into a number for cataloguing. Five body measurements, subdivided into three categories, corresponded with praticular drawers in the cabinet; and the combination yielded a practical search-and-retrieval system. Henry implemented Bertillon's system, but its major limitation became apparent: the results depended on consistency of measurement.”””


The quote speaks of ”three categories”, which indicates a *very* coarse quantization, so the number of uncertainties should be very low – but even then, apparently, they had problems with the issue of consistency of measurement! The number of drawers in their hash-scheme should be 3^5 = 243. So when the measurements have no uncertainties, in the best case, the average number of records you have to look through will be 1/243th, of the total number of records (e.g. 41 instead of 10.000).


Huffman Coding


I showed how to find the Huffman code for an alphabet of symbols with known probabilities (relative frequencies) using a style that builds a tree from the bottom (the individual symbols being the leaves) up. The code for each symbol is determined by the path from the root of the final tree where each branch taken to the left represents a 0 and each branch taken to the right represents a 1.


I also mentioned so called run-length encoding which is sometimes more efficient than Huffman. It all depends on what the data looks like.


Tips for lab 5


You don't have to write the updated vectors back to file.


Please make use of the idea of ADT, so that you don't try to solve everything in one place. Suggestions for functions / ADTs:


toknizer-function in order to convert user's input string to something more practical (this should be very simmilar to lab 1 actually, so look back at how you did there), stored in a ADT-object (struct + functions) that represents an operation (fields for operation, operand1, operand2, target [in case of assignement]). The same field can not have two different types (vector and scalar), but you can solve that by having two fields for storing the potential two vectors and one field for a scalar. The ADT will be able to figure out from the contents of the fields what the operations means.


Keep the vector operations together in an ADT. The algebraic operations on vectors belongs in a natural way in it's own ADT. Also, the name of the vector would be nice to have belong to the vector.


I would personally make an ADT for storing all the vectors, where there will be the typical ”dictionary” operations of findVector(char* name), existsVector(char* name), setVector(char* name, Vector* v). This way, the ”temp” variable doesn't even have to be taken care of in some special way (like using a separate variable in the program just to handle ”temp” - what if you want a ”temp2” then?).


Rememeber to document all your functions with PRE- and POST or the IN, OUT, DEPENDENCIES, EFFECTS.