# Computing Triplet and Quartet Distances

## Abstract

In this thesis we present an implementation of, improvements of and experiments on the two algorithms presented by [SODA 2013].

The algorithms calculate the triplet and quartet distances on two respectively rooted and unrooted trees, of arbitrary degree. Both the triplet and the quartet distances are measures for comparing the similarity between two trees.

The triplet distance calculation operates on sets of three leaves (triplets). The quartet distance calculation operates on sets of four leaves (quartets). The distances are defined as the number of triplets or quartets with different structures in the two trees.

The triplet distance calculation algorithm presented in [SODA 2013] runs in time $O(n \cdot \lg n)$. The space usage of the algorithm is $O(n \cdot \min(d_1, \lg n))$, where $d_1$ is the degree of the first input tree, $T_1$.

The quartet distance calculation algorithm runs in time $O(\max(d_1, d_2) \cdot n \cdot \lg n)$ where $d_2$ is the degree of the second input tree, $T_2$. The space usage of the algorithm is $O(\max(d_1, d_2) \cdot n \cdot \min(\max(d_1, d_2), \lg n))$.

We improve the quartet distance calculation algorithm, reducing the runtime to $O(\min(d_1, d_2) \cdot n \cdot \lg n)$. This improvement furthermore decreases the space usage to $O(\min(d_1, d_2) \cdot n \cdot \min(d_1, d_2, \lg n))$.

We furthermore significantly reduce the number of calculations needed to calculate the quartet distance.

Through experiments we provide empirical evidence that both the algorithms presented in [SODA 2013], as well as our improvements, are feasible and perform well in practice.

## Compilation and Usage

The Makefile specifies -m64, meaning that all compiles will result in 64-bit programs. This is only useful when your compiler supports it, and you intend to run the program on a 64-bit system.

To instead compile a 32-bit version, manually edit the file Makefile to state -m32 in the two places (CFLAGS and LDFLAGS at the top) where it currently specifies -m64. As noted below, you will further have to specify NO_N4_128=1 for all compiles.

## Compilation

Many different variations are supported. The combination is specified at compile-time. We here list a number of useful commands.

• Triplets:
make CONTRACT_NUM=20000
• Quartets, [SODA 2013]:
make QUARTETS=1 CONTRACT_NUM=20000
• Quartets, our variation, calculating $A$ and $B$:
make QUARTETS=1 NOSWAP=1 CONTRACT_NUM=20000
• Quartets, our variation, calculating $A$ and $E$:
make QUARTETS=1 NOSWAP=1 CONTRACT_NUM=20000 CALCE=1
• Furthermore note that
• CONTRACT_NUM=<?> denotes $Q$. 20,000 has been specified in the commands above as this was found to perform the best.
• Contract can be disabled by specifying NOEXTRACT=1.
• 128-bit integers are used by default for $n^4$ sums but can be disabled with NO_N4_128=1. Note that 128-bit integers only works for 64-bit compiles via gcc and similar. By disabling 128-bit integers, only 64-bit integers are used, which limits the size of the input, the program can meaningfully process.

## Running the Program

• All compiles can calculate the triplet distance, but only the quartet-compiles can calculate the quartet distance. Also note that the triplet distance calculation for quartet compiles will be slower, as they will maintain all counters used for the quartet distance calculation.
• Input trees should be given in the Newick format (and the file has to end on the ; character, i.e. line breaks will not be tolerated).
• Triplets:
<program file> fancy calcTripDist <t1> <t2>
• Quartets:
<program file> fancy calcQuartDist <t1> <t2>

## Compilation, Simplified Version

A few variations are supported. The combination is specified at compile-time. We here list the commands.

• Triplets:
make
• Quartets, our variation, calculating $A$ and $E$:
make QUARTETS=1
• Furthermore note that the following can be added
• CONTRACT_NUM=<?> denotes $Q$. 20,000 is the default as this was found to perform the best.
• Contract can be disabled by specifying NOEXTRACT=1.
• 128-bit integers are used by default for $n^4$ sums but can be disabled with NO_N4_128=1. Note that 128-bit integers only works for 64-bit compiles via gcc and similar. By disabling 128-bit integers, only 64-bit integers are used, which limits the size of the input, the program can meaningfully process.

## Running the Program, Simplified Version

• All compiles can calculate the triplet distance, but only the quartet-compiles can calculate the quartet distance. Also note that the triplet distance calculation for quartet compiles will be slower, as they will maintain all counters used for the quartet distance calculation.
• Input trees should be given in the Newick format (and the file has to end on the ; character, i.e. line breaks will not be tolerated).
• Triplets:
<program file> calcTripDist <t1> <t2>
• Quartets:
<program file> calcQuartDist <t1> <t2>