Network Clustering - Bernhard Pfann, CFA

## Triangle Density The number of actual triangles divided by the number of potential triangles that could be formed in a fully connected network of $n$ nodes. $ \text{Triangle density}=\frac{\text{\# of triangles}}{n \choose 3} $ ## Network Transitivity Also called clustering coefficient. Describes the number of closed triplets divided by the number of open and closed triplets. $ C= \frac{\text{\# of closed triplets}}{\text{\# of triplets}} $ It can be written in terms of the [[Network Types#^daa3e3|Adjacency Matrix]]: $ C= \frac{\sum_{i,j,k} A_{ij}A_{jk}A_{ki}}{\sum_ik_i(k_i-1)} = \frac{\frac{1}{2}\text{tr } A^3}{\sum_ik_i(k_i-1)} $ - In the numerator each product of $(A_{ij}A_{jk}A_{ki})$ results in $1$ when it is a closed triplet and $0$ otherwise. This is the same as considering $A^3$, which gives the number of [[Network Types#^7075f3|Walks]] from $i$ to $j$ in $3$ steps. As a closed triplet needs to start and end at the same node, only elements on the diagonal $[A^3]_{ii}$ are of interest. So we simply sum the diagonal of $A^3$, which is the trace $(\text{tr } A^3)$. - Since each element of the diagonal is either $0$ (if no closed triplet) or $2$ (if closed triplet), we still need to divide the trace by $2$. This is because there are always 2 possible walks around a triangle. - For the denominator, the number of triplets (open or closed) depends on the degree $k_i$ of node $i$. ## Node Transitivity Also called local clustering coefficient. The same concept as network transitivity, but applied on a single node. $ C= \frac{3* \text{\# triangles connected to node }i}{\text{\# connected triples centered at node }i} $ $ C= \frac{\sum_{j,k} A_{ij}A_{jk}A_{ki}}{k_i(k_i-1)} $ >[!note:] >A triangle and a closed triplet describe the same thing. However, for network statistics triplets are counted nodewise. Therefore each triangle is counted 3 times. ## Homophily Answers the question of how similar is a node to its neighboring nodes. Modularity is a metric to measure this. It considers the fraction of edges where the node attribute $t$ is the same in both of its nodes. To see the uplift, we subtract this by the fraction of such edges, that would appear in a random graph (of certain configuration). **Number of edges of same type:** $ \begin{align} &=\sum_{(i,j)\in E}\delta(t_i, t_j) \tag{1}\\[8pt] &= \frac{1}{2}\sum_{i,j}A_{ij} *\delta(t_i, t_j) \tag{2} \end{align} $ where: - (1) We denote $\delta(t_i,t_j)$ as an indicator variable that equals $1$ if the node attribute $t$ is the same for node $i$ and $j$. We sum over all $(i,j)$ tuples in the edge list $E$. - (2) This is the same as summing over all possible combinations. The product with the adjacency matrix $A_{ij}$ we only return $1$ for connected nodes, where the attribute matches. We need to divide by $2$ as each edge is counted twice this way ($i \mapsto j$ and $j \mapsto i$). **Number of expected edges of same type:** $ \begin{align} =\frac{1}{2}\sum_{i,j}\frac{k_i k_j}{2m} \delta(t_i, t_j) \end{align} \tag{3}$ (3) The terms $k_i,k_j$ denote the degrees of the respective nodes. The term $m$ denotes the total number of edges of the network (actually of the configuration model). Thus the fraction $\frac{k_ik_j}{2m}$ is the expected number of edges between node $i$ and $j$ (in a configuration model). ## Modularity Now we subtract the two terms and scale by $2m$. This results in a metric that can be within the interval $\in [-1,1]$. $ M=\frac{1}{2m}\sum_{i,j}\Big(A_{ij}-\frac{k_i k_j}{2m}\Big) \delta(t_i, t_j) $