How infections spread in a social network: COVID-19

What better way to use quarantine time than to do a network analysis of what is causing this lockdown and the resultant quarantine?

In this post, we’ll go over the well-known network dynamics of infection spreading [1]. After understanding them and implementing them on a toy network, we will implement COVID-19 relevant numbers on a social network of a larger size. Since it is hard to extract a social network, we’ll assume that it is a Barabasi-Albert (BA) graph of the same size [2]. It has been shown in previous research articles that social networks follow the power-law dynamics of BA networks hence it makes a good analysis model for this blog post. The same analysis can be done on real social networks as well.

Constructing the social network. The social network will comprise of nodes and edges where nodes represent people and edges represent a relationship where there is a good possibility of in-person interaction between them. For example neighbors, colleagues, family members would be linked to each other by edges. For this reason, extracting a social network of tweets or Facebook wall posts will not serve the same purpose. We will start with a network of who knows who in a small neighborhood, i.e., there is an edge between two people if they know each other and live in the same neighborhood. We do not necessarily meet everyone we know every day and for this reason, we will consider a meeting fraction detailed later in this post. We will use the preferential attachment method to construct this social network [2]. Preferential attachment means a new node in the network will connect to other nodes in the network with a probability proportional to their existing degree. In other words, the node in the network with most node-neighbors will be most likely to create a link with a new node. If you are new to a town and you’re trying to make friends, you are most likely to be introduced to someone who has many friends in the town. Consider a network with only three nodes: A, B, and C. A new node, which we label as 1 comes into this network. This new node connects to two of the three nodes at random with a higher probability for a high degree node. In this case, all the nodes A, B, and C have the same degree and hence the node 1 picks any two of these three nodes at random – as a result, this node connects to nodes A and B. Then, another node, labeled 2 is introduced to the network. This node picks two of the nodes A, B, C, and 1 each with degrees 3, 3, 2, and 2 respectively. I ran this code on a computer and the random chance results in node 2 connecting with nodes A and C. Note that the highest degrees at this point is of nodes A and B but since we are choosing nodes at random with probability proportional to the degree, C can also get picked. Thereafter, another node 3 is introduced to the network which connects with nodes A, B, and C.

example_PA

This network generation model is called the Barabasi-Albert preferential attachment model and I’ll use the networkx package in python to implement it. The figure below shows a BA network with 200 edges. The size and the color of the node denote its degree. A node with a bigger size and darker shade of blue has more edges than a smaller size and lighter shade of blue node. The nodes with very high degrees in a network are called “hubs”.

BA_big2

Spread of infectious disease on a network

We’ll try to understand the spreading of infectious disease on a social network. There are various models of epidemic spreading, read more here. We’ll start with a simple model on networks, the SIR model, where a fraction of the population is Susceptible to catching the infection, that is, everyone who is currently healthy and hasn’t been infected; another fraction is Infected, that is, people who currently have the infection; and a fraction is Removed, that is, people who are removed from the two groups because they became infected and recovered and are now immune or people who died from the disease. For illustration, we will start with a Barabasi-Albert network model of 600 nodes, for example, a network of people living in a housing complex. We’ll assume that to start with, 10 people in the building caught an infectious disease and they have a likelihood of spreading it to people they come in physical contact with. In a world without social distancing, we will assume that a person interacts with a certain fraction of everyone she knows in one day. My guesstimate on this fraction would be 10%. For the fun sake of it, we assume that this infectious disease kills 60% of the people who get it at the end of 6 days of being infected and the other 40% become immune. Running this simulation for 35 days leads to the outbreak completing its course resulting in almost everyone infected at least once hence killing nearly 60% of the total population and the other 40% surviving and becoming immune to further infection. The network evolution resulting from simulations using these numbers is in the gif below. The green nodes are healthy and susceptible, the red nodes are the infected ones, the black nodes are dead and the blue nodes are recovered from the infection and hence immune.

try_gif

timeseries

The timeseries plot of the number of healthy, infected, immune, and dead people over time tells how the outbreak will likely die out. It is interesting to note that once the infection dies out, that is, the number of infected people goes down to zero, some people still stay healthy. In other words, the outbreak can lead to some people never catching the infection. To think about it, it would be the best idea to ensure that the most vulnerable form this group of people ensuring the least deaths.

COVID-19 case

The current pandemic of COVID-19 is different in many ways. There are two cases to consider, one is where there is still no social distancing which is the beginning of the spread in many countries and the other is where there is social distancing. The case without social distancing would have the same transmission rate as we used in the toy model above, that is, 10%. When everyone is practicing a great extent of social distancing, this rate would reduce – I estimate it at 2.5%. The death rate of COVID-19 is nearly 5% in the US. However, not all of the other 95% are becoming immune since we have been hearing cases of re-infection. I’ll assume that 85% become immune and the other 10% become susceptible again. Another modification is based on that the patients who die from COVID-19, do so within a week but those who survive and recover need 14 days to become uninfected. Since the COVID-19 evolution rate is expected to be much slower, we will consider a network of 10000 nodes. With these updated numbers, running the simulation for 60 days where the lockdown started after 20 days after the beginning of the outbreak, we get the following results.

try10k_v4_new

timeseries_COVID_10k_v4

The time-series plot above shows how the numbers vary over a 60 day period for the COVID-19 case. The vertical grey line is the time when the lockdown became effective drastically reducing the transmission probability. It is interesting to note that while the lockdown causes kinks in the plot the next day, there is another kink in the number of deaths plot ~6 days after, which is the time after which the sick people who cannot recover die. There is yet another kink in the plots of immune plot and infected plot ~14 days after, which is the time taken by an infected person to get rid of the virus. This shows that in the long run, ~70% of the entire neighborhood will be infected and some ~5% will die from the infection. We can also clearly see the lockdown leads to a reduction in numbers.

We can look at disease spread mechanisms on various other network models or even on the data of real social networks. There were various assumptions in this analysis which can be improved by the match of the predictions with real-time data. We can also involve node attributes to find the vulnerability of a particular person to this disease. The parameters we use here will vary with location. In particular, the death rate would be dependent on the quality of healthcare in the neighborhood. Hopefully, these would make a future blog post. Till then, stay safe! 🙂

References:

1. Newman, Mark EJ. “Spread of epidemic disease on networks.” Physical review E 66.1 (2002): 016128.
2. Barabási, Albert-László, and Réka Albert. “Emergence of scaling in random networks.” science 286.5439 (1999): 509-512.

Temporal Social Network of High School Students

The analysis on a network is highly insightful and can predict interesting information, for example, predicting different groups that would be formed if the network splits (see this). We can calculate various node centrality measures and obtain their distributions to understand the importance of different nodes and their ability to control the other nodes. All of this analysis is typically done on a fixed network structure. In real life, however, networks are not so fixed and with time, different nodes and edges are removed or added which makes it interesting to see the evolution of a network and information we can deconstruct from there.

A basic example of a network evolving with time is the one where the nodes stay fixed while different edges exist at different times. An example of this is a social network where an edge between two people exists at a particular time only if the two people were talking to each other at that time. The complete network is a collection of various edges each with none or many timestamps. In such a network, a time snapshot shows the interaction at a particular time, i.e., it tells who was talking to whom at a given time. In a temporal social network, group interactions would be reflected in cliques where all edges carry the same timestamp. When there is a path such that every consecutive edge has a timestamp after the timestamp of the previous edge, we have a time respecting pathway. For example, A and B talk at t=0, B and C talk at t=2, C and D talk at t=3 then we can say that A -> B -> C -> D is a time-respecting path. Time respecting paths are important because they could be denoting the trajectory of a particular information. For example, A told B about a new announcement by the mayor which B told C and C told D. Hence identifying these paths can tell us the possible origins and carriers of information. If the network has a characteristic radial pattern of multiple time respecting pathways originating from the same node, it could possibly indicate the spread of a rumor. A visualization of the timeline plot which shows how the network evolves over time can indicate what times of the year students are more talkative. If we find certain edges which exist in most time-line plots, we can possibly call them strong friendships (more on timeline plots later in this post). There are network centrality measures specific to temporal networks that can help us understand the network. For example, the temporal distance (or latency) between two nodes reflects how close two nodes are, and how likely they are to participate in the spreading of a rumor. It is interesting to note that identifying node pairs with small latency can be exploited to propagate a belief or political opinion.

In this post, we’ll look at a social network of interaction between high school students where the nodes are the students and an edge with a timestamp tells when a pair of students interacted. Most of the analysis is done using the networkx and pathpy libraries in python. This network has 126 nodes (with anonymized student names, of course) and 28,562 edges. Since the complete network is too huge and dense for any convenient visualization, we’ll be focusing on visualizing and analyzing just a subset of the network. Hence, any conclusions made on this subset may not be representative of the entire network. There are various ways of representing a temporal network – as a multigraph where each edge label represents the timestamp of the interaction (this is simple to construct but very cumbersome to read); as a graph where each edge label is a list of timestamps of interactions; or as a timeline plot where we can see the flattened version of the network at each time stamp. Below are two of the representations of a 21 node subset of the complete network. The timestamps are in seconds and each interaction lasts 20 seconds. Each row in the timeline plot corresponds to a student and there is a line between two rows at a particular position if the two students interact with each other at that time. The bottom line in the timeline plot marks the time-axis. Depending on what we want to understand from the system, either of the representations could be more useful. For example, a quick look at the graph with a list of timestamps as edge labels tells us that nodes 47 and 61 never interact, the same piece of information is difficult to read from the timeline plot. While some other information could be easier understood from the timeline plot, for example, between t=100 to t=300, none of the students interacted with each other – this could be the time a teacher gave a brief instruction to the class. However, there are still pieces of information that could be conveniently understood from either of the representations, for example, which students are more talkative, i.e. high degree nodes in the graph or many dots on a horizontal line in the timeline plot, or, which node pairs are highly interacting, i.e. who are best friends – in the graph, this would correspond to an edge with a really long list of timestamps and in the timeline plot, this would correspond to multiple lines between two students at different times. An example of a best-friend pair is nodes 5 and 20 which interact with each other at 10 different times.

final1

final1

Students 8, 47, 72 and 116 interact with each other at the exact same timestamp, i.e., at t=54940. This could happen only if they are collectively interacting in a group. The identifier of a group interaction is hence a clique with at least one timestamp that is common for all nodes.

group1

This network of high school students has multiple time-respecting paths and I used in-built functions from the pathpy library to find these paths. Of multiple nodes, node 106 is one which is the origin of many time-respecting paths. Four of these time respecting paths are shown in the network subset below. The delta of the time respecting path is the upper limit of the allowed time difference between two consecutive edges. If this delta is small, we are surer of the information spreading “as heard” while if we allow the delta to be large, people are less likely to be talking about the same thing through the complete path. The delta for this network subset was 10 minutes. Please note that every pair of nodes shown in the example has multiple edges with each other, the display is cut down to only relevant information for ease of understanding.

rumor_just

A temporal network is a highly informative instrument and this post covers a very small fraction of what all we can understand from one. Modeling and analysis of a temporal network is a field with a lot of ongoing research. With the presence of online social networks, temporal interaction networks are more accessible to us. While many important network measures have more significance if the nodes are not anonymized, there are still motifs and artifacts that hold significance and tell us a lot about the underlying system. I hope this post gives you a good primer on the analysis and significance of temporal networks.

Data source: High School temporal network.

Understanding networks

A network or a graph is a collection of vertices (or nodes) and edges. The number of edges incident to a node gives its degree. In a directed graph, we have in-degree (number of incoming edges) and out-degree (number of outgoing edges). The degree of a node is one of the most basic and crucial pieces of information about a node. The degree can help us identify hubs or weakly connected nodes. Degree distribution P(k) gives the fraction of nodes with degree k. In a uniform network, every node has more or less the same degree and hence the degree distribution is uniform. Thanks to the great work by Barabasi and Albert [1], degree distribution can provide incredible insights into how the network possibly evolved. In addition to the degree, we can find the clustering coefficient of a node (or local clustering coefficient) which quantifies how closely bound a node’s neighborhood is. The local clustering coefficient is given as C_i = \frac{n_i}{k_i(k_i-1)/2} where n_i is the number of edges among nodes adjacent to i, and k_i is the number of nodes adjacent to i. This formula is the same as C_i = \frac{no.\: of\: triangles\: connected\: to\: i}{no.\: of\: triplets\: centered\: at\: i}. The global or graph clustering coefficient is given by the formula: C = \frac{no.\: of\: closed\: triplets}{no\: of\: connected\: triplets} = \frac {6 (no.\: of\: triangles)}{no.\: of\: simple\: paths\: of\: length\: two}. The global clustering coefficient is 0 for a tree and 1 for a complete graph. Another interesting centrality measure is the betweenness centrality of a node which is the sum of fractions of shortest paths that pass through the node. For a node pair (i,j), if C(i,j) is the number of all shortest paths between i and j and C_k(i,j) is the number of these shortest paths that pass through node k, then the betweenness centrality of node k is given as g_k = \sum_{i\neq j} \frac{C_k(i,j)}{C(i,j)}. As is evident from the formula, betweenness centrality is a measure of how much a node stands in the path of other nodes and has wide applications in transport and telecommunication networks. Now that we have looked at measures describing properties of individual nodes, it would be interesting to take a peek at measures that describe the characteristics of the whole network topology in general. The diameter of a network is the maximum shortest path between any pair of nodes and is representative of the linear size of a network. While the network diameter is a depiction of worst-case distance in a network, the average path length comes in handy to estimate the efficiency of information or mass transport in many networks. The average path length <d> is the average of all n(n-1) shortest path lengths. Of course, this average is considered only over pairs of nodes that have at least one path with a finite path length. A network is connected if each pair of nodes has a path between them. When a network is not connected, it can be separated into connected components. By definition, a connected component in an undirected network or a graph is a subgraph in which any two nodes are connected to each other by a path and no node in this subgraph is connected to a node outside of this subgraph. In directed graphs, we define strongly and weakly connected components. A strongly connected component (SCC) is a subgraph within which, there is a directed path for each ordered pair of nodes. Every node in an SCC has the capability to reach every other node in it. The SCCs of a directed graph can be found in linear time using a DFS (depth-first search) based algorithm, for example, Kosaraju’s algorithm. A weakly connected component (WCC) is a subgraph of a directed graph such that there is an undirected path for each unordered pair of nodes in the WCC.

For almost all centrality measures discussed here, the networkx library in python has in-built functions making it really easy and fun to analyze real networks. In the rest of this post, I use networkx to analyze a real world network – Zachary’s karate club, which is a social network of friendships between 34 members of a karate club at a US university in the 1970s [2]. This karate club was not indeed Zachary’s, he collected the data on how different members of the club interacted outside the club and after constructing the network, he observed that the club is a cumulation of two social groups each centered around a different leader which can be identified by the degrees of these nodes. These leaders were the instructor of the club and the club president. Interestingly, when the club split up, the two new groups centered around these leaders. Zachary used Ford and Fulkerson’s maximum flow – minimum cut labeling procedure to find which club member leaned towards which social group. Except for one guy, the club split exactly as per Zachary’s labeling. I plan to cover various network labeling methods in a future post.

karate_CC

The karate club network shown in the image above shows that the two leader nodes are easily distinguishable. The node sizes depict the degree – bigger nodes have a higher degree and the color depicts the betweenness centrality – red node has the highest betweenness while blue has least. The two most connected nodes or the leader nodes have degrees 16 and 17. These nodes have clustering coefficients 0.15 and 0.11 respectively, while their betweenness centrality measures are 0.44 and 0.30 respectively. It is interesting to note that both leaders have rather low clustering coefficients and high betweenness centralities implying that it is the leaders who are holding together the social network and without the leaders, this would indeed be a very sparse network with starkly different labeling in the case of splitting.

karate_pk

The degree distribution further reinforces the importance of leader nodes as most nodes have a rather small degree (2-4) and there are very few nodes with high degrees. The global clustering coefficient of this network is 0.26 and the diameter of the network is 5.

Before I complete this post, I would like to touch upon two things, both of which I plan to cover in future posts. We looked at a stationary network in this post, real networks, especially social networks evolve with time as new nodes and edges are added and some nodes and edges are deleted. It is interesting to observe how the centrality measures of nodes and the network topology grow with this evolution. I plan to study the evolution of a social network and hopefully post on this blog soon! The second thing is that here we looked strictly at the network topology or network structure without going into the exact dynamics of an edge. In dynamical networks, nodes have states which can be continuous or discrete (0 or 1 for example) and edges between nodes have associated dynamics which denote how exactly the state of one node affects another. These dynamics can be continuous, differential equations for example or can be discrete like Boolean logic rules. While there is a lot that network structure alone can convey about the network, dynamics are crucial for understanding the underlying mechanics.

Resources:

  1. Barabási, Albert-László, and Réka Albert. “Emergence of scaling in random networks.” science 286.5439 (1999): 509-512.
  2. Zachary, Wayne W. “An information flow model for conflict and fission in small groups.” Journal of anthropological research33.4 (1977): 452-473.

Enumeration of spanning trees

An undirected connected graph can have multiple spanning trees. Sometimes, some of these spanning trees are edge-disjoint, that is, they do not share any edge with each other. Edge-disjoint spanning trees have applications to construct alternate message transmission networks; to design new transport systems where a transport system already exists. “Bridg-it”, a game devised by David Gale (copyright 1960 by Hasbro Toys) has a winning strategy which relies on constructing disjoint spanning trees on the board [1]. You can play the game here. The complete graph K_5 has 2 disjoint spanning trees.

K5

The upper bound on the number of edge-disjoint spanning trees in a complete graph K_n is \lfloor \frac{n}{2} \rfloor since the total number of edges in the graph is \frac{n(n-1)}{2} and every tree has n-1 edges.

If a graph is connected labeled and not a tree, it contains more than one spanning tree which may or may not be disjoint. The problem of finding the number of spanning trees of a graph has interesting applications, for example, counting certain hydrocarbons containing a given number of carbon atoms. Recently, I learned that it is also useful for computer scientists (or data scientists more particularly) to estimate the length of database calculation. This problem was first solved by Gustav Kirchoff [2]. I’ll first illustrate the counting of spanning trees in a graph by case counting and then by using Kirchoff’s theorem. Consider the 5-vertex undirected graph in the following image.

MST

Let us call edge (2,5) e1 and edge (2,4) e2. A spanning tree of this graph may contain both, either or none of these two edges. If the spanning tree contains none of these edges, then we are basically looking at G – {e1, e2}. This graph is a cycle and removing any edge from this cycle gives us a tree, hence there are 5 spanning trees in this case. Consider the case when the spanning tree contains e2 but not e1. We have to remove either of the three edges (1,2), (1,5), (5,4) and either of the two edges (2,3), (3,4), so we have 2×3 = 6 trees in this case. When the spanning tree contains e1 but not e2, in G – {e2}, we have to remove either of the two edges (1,2), (1,5) and either of the three edges (2,3), (3,4), (4,5) resulting in 2×3 = 6 trees in this case as well. Now the last case to consider is when the spanning tree contains both the edges e1 and e2. This spanning tree cannot contain the edge (4,5) since that forms a cycle, hence it must have either of the two edges (1,2), (1,5) and either of the two edges (2,3), (3,4) giving us 2×2 = 4 trees in this case. The total number of spanning trees of this graph is hence 5+6+6+4 = 21. The way we found out the spanning trees is similar to deletion-contraction method in which the number of spanning trees is obtained by the formula: \tau(G) = \tau(G-e) + \tau(G.e) where \tau(G) is the number of spanning trees in the graph G, G-e is the graph obtained by deleting edge e from the graph G and G.e is the graph obtained by contracting the edge e from the graph G. We can find the same result of 21 spanning trees by using Kirchoff’s matrix tree theorem which says that the number of spanning trees is given by the value of any cofactor of the Laplacian matrix of the graph. The Laplacian matrix Q is obtained by: Q_{ij} = deg(v_i)\ \text{if}\ i=j;\ -a_{ij}\ \text{if}\  i\neq j where a_{ij} are the elements of the adjacency matrix of the graph. The Laplacian matrix of the discussed 5-vertex graph G is: \begin{bmatrix} 2 & -1 & 0 & 0 & -1 \\ -1 & 4 & -1 & -1 & -1 \\ 0 & -1 & 2 & -1 & 0 \\ 0 & -1 & -1 & 3 & -1 \\ -1 & -1 & 0 & -1 & 3 \\ \end{bmatrix} and the value of any cofactor can be computed easily, which indeed is 21.

Side note:  I feel there is a possibility that K_n has exactly \lfloor \frac{n}{2} \rfloor edge-disjoint spanning trees where each of the trees is a path. I do not have a mathematical proof or contradiction to this, so this is merely my intuition at this point. If I find either, I’ll update it on this blog!

Resources:

  1. West, Douglas Brent. Introduction to graph theory. Vol. 2. Upper Saddle River: Prentice hall, 2001.
  2. Chartrand, Gary, and Ping Zhang. A first course in graph theory. Courier Corporation, 2012.

The spanning tree problem

Hello, world! Welcome to the first post on mathbrewcode blog. This blog is going to be about network science and graph theory, two fields which are highly intertwined but at the same time distinct in terms of their applications and overall philosophy. Most posts would be focused on either of these two fields and I plan to tag them appropriately. The posts on this blog assume some basic knowledge of at least either of these fields. If you are completely new to network science, I would recommend this brilliant (and free) book to get started Network Science by Albert-László Barabási – I would suggest you take a quick look even if you are very well versed with graph theory. And if you are new to graph theory, it is best to start with some introductory textbook, I used “A First Course in Graph Theory” by Gary Chartrand and Ping Zhang, and I really loved going through that book.

In this post, I want to discuss one of my favorite graph theory problems, finding the spanning tree of a connected graph. A graph is a collection of vertices (or nodes) which are interlinked with each other via edges. In graph theory, V denotes the set of vertices and E denotes the set of edges. In the simplest definition, a tree is a connected acyclic graph. Of the many interesting properties of trees, the most notable ones are i) every edge is a bridge i.e. removal of any one edge from the graph splits the graph into two connected components; ii) the number of edges is exactly one less than the number of vertices in the graph, that is, |E| = |V|-1. The spanning tree of an undirected connected graph G is a tree containing all the vertices of G. A graph can have multiple spanning trees and there are many well-known algorithms to find one of them, for example, Kruskal’s algorithm and Prim’s algorithm. When the edges of a graph each have a weight assigned to themselves, a more significant problem is that of finding the minimum spanning tree. Mathematically, we concern ourselves with minimizing the total weight of the spanning tree T obtained by \Sigma_{e \in T} weight(e). The problem of finding the minimum spanning tree has useful applications. For example, for a given roadmap, constructing the least cost bus transport system or finding the most economical layout of a power-line network. In addition to finding the (minimum) spanning tree, there are mathematically interesting problems of finding disjoint spanning trees and the enumeration of spanning trees in an unweighted connected graph. Interestingly, this latter problem of finding the number of spanning trees can be solved in polynomial time. Since this is just an introductory post, I’ll stop here and will delve into these problems in the next post!