All models are wrong
In the previous part of this post, I talked about the mathematical idea of networks as being formed out of vertices (people, say) some of whom are connected by edges (if they are friends, for example). In particular, I described how the Erdős–Rényi model generates networks at random. However, while that makes an useful start point, it’s too simplistic to capture the behaviour of real-world social networks.
Just to remind you, in Erdős and Rényi’s model, we look at each pair of vertices separately, and decide whether to place an edge between them with the same probability p, independently of the other choices. Although this is a simple model, it has some nice features: for example, pairs of vertices are often not too many edges apart, capturing the ‘small world’ effect that allows us to play the Kevin Bacon game.
But of course, the independent edge model of Erdős and Rényi isn’t how friendships form in real life. On learning that Alice is friends with both Bella and Charlotte, then you’d expect that the chances are higher than p that Bella and Charlotte are also friends with one another. Of course, it’s not guaranteed, but the fact they both know Alice makes it more likely that they all live in the same city, that they might socialize in a group, be of a similar age, have common interests, and so on.
In other words, we expect that triangles (three vertices all connected by an edge) would be more common in a real-life network than the Erdős–Rényi model suggests. In the actual friendship network, the existence of edges is positively correlated, not independent. We could attempt to tweak that model in some way, to make triangles more likely, but we would then still have issues with sets of size four, five or six, all of whom are more likely to be mutual friends.
Follower counts
There are other issues with using Erdős–Rényi to model social networks. Since every vertex has its edges generated in the same way, everybody’s number of connections tend to lie in a roughly similar range. As I describe in Numbercrunch, the Central Limit Theorem tells us that this number will be roughly normally distributed (following a bell-shaped curve).
For example, in an Erdős–Rényi model with n = 100,000 and p = 0.0005 the average person will have about 50 friends, and a histogram of a thousand peoples’ friend counts will look something like this:
The histogram is pretty tightly bunched together: almost nobody has fewer than say 20 friends, and although in theory someone could have 100,000 friends, in practice very few people have more than 100.
While this isn’t a bad approximation for what real-life friend counts might look like, it isn’t a realistic model for social media, which tends to have a small number of extremely heavy hitters, follower-wise. For example, Taylor Swift currently has more than 90 million Twitter followers, Cristiano Ronaldo’s Instagram account has over half a billion followers, and so on.
There’s a certain amount of controversy as to what histograms of social media followers really look like, with some people arguing for what is called a power law and others arguing for log-normal. But the key thing is that in either case these are what are called heavy-tailed distributions (there are a few users with crazy follower counts), and you won’t see these from the Erdős–Rényi network model.
Building a better model
We’ve seen that in some ways (simple to describe, gives well-connected networks) Erdős–Rényi gives a good model of social networks, but in others (assuming independence, not allowing heavy hitters) it falls down badly. Towards the end of the last century there was an upsurge of interest in new network models which worked more realistically for Internet applications, and a huge amount of research on what this meant in practice.
One class of models that attempted to improve on Erdős–Rényi were introduced by Watts and Strogatz. These were based on adding and removing edges in a particular structured way, and overcame some of the issues around lack of triangles while maintaining the small-world structure. However, they still didn’t capture the heavy-hitter behaviour that we see in real-world social networks.
An alternative way of thinking was introduced by Barabási and Albert, and to my mind captures better what we see on social media networks such as Twitter. Think about the dynamics of joining this kind of network, and deciding who to follow.
When deciding who to trust and who to follow, existing follower counts can play a significant part in your decision: if an account has hundreds of thousands of followers, including some famous people, you are more inclined to think it must be worth following. Whereas maybe that other account only has five followers for a reason? Further, well-followed accounts are more likely to generate retweets and will generally gain a bigger reach with their posts, meaning that you are more likely to notice them by accident.
As a result, the act of deciding who to follow on Twitter is nothing like the ‘every edge has the same chance of being there’ model that Erdős and Rényi introduced. In fact we observe a so-called Matthew effect, named after the Bible verse “For whoever has will be given more, and they will have an abundance” (Matthew 25:29). Successful accounts become more successful, and there is a first-mover advantage where making an early impact on a new social network can lead to sustained gains in followers in the long term.
Of course, it is important to remember that large follower counts are not in themselves a guarantee of reliability or accuracy. We might hope that these qualities would be rewarded by more followers, and perhaps this is true to some extent, but there are certainly examples of accounts which have been rewarded by the Matthew effect despite generating questionable content. Depressingly, similar effects are often noticed in academic career trajectories (where a track record of previous grant funding is often required to win grants) and citation counts (where the papers that have enough citations to be ranked on the front page of Google Scholar then tend to get cited more).
Although it predates Twitter, the Barabási-Albert model captures something of these dynamics. In this model, new users join a network one by one, and choose whether to form links to existing vertices. Instead of the probability of doing this being constant for each potential vertex, as in the Erdős-Rényi model, it is chosen to be proportional to the number of edges that the vertex already has. In other words, you are more likely to follow someone who has lots of followers already. This is often referred to as a preferential attachment model, and it can be proved that it yields the kind of heavy-tailed distributions for the numbers of friends that we see in real social networks.
As I say, other models are possible — for example, I mentioned that follower counts might follow a log-normal distribution. That is, you might expect to see a bell-shaped curve on plotting a histogram of their logarithms, rather than the raw values. This might arise if every week everyone gains a number of followers that is a random small fraction of their previous follower count (so their logarithm has a small amount added to it), again following the principle of the Central Limit Theorem that I talk about in Numbercrunch. Indeed the Barabási-Albert model doesn’t really generate triangles in the way that we would hope, partly because it misses the dynamic that people will also tend to follow their real-life friends when joining a social network, so as usual we should think that “all models are wrong, but some are useful”.
Love will tear us apart
However, having seen that the Barabási-Albert model gives a plausible description of how heavy-tailed follower counts might emerge on social networks, it’s worth thinking about what this means for Twitter.
One key aspect of this model is that, just like the previous ones, it tends to have short routes between pairs of vertices, meaning that the network is well connected. However, this connectivity forms in an interesting way, with these high-connectivity vertices acting something like hubs in the airline network.
If I want to fly from Bristol in the UK to a relatively small city like Tucson in the United States, my best strategy may be to fly from Bristol to a European hub such as Schiphol, from Schiphol to an American hub such as Chicago, and then from Chicago to Tucson. The existence of these very high-connectivity hubs (and the fact that the hubs are themselves connected) ensures that I can reach a very large number of destinations worldwide without changing planes more than twice.
In the same way, the high-connectivity vertices in the Barabási-Albert model ensure that the resulting social network is well connected, in the sense that I described in Part 1. However, this leads to vulnerabilities. In the previous part I explained how the Erdős-Rényi model suggested that gradual removal of edges in a failing social network could lead to a phase transition into small siloed discussions. However, in a preferential attachment model, the situation can be much worse, in the sense that removing a few high-connectivity hubs can cause the whole network to degrade significantly. If Schiphol or Chicago airports closed down, a huge number of journeys would no longer be anything like as easy, and in the same way Twitter accounts with high follower numbers generate a disproportionate amount of the network traffic and linkages.
For this reason, it feels as if the current Twitter strategy of down-weighting the value of previously verified accounts, by removing their blue ticks and generally fighting with the legacy media, is a risky one. In many cases these are the high-connectivity hubs that help keep the site together, and should they choose to walk away from the site it could significantly degrade its performance. In the long term it may be that the network will transform into something flatter and more akin to an Erdős-Rényi model, but in the meantime there is a risk that the value of Twitter will be significantly affected.
https://twitter.com/scottdetrow/status/1646139529072451585
For this reason and many others, it will be extremely interesting to see if any other large accounts follow the lead of NPR and PBS to move away from Twitter in the coming weeks, and what effect that has on the long run future of the site.
Note: because of the kind of network effects I have described, there is great power in you passing on this article. So if you found it interesting, please share it and if you aren’t already subscribing then please do!