Hendo’s Blog

Epidemiology

2026-05-17T00:00:00-05:00

Epidemiology is a serious subject and one I'm not really an expert in. So while this is a fun experiment don't treat it as a serious public health thingy.

I was playing the board game Twilight Imperium and came across an interesting scenario for analysis and modelling infections. It's complex board game, with a large hexagonal map:

The map is made up of differing systems, each represented by a hexagon, which may contain planets. Different systems have differing planets, some with up to 3, some with none. One of the game's central mechanics is occupying and controlling these systems by placing "infantry" on them, representing populationsThere are no civilian populations in the game, just military . As the game progresses, players spread out from home systems, pushing through unocupied planets, eventually colliding into skirmishes and forming frontlines.

Systems from left to right, 3 Red infantry, 1 Blue and 1 Green infantry, no infantry

In a recent game a particular action card came up, Locust:

The idea is to unleash a targeted plague on the galaxy, aimed at your opponents systems. But it depends partially on chance, the roll of a dice, to determine its spread. So in some ways, this might behave like a real infection. I thought it would be fun to treat this as an experiment in statistical modelling, and see if we can predict the spread of the plague.

Now if you think about it, this is not a disease, and adheres to the card's name: it simulates a single "plague" of locusts moving on its own path, destroying things in it's path. by contrast, diseases spread in all directions, multiplying and re-infecting, leading to much greater infection:

Nonetheless, in this scenario the locusts spread through connected populations and therefore behave somewhat like a communicable disease, so we're going to continue using terminology "epidemic", "infection", etc.

Real epidemiology

In the real world analysis and depiction of diseases is a much more complex topic than in our example. In real life analysis has different purposes:there are probably official terms for these

Retro-active analysis: looking backwards to ascertain truth, where an infection started, or what's causing it
Trying to analyse behaviour of disease (how it's transmitted)
Future facing analysis: trying to predict future infections and stop them

Retro-active analysis of an epidemic was the function of one of the most famous statistical illustration out there: John Snow's Cholera Map of London. ⊕

The original can be found here: Project Gutenberg - On the Mode of Communication of Cholera https://www.gutenberg.org/files/72894/72894-h/72894-h.htm

John Snow, A Geography of Life and Death

Snow recorded deaths in London during a cholera pandemic, shown here by dots, and when viewed on a map, their clustering identified the broad street pump as the source of infection. When the pump was closed, this reduced rates of infection in the areahttps://www.londonmuseum.org.uk/collections/london-stories/john-snow-cholera-broad-street-pump/ . This nicely showcases the power of effective illustration in communicating data, and is a nice prelude to our example.

Real epidemiology is further complicated by the fact that the data itself is uncertain. Estimates for "transmissibility" vary, and are often revised retroactively. Infections and casualties may be over or under diagnosed, mis-attributed, and take time to analyse. Different people and populations have different susceptibilities to diseases, depending on age, genetics, access to medicine, climate, population density, etc.

There's a great chapter discussing these complexities in "The Signal and the Noise"https://www.amazon.co.uk/Signal-Noise-Art-Science-Prediction/dp/0141975652 . This discusses challenges in predicting epidemics, and how prediction / reporting of a disease is itself a variable in their spread.

In the scenario we'll be looking at we have a well defined population, and a perfect definition of how the infection spreads which remains constant across every transmission, greatly simplifying our model.

Plan

Anyway, let's go back to the game scenario. We want to model and illustrate the possible outcomes of the infection, and to do this we need two input variables:

Starting state: population of the map before the disease
Spreading behaviour

Our output will be a map of expected casualties.

Model

In epidemiology, transmisibility of a disease is expressed as an 'R-number'https://en.wikipedia.org/wiki/Basic_reproduction_number . This represents the number of people infected by an existing infection, and determines the rate of spread of the disease. Obviously factors like immunity and variations affect the real numbers, but the nice thing about large populations is that they tend to obey statistical laws at a macro level.

In our game the card says there is a constant chance of infection from one system to the next, and that the locusts can travel to "the same or adjacent" system, so systems can be re-infectedThe infection doesn't care about the number of planets, just systems and infantry. .

In reality the player playing the card has agency and can choose which system to target. They're likely to target it at systems with larger numbers of infantry, or protecting more strategic locations. We could attempt to model this by giving a greater likelihood to systems with higher populations, but then we're really just introducing more bias into the model.

So I'll go with the simplest possible model, where every adjacent system, plus the original system, has an equal likelihood of infection.

The chain of infections ends when it fails to infect, which is when the player rolls < 3. The game uses a 10 sided dice and 0's count as 10, so there is an 0.8 chance of the infection continuing, and 0.2 of it stopping.

            1 2 = Infection stops
3 4 5 6 7 8 9 0 = Infection continues 

p = 0.2
q = 0.8

Each infection attempt is therefore a boolean variable, and we can model it's spread as a Geometric distribution with probability of infection (p) = 0.8, and the number of systems infected in the chain is k. This is because successes have to be successive. If we relate this to coin flipping, we are measuring the number of times you get heads in a row, not the number of heads you get when flipping a coin 10 times. That would be described by a Binomial Distribution.

We can see the expected spread in the table and graph below.

The final column of the table is probably what we're interested in, this is the appropriately named Survival function, P(X≥k)=p, and is equivalent to "what's the probability the infection gets to at least this number of infections before ending".

k (successes)	calculation	P(X = k)	P(X ≥ k)
0	0.2	0.2000	1.0000
1	0.8 × 0.2	0.1600	0.8000
2	0.8² × 0.2	0.1280	0.6400
3	0.8³ × 0.2	0.1024	0.5120
4	0.8⁴ × 0.2	0.0819	0.4096
5	0.8⁵ × 0.2	0.0655	0.3277
6	0.8⁶ × 0.2	0.0524	0.2621
7	0.8⁷ × 0.2	0.0419	0.2097
8	0.8⁸ × 0.2	0.0336	0.1678
9	0.8⁹ × 0.2	0.0268	0.1342
10	0.8¹⁰ × 0.2	0.0214	0.1074

As we can see, the geometric distribution means it trails off. If the infection travels in a straight line from the centre, we can map it's expected reach as a heatmap.

But in reality the process is more involved. The infection can change direction, and not every system can be infected. Additionally

The system of actions is:

System is infected
10 sided dice is rolled
If result =< 2, the infection halts ⊕remmeber 0 = 10 in this system
If result >=3 an infantry is killed
The infection moves to the same or adjacent system and repeats

This means, if a system has 1 infantry, and it's destroyed, it can't be re-infected. This makes the simulation more complex and means we probably can't be simulated with a simple geometric distribution.

Instead, the best way to model this will be a Monte-carlo simulation, in which we simulate potential outcomes and record the results.

To do this I'll build a model of the galaxy, simulate an infection until it peters out, reaching an eventual end-state with casualties. Then I'll run a number of simulations, and average the results, modelling expected outcomes and the average fatalities per system.

Building a model

All we need is an object for each system, with a population count (for each player), and list of adjacent systems.

Then we configure the model with the number of infantry in each system. I did this from a snapshot of the game state, and a python script to parse the populations. Shown below, the populations turn out to be very sparse, significantly reducing the likelihood of a significant epidemic.

Start state

The originating player can choose any system adjacent to theirs in which to begin the infection. We could assume an equal likelihood amongst adjacent systems:

Or we could simulate from the point at which the player has chosen the first system to infect. This is the easiest option so is what we'll go with. The system chosen in the game was (-1,-1) on the above map, the tile with 2 orange population, below left of the centre.

Transmission

Now we model an infection. This is nice and simple based on the list of adjacent systems and their per-player populations. Our starting point will be

Technically, the player could deliberately target their own systems, and might if it provided a bridge to a particularly high value target

⊕

for _ in range(max_steps):
    roll = roll_d10()

    if roll <= 2:
        # infection halts
        break

    # roll >= 3: kill one unit in current system
    # pick a random player with population > 0
    mortal_players = [
        p for p in PLAYERS
        if p != IMMUNE_PLAYER and populations[current].get(p, 0) > 0
    ]

    victim = random.choice(mortal_players)
    populations[current][victim] -= 1
    casualties[current][victim] += 1

    # check if hex is now cleared
    if mortal_population(populations[current]) == 0:
        cleared.add(current)

    # find valid next systems
    candidates = infectable_candidates(current, populations, cleared)

    if not candidates:
        # nowhere left to spread - halt
        break

    current = random.choice(candidates)

You can find the complete script here: simulation.py

Results

So what happens when we run our simulation?

I set the number of experiments at 500 simulations. Here we can see the first 9, showing the variation. In 2 of them the infection fails to inflict any casualtiesThis fits nicely with our q = 0.2 , and in a couple the infection goes up and down the entire galaxy. This nicely displays the varaition in results we can inspect

Run 1

Run 2

Run 3

Run 4

Run 5

Run 6

Run 7

Run 8

Run 9

And, after 500 simulations, here's the map of average casualties per system, across all runsresults <0.1 are hidden :

The mean casualties by player outcomes make sense:

Mean casualties by player:
  green       :     0.73
  orange      :     1.87
  pink        :     0.00 [IMMUNE]
  purple      :     0.91
  red         :     0.05
  yellow      :     0.00

Yellow is unaffected: all of yellow's populations are disconnected from the main system cluster⊕
Yellow%26s nice little quarantine zone
Red is also largely unaffected: they only have 1 system reachable from the infection site, and it's 4 hops awayRemembering our above geometric distribution earlier, the infection only has a 40% chance of making it 4 tiles, and will only go this way in a small number of simulations
Orange is the most heavily affected. They have the most population to start with by far, and the infection starts in a system with 2 of theirs. The only thing limiting the spread is that orange's most populous systems are not directly connected to the infection site. The weird isolated 0.1 in (-3,3) is presumably because that system had a total of 9 population, increasing the chance the disease re-infects that system.
Purple and green are roughly equivalent, but purple is closer to the initial infection site, hence higher casualty count.

Overall the mean number of casualties was around 3, and in the majority of runs a maximum of 8 were inflicted, although across all 500 runs one simulation resulted in 18.

Casualties per run:
  mean : 3.6
  min  : 0
  max  : 18
  p10  : 0
  p90  : 8

Reality

In reality the infection successfully hit system (-1,1), killing 1 infantry, and then moving to (0,1), where a 2 was rolled, and the infection stopped. This matches our expected results exactly, if we convert probabilities to integers.

Curve + exponential AI

This is a slight tangent to the original subject, but ties in ideas of prediction, data graphics and epidemics.

This article was written around the time of the "Mythos hypecycle", Anthropics teasing of their "Mythos" LLM model and some fairly inaccurate analysis that's been reported. Anthropic, and many others, have been making this out to be a step change in capability, in fact, so insane at finding vulnerabilities that it is just too dangerous to be releasedas opposed to maybe, far too expensive to run or buy . This is pretty much marketing BS. It is a step up, and in some ways significantly so, but not what's being made out.

The team at Mozilla wrote a good article, and reported that they were able to use Mythos to find a substantial number of bugs across their systemshttps://hacks.mozilla.org/2026/05/behind-the-scenes-hardening-firefox . However, exclusively attributing this to the performance of Mythos would be a mistake:

The Mozilla fuzzing team are world experts in fuzzing, and have clearly built an excellent bug hunting pipeline
In fact, they state that their harness performs almost as well with other models

There's a great statistical analysis here: https://pointestimate.substack.com/p/how-good-is-mythos, by far the most rational thing I've read written about this subject. It presents lots of graphs side by side, commenting on the trends, the error bars, and the reliability of the metricsI like this article, but its provenance is a bit odd, the substack has only a single post, and no linked identity etc. Absolutely doesn't mean it cant be trusted, but there's no historic record of other analyses / opinions to read up on to understand the source and its potential biases .

That article analyses outcomes from the UK's AI Security Institute (AISI) researchhttps://www.aisi.gov.uk/blog/our-evaluation-of-claude-mythos-previews-cyber-capabilities , one of the more reputable bodies in LLM analysis. Almost everyone else on the internet is just losing their heads and spouting nonsense. I'd recommend reading the full article, but a few points are worth extracting:

Mythos (and OpenAI's competitor, GPT5.5) are improvements over other models
However they are fairly well within existing trends of model improvement, and not outside anticipated performance
This result varies depending on which benchmark is used
They have downsides: they are very expensive to run for complex tasks

Here's a trend line from the pointestimate article:

This shows the new models (Mythos and GPT5.5) as well within the thresholds of existing trends. Again bear in mind this is just showing performance on one benchmark, with relatively few datapoints. It's not immediately clear why a curve with that number of parameters was chosen, but you could probably fit a linear trend too.

Now here's another fucking monstrosity of a graph that I saw doing the rounds on social media:

The original source can be found here:https://waitbutwhy.com/2015/01/artificial-intelligence-revolution-1.html, which to be fair, has a bit more of a nuanced explanation. Additionally it's from 2015 discussing potential future advances in AI

While this particular graph doesn't claim to show actual AI advances, and isn't based on any recent data, some people have been comparing this "prediction" with the now-famous Model Evaluation & Threat Research group (METR) report: Measuring AI Ability to Complete Long Taskshttps://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/ , which has been widely shared as showing exponential trend in capability for new AI models. There's a good analysis of this here MIT Technology Review: https://www.technologyreview.com/2026/02/05/1132254/this-is-the-most-misunderstood-graph-in-ai/.

⊕
⊕Just a warning, this graph has been doing the rounds both in linear form (shown here) AND logarithmic form. It's good of the authors to provide both, but it's a bit confusing that there are now 2 different versions of the graph being circulated When people share this graph, they're attempting to convey the following:

Historic analysis suggests a linear improvement in model capability
New models (Mythos / GPT5.5) break that trend
The overall trend in model performance is exponential, and this is the start of that

This is flawed analysis for a number of reasons:

Lack of data

First off, this is a pretty sparse dataset to use for any sort of prediction. The graphs from METR have ~ 25 datapoints belonging to state of the art (SOTA) frontier models, not really enough for confident long term predictions, although all we have to work with for the time being.

Mis-representation of complex data

This graph is representing the performance of LLMs as a scalar variable: a single number that goes up or down. Obviously this is over-simplistic. You can measure models in a wide variety of different benchmarks. Our situation is even worse, since plenty of the people publishing benchmarks are motivated to fudge the numbers, and there is evidence the models are just learning the training / valuation data, leading to having to adjust and shift benchmarksDiscussed here: https://pointestimate.substack.com/p/how-good-is-mythos#footnote-3 . If you look on the Ollama model pages, every single model somehow comes up with a benchmark that shows themselves as the most performant model.

Lack of prediction of exponential trends

This is neatly shown in that chapter of "The Signal and the Noise", when it comes to extrapolating exponential trends. It is very hard to predict exponential trends without a real understanding of the system, as you simply do not know the exponent. And since the data is exponential, small errors compound massively. This graph shows possible predictions for spread of AIDs, based on 5 years of data, with 95% prediction band. In other words at any point on the X axis, given the observed trend, 95% of possible futures are between the error bands.In fact, scientists had miscalculated the exponent, and the true rate of infection was significantly higher

As you can see, there is an insane range of possible future outcomes and results based on this data. And this is when we know that the data obeys an exponential trend, as infectious diseases do, not when you are guessing the shape of the curve.

By contrast, the graph shown within the METR does not have 5 years of data and decades of study as communicable diseases have.

Now, we can't say that the increase in LLM capability won't be exponential, it might be. But you can't take a tiny number of studies, take 1 or 2 data points, plot arbitrary tangents and then confidently predict an exponential trend from that.

If you don't believe me, take a look at my awesome graph below, which magically can show almost any trend you want:

Anyway, this is not to dismiss the work of the various safety institutes. There is serious work being done, and the research and graphics aren't to blame. The issue is problematic mis-interpretation of the data, or over-willingness to read trends from the data.

Summary

Okay, rant on misleading graphs aside, This was an enjoyable experiment. Things wot I learned:

Geometric vs binomial distributions, and survival functions
Difficulty inherent in modelling and predicting exponential trends
Factors affecting ability to predict infectious diseases, from Chapter 7, The Signal and the Noise

If you want another good book on the subject I can recommend The Art of Uncertainty, by David Speigelhalter. It too covers the ability to predict things, and how we reason about unknowns.

RADIO

2025-02-27T00:00:00-06:00

In Feburary last year I attended Disobey in Finlandhttps://disobey.fi/2025/ , it was a great event, although very cold. At the event was a good CTF, with a bunch of physical challenges that were hosted in the building. I didn't spend too much time in the CTF, but in this post I'm going to write up my failed attempt to solve one of the challenges.

I didn't solve this at the event, just downloaded the capture file, and later took a look. As such, I didn't have the original challenge name, the description, or any other supporting files. All I had was this one file: capture.complex16. This was also my first time diving into practical signal analysis and radio hacking, so was a learning journey.

Radio basics

First off, it may be helpful to quickly familiarise yourself with the basics of radio signals:

Radio signals are transmitted via Sine waves in the RF spectrum. At a given point in time we can measure the FREQUENCY at which the wave oscillates, and the AMPLITUDE, or signal strength, of waves, using an antenna.

From The Wireless Cookbook, figure 1-1

We communicate⊕
information by encoding the information into 1's and 0's, and then super-imposing this onto a sine wave. The wave without information is known as a carrier frequency, and the act of super-imposing of information onto a carrier is known as MODULATION. There are multiple forms of modulation:

changing the Amplitude over time: this is known as Amplitude modulation (AM). AM of 1's and 0's is known as Amplitude Shift Keying (ASK).
changing the Frequency over time: this is known as Frequency modulation (AM). AM of 1's and 0's is known as Frequency Shift Keying (FSK).
changing the "phase" of the sine wave: this is known as Phase Shift Keying (PSK)

This diagram shows FSK as "Digital Modulation"

To retrieve information from a radio signal, we simply perform this process in reverse:

we recieve the signal with an antenna, and do some horrible maths to recover the frequencies it contains
we identify and isolate a particular signal within the data (A frequency range containing a signal is referred to as a BAND or CHANNEL)
we DE-modulate the signal to recover 1's and 0's
we deconstruct the protocol and decode the data to recover the original information

At some point we may need to perform some decryption as well, depending on the protocol.

This is what we are going to do for this challenge: we have a captured signal, and will deconstruct it to find the original information (a flag).

Loading the file

The challenge gives us a file: capture.complex16. The file name tells us this is a captured radio signal using 16 bytes foir I/Q data (https://github.com/jopohl/urh/wiki/Supported-signal-file-formats). Therefore we're going to use the excellent and intuitive tool URHhttps://github.com/jopohl/urh?tab=readme-ov-file to analyse the data, and hopefully extract a flag.

First off, we download and open URH:

pipx install urh

We then load the capture.complex16 file, and see ... err nothing? Taking a closer look at the file formats supported, it looks like URH relies on the file extension to guess the file type, since it imports raw binary data with no headers etc.

We copy the file to 2 files for the 16bit format: complex.caputure16s / complex.caputure16u, and open these:

.caputure16u shows a signal, but it doesn't look like a proper spectrum and URH complains it cannot understand it
.caputure16s actually looks much more like what we expect, and URH instantly is able to understand it. So it must be this one.

The first thing we see is the Spectrogram view. This shows a view of the strength of each frequency over time, as a visualisation of the RF spectrum. We can see distinct "bursts" of traffic, we will call these "packets", which we will hypothesize contain a number of "symbols" (1's / 0's), and we can see our data is contained in a single band.

URH also tells us the length of the conversation is 14 seconds, although this isn't hugely helpful here. If we look at the number of symbols within the timespan (you can just make them out as dark and light bars within each packet shown above) it tells us the thing encoding / decoding this signal is operating faster than a human could feasibly process: we are not dealing with human created morse code, typed out by hand, but with a digital system meant to be sent and processed by a computer.

First off we highlight just the frequency range we're interested in in URH, and select "Apply bandpass filter" from the right click menu. This crops the data to only these frequencies, like cropping the images of an image. Removing the redundant background noise helps later analysis and makes the image clearer. We also adjust the Data_min and Data_max values in URH to make the visualisation clearer.

Demodulation

So the first step is to Demodulate the signal, ie recover 1's and 0's from the sine wave. URH in fact is pretty good at this for basic modulations such as ASK and FSK, and can do it automatically, but we'll just verify it manually.

First off we zoom in on a packet in the spectrogram view:

We can see here that the signal is switching between 2 distinct frequencies:

If we switch to the Analogue view in URH, we can see amplitude vs time, which confirms that the frequency is changing over time within the signal. The darker sections have more wavelengths per unit of time, and are therefore higher frequency, while the lighter sections are lower:

Since we can observe the frequency switching between 2 distinct values (and not a continuous range of values) we will deduce this is a form of digital information (ie 1's and 0's) and that the modulation is Frequency Shift Keying (FSK). In URH we can now switch the modulation type to FSK, and move to the "Demodulated view" which shows us how to interpret the data. We drag the centre line between the clear ridges, and this is how we determine each "bit" of data. If the frequency is above the centre it is a 1, and if it is below it is a 0. We can invert these later if we want, when doing further decoding, but for the time being we only care about separating the 1's from the 0's. We can tell this is the correct decoding, as there is a clear separation between the peaks and troughs:

We can see that the various peaks and troughs differ in length. We usually interpret this as being repeated symbols, ie 11 would be represented as a peak twice as wide as a 1 by itself. To fully demodulate the data we need to work out the length of each individual symbol. Luckily URH does this for us, and guesses 1200, which turns out to be right. But we can also do it ourselves. Typically we look for the greatest common denominator of the various lengths represented, and smallest individual peak / trough. Here, if we select a small bump, URH tells us that it is ~ 1200 samples long. If we select a few other peaks / torughs at random we can see they are all multiples of 1200 (2400, 4800). This gives a good level of confidence that the width of each symbol is 1200 samples long.

If we take a closer look at both this demodulated view, and the original spectrogram, we can see some interesting properties of the underlying radio signal:

At each "Square" wave, we see a spike, that calms down to a more stable frequency, these are "overshoot" and "ringing" signals, https://en.wikipedia.org/wiki/Ringing_(signal), as described in CRYPTONOMICON's illustration of Van Eck phreaking:

We can also see radiating "echoes" of the original frequencies, diminishing in strength the further from the carrier frequency. These are SIDELOBES, generated as an artifact of antenna mechanics (https://en.wikipedia.org/wiki/Sidelobes):

These give us a strong indication this signal was generated from an actual capture, and not synthetically generated

Decoding

Now we think we've worked out how to convert the wave to 1's and 0's by demodulating it, let's take a look at retrieving actual information from the bits, ie decoding. Given our demodulation parameters: FSK with a centre frequency of 0.015 and a samples/symbol of 1200, we get the following bits for each message:

011110000111100001000001011111100100010100111001100000000000000011110010000
111001010010101110011101001001010001001111111010011100011001100000101000000
[Pause: 2080068 samples]

011110000111100001000001011111100100010100111001100000001000000000000000001
1010110000
[Pause: 2174415 samples]

011110000111100001000001011111100100010100111001100000010000000101010110001
01000001101011010
[Pause: 2049324 samples]

011110000111100001000001011111100100010100111001100000011000000000000001010
11001100
[Pause: 2115700 samples]

000001111000011110000100000101111110010001010011100110000010000000010101000
000111000000011001101010000000101100000001101001111110011100100010110101000
111011101111101101110000111111000110110110010101101001001010011101010010000
100001111101101110000000001001110111101001111010100100111001101101000110110
001110010110101100111100000000111110011111100010100111001100001010101100000
1110011000100101111101011101111100011011100001110100100000000
[Pause: 3539490 samples]

Here again in hexadecimal:

7878417e45398000f21ca573a4a27f4e330500
[Pause: 2080029 samples]

7878417e45398080003580
[Pause: 2174377 samples]

7878417e45398101562835a
[Pause: 2049300 samples]

7878417e4539818001598
[Pause: 2115661 samples]

07878417e4539820150380cd40580d3f3916a3bbedc3f1b656929d4843edc013bd3d49cda3639
6b3c03e7e29cc2ac1cc4bebbe370e90000
[Pause: 3539490 samples]

And here in ASCII (printable chars only):

xxA~E9���s��N3%

xxA~E9��5�%

xxA~E9�V(5� (LSB padded with 0)

?9�����V��HC���=Iͣc���>~)�*��K�7�%

So far, so meaningless. If we were hoping this would nicely translate to human readable text we were wrong. First off, let's dump this into Cyberchef, using the binary representation and us a recipe From Binary > Brute force text encoding (decode)Cyberchef recipe link: https://gchq.github.io/CyberChef/#recipe=From_Binary(Space,8)Text_Encoding_Brute_Force(Encode)&input=MDExMTEwMDAwMTExMTAwMDAxMDAwMDAxMDExMTExMTAwMTAwMDEwMTAwMTExMDAxMTAwMDAwMDAwMDAwMDAwMDExMTEwMDEwMDAwMTExMDAxMDEwMDEwMTAxMTEwMDExMTAxMDAxMDAxMDEwMDAxMDAxMTExMTExMDEwMDExMTAwMDExMDAxMTAwMDAwMTAxMDAwMDAw&oeol=NEL . This will show us a table of the messages decoded using a variety of text encodings, which helps identify any unusual encodings. We also run this with a bit length of 7 and 8, to account for different character lengths. We can also try inserting a "swap endianness" block, to check both LSB and MSB, as radios / networks often work with either. Nothing massively helpful comes out of this, if we were hoping a flag would jump out at us, all the potential decodings include seemingly random non-printable characters.

The only clear pattern we can make out is that the first 4 messages begin with the binary/ hexadecimal string:

Binary: 0111100001111000010000010111111001000101001110011000000
Hexadecimal: 7878417e45398

Which decodes in ASCII to xxA~E9, which has no obvious meaning.

This is 55 bits long, an odd length, not a multiple of 7 or 8 commonly associated with ASCII / byte encodings. it is divisible by 5, which is the symbol length of the original Baudot encoding, used in telegraphs and punched tape readers https://en.wikipedia.org/wiki/Baudot_code , putting the messages into https://www.dcode.fr/baudot-code yeilds no obvious plaintext either:

A link in Wikipedia mentions the Bacon cipher, which uses groups of 5 binary symbols. A quick experiment in cyberchef also shows nothing from this.

At this point we google the string xxA~E9, in case it's a common identifier for some known radio protocol. This reveals nothing however. Another potential approach is to shave bits off the beginning of the message, efectively bitshifting the result. This is in case the protocol beings with a number of bits before beginning the 7 / 8 bit encoding scheme, which would misalign cyberchef's results. However looking at each of the various offsets between 1 and 8 reveals nothing.

If we look at the final message we see it doesn't begin with this sequence, but that the sequence is offset into the start of the message, except for the 55th bit, which is different:

This could mean the motif serves as some kind of identifier to signify the start of data, in case of glitches. This is known as a PREAMBLE.

If we look at the binary representation, a couple of things can be made out:

while offset by the first bit, we see oscillating patterns of 4: - 4 1's - 4 0's - 4 1's - 4 0's

if we take these as 8 bit symbols we have some nice and symmetrical:

without starting 0:
11110000 (240 decimal)
11110000 (240 decimal)

with starting 0:
01111000  (120 decimal, x ASCII)
01111000  (120 decimal, x ASCII)

This also means that taken from the first symbol and divided into groups of 4 or 8, the chunks are inverses:

However both these patterns only hold for the first 16 / 17 symbols, and don't explain the rest of the pattern. They do kind of feel like a preamble, since preambles often use alternating patterns of a fixed length to syncronise the bitrate, but they convey no clear meaning. They also don't really help understand the rest of the message.

URH actually has an entire dedicated tab for analysis, so lets use this:

This allows us to select various decoding techniques, visualise the output, and analyse the structure of each message. Here we have selected the default Non Return to Zero (NRZ) encoding, which is just an irritating way of saying high=1, low=0, as we'd already assumed. On the right we can see that URH has auto-highlighted sections of the message and highlighted them. These highlights are the tools way of guessing what the different segments mean, and here it's guessed the same as we did: the xxA~E9 is some kind of preamble / sync word. It thinks the green blobs are checksums (checksums are usually at the end of the message), which might be the case, but we'd need to find an algorithm that makes sense. So this decoding doesn't yield results, but what about the others. I tried cycling through all the inbuilt encodings URH knew about:

morse code
inverse NRZ
manchester encoding
differential manchester encoding

But no luck. URH also allows you to create your own decodings, both using the inbuilt drag and drop architecture, or via an external program. An external program would help if I knew what the encoding scheme as, but I don't. Instead i cobble together a few basic potential encodings with the editor:

Invert bits
Invert endianness
Remove data whitening
Differential encoding

For each of these, I take the binary output and run through the earlier cyberchef recipe to detect potential text encodings for both 7 / 8 bit digits.

Still nothing.

Conclusions:

the message is not a straightforward text encoding

Symbol analysis

Let's see if the contents of the symbols tell us something about the mesage. The characters used and their frequency can often give an interesting insight into the contents and structure of a binary message. To do this we'll generate a quick frequency histogram of characters taken from the binary decoding using Cyberchef:

If this had resulted in a small spike of, say 26 symbols (the length of the alphabet), it would imply we were dealing with a nice and simple substitution cipher, ROT13 etc. We can also intuit those types of simple ciphers by looking for common patterns in the flag format that are echoed in the ciphertext:

However the data looks pretty much random and evenly distributed between 0 and 256, which potentially implies one of several conclusions:

The data is encrypted, as encryption algorthims are often engineered to produce data indistinuishable from random streams
The data is encoded / compressed: compression algorithms will seek to make full use of the available symbol alphabet, obscuring frequency patterns in the original data

However with this short a signal, we are unlikely to be able to draw many conclusions from this: frequency analysis is only effective on sufficiently long texts for patterns to emerge, and relative frequencies to be realised. Compression seems redundant for messages this short, but as this is a CTF we'll briefly explore the hypothesis that the data is encrypted.

We've already established that we don't think it's an encryption form based on substitution cipher, due to the entropy of the data, and lack of repeated patterns in the ASCII. One common form of simple encryption, used in insecure protocols and commonly seen in CTFs is XOR. XOR encryption would result in output characters across the 8-bit range (0-256) as we appear to be seeing here.

If the encryption used was XOR: the repeated pattern of xxA~E9 could be an artifact of XOR encryption using a key against a pre-amble consisting of 0's, a form of "known plaintext" attack against XOR. Trying this in cyberchef reveals no meaningful data however. A couple of other quick and dirty checks for XOR encryption:

XOR brute force (key len 1): nothing
XOR brute force (key len >1 with crib: disobey: nothing

At this point, we are likely over-complicating things, and should fall back on occams razor: "other things being equal, simpler explanations are generally better than more complex ones". We've begun to make too many assumptions, assuming the message is encrypted, assuming the form of the encryption (e.g XOR), and then beginning to assume the key length and the plaintext format.

This is the NSA's first law of cryptanalysis: look for plaintext.

In fact the simplest explanation for the xxA~E9 preamble is that it is simply a preamble, and not any form of leaked XOR key, and we should fall back to this line of thought before beginning to engage in fanciful explorations of cryptography.

Conclusions:

Until we find evidence, the signal is unlikely to be encrypted or compressed

ASK??

Near the beginning of the challenge, we noticed the signal appeared to be modulated in the frequency domain (ie, using FSK). But was that an assumption we should have proceeded with? Let's quickly double check whether the signal could in fact be modulated in the Amplitude domain, ie, using ASK.

If we examine the analog signal, we can in fact see slight differences in amplitude; these could be artefacts of ASK modulation

However, if we actually look closer at the signal we can see 2 reasons this doesn't quite make sense:

The differences in amplitude are too small to be discernable amongst the noise: we would expect much more distinct peaks and troughs
The differences in amplitude actually align with the changes in frequency, and don't appear to convey any extra information

Conclusion:

the signal is not ASK

Signal lengths

Another point about the signals we haven't examined is the length of each message: for ease of processing radio signals will often be sent in "packets" of predictable lengths, to help the receiver distinguish individual messages in a potentially noisy channel. If we look at the bit lengths of our messages we see the following:

Message 1	150 bits
Message 2	85 bits
Message 3	92 bits
Message 4	83 bits
Message 5	436 bits

A couple of observations here:

The messages are not all the same length, as a very rough estimate we have:
- 1 medium (msg1 = 150)
- 3 short (msg2, msg3, msg4 ~= 86)
- 1 long (msg5 = 436)
The fifth and final message is significantly longer than the others. One interpretation of this is that packets 1-4 are some handshake, and the fifth contains the data payload:
- short messages are usually used to convey signalling data (think like TCP SYN/ACK/SYNACK meta packets)
- long messages are often associated with the actual transmission of data

Therefore the fifth packet is most likely where our data (flag) is, as it is longer than the others, and long enough to contain a flag, although this is an assumption.

There is no common denominator between all lengths (83 is a prime). This has multiple explanations:
- minor glitches in the signal have resulted in incorrect demodulation, throwing off some messages length by 1 or 2
- packets do not represent compositions of even-length symbols. For example, if all our packets could be decoded to ASCII, we would expect them to all be multiples of 7 / 8. If messages were AES encrypted, we could expect them to roughly align with the blocksize of AES, and if Baudot code, 5 bits.

Participants

Radio communications can either be unidirectional (ie broadcast), whereby a single party distributes information to a large number of potential receivers (e.g television), or a bi-directional communication between 2 or more parties. In other words, a given signal can represent information from A-> B, or from both A->B, and B-> A. In fact radio signals can even represent communications between much larger numbers of participants (https://en.wikipedia.org/wiki/Channel_access_method).

It might be relevant to the challenge to understand if we're looking at the messages between multiple parties, or listening to a solitary ALICE screaming into the void. For example if we're looking at 2 parties, we might want to see if we can distinguish some form of key-exchange happening that results in later messages being encrypted.

Let's attempt to work out if the traffic we're examining belongs to one or more parties*.

of course, we could simply be observing one half of a bi-drectional converstation, with our capture cropped to only one side of the conversation

Multi-party comms

First off, we are going to discard the idea that our signal could be a conversation between multiple parties, for the following reasons:

the number of packets (5) is too few
the length of packets is too short
the aforementioned "Occam's razor": we shouldn't assume too much complexity, trying to discern a mesh network for a simple CTF challenge is getting a little silly

Why the number of packets is relevant: We only have 5 packets: a meaningful conversation between >2 parties is likely going to be significantly more involved than this, as the total number of packets would be divided somehow between the number of parties present. On the other hand 5 signals could reasonably represent a short but meaningful exchange between 2 parties:

Hi I'm ALICE
Hi ALICE, I'm BOB
Nice to meet you BOB, do you want a FLAG?
Yes please ALICE, send that over
Sure thing BOB: here's the FLAG: FLAG{askjdhakdhhasd}

Why The length of messages is relevant:

When you recieve a radio message for a multi-party communication, you have to work out who it is to / from. For a bi-party channel this could be simply represented with a single bit: 0 to represent ALICE and 1 to represent BOB. However when you have more participants, and especially when you have a communication medium with an unknown and potentially high number of parties, the process of establishing who is who, and who the message is meant for, begins to take substantially more bits.

As an example of this let's take Wi-Fi: https://community.cisco.com/t5/wireless-mobility-knowledge-base/802-11-frames-a-starter-guide-to-learn-wireless-sniffer-traces/ta-p/3110019.

Wi-Fi networks are designed to support large numbers of devices all communicating on the same frequency. As a result each Wi-Fi packet needs to include headers / footers with the following properties:

who the sender is
who the destination is
various sequence numbers / checksums to avoid confusion when messages interfere / are descynchronised

As you can see in a Wi-Fi packet header this takes up a significant amount of data, purely for the header. In fact a Wi-Fi header by itself is longer than our shortest packet (~36 bytes vs 83 bits)

Additionally, to avoid interference, the parties need to observe a strict timing pattern to avoid conflictionhttps://en.wikipedia.org/wiki/Duplex_(telecommunications)#Time-division_duplexing . This involves distributing data in short sharp bursts, with gaps between. The parties need to synchronise their communications and also issue a bunch of packets to indicate who can send data (CTS = Clear To Send: https://en.wikipedia.org/wiki/IEEE_802.11_RTS/CTS). As a result here's what a radio spectrogram of a Wi-Fi network looks like: it's much noisier and choppier than our comms, as any available time becomes filled with synchronisation signals.

Now these are all assumptions we can argue with: there are simpler multi-party protocols than Wi-Fi that don't require as much overhead. But the point remains that if we were looking at a capture of a communications protocol between N (N>2) parties, we would expect it to be noisier, and the messages longer. So we will proceed with the assumption that our capture is of either 1 or 2 parties.

Conclusions:

our capture does not represent communications between more than 2 participants

figuring out how many participants: signal strength

According to laws of physics, signal strength of radio communications decreases with distance exponentiallyhttps://en.wikipedia.org/wiki/Attenuation . This means that if we record a radio signal that represents a multi-party communication, messages belonging to different parties may have different average strengths (amplitudes).

Think about if you are stood between 2 people yelling at each other from a long distance away: if you are stood next to one, and far from the other you will hear one very loud voice, and one quieter. Therefore, even if you could not recognise the distinct voices of the participants, you could make a good guess that you are hearing a conversation between 2 parties, simply from the fact you can make out 2 distinct volumes.

We can apply this principle to radio analysis in the same way. A captured radio signal has to be captured from somewhereerr actually we could be looking at a purely synthetic / simulated signal generated in software, or drawn on graph paper. But if we look at the ringing, sidelobes etc we showed earlier, it seems likely this is a real signal capture . So depending on where the signal was captured, our interception point will have a distance between each side of the conversation, shown in the below image:

Unless we are at point Z: equidistant from either party, then one party will be stronger, and the other weaker. If we are at X, we will observe A as the stronger signal, and vice versa if we are point Y.

Now in fact, we don't just consider points on a straight line, antennas broadcast signals in multiple directions, depending on antenna construction. We can think of a distance from a signal being represented by a radius around the transmitter, and the relative signal strengths represented in 2 dimensions, in a kind of venn diagram. This can be taken further with 3 signals, used for triangulation, as in GPS or radio direction finding (GPS / DF / CRYPTO: https://www.cryptomuseum.com/df/df.htm)

Anyway, let's put this principle in practice. If we are looking at comms between 2 parties, we may be able to determine who each is, by looking at the amplitude of each packet. An example capture that shows distinct amplitudes belonging to multiple participants is shown below:

Do we see this when looking at our capture??

not really: there's no massively discernible difference in strenght between packets
because of the exponential drop off in signal strength, we'd expect even a relatively small difference in distance to result in discernably different amplitudes
If we actviate URH's auto "participant detection", it also doesn't distinguish relative RSSI's of different packets

Conclusions: Either - our capture source is perfectly equidistant between both parties - our capture source is close enough to both parties there is no discernible stength difference - we are only seeing communications from a single party - the captured signal has been artificially edited to make both parties equally strong (sometimes used to make decoding easier in a noisy channel for modulations that are not ASK)

Either way, it seems highly unlikely that the party a message comes from conveys any meaningful information needed to obtain the flag, as it cannot be easily discerned. We will assume it is not important.

Conclusions:

the sender of each packet is not relevant to it's decoding

There's yet another potential complication: multi path signals. In the real world there are multiple surfaces off which radio signals are reflected before hitting the receiver.In a building, signals bounce off walls (explaining odd patches of wifi strength / weakness in houses), and outside, tunnels, mountains, clouds, and layers of the ionosphere all reflect different signals. The degree of reflection is proportionate to the frequency / characteristics of the signal, but the resulting effect is that if you take a single signal and broadcast it, the waves will be scattered by the different objects they encounter. This creates multiple paths from source to destination, and the reciever will likely recieve the signal multiple times at different strengths and at different times. This is the same principle of shouting "ECHO" in cave or tunnel: you hear the sound reflected back at different intervals, representing the multiple lengths that the sound wave took to be reflected back to youhttps://en.wikipedia.org/wiki/Multipath_propagation .

multipath reflections scatter a single signal signal across multiple routes

This reflection can even be harnessed to exchange information round corners or over the horizon, frequently used in sattelite signals / long distance communications⊕
, and more recently used to generate 3D maps of structures from received radio signalshttps://ieeexplore.ieee.org/document/10025551 (not the same as systems such as SONAR, RADAR, DOPPLER RADAR, or LIDAR, which typically subtract multi-path echoes to discard noise).

Our signal doesn't appear to be complicated by this effect, so we can safely proceed and ignore it's implications.

File type analysis

CTFs like to combine different disciplines in interesting ways, so while radio's are pretty analogue maybe theres some other digital funkiness going on. A common category of CTF challenge is forensics / stegnaography: looking for hidden meaning encoded in digital files, and that's effectively what we're doing here. We have a seamingly meaningless stream of binary data, and are aiming to recover a meaningful flag from it.

Let's try that with binwalk, and the file utilitiy, to pick out any file signatures in the data. We'll do this with the first and last messages, as they have the greatest lengths.

$ echo "7878417e45398000f21ca573a4a27f4e330440" | unhex > blob5.bin

$ file blob1.bin       
blob1.bin: data

$ binwalk --dd=".*" blob1.bin 

DECIMAL       HEXADECIMAL     DESCRIPTION
--------------------------------------------------------------------------------


$ echo "07878417e4539820150380cd40580d3f3916a3bbedc3f1b656929d4843edc013bd3d49cda36396b3c03e7e29cc2ac1cc4bebbe370e9000" | unhex > blob5.bin

$ file blob5.bin       
blob5.bin: data

$ binwalk --dd=".*" blob5.bin 

DECIMAL       HEXADECIMAL     DESCRIPTION
--------------------------------------------------------------------------------

$

Neither tool finds anything.

Conclusion:

the message does not represent an encoded file with a known filetype

Eventual analysis

I wasn't able to decode the flag or extract any meaning from the messages. If we put together our conclusions we have the following hints:

the capture is a digital transmission, sent and received by a computer not a human
the capture represents a real signal captured with radio hardware
the message is not a straightforward text encoding
the signal is not ASK
the signal is likely to be FSK encoded, with a sample/symbol of 1200
Until we find evidence, the signal is unlikely to be encrypted or compressed
the sender of each packet is not relevant to it's decoding
our capture does not represent communications between more than 2 participants
The fifth packet is most likely where our data (flag) is
the message does not represent an encoded file with a known filetype Either:
minor glitches in the signal have resulted in incorrect demodulation, throwing off some messages length by 1 or 2
packets do not represent compositions of even-length symbols.

So, in keeping with th spirit of the blog, I've failed to slve this challenge, but in the process learned a lot about radio signals and decoding, found ut about the aweome URH tool, so it's still been fun.

To be continued...

Glossary

PREAMBLE A series of symbols to indicate the start of a message / packet
Frequency: The rate at which a radio wave oscillates, typically measured in Hertz (Hz). It determines the wave's position on the electromagnetic spectrum.
Amplitude: The strength of a radio wave, representing the power or intensity of the signal.
Modulation: The process of varying a property of a carrier wave (e.g., amplitude, frequency, or phase) to encode information for transmission.
- ASK (Amplitude Shift Keying): A digital modulation technique where the amplitude of the carrier wave is varied to represent binary data (e.g., 0 or 1).
- FSK (Frequency Shift Keying): A digital modulation method where the frequency of the carrier wave is shifted between discrete values to represent binary data.
Channel: A specific frequency band allocated for the transmission of a signal, ensuring separation from other signals
Sidelobe: Unintended emissions of power from a signal outside its main lobe, often caused by imperfections in transmission

Cheating

2024-09-29T00:00:00-05:00

Lots of fiction, in all forms, focuses on the world of spying and espionage. There are thousands of books, films, and games that cover the subject, and from different angles. There are non-fiction books and documentaries that study the true history of the subject, romantic fiction, and all sorts in-between.

The most common stereotype among these is probably that of the action packed adventure: fast cars, gunfights, doomsday weapons. James Bond, Jason Bourne, Jack Ryan, all rush from danger to danger in exotic locales across the world.

Then there's another favourite subject of fiction and documentary: police procedurals, murder mysteries, true crime. FBI agents, police inspectors, and complete novices crack mysteries, piece together clues, and the mystery concludes with a dramatic gunfight with the villain, in abandoned fortresses, the faces of mount Rushmore, serial killer's basements.

But these are romanticised depictions. Instead, real mysteries are solved by sifting through information and painstakingly building a picture of the relationships between people and events. The same with espionage, more damage can be done by poring over serial numbershttps://www.numberphile.com/videos/clever-way-to-count-tanks than with silenced pistols.

This is the idea behind "A Hand with Many Fingers"https://store.steampowered.com/app/1229030/A_Hand_With_Many_Fingers/ , a game that is inspired by another genre trope. It's a scene found in several places in espionage/mystery genre: the detective/spy is stumped and has no leads. Or they've pissed off their bosses by overstepping a line. Either way, they're relegated to the basement archives to hunt for clues the hard way: poring through documents. Here's Rust Cohle at it, in True Detective.

There's often a corkboard and red string involved, and after a montage of looking tired, papercopying files and drinking coffee out of polystyrene cups, the hero is rewarded by a new connection. This is a plot device used to take a break from the action, act as an ordeal and Apotheosis in a character arc, to move the plot forward.

In "A Hand with Many Fingers" the premise is this: you are an un-named, un-described individual in the employ of an implied but undefined government agency (FBI?). You are granted a corkboard, some filing cabinets, and a basement archive store with hundreds of files.

https://store.steampowered.com/app/1229030/A_Hand_With_Many_Fingers/

You start with a single newspaper cutting: the death of a man in Australia by the name of Nugan Hand. From this single piece of information, and the archive files, you use the corkboard to unravel the connections and events surrounding Hand. You don't know what you're going to find, maybe you will discover the truth behind his death, or something else entirely. From a suspicious death, you trace Hand's connections, and it becomes apparent that Hand was closely linked to schemes such as Air America https://en.wikipedia.org/wiki/Air_America_(airline) , was close friends with a former CIA director, and the story turns from an isolated death to global geopolitics, history, and espionage.

The game's mechanics are also pleasingly simple. You start from the one newspaper clipping. You can use information from that to extract a triplet of name, year, and country of interest. You then use this to consult the filing cabinets. There's a bank of cabinets for each region of the world, a drawer per-year, and an alphabetical list of names in each drawer.

You go to the drawers, lookup the information by year and name, and are given a card with several numbers on it. The numbers are in turn the numbers of boxes, stored in the basement. You trundle down to the basement, retrieve the numbered boxes and examine the contents. The contents will be more newspaper clippings, or redacted intelligence reports, or torn bank statements. These may reveal more dates, names, and places, and you continue the hunt. On your corkboard you can pin these clippings, and draw pleasing red strings between them to show connections.

Slowly you build a network of connections, and begin to understand the relationships between the key characters, and their shady dealings across Vietnam, Angola, and the middle east. At this point it's worth mentioning that this is based on real events, although official collusion between Nugan Hand Bank and the CIA is disputedWarning: spoilers for the game: https://en.wikipedia.org/wiki/Nugan_Hand_Bank . The characters and events depicted in the game were the real subject of conspiracy, and investigation. Nugan Hand Bank was ultimately found guilty of fraud, money laundering, and funding drug smuggling, but the connections to arms smuggling for the CIA remain unproven allegations.

That method of manually hunting down and cross referencing dates is a great way of drawing you into the conspiracy: you slowly uncover new links and start thinking about new theories as to why characters are really connected. But what if there was another way of solving the mystery? To do this we're going to look at a mathematical technique for analysing information, used by real world spies, investigators, and private surveillance organisations. The technique is Social Network Analysis (SNA). Not "social network" as in social media, alhough there is a very close relationship between the two, but the technique of analysing connections that represent "social" ie human relationships, and "network"s in the sense of groups of people.

Social Network Analysis is an application of graph theory https://en.wikipedia.org/wiki/Graph_theory , which represents data in a Graph. There are Nodes, and edges which link them. These can be used to represent whatever you want, it could be roads between points on a map, abstract mathematical concepts, or in the case of social network analysis: people and the connections between them. In fact social network analysis doesn't just have to be people, it could be countries, political movements, etc. But the theories are mostly applicable to people, or things that act "like" people, such as groups of people.

The beauty of social network analysis is how general purpose it is: it can give fascinating insights into people and behaviour from seemingly nonsensical data. There's no particular question it can answer, but instead, given some input data it will assign "weights" and patterns in the data that a human being can interpret. The algorithms are in fact completely ignorant of the "meaning" of data, the techniques operate purely on numbers, and it's humans that supply the context of input, and the resulting interpretation. This gives incredible flexibility: if your input data shows text messages between people, you can see social structures drawn out and draw conclusions about friendships, and relationships. but if you change that dataset to who speaks to each other in person, you get a different output, and different conclusions.

Introduction to graph theory

Let's start off with a classic example: a graph between characters. We start off with our lists of people, and the strenght of their relationship on a scale from 0-10.

Character A	Character B	Weight
Acciaiuoli	Medici	7
Medici	Barbadori	8
Medici	Ridolfi	9
Medici	Tornabuoni	9
Medici	Albizzi	9
Medici	Salviati	8
Castellani	Peruzzi	6
Castellani	Strozzi	7
Castellani	Barbadori	5
Peruzzi	Strozzi	7
Peruzzi	Bischeri	6
Strozzi	Ridolfi	7
Strozzi	Bischeri	7
Ridolfi	Tornabuoni	6
Tornabuoni	Guadagni	7
Albizzi	Ginori	4
Albizzi	Guadagni	7
Salviati	Pazzi	3
Bischeri	Guadagni	7
Guadagni	Lamberteschi	5

Now we draw a diagram, with lines between characters, if they have a relationship. We've already done some simple network analysis here, just by depicting relationships. Suddenly out of a list of names we have shape and pattern, we can begin to see meaning in this. One of the most common forms of analysis is of "cliques" and "clusters" https://www.oreilly.com/library/view/social-network-analysis/9781449311377/ch04.html (you can find the full version online) . Graphs of real world data are not uniformly distributed, and instead we can in fact make out sub-groups, or clusters. These can be connected, or entirely separate, with no clusters. again, it's up to humans to provide interpretation onto this, but if we take graphs of nodes that represent people, these clusters are likely to show friendships, relationships, families etc.

These can be aligned with other attributes people have, such as politics, religion, nationality, interests, beliefs. Suddenly you can see why spies and companies such as Cambridge Analytica https://en.wikipedia.org/wiki/Cambridge_Analytica#Methods are so interested in the subject, and where all of Meta's profits come from.

Now let's do something else, instead of a simple graph with nodes and edges, let's depict some "weights". Weights are where we assign a value to edges or nodes. Again, this number can represent whatever we want it to. In our example, we'll use the relationship score as our weight for edges. Now let's draw the graph again, and visualise the edge weights by size.

We've got the same graph as before, but there's even more information. Now we can see that not only are characters connected, but some are more connected than others. If we supply our human interpretation to the cold numbers again, we can think of these weights as the strength of a relationship.

CENTRALITY

And now we're going to dive into even more interesting territory: the concept of centrality. CENTRALITYhttps://www.oreilly.com/library/view/social-network-analysis/9781449311377/ch03.html is a function in graph theory that has significant implications in the world of social network analysis. Centrality, as the name implies, is a measure of how "central" a node is. Consider the following graph:

What's the most "central" / important node? clearly the one surrounded by the 3 other nodes: node A. "Important" here is a really vaguely defined concept, and this comes back to the idea earlier: graph theory has no actual knowledge of the "meaning" of data, it just crunches numbers. Importance is a concept superimposed onto these random shapes by people.

In fact in the above diagram there's no indicator as to why any of these nodes in particular is "important" in any real sense, we have no way of knowing what they represent, but we can measure centrality. We could sort of define importance as a synonym for centrality by thinking "How much would it disrupt the graph if this node was removed?". We can see here that if we removed any one of the outside nodes we'd still be left with most of the graph. But the second we remove the centre we no longer have a graph, just some scattered and disconnected dots.

Just to be clear, centrality isn't anything to do with placement or spacing, while we are drawing the nodes here on a 2 dimensional plane, that's just a visual representation to conceptualise the idea: the nodes have no inherent property of x,y coordinates. Node A is still the most "central" to this graph, regardless of how we visually represent it:

No matter which way we re-arrange the nodes, A is still more central than the others, due to the connections

Ok, so how is centrality defined? we can see in the examples above, and as humans consider it sort of obvious, but what about a more complex example. What are the centralities of nodes in this graph?

If we go back to our simple example, why was node A the most central? Well if we take each node, and assign it a number based on the number of edges it has, we can see this gives us our result, where the middle node has a centrality of 3, and the others 1:

This is the first measure of centrality, known as DEGREE CENTRALITY: the centrality of a node is the number of edges it hashttps://www.sciencedirect.com/topics/computer-science/degree-centrality . This is a pleasingly simple measure, and a building block of other forms. There are other common techniques for assigning centrality, but we'll only look at one: EIGENVECTOR CENTRALITYhttps://en.wikipedia.org/wiki/Eigenvector_centrality . This is probably one of the most commonly used and important measures. The maths is more complex than I really understand, but we can say the way it works is that it takes into account the relative centralities of the neighbours of each node. This means that centrality" flows" through the graph. This results in it disproportionately picking out particular clusters of high importance.

This measure seems to have several impactful consequences in social networks, for example, think about if we use the idea of centrality as an indicator of "importance" or "influence" among people. let's say Alice influences 10 people (10 edges), and Bob also influences 10 people. who's more influential? Degree centrality would say they're equally important. But eigenvector centrality would calculate scores based on the influence that each of Alice or Bob's connections have. If all Bob's friends are nobodies who don't know anyone (they have 0 connections, other than to Bob) then really how much influence does he have? But if all Alice's 10 connections are celebrities and powerful politicians, who in turn influence thousands of other people, then Alice's ultimate influence is greater. EIGENVECTOR centrality would assign Alice a much much higher score than Bob.

We can briefly see the difference between degree and eigenvector centralities in these renderings, showing the exact same data marked using different centrality techniques.

https://commons.wikimedia.org/wiki/File:Wp-01.png

There's no "one" interpretation of centrality when applied to social networks, but we often associate the amount of connections people have with power and influence: think about the phrase "well connected". Maybe it's the number of people who'd lend you money, vote for you, tell you information they shouldn't, but the point is, it's generally regarded as beneficial. So with people we can think of centrality as a measure of social power / influence. In a network representing people, we can often think of high centrality as a representation of leadership, or control of a group. There's a fantastic example (that inspired this post) of showing how social network analysis and centrality could be applied to information from 1775 to identify the ringleader(s) of the American revolution:

If the British counter-revolutionary police had known of this technique then, history might look different. And this shows one of the main uses of SNA: identifying important groups and leadership structures from data. It doesn't take much imagination to work out who's using this and what for.

Before we finish discussing SNA there's something important to point out. Earlier I said the beauty of the idea is that there's no one interpretation of the outcome, the algorithms only input and output numbers, only humans deal in "meaning". There's a flip-side to this as well: as there's no one answer or interpretation of the data, and no way of knowing if the interpretation you've placed on the data is correct. The maths draws connections that might not have any real-world relevance, or the meaning of which is hard to interpret. the connection between a sibling, or a couple look the same to the maths, but we as humans know those are very different types of relationships. Making definitive judgements on the meaning of the data can be hard, and is fraught with risk. It can be tempting to impose simple, pleasing conclusions on the data, and as we know from examples like the birthday paradox, humans are bad at processing probability, and what might sound unlikely can occur much more frequently than we think. In short, jumping to conclusions because SNA makes something appear a certain way isn't a great idea.

Ok, so back from the mathematical sojourn: why do we care about SNA / centrality when it comes to "A Hand with many fingers"? Well if we think about it, the game's central puzzle is this: here's 100 filing cabinet's worth of unstructured data on people, and the player's job is to find the pattern between them.

So what if we take this information, and instead of actually solving the puzzle in-game, just find a way of representing it as a social network, and seeing what the data shows us? Can we use SNA to find the network of people connected to Hand?

Remember that the game works as follows:

we have banks of filing cabinets, one per region
we have a drawer in each bank for each year
each drawer contains a list of cards with names
each card has a reference to one or more references to files in the basement archives

As in our earlier examples, we need to pick what our nodes will be. This is simple, each person in the game is represented as a surname on a card index. Now we need to pick relationships, and this is more complex. Several potential definitions immediately come to mind:

option 1: create a connection between two people if they have been to the same region ever. In the game this means if their names are in the same bank of cabinets.
option 2: create a connection between two people if they were in the same region in the same year. In the game this means the two names would appear in the same drawer
option 3: create a connection between two people if they are both referenced in the same file. This would mean both their names share a file index.

Let's dismiss the first: the game has 8 regions, which isn't really a rich enough dataset. We're likely to just end up with a graph that's either just 8 clusters of each region, or just a complete mess. This might tell us who amongst the group is the most frequent traveller, but if 2 people have been to the same region somewhere in a 10 year period, that doesn't mean much, especially considering the regions are often entire continents.

The second, create a connection between two people if they were in the same region in the same year, is immediately more interesting, as it's effectively how you're making connections at the start of the game. You have a newspaper clipping with Hand, Australia, 1979, and start looking for who else was in the region at the same time.

The third is almost better: if two people are mentioned in the same document (news article, photo, bank transfer), chances are there's a real, tangible connection, and our graph should be very powerful as a result.

The last two are almost equally attractive, so let's start with these. But before we can start trying out our theory we have to actually gather the data needed.

Collecting data

I could sit in the game for several hours, and manually copy the data out into a spreadsheet, but that sounds time consuming, so let's cheat. If the information is in the game, it has to exist somewhere in the game's files. So let's use the venerable strings.exehttps://learn.microsoft.com/en-us/sysinternals/downloads/strings to take a peek at the files. The first thing I'm going to do is take the name "ANDERSON" and search in the games files for this data. It's worth pointing out I've changed the name "ANDERSON" from what it is in the game, to not spoil the plot. When we get the data and graph it, I've also switched every name in the game for another, to avoid spoilers (although they're real names, findable on Wikipedia, so go figure...). The reason I didn't use the name "Hand" that we already knew, is that it's only 4 characters, and shows up in the name of the game. Searching for strings less than 5 characters in unsorted binary data is likely to find false positives. To be honest this isn't actually a big problem, but would be on larger files: the likelihood of a completely random 4 byte sequence spelling the ASCII for "Hand" is 1/256^4: 4294967296 a ~ 3GB file (3221225472 bytes) has approximately X% chance of that cropping up. The real reason for not picking "Hand" is that it's in the name of the game, and therefore probably crops up in all sorts of metadata / asset data / class names that we might find in the files.

Anyway, we search for "ANDERSON", and something immediately pops up:

PS > strings.exe -n 10 'A Hand With Many Fingers_Data\level1' | select-string anderson

ME-1978-ANDERSON
AU-1980-ANDERSON
Text-ANDERSONFlights N
BC-ANDERSON
Text-BakerANDERSON1978
Text-WilsonANDERSON1975
Photo-BernieANDERSON N
ANDERSON. B

Fab, this looks exactly like the data we're looking for: we see refereneces to regions (ME = Middle East, AU = australasia), and years. Now just to check, let's search for a random name that's not connected to the plot, to make sure it's there. The name "gentry" should work, their name appears in some files, but has no relevance to the story.


PS > strings.exe -n 10 'A Hand With Many Fingers_Data\level1' | select-string gentry
PS >

Uh-oh. No hits for "gentry" in that file where we found the "anderson" references. This suggests the file we're searching only contains the plot-relvant names. In other words: a name in this file means they appear in the "main story", and if it doesn't, it's a random entity used as filler. This would allow us to work out the main story actors by simply seeing if they exist in this file. That's cool, but if we put ourselves in the shoes of the unnamed detetive in the archive, we wouldn't have this information, and so it's extra-cheatey. Let's keep looking.

The second file that jumps out looks much more promising: This looks exactly like the definition of the filing cabinets, wrapped in JSON.

PS > strings.exe -n 5 'A Hand With Many Fingers_Data\sharedassets1.assets' | select-string gentry

GENTRY
[{"surname":"ABBOTT\r","initial":"E","refs":["OS 654/49","OS
331/91","OS 272/94"]},{"surname":"ADAMS\r","initial":"O","refs":["OS
225/85","OS 472/82","OS
... SNIPPED ...

667/30","OS 682/65"]},{"surname":"ZAMORA","initial":"W","refs":["OS
236/56","OS
314/24"]}]

If we unpick this we get a JSON list with entries such as the following:

{
    "surname": "CARSON\r",
    "initial": "D",
    "refs": [
      "OS 370/8"
    ]
  },

That's a bit confusing, there's no grouping there of "region", or even year, just names and their card index references. So this only gives enough data for method 3: linking two people if they share a card reference. Unforetunately, this has the opposite problem as the first example, this contains every entry except the key characters, so back to the drawing board. At theis point I tried a few more approaches:

searching for the reference numbers themselves in files, to see if any file contains both story critical and non-critical references. Nope.
trying to recreate the above JSON structure of the non-story entries found above by recreating data from the level1 file. Unfortunately this is also too hard: the binary structure doesn't seem to keep OS numbers next to story character names, so can't automatically link them. You can see an example of a file opened in a hex editor below showing this
trying to look not in the files, but in the game's memory at runtime, as maybe the data is rearranged at runtime. Nope.

At this point I'm about to give up, and trawl through every filing cabinet in the game to gather the information, but then I found a guide on the steam guides page for the game, that lists the story relevant card entries. For example, we now have information such as Australia 1980 - Hand: OS 633/75, OS 385/14. I'll adapt this to be a mapping between (sur)names and OS numbers, and we get:

Hand: [OS 633/75, OS 385/14, OS 385/14, OS 385/14, OS 449/13, OS 385/14, OS 590/7, OS 385/14, OS 449/13
Nugan: OS 109/72, OS 633/75, 
Baker: OS 161/15, OS 615/53, OS 449/13, OS 161/15, OS 786/95
REEDER: OS 633/75, OS 802/84, OS 161/15, OS 802/84
Collins: OS 488/17
Martello: OS 267/4, OS 536/84

⊕again, I\'ve obfuscated the original names here

There's a lot of duplication, since each box might contain multiple files, which reference other dates and places, this is the point of an archiving system like this. This doesn't matter for now. Let's store the data as a dict of sets, so we get a key-value mapping names to a list of OS numbers. If we also adapt the non-story examples to this format we should have a good enough datasource to work from:

#!/usr/bin/env python3

import json
import networkx as nx

data_dict = {}

story_data = {
    "HAND": ["OS 633/75", "OS 385/14", "OS 385/14", "OS 385/14", "OS 449/13", "OS 385/14", "OS 590/7", "OS 385/14", "OS 449/13"],
    "NUGAN": ["OS 109/72", "OS 633/75", ],
    "Baker": ["OS 161/15", "OS 615/53", "OS 449/13", "OS 161/15", "OS 786/95"],
    "REEDER": ["OS 633/75", "OS 802/84", "OS 161/15", "OS 802/84"],
    "Collins": ["OS 488/17"],
    "Martello": ["OS 267/4", "OS 536/84"],
}

for name,numbers in story_data.items:
    data_dict[name]: set(numbers)

non_story_data = json.load(open('hand.json', 'r'))

for d in non_story_data:
    if d['surname'] in data_dict:
        data_dict[d['surname']].update([d['refs']])
    else:
        data_dict[d['surname']] = set([d['refs']])

# data_dict now contains all OS refs

Here's some simple python to parse the data, create a graph, caclulating degree and eigenvector centrality, using the networkX library:

#!/usr/bin/env python3

import json
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
from bokeh.models import Range1d, Circle, ColumnDataSource, MultiLine, LabelSet, CustomJS
from bokeh.transform import linear_cmap
from bokeh import palettes
from bokeh.io import show, output_file
from bokeh.plotting import figure, from_networkx

def build_centrality_table(G):

    '''
    given a graph G and a centrality metric, return dataframe of node centralities :)
    '''

    try:
        df = pd.DataFrame.from_dict(nx.closeness_centrality(G), orient='index', columns=['closeness_centrality'])
        df = pd.merge(df,pd.DataFrame.from_dict(nx.degree_centrality(G), orient='index', columns=['degree_centrality']), left_index=True, right_index=True)
        df = pd.merge(df,pd.DataFrame.from_dict(nx.eigenvector_centrality(G, max_iter=10000), orient='index', columns=['eigenvector_centrality']), left_index=True, right_index=True)
        df = pd.merge(df,pd.DataFrame.from_dict(nx.katz_centrality(G, max_iter=100000), orient='index', columns=['katz_centrality']), left_index=True, right_index=True)
    except Exception as e:
        print("Unable to calculate centralities")
        #raise e
        empty_centralities = {x:0 for x in G.nodes}
        df = pd.DataFrame.from_dict(empty_centralities, orient='index', columns=['closeness_centrality'])
        df = pd.merge(df,pd.DataFrame.from_dict(empty_centralities, orient='index', columns=['eigenvector_centrality']), left_index=True, right_index=True)
        df = pd.merge(df,pd.DataFrame.from_dict(empty_centralities, orient='index', columns=['katz_centrality']), left_index=True, right_index=True)
        df = pd.merge(df,pd.DataFrame.from_dict(empty_centralities, orient='index', columns=['degree_centrality']), left_index=True, right_index=True)

        # return pn.widgets.DataFrame(pd.DataFrame())

    print(df.sort_values('eigenvector_centrality', ascending=False))

    return df

def render_graph(G, centrality_table, show_labels=True, centrality_measure='degree_centrality'):
    output_file(f"hand_{centrality_measure}.html")
    network_graph = from_networkx(
        G, nx.spring_layout, scale=20, center=(0, 0), weight=1, seed=55)

    try:
        if centrality_measure == 'degree_centrality':

            adjusted_node_size = dict([(node, (degree)) for node, degree in nx.degree(G)])
        else:

            # LOG???
            size = 80
            adjusted_node_size = dict([(node, (value * size)) for node, value in centrality_table[centrality_measure].to_dict().items()])
    except Exception as e:
        print("Unable to calculate centralities")
        raise e


    HOVER_TOOLTIPS = [("Node", "@index")]
    plot = figure(tooltips=HOVER_TOOLTIPS,
                tools="pan,wheel_zoom,save,reset",
                active_scroll='wheel_zoom',
                title='Network',
                width=1000,
                height=700,
                # x_range=(-2, 2), y_range=(-2, 2),
                background_fill_color=None,
                background_fill_alpha=0,
                border_fill_color=None,
                border_fill_alpha=0,
                outline_line_color=None
            )
    plot.xgrid.grid_line_alpha = 0
    plot.ygrid.grid_line_alpha = 0
    if len(adjusted_node_size) > 0:

        nx.set_node_attributes(G, name='size', values=adjusted_node_size)

        size_by_this_attribute = 'adjusted_node_size'

        source = ColumnDataSource(pd.DataFrame.from_dict(
            {k: v for k, v in G.nodes(data=True)}, orient='index'))

        # VARY SIZE VAR BASED ON DIFF MEASURES
        network_graph.node_renderer.data_source = source


        # network_graph.node_renderer.glyph = Circle(radius='size', fill_color=linear_cmap('size', palettes.Spectral[8], min(adjusted_node_size.values()), max(adjusted_node_size.values())))
        network_graph.node_renderer.glyph = Circle(radius=0.3, fill_color=linear_cmap('size', palettes.Spectral[8], min(adjusted_node_size.values()), max(adjusted_node_size.values())))

        plot.renderers.append(network_graph)

        if show_labels == True:
            # Add Labels
            x, y = zip(*network_graph.layout_provider.graph_layout.values())
            node_labels = list(G.nodes())
            source = ColumnDataSource(
                {'x': x, 'y': y, 'cn': [node_labels[i] for i in range(len(x))]})
            labels = LabelSet(x='x', y='y', text='cn', source=source, text_font_size='12px')
            plot.renderers.append(labels)
            callback = CustomJS(args=dict(labels=labels, x_range=plot.x_range), code="""
                const span = x_range.end - x_range.start
                const base = 11
                const scaled = Math.min(30, Math.max(6, base / span * 11))
                labels.text_font_size = scaled + 'px'
            """)

            plot.x_range.js_on_change('start', callback)
            plot.x_range.js_on_change('end', callback)

    return plot

data_dict = {}

story_data = {
    "HAND": ["OS 633/75", "OS 385/14", "OS 385/14", "OS 385/14", "OS 449/13", "OS 385/14", "OS 590/7", "OS 385/14", "OS 449/13"],
    "NUGAN": ["OS 109/72", "OS 633/75", ],
    "WILSON": ["OS 161/15", "OS 615/53", "OS 449/13", "OS 161/15", "OS 786/95"],
    "HOUGHTON": ["OS 633/75", "OS 802/84", "OS 161/15", "OS 802/84"],
    "COLBY": ["OS 488/17"],
    "HELLIWELL": ["OS 267/4", "OS 536/84"],
}

for name,numbers in story_data.items():
    data_dict[name] = set(numbers)

non_story_data = json.load(open('hand.json', 'r'))

for d in non_story_data:
    if d['surname'] in data_dict:
        data_dict[d['surname']].update([d['refs']])
    else:
        data_dict[d['surname']] = set(d['refs'])

# data_dict now contains all OS refs
print(data_dict)

inverse_data_dict = {}
for k,v in data_dict.items():
    for x in v:
        inverse_data_dict.setdefault(x, []).append(k)

G = nx.Graph()
for name,numbers in data_dict.items():
    G.add_node(name)
    for n in numbers:
        for oname in inverse_data_dict[n]:
            if oname != name:
                G.add_edge(name, oname)

# G = nx.fast_gnp_random_graph(1000, 0.01)

df = build_centrality_table(G)
eig_plt = render_graph(G, df, show_labels=True, centrality_measure='eigenvector_centrality')

show(eig_plt)

deg_plt = render_graph(G, df, show_labels=True, centrality_measure='degree_centrality')
show(deg_plt)

If we run this, with no node colourings, we see the shape of the relationships: IMG UNCOLOURED GRAPH

This tells us some interesting things:

There is a single large, well connected cluster
There are multiple nodes with 0 or 1 edges, dotted around the edges
There are a small number of disconnected minor clusters of 4

If we were to apply our theories of Social Network Analysis here we'd say the data shows a single group of well connected associates,with lotsof strange isolated people that don't fit in. This is typically an unusual graph if it were to showreal people / relationships, as we'd typically expect more inerconnectedness, and fewer isolated groups.

Now let's refine our analysis, by calculating centrality scores using DEGREE CENTRALITY, and using this to colour the graph. This should tell us the relative "importance" of certain nodes.Red indicates high centralities, blue low according to the Bokeh "Spectral" palette: https://docs.bokeh.org/en/latest/docs/reference/palettes.html

Now we can see things more clearly, we can see there are a small number of nodes with a high degree centrality (red). We can also see that on the edges of the large cluster are distinct small clusters.

As we can see, this is a kind of confusing analysis: there are no clear winners with high centrality, instead lots of nodes appear to have high centrality. This is what makes degree centrality a less useful metric, especially in Social Network Analysis. What if we calculate eigenvector centrality instead, not degree centrality?

As discussed earlier, eigenvector centrality tends to pick out only the most connected nodes, and we can see this here, as one concentrated "centre" of the graph, coloured with red.

Analysing results

So originally I theorised that we could "solve" the game, by applying Social Network Analysis, and that by calculating centralities of all the characters in the game, we'd be able to find the most influential characters, and solve the mystery.

So did this work?

If the hypothesis was correct, then perhaps the game's main characters are the most well connected / influential people in our archives. If we print out the table, sorted by eigenvector centralities, we'd expect to find the "most influential" characters, and here are the results:

name              closeness_centrality  degree_centrality  eigenvector_centrality  katz_centrality
PAYNE             0.154276           0.028662            4.239510e-01           0.112430
PARK              0.144626           0.025478            3.768897e-01           0.101089
GEYER             0.146413           0.022293            3.678812e-01           0.098157
FULLER            0.148361           0.019108            3.514795e-01           0.092221
OWEN              0.141285           0.019108            3.501144e-01           0.091208
...                  ...                ...                   ...                  ...
NORMAN            0.000000           0.000000            3.576366e-59           0.036707
NYLUND            0.000000           0.000000            3.576366e-59           0.036707
OGILVIE           0.000000           0.000000            3.576366e-59           0.036707
OLIVER            0.000000           0.000000            3.576366e-59           0.036707
XIAO              0.000000           0.000000            3.576366e-59           0.036707

Okay... these aren't the game's main characters, so not the result we might have suspected. Turns out the data doesn't match our hypothesis.

Where are the main characters in our graph? If we look closely we can see they are concentrated in a small cluster by themselves. They're not connected to other parts of the graph, and as a result, not very "central". They don't register at all on our high-centrality list:

So what does this mean? It means that our characters are differentiated not by their connections, but by their lack of them. It seems an illogical conclusion: we'd assume that these individuals, who ran an international bank, smuggled arms and guns across multiple continents would be extremely "well connected". Why isn't this shown in our analysis?

In a sense the graphing has worked: we've identified the cluster of 4 people closely associated with Nugan, which was sort of our original objective. Except, apparently, this does not include one of the central characters: Collins. Not sure why but I think in the game your supposed to make a leap of faith to work out how hes connected, and he doesn't appear related in any files directly. . Additionally, we can't say that this grouping is really special, as there are multiple other standalone cliques:

The MONROE-IGUINA-VOSS-LYNCH-TYSON-ESTES-JI cluster
The KNOX-ELLIOT-IRWIN-KAUFMAN cluster

The reason for this unexpected result, is that this isn't a real dataset.

In fact, if we graph a histogram of the degree centralities, what do we see:

def plot_histogram(df, column='connections', bins=20, output='histogram.html'):
    data = df[column].dropna()

    # Fixed range 0-1 for centrality metrics
    hist, edges = np.histogram(data, bins=bins)

    output_file(output)

    plot = figure(title=f'Distribution of {column}',
                  x_axis_label=column,
                  y_axis_label='Frequency',
                  x_range=(-2, 10),
                  width=800, height=400,
                  tools='pan,wheel_zoom,reset,save',
                  background_fill_color=None,
                    background_fill_alpha=0,
                    border_fill_color=None,
                    border_fill_alpha=0,
                    outline_line_color=None)

    plot.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
              fill_color='steelblue', line_color='black', alpha=0.7)

    show(plot)

df = pd.DataFrame({
    'node': list(G.nodes()),
    'connections': [G.degree(n) for n in G.nodes()]
})

plot_histogram(df, column='connections', bins=30)

But actually this has an interesting real world implication on the biases of social network analysis and other algorithms: data bias in the input will result in bias in the output. In the world of the game, we assume that this archive is a set of data collected independently and without bias. But let's say the collection of newspaper clippings in the game was collected by someone who was actually trying to investigate an unrelated case, would that be unbiased?

it wouldn't contain much Nugan-Hand material, or wouldn't link them to other events.

On the other hand, what if the people responsible for gathering data already knew of Nugan Hand's connection to the CIA and were obsessed with proving it? wouldn't they have disproportinality picked out clippings that support that hypothesis? The resulting data would potentially assign higher centralities to the main cast: the only conclusion that could be drawn from this data is that Nugan hand was connected to the CIA. This is an example of the confirmation/narrative biases that draw people into conspiracy theories every day: you look for a pattern in data, and lo and behold, you find one.

If we'd gathered a completely random set of newspaper clippings, or rather, all the newspaper articles in the world, maybe we'd be able to see other conclusions. This illustrates the real dangers of biased data, and in drawing real world conclusions from social network analysis: the algorithm can only analyse what it knows about, and you have to be very careful to avoid bias in datasets.

Summary

So while ultimately my hypothesis was wrong, and the data wasn't representative of reality , I think this ended up an interesting exploration in network analysis and it's applications.

Real and Imagined Prague

2024-05-20T00:00:00-05:00

Prague is the setting for Deus Ex's fourth adventure: Deus Ex: Mankind Divided, set in 2029.

It's a beautiful city, and the game imagines an elegant and at times brutalist future for it, blending the old town with new structures and ideas. Glass, metal, concrete, and police in mech-suits crawl over the old gothic buildings, in a style the artists dubbed "techno-feudalism". This post is just a lighthearted comparison between the game and real life, seeing how they match up and if there are any interesting nods to the real city hidden within the game. At the end I've linked an interview with some of the game's designers that gives some insight into how they built the city.

Past

Prague has evolved and radiated out from the seat of power: Pražský hrad, on the hill west of the river. The city boomed due to it's central position in Europe, situated on a river for trade, and with defensible terrain. In the 14th century Prague became the seat of the Holy Roman Empire, during which the city prospered and grew, it's this period which created many of the elaborate gothic landmarks Prague is famous for today.

In this picture from 1607, much of the distincitive skyline remains visible today

2029

The slices of prague seen in Deus Ex are only a small fraction of the city. Specifically, the map is within Staré Město: the old town on the east side of the riverOther maps in the game, as well as the in-game compass, change the orientation to make west north and east become south . Some maps in-game are nice enough to overlay the map on a real map, placing the map just between Mánes Bridge and Charles Bridge. Convinient, as that's where all the landmarks are. As we can see, the street layouts don't match, and the game world isn't an attempt to accurately model the real Prague. The central train line that bisects the map doesn't exist in real life, but otherwise the game map is somewhat true to this placement, with the in-game architecture roughly correct for this area of the city, and it makes sense for the landmarks in the game such as Palisade Bank to be situated in this central part of the city.

A map found within a disused tourist office in-game

Here are the real streets that area relates to, courtesy of Google earth:

The buildings are pretty much as they are now, and as they existed in the 20th century, but with some extra sculptures. There's now a metro though, and some ridiculously large, impossibly cantilevered futuristic structures loom overhead. ⊕

The spires on the hill are a fairly accurate rendition of Pražský hrad. The silhouette is the same as the old drawing further up, although flipped

Rooftops and the mysterious case of the towers

The town, especially the old town, is quite dense in the medieval European style, and as you play an elite cyborg ninja sniper with a recently discovered knack for teleportation, there's quite a few opportunities to admire the skybox and look out over the roofs of Prague. In fact this is one of the main ways you get a sense of the city's geography while playing, as otherwise you're walking through narrow alleys, tunnels, and vents.

One of Prague's distinguishing features are the gothic spires attached to the bridges, clock towers, and old town hall, so much that is has been known as the city of 100 spires:

A nice touch in Mankind Divided is that you can occasionally glimpse these poking out of the skyline.

When I noticed this I tried to cross reference them, to see if the approximate in game positions made sense. I came to the conclusion they didn't match up, as in the game you can find more towers than there are in real life, and the relative directions don't quite make sense:

The Clocktower

One of the mentioned towers is Prague's astronomical clocktower, located in the old town by the open square and town hall. You can in fact find the clocktower in the game, recognisable by it's distinctive shape. In 2029, the bottom of the tower is clad in brutalist concrete sculpture, obscuring the famous clock at its base. You can find a clip of the in-game clocktower during the daytime here, showing off the concrete structure: https://youtu.be/eMFL7ettk7Y?t=42. ⊕
Prague Astronomical Clock. The lower part was painted by Joseph Mánes, who gives his name to one of the bridges discussed earlier

Bridges and castles

Prague is famous for it's bridges, especially Charles Bridge, which look beautiful at night, and criss-cross the Vltava. Sadly due to the small map, there isn't an opportunity to walk across one, but there are a couple of neat viewpoints. In real life, looking west across the river from old town shows a view of Prasky Hrad. But in the game they've flipped the view, confusingly if you look west, the view you see is that of the east side. This is shown in the 2 images below, with the same distinctive rooftops highlighted in each:

Across the Vltava in the game, remembering the game's map places us on the east side of the river

The same view in reality

The real view west from the game's position on the east would show the castle on the hill to the north west:

The Theatre

The exterior of the "Dvali" theatre, the setting of the "Hunting Down the Final Clues" mission, shows off a domed building on a corner, with a canopy and tympanum ⊕
For more 'What style is this building and what do I call this funny bit' information, check out Rice's Architectural Primer. It's a little hard to make out in the image below, due to the large mech and pouring rain. R.U.R on the billboard is a reference to Rossum's Universal Robots, a Czech play which first brought about the term "robot", and discussed the ideas about the rights of androids, a central theme of the recent Deus Ex entries.

⊕

Exterior of un-named theatre in Deus Ex: Mankind Divided

This actually appears to be a scaled down model of the Art Nouveau Prague Municipal House, not really a theatre. You can see the same corner shape, the distinctive dome, narrow arched windows, as well as the canopy and tympanum.

Exterior of Prague municipal house in real life, inspiration for the in-game theatre

The Time Machine

"The Time Machine" is a bookshop encountered early during the Deus Ex story, with no single inspiration I could find. The closest reference I could find was that the quirky book-arched front door is reminisicent of the window of This shop, shown below. Prague's famous library, Strahov Monastery Library doesn't seem to be referenced, and is geographically outside of the game's area, situated west of the river close to the castle.

The naming of The Time Machine is another neat nod to the game's themes and classic science fiction. HG Well's "The Time Machine" is a seminal sci-fi work, also looking forward and envisioning a more divided future, with vast inequality between rich and poor, in which the working classes have lost their humanity.

There the parallels end, since H.G Wells' novel ends with the protagonist travelling forwards in time to when the world is filled with giant crabs and the sun becomes a red giant, and in the game you fly to London to fight the Illuminati, both equally plausible and compelling visions. ⊕Another link to classic sci-fi can be found nearby: a Jewellers called "Vern's Jewels" ⊕

Wrapping up

I hope that's been an interesting dive into the city, with a bit of architecture and sci-fi history sprinkled in. If you've not played it I can heartily recommend playing the game, and if you want to read more, the following links provide some insights from the game's level designers:

Blog design

2024-05-20T00:00:00-05:00

Design

The squarish, block based design is supposed to imitate retro-futuristic computer interfaces such as Metal Gear Solid (V), Chaos Theory or Nier Automata:

MGSV UI

Splinter Cell UI

It uses hamilton as a base theme, using CSS from https://github.com/metakirby5/yorha to provide styling. There is no dark mode.

I'd read some blogs with neat sidenotes for links, upon researching how to do this I came across references to Edward Tufte. Projects such as tufte-jekyll provided the base for this site, as well as some some useful code snippets.

You can see a post highlighting the features and content styles here: Tufte style post, plus some custom jekyll plugins i wrote for scrollable + dropdown code blocks. Overall each post should make good use of images of different widths, plus sidenotes, resulting in something like this:

This actually got me reading Tufte's books on design, Visual Explanations, Envisioning Information, and The visual display of quantative information. These are beautifully produced, informative, and a joy to read.

Although I'm not much of a designer I'm proud how this blog's design has ended up. I do my drawings in Excalidraw, which provides a really intuitive interface that can produce "hand-drawing-y" images. Over time, especially since LLMs have come about, I've been trying to incorporate more interactive elements such as embedded HTML diagrams and matplotlib charts, to make it feel less dry.

After quite a lot of time I realised that one of the inspirations for this blog, and my excalidraw-y diagramming style was this book I read as a kid: The way things work. It explains mechanical and scientific concepts with fun, engaging illustrations, and got me interested in science and technology. The illustrations in there are infitnitely better than mine, but I think I can trace back some of my wanting to understand and draw the shapes and interactions of systems to this. For example check out this amazing illustration of pin & tumbler locks:

The illustrations make frequent use of mammoths, a motif that hasn't made it over to my drawings, but I remember loving the inclusion of mammoths in that book, and the fun it added to diagrams. I found this good article about the books and their illustration that nicely summarises how the humour in the pictures helps readers:

The mammoth is us; a little bewildered but trying hard to make its way in a complex world. [...] Like the mammoth, the inventor seeks to humanise our attempts to understand technology and render it comprehensible; even if the answers gained are sometimes wrong. His descriptions are not always right, which on the surface could seem misleading. However, his story is always clearly indicated as a secondary and intentionally humorous tale
, https://www.christopherroosen.com/blog/2021/9/5/david-macaulay-neil-ardley-the-way-things-work

Similarly, I'd like to think my tails of failing to solve challenges, and frequent dead ends provide a more approachable narrative in learning how things work. There's another great quote from that article later on that discusses the method of communicating complex topics with illustrations:

The goal of such explanation isn’t to understand every aspect of a machine or technology, but to grasp its core ‘zeitgeist,’ its operating principle. The goal is not for people to be able to build a complex device, like a gearbox or telephone, but to understand how, in principle, these these things work. Crucially, its mean to encourage people to be curious and interested, to want to know more. They need to be comfortable with ever increasing levels of complexity.
, https://www.christopherroosen.com/blog/2021/9/5/david-macaulay-neil-ardley-the-way-things-work

I think this is also what makes me draw all the crappy excalidraw images, wanting to break down the systems to their component parts, and put them back together.

Depending on the platform you're reading from, you may notice the site doesn't render well on mobile. This is because I hat every second I spend on CSS / HTML, and I often hardcode widths etc as pixel counts in the source. Apologies.

Tufte-style Jekyll blog

2020-04-13T04:46:04-05:00

The Tufte Jekyll theme is an attempt to create a website design with the look and feel of Edward Tufte's books and handouts. Tufte’s style is known for its extensive use of sidenotes, tight integration of graphics with text, and well-set typography. The idea for this project is essentially cribbed wholesale from Tufte and R Markdown's Tufte Handout formatSee tufte-latex.github.io/tufte-latex/ and rmarkdown.rstudio.com/tufte_handout_format This page is an adaptation of the Tufte Handout PDF.

Custom stuff

dropdown:

Scroll:

function test() {
    console.log("highlighted JS");
}

Jekyll customizations

This Jekyll blog theme is based on the github repository by Edward Tufte here, which was orginally created by Dave Leipmann, but is now labeled under Edward Tufte's moniker. I borrowed freely from the Tufte-CSS repo and have transformed many of the typographic and page-structural features into a set of custom Liquid tags that make creating content using this style much easier than writing straight HTML. Essentially, if you know markdown, and mix in a few custom Liquid tags, you can be creating a website with this document style in short order.

The remainder of this sample post is a self-documenting survey of the features of the Tufte-Jekyll theme. I have taken almost all of the sample content from the Tufte-css repo and embedded it here to illustrate the parity in appearence between the two. The additional verbiage and commentary I have added is to document the custom Liquid markup tags and other features that are bundled with this theme.

side images

Some text.

The SASS settings file

I have taken much of the actual Tufte-css files and modified them as necessary to accomodate the needs inherent in creating a Jekyll theme that has additional writing aids such as the Liquid tags. I have also turned the CSS file into a SASS file (the .scss type). This means that you can alter things like font choices, text color, background color, and underlining style by changing values in this file. When the Jekyll site is built using jekyll build the settings in this file will be compiled into the customized CSS file that the site uses. If you don't use SCSS or SASS, you are missing out on a huge productivity tool.

This file looks like this:

/* This file contains all the constants for colors and font styles */

$body-font:   ETBembo, Palatino, "Palatino Linotype", "Palatino LT STD", "Book Antiqua", Georgia, serif;
// Note that Gill Sans is the top of the stack and corresponds to what is used in Tufte's books
// However, it is not a free font, so if it is not present on the computer that is viewing the webpage
// The free Google 'Lato' font is used instead. It is similar.
$sans-font:  "Gill Sans", "Gill Sans MT", "Lato", Calibri, sans-serif;
$code-font: Consolas, "Liberation Mono", Menlo, Courier, monospace;
$url-font: "Lucida Console", "Lucida Sans Typewriter", Monaco, "Bitstream Vera Sans Mono", monospace;
$text-color: #111;
$bg-color: #fffff8;
$contrast-color: #a00000;
$border-color: #333333;
$link-style: color; // choices are 'color' or 'underline'. Default is color using $contrast-color set above

Any of these values can be changed in the _sass/_settings.scss file before the site is built. The default values are the ones from tufte-css.

Fundamentals

Color

Although paper handouts obviously have a pure white background, the web is better served by the use of slightly off-white and off-black colors. I picked #fffff8 and #111111 because they are nearly indistinguishable from their 'pure' cousins, but dial down the harsh contrast. Tufte's books are a study in spare, minimalist design. In his book The Visual Display of Quantitative Information, he uses a red ink to add some visual punctuation to the buff colored paper and dark ink. In that spirit, links are styled using a similar red color.

Headings

Tufte CSS uses

for the document title,
with class `code` for the document subtitle,

for section headings, and

for low-level headings. More specific headings are not encouraged. If you feel the urge to reach for a heading of level 4 or greater, consider redesigning your document:

[It is] notable that the Feynman lectures (3 volumes) write about all of physics in 1800 pages, using only 2 levels of hierarchical headings: chapters and A-level heads in the text. It also uses the methodology of sentences which then cumulate sequentially into paragraphs, rather than the grunts of bullet points. Undergraduate Caltech physics is very complicated material, but it didn’t require an elaborate hierarchy to organize.

http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0000hB

As a bonus, this excerpt regarding the use of headings provides an example of using block quotes. Markdown does not have a native shorthand, but real html can be sprinkled in with the Markdown text. In the previous example, the was preceded with a single return after the quotation itself. The previous blockquote was written in Markdown thusly:

[It is] notable that the Feynman lectures (3 volumes) write about all of physics in 1800 pages, using only 2 levels of hierarchical headings: chapters and A-level heads in the text. It also uses the methodology of *sentences* which then cumulate sequentially into *paragraphs*, rather than the grunts of bullet points. Undergraduate Caltech physics is very complicated material, but it didn’t require an elaborate hierarchy to organize.
[http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0000hB](http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0000hB)

In his later books http://www.edwardtufte.com/tufte/books_be , Tufte starts each section with a bit of vertical space, a non-indented paragraph, and sets the first few words of the sentence in small caps. To accomplish this using this style, enclose the sentence fragment you want styled with small caps in a custom Liquid tag called 'newthought' like so:

{% newthought 'In his later books' %}

Text

In print, Tufte uses the proprietary Monotype BemboSee Tufte’s comment in the Tufte book fonts thread. font. A similar effect is achieved in digital formats with the now open-source ETBembo, which Tufte-Jekyll supplies with a @font-face reference to a .ttf file. Thanks to Linjie Ding, italicized text uses the ETBembo Italic font instead of mechanically skewing the characters. In case ETBembo somehow doesn’t work, Tufte CSS degrades gracefully to other serif fonts like Palatino and Georgia. Notice that Tufte CSS includes separate font files for bold (strong) and italic (emphasis), instead of relying on the browser to mechanically transform the text. This is typographic best practice. It’s also really important. Thus concludes my unnecessary use of em and strong for the purpose of example.

Code snippets ape GitHub's font selection using Microsoft's Consolas and the sans-serif font uses Tufte's choice of Gill Sans. Since this is not a free font, and some systems will not have it installed, the free google font Lato is designated as a fallback.

Epigraphs

The English language . . . becomes ugly and inaccurate because our thoughts are foolish, but the slovenliness of our language makes it easier for us to have foolish thoughts.
George Orwell, "Politics and the English Language"

For a successful technology, reality must take precedence over public relations, for Nature cannot be fooled.
Richard P. Feynman, “What Do You Care What Other People Think?”

If you’d like to introduce your page or a section of your page with some quotes, use epigraphs. The two examples above show how they are styled. Epigraph elements are modeled after chapter epigraphs in Tufte’s books (particularly Beautiful Evidence). The Tufte-css gitub repository has detailed instructions on how to achieve this using HTML elements. As an easier alternative, the Tufte-jekyll theme uses custom Liquid tag pairs that allow the writer to embed elements such as epigraphs in the middle of the regular Markdown text being edited.

In order to use an epigraph in a page or section, type this:

{% epigraph 'text of citation' 'author of citation' 'citation source' %}

to produce this:

Lists

Tufte points out that while lists have valid uses, they tend to promote ineffective writing habits due to their “lack of syntactic and intellectual discipline”. He is particularly critical of hierarchical and bullet-pointed lists. So before reaching for an HTML list element, ask yourself:

Does this list actually have to be represented using an HTML ul or ol element?
Would my idea be better expressed as sentences in paragraphs?
Is my message causally complex enough to warrant a flow diagram instead?

This is but a small subset of a proper overview of the topic of lists in communication. A better way to understand Tufte’s thoughts on lists would be to read “The Cognitive Style of PowerPoint: Pitching Out Corrupts Within,” a chapter in Tufte’s book Beautiful Evidence, excerpted at some length by Tufte himself on his website. The whole piece is information-dense and therefore difficult to summarize. He speaks to web design specifically, but in terms of examples and principles rather than as a set of simple do-this, don’t-do-that prescriptions. It is well worth reading in full for that reason alone.

For these reasons, Tufte CSS encourages caution before reaching for a list element, and by default removes the bullet points from unordered lists.

Figures

Margin Figures

⊕
F.J. Cole, “The History of Albrecht Dürer’s Rhinoceros in Zoological Literature,” Science, Medicine, and History: Essays on the Evolution of Scientific Thought and Medical Practice (London, 1953), ed. E. Ashworth Underwood, 337-356. From page 71 of Edward Tufte’s Visual Explanations.

Images and graphics play an integral role in Tufte’s work. To place figures in the margin, use the custom margin figure liquid tag included in the _plugins directory like so:

{% marginfigure 'mf-id-whatever' 'assets/img/tufte/rhino.png' 'F.J. Cole, “The History of Albrecht Dürer’s Rhinoceros in Zoological Literature,” *Science, Medicine, and History: Essays on the Evolution of Scientific Thought and Medical Practice* (London, 1953), ed. E. Ashworth Underwood, 337-356. From page 71 of Edward Tufte’s *Visual Explanations*.' %}.

Note that this tag has three parameters. The first is an arbitrary id. This parameter can be named anything as long as it is unique to this post. The second parameter is the path to the image. And the final parameter is whatever caption you want to be displayed with the figure. All parameters must be enclosed in quotes for this simple liquid tag to work!

In this example, the Liquid marginfigure tag was inserted before the paragraph so that it aligns with the beginning of the paragraph. On small screens, the image will collapse into a small symbol: ⊕ at the location it has been inserted in the manuscript. Clicking on it will open the image.

Full Width Figures

If you need a full-width image or figure, another custom liquid tag is available to use. Oddly enough, it is named 'fullwidth', and this markup:

{% fullwidth 'assets/img/tufte/napoleons-march.png' 'Napoleon's March *(Edward Tufte’s English translation)*' %}

Yields this:

Napoleon’s March (Edward Tufte’s English translation)

Main Column Figures

Besides margin and full width figures, you can of course also include figures constrained to the main column. Yes, you guessed it, a custom liquid tag rides to the rescue once again:

{% maincolumn 'assets/img/tufte/export-imports.png' 'From Edward Tufte, *Visual Display of Quantitative Information*, page 92' %}

yields this:

From Edward Tufte, Visual Display of Quantitative Information, page 92

Sidenotes and Margin notes

One of the most prominent and distinctive features of Tufte's style is the extensive use of sidenotes and margin notes. Perhaps you have noticed their use in this document already. You are very astute.

There is a wide margin to provide ample room for sidenotes and small figures. There exists a slight semantic distinction between sidenotes and marginnotes.

Sidenotes

SidenotesThis is a sidenote and displays a superscript display a superscript. The sidenote Liquid tag contains two components. The first is an identifier allowing the sidenote to be targeted by the twitchy index fingers of mobile device users so that all the yummy sidenote goodness is revealed when the superscript is tapped. The second components is the actual content of the sidenote. Both of these components should be enclosed in single quotes. Note that we are using the CSS 'counter' trick to automagically keep track of the number sequence on each page or post. On small screens, the sidenotes disappear and when the superscript is clicked, a side note will open below the content, which can then be closed with a similar click. Here is the markup for the sidenote at the beginning of this paragraph:

{% sidenote 'sn-id-whatever' 'This is a sidenote and *displays a superscript*'%}

Margin notes

Margin notes ⊕This is a margin note without a superscript are similar to sidenotes, but do not display a superscript. The marginnnote Liquid tags has the same two components as the sidenote tag. Anything can be placed in a margin note. Well, anything that is an inline element. Block level elements can make the Kramdown parser do strange things. On small screens, the margin notes disappear and this symbol: ⊕ pops up. When clicked, it will open the margin note below the content, which can then be closed with a similar click. The Markdown content has a similar sort of markup as a sidenote, but without a number involved:

{% marginnote 'mn-id-whatever' 'This is a margin note *without* a superscript' %}

Equations

The Markdown parser being used by this Jekyll theme is Kramdown, which contains some built-in Mathjax support. Both inline and block-level mathematical figures can be added to the content.

For instance, the following inline sequence:

When \$\$ a \ne 0 \$\$, there are two solutions to \$\$ ax^2 + bx + c = 0 \$\$

is written by enclosing a Mathjax expression within a matching pair of double dollar signs: $$:

When $$ a \ne 0 $$, there are two solutions to $$ ax^2 + bx + c = 0 $$

Similarly, this block-level Mathjax expression:

\$\$ x = {-b \pm \sqrt{b^2-4ac} \over 2a} \$\$

is written by enclosing the expression within a pair of $$ with an empty line above and below:

$$ x = {-b \pm \sqrt{b^2-4ac} \over 2a} $$

You can get pretty fancy, for instance, the wave equation's nabla is no big thing:

\$\$ \frac{\partial^2 y}{\partial t^2}= c^2\nabla^2u \$\$

All of the standard L^aT_eX equation markup is available to use inside these block tags.

Please note that the block-level Mathjax expressions must be on their own line, separated from content above and below the block by a blank line for the Kramdown parser and the Mathjax javascript to play nicely with one another.

The Mathjax integration is tricky, and some things such as the inline matrix notation simply do not work well unless allowances are made for using the notation for a small matrix. Bottom line: If you are using this to document mathematics, be super careful to isolate your L^aT_eX blocks by blank lines!

Tables

Tables are, frankly, a pain in the ass to create. That said, they often are one of the best methods for presenting data. Tabular data are normally presented with right-aligned numbers, left-aligned text, and minimal grid lines.

Note that when writing Jekyll Markdown content, there will often be a need to get some dirt under your fingernails and stoop to writing a little honest-to-god html. Yes, all that hideous

....

nonsense. And you must wrap the unholy mess in a

tag to ensure that the table stays centered in the main content column.

Tables are designed with an overflow:scroll property to create slider bars when the viewport is narrow. This is so that you do not collapse all your beautiful data into a jumble of letters and numbers when you view it on your smartphone.

⊕Table 1: A table with default style formatting

Content and tone of front-page articles in 94 U.S. newspapers, October and November, 1974	Number of articles	Percent of articles with negative criticism of specific person or policy
Watergate: defendants and prosecutors, Ford’s pardon of Nixon	537	49%
Inflation, high cost of living	415	28%
Government competence: costs, quality, salaries of public employees	322	30%
Confidence in government: power of special interests, trust in political leaders, dishonesty in politics	266	52%
Government power: regulation of business, secrecy, control of CIA and FBI	154	42%
Crime	123	30%
Race	103	25%
Unemployment	100	13%
Shortages: energy, food	68	16%

This is not the One True Table. Such a style does not exist. One must craft each data table with custom care to the narrative one is telling with that specific data. So take this not as “the table style to use”, but rather as “a table style to start from”. From here, use principles to guide you: avoid chartjunk, optimize the data-ink ratio (“within reason”, as Tufte says), and “mobilize every graphical element, perhaps several times over, to show the data.Page 139, The Visual Display of Quantitative Information, Edward Tufte 2001. Furthermore, one must know when to reach for more complex data presentation tools, like a custom graphic or a JavaScript charting library.

As an example of alternative table styles, academic publications written in L^aT_eX often rely on the booktabs package to produce clean, clear tables. Similar results can be achieved in Tufte CSS with the booktabs class, as demonstrated in this table:

⊕Table 2: A table with booktabs style formatting

Items
Animal	Description	Price ($)
Gnat	per gram	13.65
	each	0.01
Gnu	stuffed	92.50
Emu	stuffed	33.33
Armadillo	frozen	8.99

The table above was written in HTML as follows:



          
            Items 
            Animal Description Price ($)
          
          
            Gnat      per gram 13.65
                     each     0.01
            Gnu       stuffed  92.50
            Emu       stuffed  33.33
            Armadillo frozen   8.99

I like this style of table, so I have made it the default for unstyled tables. This allows use of the Markdown Extra features built into the Kramdown parser. Here is a table created using the Markdown Extra table syntax to make a nice table which has the side benefit of being human readable in the raw Markdown file:

⊕Table 3: a table created with Markdown Extra markup using only the default table styling

	mpg	cyl	disp	hp	drat	wt
Mazda RX4	21	6	160	110	3.90	2.62
Mazda RX4 Wag	21	6	160	110	3.90	2.88
Datsun 710	22.8	4	108	93	3.85	2.32
Hornet 4 Drive	21.4	6	258	110	3.08	3.21
Hornet Sportabout	18.7	8	360	175	3.15	3.44
Valiant	18.1	6	160	105	2.76	3.46

Using the following Markdown formatting:

|                 |mpg  | cyl  |  disp  |   hp   |  drat  | wt  |
|:----------------|----:|-----:|-------:|-------:|-------:|----:|
|Mazda RX4        |21   |6     |160     |110     |3.90    |2.62 |
|Mazda RX4 Wag    |21   |6     |160     |110     |3.90    |2.88 |
|Datsun 710       |22.8 |4     |108     |93      |3.85    |2.32 |
etc...

The following is a more simple table, showing the Markdown-style table markup. Remember to label the table with a marginnote Liquid tag, and you must separate the label from the table with a single blank line. This markup:

{% marginnote 'Table-ID4' 'Table 4: a simple table showing left, center, and right alignment of table headings and data'  %}

|**Left** |**Center**|**Right**|
|:--------|:--------:|--------:|
 Aardvarks|         1|$3.50
       Cat|   5      |$4.23
  Dogs    |3         |$5.29

Yields this table:

⊕Table 4: a simple table showing left, center, and right alignment of table headings and data

Left	Center	Right
Aardvarks	1	\$3.50
Cat	5	\$4.23
Dogs	3	\$5.29

Code

Code samples use a monospace font using the 'code' class. The Kramdown parser has the 'GFM' option enabled, which stands for 'Github Flavored Markdown', and this means that both inline code such as #include and blocks of code can be delimited by surrounding them with 3 backticks:

(map tufte-style all-the-things)

is created by the following markup:

(map tufte-style all-the-things)

To get the code highlighted in the language of your choice like so:

module Jekyll
  class RenderFullWidthTag < Liquid::Tag
  require "shellwords"

    def initialize(tag_name, text, tokens)
      super
      @text = text.shellsplit
    end

    def render(context)
      "#{@text[0]}'/>
 " +
      "#{@text[1]}
"
    end
  end
end

Liquid::Template.register_tag('fullwidth', Jekyll::RenderFullWidthTag)

Enclose the code block in three backticks, followed by a space and then the language name, like this:

 ``` ruby
    module Jekyll
    blah, blah...
   ```

Jekyll also offers powerful support for code snippets:

def print_hi(name)
  puts "Hi, #{name}"
end
print_hi('Tom')
#=> prints 'Hi, Tom' to STDOUT.

Hendo’s Blog

Epidemiology

Real epidemiology

Plan

Model

Building a model

Start state

Transmission

Results

Reality

Curve + exponential AI

Summary

RADIO

Radio basics

Loading the file

Demodulation

Decoding

Symbol analysis

ASK??

Signal lengths

Participants

Multi-party comms

figuring out how many participants: signal strength

File type analysis

Eventual analysis

Glossary

Cheating

Introduction to graph theory

CENTRALITY

Applying Social Network Analysis

Collecting data

Analysing results

Summary

Real and Imagined Prague

Past

2029

Rooftops and the mysterious case of the towers

The Clocktower

Bridges and castles

The Theatre

The Time Machine

Wrapping up

Blog design

Design

Tufte-style Jekyll blog

Custom stuff

Jekyll customizations

side images

The SASS settings file

Fundamentals

Color

Headings

for the document title, with class code for the document subtitle,

Text

Epigraphs

Lists

Figures

Margin Figures

Full Width Figures

Main Column Figures

Sidenotes and Margin notes

Sidenotes

Margin notes

Equations

Tables

Code

for the document title,
with class `code` for the document subtitle,