Getting started with GPT-2

It’s a bit tricky to see where to begin with OpenAI’s (in)famous GPT-2 model. This blog post is our first in a small series about NLP. We hope it helps!

Getting Python

Our preferred way of installing and managing Python, particularly for machine learning tasks, is to use the Anaconda Environment.

⚠️ Anaconda’s environments don’t quite work like virtualenv, or other Python environment systems that you might be familiar with. They don’t store things in the location you specify, they store things in the system-wide Anaconda directory (e.g. on macOS in “/Users/USER/anaconda3/envs/”). Just remember that once you activate them, all commands are inside the environment.

Anaconda bundles a package manager, an environment manager, and a variety of other tools that make using and managing Python environments easier.

➡️ Download the Anaconda installer for Windows or macOS

Once you’ve installed Anaconda, following these instructions to make an Anaconda Environment to use with GPT-2.

➡️ Create a new Anaconda Environment named GPT2 and running Python 3.x (the version of Python you need to be running to work with GPT-2 at the moment):

conda create -n GPT2 python=3

➡️ Activate the Conda environment:

conda activate GPT2

Getting and using GPT-2

➡️ Clone the GPT-2 repository to your computer:

git clone

➡️ While inside the activated Conda environment you’re using, install the requirements specified inside the GPT-2 repository:

pip install --upgrade -r requirements.txt

➡️ Use the model downloading script that comes with the GPT-2 repository to download a pre-trained model:

python3 774M

⚠️ You can specify which of the various GPT-2 models you want to download. Above, we’re requesting a download of the 774 million parameter model, which is about 3.1 gigabytes. You can also request the smaller 124M or 355M models, or the much larger 1.5B model.

You might need to wait a little while as the script downloads the model. You’ll see something like this:

Fetching 44%|███████ | 1.36G/3.10G [02:16<03:44, 7.77Mit/s]

➡️ You’ll then need to open the file src/ using your favourite programming editor, and update the model_name to the one that you downloaded:

def interact_model(

➡️ You can then execute the Python script:

python3 src/

This will fire up the model, allowing you to enter some text. You’ll eventually (likely after some warnings that you can ignore about your CPU and the version of TensorFlow) see something like this:

Model prompt >>>

Enter some text and press return (we recommend only a sentence or two), then wait a bit and see what the model generates in response. This could take a while, depending on your computer’s capabilities.

We’ll be back with a follow-up article, exploring how you can actually use GPT-2 for something useful. Stay tuned!

For more AI content, check out our latest book, Practical Artificial Intelligence with Swift! It covers using Swift for AI in iOS applications, using Apple’s CreateML, CoreML, and Turi Create. If you like filling your brain with words, why not fill them with ours?

Analysing Performance Problems with Xcode Performance Tools, Part 1

This is the first post in a two-part series. Stay tuned for the second part.

One of the projects that we’re most excited about is our ongoing work to bring Night in the Woods to iOS. This has been an enormous amount of technical effort, and has led to a bunch of very cool spin-off projects – for example, Yarn Spinner only exists because we built it for Night in the Woods, and the research we’ve been doing in sprite compression to fit the game into a tiny amount of memory has been a blast.

A screenshot of Night in the Woods.
Night in the Woods.

One of the things we’re doing right now is improving the performance of the game on the device. Just like when you’re building a game for PCs or consoles, you want your game to be running at a high, stable frame rate. For mobile devices, this is a little more complex – you also have to consider the impact that the game has on the phone’s battery, because nobody wants to play a game and then find that their phone’s about to die.

Another important element of performance for mobile games is thermal impact. When the phone’s running a game, it’s making the hardware do a lot of work, and this makes it heat up. Modern systems – that is, anything made in the last 40 years – detect when they’re under thermal stress, and reduce their performance to avoid damage. This means that you might be able to achieve a solid 60 frames per second when you start playing, but that might get laggy in a couple minutes (or seconds!) of play. (Thermal stress is also a consideration for PCs and consoles, but it’s more critical for phones due to the reduced amount of space inside the chassis, the lack of active cooling, and the fact that you’re holding it in your hands.)

Spotting the Problem

The first thing that we noticed was that the reported frames per second in the game was low.

(We use an FPS counter called Graphy to visualise FPS. Graphy’s great – it’s easy to set up, reliable, low-impact, and open source. You should use it!)

The FPS counter was reporting an FPS count of about 20 to 40 frames per second on my iPhone X, depending on the scene. The other thing that was spotted was that the frame rate was inconsistent to boot.

Inconsistent frame rates in the Towne Centre East scene. Note the slightly jerky camera movement.

What was going on? Night in the Woods, despite its gorgeous visual style, doesn’t actually have hugely complex scenes. When diagnosing a problem like this, the first thing to do is to figure out where the bottleneck is – on the CPU, or the GPU.

Finding the Bottleneck

Xcode has a very simple tool for checking which part of the system is under the heaviest load. When running the game from Xcode, you can just click on the Debug navigator, which will show you a summary of the app’s performance.

A screenshot of Xcode's debug view. It shows the CPU at 79%, memory usage at 859MB, energy usage as Very High, disk usage at 80 KB/s, network usage at 2.9MB/s, and a frames per second count of 37.

The CPU usage and energy impact here are pretty high. Not enormously high, considering it’s a game, but still high. As the game itself reports, it’s barely achieving 30FPS frame rate. When we select the FPS element, we get some more detail:

A screenshot of the GPU load level. It shows the tiler's load at 7%, renderer at 86%, device utilisation at 100%, and CPU and GPU both taking 28.4 milliseconds to render.

The FPS report is showing that the device is 100% utilised, and that the CPU and GPU are both taking a full 28 milliseconds to produce the frames. The renderer is under significantly more load than the tiler, which effectively means that most of the work is being done by the fragment shaders, and not by having to deal with a lot of geometry (something that we’d expect, given that the polygon count of Night in the Woods, like most other 2D games, is not terribly high.)

Even though the CPU and GPU times were the same, this doesn’t necessarily mean that they’re under the same degree of load. Switching over to the CPU report showed that the CPU wasn’t really at maximum load (though this can be deceiving, since it may have been the case that a single core was at max load.)

The CPU load indicator, showing 87%.

The analysis that we’d seen so far seemed to suggest that the GPU might be the best place to look at, so we pulled out one of the two biggest and baddest tools in the chest: Instruments.

The application icon for Instruments.

Instruments can show a huge amount of data about the performance of an app. Happily, recent versions of the app come with the Game Performance template, which sets up a number of recorders that relate to how a game performs.

The instruments template selector. The "game performance" template is selected.

So, we ran our scene while recording data, and got back a simply enormous amount of data. It’s actually rather pretty.

There’s a lot of charts here, but the key thing to look at is at the bottom of the window, where the GPU state and display information is shown. Let’s zoom in on that.

A close-up view showing performance data from Instruments.

The purple bar at the top shows when the GPU was busy doing something. The bar is entirely filled, which means that there was no point in the run when the GPU was allowed to be idle. That’s bad, because when the GPU is active, it’s pulling energy out of the battery, and heating up the phone. We already knew this, because Xcode’s report was showing us that the device utilisation was 100%, but it’s nice to see this in a little more detail.

Where we start to see more useful information is in the ‘Display’ instrument. The Display instrument shows four charts:

  1. The current frame shown on the screen
  2. When a new frame was delivered to the screen, ready to be shown
  3. When the screen vsync’ed (that is, when it attempted to swap to the next available frame)
  4. Any frame stutters that Instruments found. There aren’t any in this screenshot, but that doesn’t mean that the display isn’t stuttering.

Take a close look at the top row of the Displays instrument. It’s showing the duration of each frame on screen, and they’re not all the same. Some are 33 milliseconds, some are 16. That’s what’s causing the janky appearance.

What’s interesting about this is that the frames are taking the same amount of time to be produced! We know this because the second row, labelled ‘scaler’, shows that each frame is being submitted to the GPU after about the same amount of time. The reason why some frames list for 33 milliseconds and some last for 16 is due to the fixed rate at which the display is swapping to the next available frame.

Understanding Frame Judder

To figure this out, let’s look at the timing information for four frames.

A close-in screenshot of Instruments, showing the timing of four frames, coloured blue, green, orange and blue again.

Here’s what’s happening at each of these four steps.

  1. The blue frame is on screen. The next frame, shown in green, is submitted to the display.
  2. The display hits its next vsync point. The green frame is ready, so it’s shown to the user.
  3. The orange frame is submitted to the screen, but it’s just too late for vsync. The screen therefore has no new frame to show, so it keeps showing the green frame for another sync interval. The orange frame is eventually shown at the next vsync.
  4. Meanwhile, the blue frame was being drawn, and it’s ready to go right away. It’s submitted to the display, and drawn right after that. The orange frame was only on screen for a single sync interval.

The problem here is not that the frames aren’t taking too long to draw. The problem is that the game is trying to send them to the screen at too high a rate. Ironically, in order to make the game smoother, we need to reduce the frame rate.

This was a one line change:

Application.targetFrameRate = 30;

The result was this:

A screenshot of Instruments, showing frame timing data. The timing of the frames is more consistent.

There are two things to note here.

First, the purple area now has gaps in it, indicating periods of time when the GPU was not active. This is a good thing! Less energy being drawn from the battery, and less heat.

Secondly, all of the frames are on screen for a consistent amount of time. They’re never missing a vsync, because the game isn’t trying to get them onto the screen as fast as it can. The game can now focus on operating at a consistent 30 frames per second.

The result is lower power consumption, and a smoother gameplay experience.

We’re Not Done Yet

But this was only part of the problem. As I mentioned earlier, the scenes in Night in the Woods are not incredibly complex; why was it taking so long to produce the frames? To solve that question, we needed to take a much closer look at the internals of how the GPU was drawing each frame.

We’ll look at how we achieved even better results in the next post. In the meantime, I’ll tease you with this:

A screenshot of Instruments, showing significantly less GPU usage.

🧶 Yarn Spinner 1.0

The popular open source narrative game development framework, Yarn Spinner, which is maintained by Secret Lab and a fabulous community, has reached version 1.0. As part of our 1.0 release, we’ve debuted 5 exciting new features:

  1. Compiled Scripts — Yarn Spinner now compiles to a binary format.
  2. Automatic Compiling — In Unity, your Yarn scripts will automatically be compiled when they change.
  3. Line Tagging — You can automatically add unique tags to lines of dialogue, and generate a .csv file to send to translators with the click of a button.
  4. Code Extension — There’s a syntax highlighting extension, available from the marketplace, for Visual Studio Code.
  5. No more .yarn.txt — The file extension is now .yarn! It was time.

We want Yarn Spinner to be the best tool that it can be. As part of that, we’ve launched a Patreon page, and we’d love for you to help support its development!

We’ve got big plans, so please check out the website, follow us on Twitter, support the Patreon if you can, and join our Narrative Game Development Slack. And we’d really appreciate it if you shared the news!

📊 Replacing Your Gearbox at 100 MPH: How live games monitor and change with millions playing

At Velocity Berlin 2019, we were asked to give a talk to a crowd of largely non-game developers, on how games make gather and analyse telemetry in order to improve the experience of players. In this post, we’ll recap what we talked about, and provide links for further reading!

Jon and Paris on stage at Velocity Berlin 2019. Photo: Tim Nugent

This talk was largely a literature review of advice and techniques from game developers, and we’re tremendously grateful for their generous sharing of knowledge. In particular, three stand-out talks from GDC were hugely useful:

We’d also like to thank Tony Albrecht from Riot Games, whose advice helped us put this talk together.

In order to talk about how games use data, we’ll break the discussion down into four main topics: what data is gathered, how that data is gathered, how the data is analysed, and how the changes are deployed.

What Data is Gathered

Much of the data that games gather is not unique to games. Just about every product out there collects data on when it’s launched, how long the session lasts for, and how much interaction the user has with it; by gathering this data, it becomes possible to understand patterns of usage in terms of sessions.

There are three critical metrics that need to be gathered in order to gain a rough understanding of how a game is used:

  • Session duration: how much time elapses between the user starting and finishing a stretch of interaction.
  • Session interval: how long the user waits before starting another session.
  • Session depth: how many interactions the user has in each session.

It’s important to note that you can’t analyse these individually. If the average session interval is decreasing, that may indicate that people are coming back to play more and more, but it could also mean that players are opening the game, checking to see if there’s anything new, and then immediately leaving.

By gathering this data, it becomes possible to get a picture of the number of daily, weekly and monthly active users for the game. These figures represent how many unique users had a session in the game over the specified period; generally, you want this to be going up over time, though players tend to drop off a game over time. Daily active users counts tend to be quite spiky, because of players who hear about new content and updates and jump into the game, but don’t stick around for long. As a result, monthly active users tends to be the main reported figure, because it effectively smoothes out trends in player population; in the 2018 annual report for Activision Blizzard, one of the key figures that they highlighted was MAU across Activision, Blizzard and King.

The quarterly MAU figures for Activision Blizzard, from September 30 2017 to December 31 2018.

In addition to these standard metrics, there are also data points that are unique to games. These tend to vary based on the type of game, but generally include things like score, death, level number, and position. Position is a particularly interesting one, because it’s very easy to visualise and plot against other key events – we’ll come back to this in a moment.

However, these concrete measurements of player behaviour aren’t good at getting an understanding of whether the player enjoyed themselves in the game. To fix this gap, Call of Duty: World War II directly asks their players if they had fun, using a single yes/no question that’s designed to minimise the burden of answering. Interestingly, the development team reports an average non-skip rate of between 60-80%, even when the answer order randomisation places the option to skip as the default. This is significantly higher than they expect.

The Call of Duty: WW2 fun survey.

Finally, games typically record performance data on how smoothly the game is running. Games typically aim to play at 60 frames per second, which means that each frame has only 16.6 milliseconds to render. When you have only about a dozen milliseconds to render, every one of them counts.

As a result, League of Legends records two kinds of performance data: first, regular telemetry reports are sent during a game, giving an idea of the impact that the most recent patch has had on performance. Additionally, the game records performance data for each frame as it’s played, and then compresses and uploads the data at the end of a round.

Most games perform some kind of internal time profiling; a common tool that we’ve seen multiple game developers use is the Chromium tracing system. This system was originally built to profile the performance of the Chrome web browser, but it’s able to load arbitrary timing data for analysis. For example, Aras Pranckevičius has written about how Unity Technologies uses it to profile the performance of their build system, and Colt McAnlis has written about why using Chrome’s profiler is a better idea than building your own.

How the Data is Gathered

Data collection is the process of delivering the data to the developer for analysis. There are a few ways to do it, and a great talk from Tom Mathews from 343 on how they built their telemetry systems for Halo 5 is a great place to look at. Some highlights of his talk include the fact that they converted their logging system away from unstructured logging strings to a formal, schema-based format based on Microsoft’s Bond format.

Among a few other benefits, this logging system allowed for sub-second response to telemetry, which meant that the game itself is able to respond to the data. As a result, in-game elements like end-of-game reports and leaderboards are driven by the telemetry system, rather than having to build a separate system for this purpose.

How the Data is Analysed

The data received from games can be divided into two main categories: spatial, and non-spatial. Spatial game data is anything that’s related to the player’s position in the game, whereas non-spatial data is everything else, and includes data like in-game performance, skill, and time spent in the game.

Non-spatial data is generally used to get a picture of how players are enjoying the game, and to generate a predictive understanding of whether players are coming back to play more. The fun survey from Call of Duty: World War II is particularly interesting, because they gather data on whether the player enjoyed the game or not – the developers don’t have to infer this data from simpler things like the fact that players are coming back.

This allows them to link player fun to other variables, with some surprising results. As might be expected, in-game performance – that is, how many kills a player got versus how many deaths they had – is a strong predictor of how much fun the player had in the game, but it’s not the only important variable. Player tenure (the total duration of time spent in game, across all sessions) and player skill (total kills versus total deaths, across all sessions) were found to be strong predictors of players reporting that they didn’t have fun. The Call of Duty development team’s theory for this is that the longer a person plays a game, the more critical of it they become; additionally, they found that player fun reports drop off once a player’s skill level becomes greater than that of the median player.

A particularly interesting note made by the development team was that the margin by which a team won was less impactful on fun score than the margin by which the team lost by. That is, a player whose team won by 50 points was just as likely to say that they had fun as a player who won by 5 points. This wasn’t the case for the losing team, however – a player whose team lost by a large margin was much more likely to report that they didn’t have fun than a player who only lost by a few points. This is good empirical evidence for the widely held belief that games that end in close ties are better for everyone.

In the area of spatial data, an excellent paper by Anders Drachen and Matthias Schubert, “Spatial game analytics and visualization” (PDF) proposes four primary types of spatial data analysis: univariate/bivariate, multivariate, trajectory, and behavioural analysis.

Univariate/bivariate analysis focuses on either one or two variables, in which one of those variables is player position. This is usually seen in the form of heat maps for levels, which allow developers to get a good understanding of where players die in levels. For example, consider this heat map for the level de_dust2, from Counter-Strike. Red areas indicate places where lots of players die.

A heatmap showing player deaths in de_dust2, from Counter-Strike 1.6. Source: gameME

Some interesting observations from this heat map:

  • Corridors and doorways are hotspots for player death
  • Doorways exhibit diagonal lines of player death, indicating where players lie in wait
  • Certain long lines of player death indicate places where players have long sight lines
  • A grid pattern can be seen at the bottom and top areas; these are the player spawn locations, and represent players starting the game there, deciding they don’t like their team, and quitting.

An excellent discussion on using heat maps to analyse level flow and gameplay balance is Sean Houghton’s post Balance and Flow Maps, which discusses an analysis of gameplay balance in maps used in Transformers: War for Cybertron.

Multivariate analysis allows for some more complex and sophisticated analysis of level content. Georg Zoeller’s excellent 2011 talk at GDC about the analysis tools used in Star Wars: The Old Republic shows how they combine information about player death with the locations and levels of monsters in the level, which were used to figure out balance problems with the level progression curve.

Trajectory information can be used to plot the movement of individual players through the game’s space, and is useful for detecting outliers and unintended paths through the environment. Jonathan Dankoff’s post on using trajectory analysis in Assassin’s Creed Brotherhood highlights how players could bypass tutorial content by not following the expected path through the environment.

Player trajectory data in Assassin’s Creed: Brotherhood. Players are intended to jump off a tower and parachute towards the destination (green lines at top), but some players were instead finding a way down the walls (red and green lines in the lower right), bypassing tutorial content. Source: Jonathan Dankoff

Finally, spatial information can be used to derive data about player behaviour in the game. Mahlmann et al’s paper, “Predicting player behavior in Tomb Raider: Underworld” (PDF) was able to use information like how long players spent in the early parts of the game to predict how far through the game they’d get before they gave up – something that’s extremely useful in balancing game difficulty and production investment in the game’s content.

How Changes are Deployed

When a game has made changes, it’s time to get the updated version out to players. There are a variety of ways to do this, with varying levels of disruption to players.

The most straightforward way of doing it is to release a new version of the game via digital distribution – that is, via Steam, itch, and the various App Stores. This is conceptually simple, but has a few downsides: players may not be aware of the update, or may choose not to update, which fragments the installed player base. Additionally, the size of the updates may be large, which reduces the chance of all players updating.

The local patching model adopted by the Nintendo Switch is quite interesting: players who want to form a local network and play a game are able to compare their installed version, figure out who has the most recent update, and then distribute the patch locally, without relying on internet access. This feature is especially important when you consider that one of the main marketing points of the Switch is the ability pick it up and take it outside; the Switch has no cellular internet connectivity, so local wireless communication is all it has.

If a game is designed to be competitive, opt-in beta streams can be used. In this model, players choose to receive beta versions of patches, and play the game in a testing mode. Because game changes frequently change the balance of play, players who want to maintain a competitive edge have an incentive to play the beta stream, and accept the risk of bugs and data loss.

Game changes don’t necessarily require updates to the code or assets, and small hot fixes can be applied. Some notable games that do this include Borderlands and Fortnite, which download a small patch on every game load that tunes gameplay content. These patches are typically only kept in memory, and are lost when the game exits; hot fixes are generally rolled up into a permanent patch after some time.

In order to minimise downtime, a blue-green deployment model for server updates is frequently common. When a new patch becomes available, existing servers are kept online for as long as there are players connected. All new players connect to servers running the latest version, and older servers are shut down as players disconnect from them. This means that players aren’t required to leave the game when a new patch lands; however, this model only works in games where players are separated into discrete sessions, and doesn’t work in single, shared-world environments like massively multiplayer games. For example, Star Wars: The Old Republic shuts down every Tuesday night for a few hours for patch deployment, and all players are kicked from the server.

Wrapping Up

We had a great time presenting this to a room full of operations and deployment experts, and we feel that there’s a lot that games can bring to the wider world of operations management. The video recording of our session will be available soon, and we’ll add it to this post when it arrives.

🧠 First Steps with Swift for TensorFlow

We just finished presenting at the inaugural TensorFlow World conference, in Santa Clara, California. Mars, Tim, and Paris presented what might be the first 3-hour tutorial session on the brand new Swift for TensorFlow machine learning platform.

This post serves as both a follow-up to that session (which was recorded, and will be posted soon — we’ll update this post when that happens) and a standalone guide and tutorial to get started with Swift for TensorFlow.

We’ll be posting follow-up tutorials, which will get more advanced, over the coming weeks. (In the mean time, check out our new book on Practical Artificial Intelligence with Swift!)

Getting Swift for TensorFlow

There are two ways to get Swift for TensorFlow that we’d recommend right now. The first is to use Google’s Colaboratory (Colab), an online data science and experimentation platform, which means you use it via a browser and a Jupyter Notebooks-like environment.

The second is to install it locally, using Docker.

If you use Windows, we recommend using Google Colab, and if you use Linux or macOS, we recommend installing using the Docker image (it’s much easier than Docker’s reputation might suggest!)

Installing Swift for TensorFlow with Docker

➡️ First, make a folder on your local system in which to store your Swift Jupyter notebooks. For example, mine is located at /Users/parisba/S4TF/notebooks. You don’t need to put anything in there, just make sure you’ve created it.

➡️ Download and install Docker:

We’re not going to explain this process much, because once it’s done you don’t need to think about Docker or any of this process again. If you want to learn how Docker works, there are plenty of sources online.

➡️ Now, clone the following git repository:

git clone

➡️ Then, change directory into the cloned repository, and execute the following command:

docker build -f docker/Dockerfile -t swift-jupyter .

➡️ Then, to launch the Docker container and Jupyter notebooks, execute the following command:

docker run -p 8888:8888 --cap-add SYS_PTRACE -v /path/to/books:/notebooks swift-jupyter

⚠️ Note that you will need to replace the /path/to/books in the above with the path to folder on your local system that you created earlier.

➡️ Open the URL that is displayed in your terminal, similar to the following:

Copy/paste this URL into your browser when you connect for the first time,
    to login with a token:

➡️ You should see something that looks like the following screenshot:

➡️ You’re ready to go!

Using Google Colaboratory

You don’t need to do much to use Google Colaboratory!

➡️ Make sure you have a Google Account, and then head to Google Colab’s blank Swift notebook.

➡️ That’s it! You’re done.

Training a Model

In this example, we assemble a multilayer peceptron network that can perform XOR.

It’s not very useful, but it showcases how you build up a model using layers, and how to execute training with that model. XOR was one of the first stumbling blocks of early work with artificial neural networks, which makes it a great example for the power of modern machine learning frameworks.

It’s simple enough that you know whether it’s correct… which is why we’re doing it!

➡️ Create a new notebook, and import the TensorFlow framework:

import TensorFlow

To represent our XOR neural network model, we need to create a struct, adhering to the Layer Protocol (which is part of Swift For TensorFlow’s API). Ours is called XORModel.

Inside the model, we want three layers:

  • an input layer, to take the input
  • a hidden layer
  • an output layer, to provide the output

All three layers should be a Dense layer (a densely-connected layer) that takes an inputSize and an outputSize.

The inputSize specifies that the input to the layer is of that many values. Likewise outputSize, for the out of the layer.

Each will have an activation using an activation function determines the output shape of each node in the layer. There are many available activations, but ReLU and Sigmoid are common.

For our three layers, we’ll use sigmoid.

We’ll also need to provide a definition of our @differentiable func, callAsFunction(). In this case, we want it to return the input sequenced through (passed through) the three layers.

Helpfully, the Differentiable protocol that comes with Swift for TensorFlow has a method, sequenced() that makes this trivial.

➡️ To do this, add the following code:

struct XORModel: Layer
  var inputLayer = Dense<Float>(inputSize: 2, outputSize: 2, activation: sigmoid)
  var hiddenLayer = Dense<Float>(inputSize: 2, outputSize: 2, activation: sigmoid)
  var outputLayer = Dense<Float>(inputSize: 2, outputSize: 1, activation: sigmoid)
  @differentiable func callAsFunction(_ input: Tensor<Float>) -> Tensor<Float>
    return input.sequenced(through: inputLayer, hiddenLayer, outputLayer)

➡️ Then we need to create an instance of our XORModel Struct, which we defined above. This will be our model:

var model = XORModel()

Next, we need an optimiser, in this case we’re going to use stochastic gradient descent (SGD) optimiser, which we can get from the Swift for TensorFlow library.

➡️ Our optimiser is, obviously, for the model instance we defined a moment ago, and wants a learning rate of about 0.02:

let optimiser = SGD(for: model, learningRate: 0.02)

➡️ Now we need an array of type Tensor to hold our training data ([0, 0], [0, 1], [1, 0], [1, 1]):

let trainingData: Tensor<Float> = [[0, 0], [0, 1], [1, 0], [1, 1]]

➡️ And we need to label the training data so that we know the correct outputs:

let trainingLabels: Tensor<Float> = [[0], [1], [1], [0]]

➡️ To train, we’ll need a hyperparameter for epochs:

let epochs = 100_000

Then we need a training loop. We train the model by iterating through our epochs, and each time update the gradient (the 𝛁 symbol, nabla, is often used to represent gradient). Our gradient is of type TangentVector, and represents a differentiable value’s derivatives.

Each epoch, we set the predicted value to be our training data, and the expected value to be our training data, and calculate the loss using meanSquaredError().

Every so often we also want to print out the epoch we’re in, and the current loss, so we can watch the traning. We also need to return loss.

Finally, we need to use our optimizer to update the differentiable variables, along the gradient.

➡️ To do this, add the following code:

for epoch in 0..<epochs
    let 𝛁model = model.gradient { model -> Tensor<Float> in

        let ŷ = model(trainingData)

        let loss = meanSquaredError(predicted: ŷ, expected: trainingLabels)

        if epoch % 5000 == 0
          print("epoch: \(epoch) loss: \(loss)")
        return loss

    optimiser.update(&model, along: 𝛁model)

➡️ Run the notebook! You should see something resembling the following output:

epoch: 0 loss: 0.25470454
epoch: 5000 loss: 0.24981761
epoch: 10000 loss: 0.2496698
epoch: 95000 loss: 0.16970003

➡️ Test your (incredibly useful) XOR model by adding a cell to your notebook with the following code:

print(round(model.inferring(from: [[0, 0], [0, 1], [1, 0], [1, 1]])))

➡️ The output should be as follows:


➡️ Congratulations! You just trained a machine learning model that can, badly, perform XOR.

We’ll be posting more Swift for TensorFlow material in the coming weeks! 🚀

For more Swift AI content, check out our latest book, Practical Artificial Intelligence with Swift! It covers using Swift for AI in iOS applications, using Apple’s CreateML, CoreML, and Turi Create. If you like filling your brain with words, why not fill them with ours?

If you want to learn a little more about Swift for TensorFlow, we recommend this session from TensorFlow World as a great starting point:

Installing Unity ML-Agents

⚠️ At O’Reilly AI Conference San Jose, attending our tutorial? This is for you! Complete all the steps in this document to be ready for the tutorial.

Want to explore the Unity Machine Learning Agents Toolkit (“ML-Agents”)? Here’s the easiest way to get up and running on Windows or macOS.

Unity ML-Agents is a great way to explore machine learning, whether you’re interested in building AI for games, or simulating an environment to solve a broader ML problem, why not try Unity’s ML-Agents?

We’ll be posting a variety of guides and material covering various aspects of Unity’s ML-Agents, but we thought we’d start with an installation guide!

To use ML-Agents, you’ll need to install three things:

  1. Unity
  2. Python and ML-Agents (and associated environment and support)
  3. The ML-Agents Unity project


Installing Unity is the easiest bit. We recommend downloading and using the official Unity Hub to manage your installs of Unity.

➡️ Download the Unity Hub for Windows or macOS

The Unity Hub allows you to manage multiple installs of different versions of Unity, and lets you select which version of Unity you open and create projects with.

⚠️ We recommend installing Unity 2019.2.4f1 for the tutorial at O’Reilly AI Conference. If you install a different version, we might not be able to help you.

If you don’t want to use the Unity Hub, you can download different versions of Unity for your platform manually:

➡️ Download a specific version of Unity for Windows or macOS

We strongly recommend that you use the Unity Hub to manage your Unity installs, as it’s the easiest way to stick to a specific version of Windows, and manage your installs. It really makes things easier.

If you like using command line tools, you can also try the U3d tool to download and manage Unity install’s from the terminal.

Python and ML-Agents

Our preferred way of installing and managing Python, particularly for machine learning tasks, is to use the Anaconda Environment.

⚠️ Anaconda’s environments don’t quite work like virtualenv, or other Python environment systems that you might be familiar with. They don’t store things in the location you specify, they store things in the system-wide Anaconda directory (e.g. on macOS in “/Users/USER/anaconda3/envs/”). Just remember that once you activate them, all commands are inside the environment.

Anaconda bundles a package manager, an environment manager, and a variety of other tools that make using and managing Python environments easier.

➡️ Download the Anaconda installer for Windows or macOS

Once you’ve installed Anaconda, following these instructions to make an Anaconda Environment to use with Unity ML-Agents.

➡️ First, download 🔗 this yaml file, and execute the following command (pointing to the yaml file you just downloaded):

conda env create -f /path/to/unity_ml.yaml

➡️ Once the new Anaconda Environment (named UnityML) has been created, activate it using the following command in your terminal:

conda activate UnityML

The yaml file we provided specifies all the Python packages, from both Anaconda’s package manager, as well pip, the Python package manager, that you need to make an environment that will work with ML-Agents.

Doing it manually

You can also do this manually (instead of asking Anaconda to create an environment based on our environment file).

⚠️ You do not need to do this if you created the environment with the yaml file, as above. If you did that go straight to “Testing the environment”, below.

➡️ Create a new Anaconda Environment named UnityML and running Python 3.6 (the version of Python you need to be running to work with TensorFlow at the moment):

conda create -n UnityML python=3.6

➡️ Activate the Conda environment:

conda activate UnityML

➡️ Install TensorFlow 1.7.1 (the version of TensorFlow you need to be running to work with ML-Agents):

pip install tensorflow==1.7.1

➡️ Once TensorFlow is installed, installing the Unity ML-Agents:

pip install mlagents

Testing the environment

➡️ To check everything is installed properly, run the following command:

mlagents-learn --help

You should see something that looks like the following image. This shows that everything is installed properly.

If you’re coming to our conference tutorial, you’re now ready to go.

The ML-Agents Unity Project

The best way to start exploring ML-Agents is to use their provided Unity project. To get it, you’ll need a copy of the Unity ML-Agents repository.

⚠️ You do not need to do this bit if you’re coming to our tutorial at the O’Reilly AI Conference. We will provide a project on the day.

➡️ Clone the Unity ML-Agents repository to your system (see the note below if you’re coming to our tutorial!):

git clone

⚠️ If you’re coming to our O’Reilly AI Conference tutorial, we will provide a project on the day.

You should now have a directory called ml-agents. This directory contains the source code for ML-Agents, a whole of lot useful configuration files, as well starting point Unity projects for you to use.

➡️ You’re ready to go! If you’re coming to our tutorial, you’ll need a slightly different project which we’ll help you out with on the day!

We’ll have another article on getting started (now that you’ve got it installed) next week!

In San Jose? At O’Reilly AI Conference? Attend our tutorial!

Symbolic Analysis with Python and Z3

This is a text version of a talk that I gave at PyCon AU 2019.

Let’s say you’ve got this program:

pie_price = 3.14

num_pies = int(input("How many pies?"))

pie_owing = pie_price * num_pies

if pie_owing > 10:
    print("You're over the pie budget")

How do you test that the line that prints “you’re over the pie budget” can run? One way is to just run the program, type in a large number, and verify that you see it.

But what if you couldn’t ask for input? For example, maybe this part of the code is buried deep within a larger process, and reaching it is tricky; maybe the code under test is operating in a continuous integration environment, and no user input is available. What do you do then to ensure that this line is reachable?

Why, producing a formal proof, of course. It’s the only sensible way.

In this post, we’ll walk through the theory and practice of using symbolic execution, a static analysis technique, for bug discovery. In particular, we’ll focus on a specific type of bug: how can we prove that a line of code is, or is not, reachable?

How to solve it

Let’s start by reframing the question into something more formal:

For any given line of code, is there a set of inputs for the program that causes that code to be reached?

Or, to put it another way:

What are the constraints on the input that cause a line of code to run?

Let’s work the problem by doing it by hand. Here’s the code again:

pie_price = 3.14

num_pies = int(input("How many pies?"))

pie_owing = pie_price * num_pies

if pie_owing > 10:
    print("You're over the pie budget")

We know from the first line that pie_price is 3.14. However, we don’t know the value of num_pies, because it depends upon user input. In order for any of the rest of the code to work, though, we need to have a label for the value stored in num_pies.

This is where the symbol in symbolic execution comes in: we’ll introduce a symbolic value – let’s call it 🥧 – and declare that the variable num_pies contains 🥧. We don’t know anything about what’s stored in 🥧, but we do know some facts about it.

Specifically, we know a single fact about it right now: 🥧 is an integer, which means that it supports any operation that other integers support: addition, multiplication, comparison, and so on.

Our next line, pie_owing = pie_price * num_pies, has a similar problem: we can’t know the value of pie_owing, because it’s the result of a multiplication between a known (or concrete) value and the symbolic value 🥧. So, what do we store in pie_owing? We’ll store the entire expression pie_price * 🥧 in there.

The final line of code before the print statement is a conditional: if pie_owing > 10. If we proceed on to the next line, then it follows that the value of pie_owing – whatever it is – is greater than 10.

We now have enough information to put together a collection of logical assertions that must be true in order to reach the print statement. They are:

  • pie_price = 3.14
  • num_pies = 🥧
  • pie_owing = pie_price × num_pies
  • pie_owing > 10

Great. Our next question is: can we demonstrate that these equations can all be true at the same time?

Could we even do it… with a computer?

Proving it with Z3

The Z3 Theorem Prover is a library from Microsoft that’s capable of, among many other things, answering this problem. It also has bindings to lots of popular languages, including Python.

To answer our question, we’ll construct several equations that represent the constraints on the input that are in place when the print line is reached, and feed them into a solver; we can then ask the solver to check to see if they can be true at the same time.

from z3 import *

# Create the solver
s = Solver()

# Declare our variables: "pie_price", which we know the 
# value of, "num_pies", which we don't, and "pies_owing", which depends upon the values of the other two
pie_price = Real('pie_price')
num_pies = Int('num_pies')
pies_owing = pie_price * num_pies

# Assert that pie_price is equal to 3.14
s.add(pie_price == 3.14)

# Assert that pies_owing is greater than 10
s.add(pies_owing > 10)

# Ask if these these can be true at the same time
s.check() # returns "sat" - they can be!!

We’ve now demonstrated that in order to reach the line, print("You're over the pie budget"), a set of equations must be true at the same time; additionally, Z3 indicates that they can indeed be. Therefore, we’ve proved that the line is reachable, and we never needed to ask the user for input.

Incidentally, we can ask Z3 to produce a model of its solution, which means it will produce a value for all of the variable in question, including num_pies – the value we’d ordinarily get from the user. That is, Z3 can produce a value for num_pies that would result in the print statement to run.

s.model()[num_pies] # 4

Generating the Equations Automatically

In the previous example, we had to read through the code and manually produced the equations that are in place. Wouldn’t it be nicer, though, if we could have a system do this for us?

To do this, we’ll take advantage of the fact that Python is very easy to decompile into byte code. Using the dis module, we can take any Python function, and produce the byte code that represents it. Converting the code to byte code is important, because byte code is significantly simpler, and easier to analyse.

Once we have the byte code, we need to find a way to determine the possible paths through the code that execution can take, depending on the inputs given the program. We then need to determine the constraints on the variables that affect the path; if at any point the constraints are not compatible with each other, the path is impossible. If all paths that reach a line of code are impossible, then the line of code is unreachable under any circumstance.

For example, consider this snippet of code:

i = 1

if i == 0:

There is theoretically a path of execution that goes from line 1, through line 2, and ends at line 3, but if you think about it, it would require the variable i to be equal to 0 and also to 1. This is impossible, and as a result, the path is impossible; because this is the only path that reaches line 3, that line is unreachable.

This means that our next problem is: given a block of code, how do we calculate the possible paths through that code?

Basic Blocks and Control Flow Graphs

As before, let’s start with a chunk of code, which we’ll use as our example.

def test(a):
    x = 0

    if a > 0 and a < 5:
        x = 1

    b = a + 1

    if x == 1 and b > 6:

Our question for this code is: can the final line of code, print("Hello"), ever be reached? And can the process of discovering this be automated?

Let’s start by asking dis for the byte code.

import dis

This produces something like this (I’ve truncated it to the first few lines):

  2           0 LOAD_CONST               1 (0)
              2 STORE_FAST               1 (x)

  3           4 LOAD_FAST                0 (a)
              6 LOAD_CONST               1 (0)
              8 COMPARE_OP               4 (>)
             10 POP_JUMP_IF_FALSE       24
             12 LOAD_FAST                0 (a)
             14 LOAD_CONST               2 (5)
             16 COMPARE_OP               0 (<)
             18 POP_JUMP_IF_FALSE       24

  4          20 LOAD_CONST               3 (1)
             22 STORE_FAST               1 (x)

  5     >>   24 LOAD_FAST                0 (a)
             26 LOAD_CONST               3 (1)

Each one of these lines is a low-level instruction to the Python virtual machine. The Python VM is a stack machine, which means that the instructions work by pushing and popping values on a stack. For example, the LOAD_CONST and LOAD_FAST operations push values onto the stack (either a constant value or a value stored in a variable), while the COMPARE_OP operation pops two values off the stack, compares them, and pushes the result back onto the stack. Additionally, certain instructions are responsible for controlling the flow of execution: the POP_JUMP_IF_FALSE instruction pops a value off the stack, and if it evaluates to False, jumps to a numbered instruction; if it evaluates to True, it proceeds to the next instruction instead.

How, then, can we find the possible paths through the code? One popular approach is to decompose the stream of instructions into basic blocks: runs of instructions that are only ever entered at the start, and only ever exit at the end (that is, it is impossible for the program to jump to a point that’s in the middle of a basic block).

To determine these basic blocks, the instructions are scanned, and certain instructions are marked as leaders:

  • The first instruction is a leader.
  • Instructions that are the destination of a jump are leaders.
  • Instructions following a conditional jump are leaders.
  • Instructions following a ‘stop’ instruction are leaders.

Once you know the leaders, you can then group up the instructions according to the most recent leader.

Next up, you form the connections between the blocks. Blocks have successors (blocks they lead to), and predecessors (blocks that lead to them.)

  • Blocks that end in an unconditional jump have one successor – the target of the jump.
  • Blocks that end in a conditional jump have two successors – the target of the jump, and the next instruction.
  • Blocks that end in a ‘stop’ instruction have no successors.
  • All other blocks have a single successor: the following instruction’s block.

With these rules in mind, we can take the byte code for our example function, and figure out the blocks.

The basic blocks in the program.

Given these blocks and the way they link together, we can generate the control flow graph of the program. This graph shows how the blocks connect, and allows us to find the paths that execution can take through the program.

The control flow graph of the program.

We’re now ready to start testing the paths that lead to the print("Hello") function call, which is the second-to-last basic block (it’s the blue block, second from the right of the above image.) For the purposes of this article, we’ll select one of them arbitrarily, and prove that the path is valid or not; the same steps apply for testing any path.

A specific path through the control flow graph.

Finding impossible paths

In order to perform normal execution of the code, Python steps through each instruction, and performs them as normal – loading data into memory, requesting that the system get input, and so on. However, this only works when we’re running the entire program, which includes all of the work done to decide what parameters to use when calling the function test. When we perform symbolic execution, and are examining only portions of the program, we no longer have the ability to know concrete values for every variable.

Let’s take a closer look at the first basic block:

  2           0 LOAD_CONST               1 (0)
              2 STORE_FAST               1 (x)

  3           4 LOAD_FAST                0 (a)
              6 LOAD_CONST               1 (0)
              8 COMPARE_OP               4 (>)
             10 POP_JUMP_IF_FALSE       24

The third instruction in this disassembly loads the contents of the variable a, and pushes it onto the stack. However, a is a parameter to the function, which means it’s not possible to get a concrete value for the variable when considering the code in isolation.

This means that when we encounter the instruction LOAD_FAST a, we need to introduce a new symbolic value. That’s not the only symbolic value we need to track, though: on lines 1 and 2, we load the number 0, and store it in the variable x. This means that we need to declare to Z3 that the variable x exists, and assert that it is equal to 0.

Additionally, if we’re testing a specific path through the code, we already know whether the POP_JUMP_IF_FALSE will jump or not. In the case of our selected path, if we’re proceeding from the first block to the second, it means that we’re taking the path in which the value on the stack is True. This mean that we also assert that the result of comparing if a is greater than 1 is True.

In effect, setting a variable now means creating and recording an assertion that the variable contains a certain value, and when encountering a conditional jump, we assert that its condition is true (if we’re taking the true path), or false (if we’re not).

We continue this execution, creating additional constraints on the values as we encounter instructions that interact with them, and at the end of each block, we feed them into Z3 and ask if the assertions are compatible with each other. If they’re not, then the path is impossible, and we try again with a different path. If all of the paths that reach a block are impossible, then that block is unreachable under any circumstances.

In the specific case of our example, the line print("Hello") is unreachable. For it to be reached, it would require either the value of a to be both greater than 5 and less than 5 at the same time.


Symbolic execution is really fun and useful, but it isn’t without its drawbacks. In particular:

  • It’s vulnerable to an explosions in the number of paths, especially when looping (and especially if the code can potentially loop infinitely)
  • If the same region of memory is referred to by two variables, it can be challenging for the analyser to detect this condition
  • Elements in a collection require special handling; do you treat the collection itself as a value, or the values in the collection as individual values?
  • It’s a lot more challenging in dynamically typed situations, where you don’t necessarily know the operations that can be performed on the values that you receive.

Nonetheless, it’s a fascinating field to play in, and can be tremendously rewarding. The video of the talk that I gave at PyCon AU 2019 is embedded below.