Cell tracking using a flow-solver

Spread the love

In this tutorial, we are going to track cells, given a list of cell positions at each time point. This tutorial is quite long, since it is a complex topic. Still, I hope that this tutorial will get you to a point where you have a basic cell tracker that you can adapt to your needs.

We’re entering a more niche topic now, which shows as that there isn’t a popular Python library that solves this problem for us. Instead, we have to rely on smaller Python packages, which are updated less frequently. We’re going to use the “Dynamic Programming Cell Tracking” (DPCT) library, which is available here. The library was originally written by Carsten Haubold and is part of the Ilastik cell segmentation and tracking toolkit. The maintainers of Ilastik generously provided the piece of software as a stand-alone library. Because the library is used from within Ilastik, we can be reasonably sure that DPCT will get at least some maintenance. Still, keep in mind that it’s a package that’s essentially maintained by a single research group, so there’s always a chance that development will stop.

Terms and conditions

The licence of DPCT is the MIT Licence, which requires you to acknowledge the authors if you redistribute the software. There’s also a paper that you should cite. There are no other restrictions, so for example building a commercial cell tracker on top of DPCT is allowed.

Why DPCT?

We are using the tracking-by-detection paradigm, where cell tracking is split into two separate problems: (1) you first detect where the cells are in each time point, and (2) then you connect those cell detections together over time so that you can follow the cells. This approach has been very successfull, getting high scores in the Cell Tracking Challenge.

DPCT is a library for the second problem, where you connect cell detections of different time points. This is a difficult problem, as there are almost unlimited possibilities to connect two sets of cell detections from subsequent time points together. DPCT then takes on the additional challenge of performing optimization over all time points at once, and allowing cell trajectories to divide.

What makes DPCT stand out, is that it does not require any expensive integer programming solvers like Gurobi or CPLEX, it only uses open-source software that Anaconda will automatically install for you. It uses a so-called min-cost flow solver, which provides excellent results that are almost as good as the solutions that use the expensive integer programming techniques.

Installation

As always, we will use Anaconda to install the packages. If you’re unfamiliar with Anaconda, then please see this page. We will also use the “Intestinal organoid crypts” example dataset, which you can download from this page.

First, open up an Anaconda Prompt and create an environment like this:

conda create --name dpcttest --channel ilastik-forge --channel conda-forge dpct scipy

This creates DPCT into an environment named dpcttest (feel free to choose a different name). We specify both the ilastic-forge and the conda-forge channels. The ilastik-forge channel contains the DPCT package, while the conda-forge channel contains some of the dependencies of DPCT. In addition, we install the Scipy package, which we will need later on.

About the Python version. At the time of writing, this command resulted in an environment with Python 3.9, while the newest Python version was 3.11. DPCT is largely written in C++, which unfortunately means that it needs to be recompiled for every new Python release. The resulting package then needs to be uploaded to Anaconda, and only then you can use it on a newer Python version.

In practise, working with an older Python version is not such a problem. However, if you have an existing Anaconda environment that you would like to install DPCT in, it might not be possible because your Python version is too new. In that case, just install DPCT in a separate environment, like the instructions on this page say.

Doing a test with DPCT

The documentation of DPCT is a bit lacking, but the README file points to an example. The example is reproduced below (modified slightly):

import dpct

# Define the weights - links, detections, divisions, appearances, dissappearances
weights = {"weights": [1, 1, 1, 1, 1]}

graph = {
	"settings" : {
		"statesShareWeights" : True
	},

	"segmentationHypotheses" : [
		{ "id" : 2, "timestep" : [1,1], "features" : [[1.0], [0.0]], "divisionFeatures" : [[0], [-5]], "appearanceFeatures" : [[0], [0]], "disappearanceFeatures" : [[0], [50]]},
		{ "id" : 3, "timestep" : [1,1], "features" : [[1.0], [0.0]], "divisionFeatures" : [[0], [-5]], "appearanceFeatures" : [[0], [0]], "disappearanceFeatures" : [[0], [50]]},
		{ "id" : 4, "timestep" : [2,2], "features" : [[1.0], [0.0]], "appearanceFeatures" : [[0], [50]], "disappearanceFeatures" : [[0], [-2]]},
		{ "id" : 5, "timestep" : [2,2], "features" : [[1.0], [0.0]], "appearanceFeatures" : [[0], [50]], "disappearanceFeatures" : [[0], [-2]]},
		{ "id" : 6, "timestep" : [2,2], "features" : [[1.0], [0.0]], "appearanceFeatures" : [[0], [50]], "disappearanceFeatures" : [[0], [-4]]}
	],

	"linkingHypotheses" : [
		{ "src" : 2, "dest" : 4, "features" : [[0], [-4]]},
		{ "src" : 2, "dest" : 5, "features" : [[0], [-3]]},
		{ "src" : 3, "dest" : 5, "features" : [[0], [-1]]},
		{ "src" : 3, "dest" : 6, "features" : [[0], [-4]]}
	]
}

res = dpct.trackFlowBased(graph, weights)
print(res)

Ok, that’s quite a lot of code. On line 27, we call the function of DPCT that does the tracking for us. The function needs two parameters: a dictionary named graph and a dictionary named weights. The dictionary named graph needs to have three entries: "settings", "segmentationHypotheses" and "linkingHypotheses".

We will start at the segmentation hypotheses. This a list, with each entry being a dictionary that looks like this:

{
  "id" : 2,
  "timestep" : [1,1],
  "features" : [[1.0], [0.0]],
  "divisionFeatures" : [[0], [-5]],
  "appearanceFeatures" : [[0], [0]],
  "disappearanceFeatures" : [[0], [50]]
}

Each entry represents a single detected cell at a single time point. We need to give all cell detections to the DPCT linker, and then it will link them together. For administrative purposes, we need to give each cell and id. Then, for all other properties we have two values, one for the state “absent” and one for the state “present”. The values are “energies”, which means that the lower the number is, the more likely a cell is to be in that state. As a result, the value functions as a kind of scoring penalty. For example, the division features the numbers are 0 and -5 for this segmentation. So if the cell division is absent, the scoring penalty is 0, if the division is present, the scoring penalty is -5, corresponding to a bonus of 5.

As a sidenode, I’m not sure how exactly this data structure works. For example, the “timestep” value has two possible states. However, in their examples, I’ve never seen a different value for one state compared to the other, and I’m not sure what it would even mean. Would that mean that a cell detection can be in different time points, depending on the state?

You have to set values for the “features”, namely the score if the detected cell is not used, and the score if the detected cell is used in linking. If you are sure that every detected cell is a real cell, you can set the penalty of not using a detected cell very high, but if you’re not so sure you should make this penalty lower.

You can leave out “divisionFeatures” if your cell doesn’t divide, “appearancePenalty” if your cell cannot appear in this time point and “dissappearancePenalty” if your cell cannot disseappear after the time point. However, I would just keep “appearancePenalty” and “dissappearancePenalty” in, and just give them a high penalty. Otherwise, if you didn’t detect a cell at one time point, the entire trajectory cannot be constructed.

Ok, that was quite some information. Now we continue with the linking hypotheses. Here you have to list all possible links. The format looks like this:

{ "src" : 2, "dest" : 4, "features" : [[0], [-4]]},

"src" and "dest" point to ids that you used in the "segmentationHypotheses" list. Then there’s also "features". Like for the positions, each entry has a score that is used if the link is not in use, and a score that is used if the link is in use. For example, you could create links between all nearby cells, and then score them based on the distance.

We will work out a real example soon, but for now, please run the code block. One way to do this is to save the above code to a file named link_testing.py, and then opening an Anaconda Prompt in the ffolder with link_testing.py, activating the dpcttest environment and then running python link_testing.py. The output should look like this:

{'detectionResults': [{'id': 2, 'value': 1}, {'id': 3, 'value': 1}, {'id': 4, 'value': 1}, {'id': 5, 'value': 1}, {'id': 6, 'value': 1}],
 'linkingResults': [{'src': 2, 'dest': 4, 'value': 1}, {'src': 2, 'dest': 5, 'value': 1}, {'src': 3, 'dest': 5, 'value': 0}, {'src': 3, 'dest': 6, 'value': 1}],
 'divisionResults': [{'id': 2, 'value': True}, {'id': 3, 'value': False}, {'id': 4, 'value': False}, {'id': 5, 'value': False}, {'id': 6, 'value': False}]}

Loading actual data

Now that we have seen DPCT in action, it’s time to work out a minimal example. We will work with known cell positions, which we will store as a numpy array. The data structure for the positions will look like this:

import numpy

positions = numpy.array([
    # index, time point, x, y, z
    [0, 1, 29.2, 34.3, 5.1],
    [1, 1, 23.8, 3.2, 5.4],
    [2, 1, 34.4, 35.3, 5.1],
    [2, 1, 76.3, 13.8, 5.9],
    ...
    [78, 2, 20.1, 14.8, 5.1],
    [79, 2, 48.4, 74.3, 6.2],
    [80, 2, 25.2, 54.6, 5.6],
    ...
])

(You don’t need to run this code.) Each row of the positions array represents the position of a detected cell. Each row contains an index, the time point and the x, y and z coordinate.

To get an actual dataset, download the “Intestinal organoid crypts” example dataset and extract it somewhere. Then, using the following code we’ll load in the positions:

import numpy
import json

positions = []

# Load in a file
index = 0
dataset_file = r"path\to\Dataset intestinal organoid proliferation\Tracking Data\OrganoidTracker format\nd799xy08.aut"
with open(dataset_file, "r") as handle:
    positions_json = json.load(handle)["positions"]
    for time_point_str, positions_json_at_time_point in positions_json.items():
        time_point_int = int(time_point_str)  # Time point is stored as string, as JSON can only have strings as key. So convert it to an int
        for position_json in positions_json_at_time_point:
            positions.append([index, time_point_int, position_json[0], position_json[1], position_json[2]])
            index += 1

# Convert to numpy
positions = numpy.array(positions)
print("Shape:", positions.shape)  # Should print "Shape: (26937, 5)"

To run this code, you need to update the path on line 7 to point to where you downloaded the dataset. In this piece of code, we read a .aut file, which is a file containing tracking data. This file is a JSON-file, which allows us to easily read it from Python. On line 9, we read the entire file, and then take out the “positions” data structure. (The file also contains links and other data, but for this example we will ignore those. After all, we just want a set of positions to try out our linking algorithm on.)

The remainder of the code is typical Python glue code, used to convert one format (the .aut file) to the format we discussed above. Since this code is specific to the data format of the example, you’ll need to write different code anyways, so we will not discuss this piece of code further.

Please copy the above code to a file and run it, and verify that you get “Shape: (26937, 5)” as the output.

Creating a list of possible links

Ok, now we have our positions, the next step is to establish some basic links. We will give multiple possibilities, so that DPCT can pick the most likely one. Add the following code to the file that contained the previous code block:

import scipy.spatial

possible_links = []

# Loop over all time points
for t in set(positions[:, 1]):
    positions_at_time = positions[positions[:, 1] == t]
    positions_at_next_time = positions[positions[:, 1] == t + 1]
    if len(positions_at_next_time) == 0:
        continue  # At last time point, nothing to link to

    # Use column 2:5 of each row, which contain the x, y and z, for the distance calculations
    distance_matrix = scipy.spatial.distance_matrix(positions_at_time[:, 2:5], positions_at_next_time[:, 2:5])

    # For each position at time point t, find the closest one at time point t+1, and add links for up to 2 times this distance
    for i, position in enumerate(positions_at_time):
        all_distances_to_position = distance_matrix[i]  # Distances to all positions in the next time point
        minimum_distance = numpy.min(all_distances_to_position)
        possible_next_positions = positions_at_next_time[all_distances_to_position < 2 * minimum_distance]
        for possible_next_position in possible_next_positions:
            penalty = _calculate_penalty(position, possible_next_position)
            possible_links.append({
                "src": int(position[0]),
                "dest": int(possible_next_position[0]),
                "features": [[0], [float(penalty)]]
            })

This script sets up a list named possible_links. We consider every link to be possible if it is at most twice as far as the closest position. The list possible_links is in the format that DPCT expects, which was discussed above. We convert all numpy numbers to Python numbers using int(...) and float(...), as otherwise DPCT will not run. To run this code, one piece is missing: the function _calculate_penalty.

Given two positions, this function will need to calculate how likely the link is. We can make this as complex as we want. We will just use the distance - 10, so that links longer than 10px have a penalty attached to them. The function looks as follows:

def _calculate_penalty(position, next_position):
    position_xyz = position[2:5]
    next_position_xyz = next_position[2:5]
    distance = numpy.sqrt(numpy.sum((position_xyz - next_position_xyz) ** 2))
    return distance - 10

Insert this code block between the previous two. You can’t insert it afterwards, since then the function is called before it’s defined, and that’s not allowed in Python.

Defining the possible positions and divisions

Ok, now we need to estabish a positions list in the format that DPCT wants. Creating this list is less complex compared to the list of links, fortunately:

segmentation_hypotheses = []
for position in positions:
    segmentation_hypotheses.append({
        "id" : int(position[0]),
        "timestep" : [int(position[1]), int(position[1])],
        "features" : [[20], [0]],
        "divisionFeatures" : [[0], [50]],
        "appearanceFeatures" : [[0], [50]],
        "disappearanceFeatures" : [[0], [50]]
})

Here, we simply create a new list, with a single entry for each position. We use a fixed penalty for appearance, dissappearance and division. Something smarter would be better here, like some method to detect when a cell divides, or is about to dissappear. But for this tutorial we keep it simple, and give all of those a penalty of 50 if they’re used.

We also give a penalty of 20 if the position is not used at all during tracking. Since we used the positions of a previously tracked organoid, we know that all positions are correct, so every position that is ignored is a tracking mistake. In a more realistic scenario, some detected positions will be incorrect. Ideally, you would have some way to know how likely a position is to be correct, and you would adjust the penalty for that: obvious cells would get a high penalty if missed, and questionable cell detections a low penalty.

Add this code to the bottom of your file. In the next step, we will actually run DPCT.

Running DPCT on our actual data

With all the preparations above, there’s not much code needed to actually run DPCT:

import dpct

# Define the weights - links, detections, divisions, appearances, dissappearances
weights = {"weights": [1, 1, 1, 1, 1]}
graph = {
	"settings" : {
		"statesShareWeights" : True
	},
	"segmentationHypotheses" : segmentation_hypotheses,
	"linkingHypotheses" : possible_links
}
result = dpct.trackFlowBased(graph, weights)

Add this section to the file too, and run it. If everything went well, the output should now look like this:

Shape: (26937, 5)
        contains 26937 segmentation hypotheses
        contains 37747 linking hypotheses
Log: Initializing Residual Graph ...
Log:  done in 1.58383 secs
Log: Beginning tracking ...
Log: Tracking took 16.798 secs and 794 iterations
Log: Final energy: -590947

Did it work? Great! Now we need some way to save the data.

Saving the tracking data

Here, we’re going to save the tracking data to a CSV file. The file will follow our pre-established format, so in every row we get the id, time point, x, y and z of a cell. In addition, we will now also provide a “from_id”, which is the id of the cell detection in the previous time point. So for example detection number 4525 (from time point 20) can be linked to detection nmber 4452 (from time point 19).

You may be wondering why we link a cell to its detection in the previous time point, and not to its detection in the next. The reason is simple: if a cell divides, it would have two detection ids in the next time point. By providing the id of the detection in the previous time point, we will always have one id. In case of a cell division, two cells would just point to the same cell in the previous time point, namely the mother cell.

If a cell cannot be linked to any cell in the previous time point, we will set from_id to -1.

The code to create this table is as follows:

# Set up a positions array with an additional column, which references the source of the link
positions_with_links = numpy.full(fill_value=-1, shape=(positions.shape[0], positions.shape[1] + 1), dtype=positions.dtype)
positions_with_links[:, :-1] = positions

# Fill that extra column
for link in result["linkingResults"]:
    if link["value"] == 0:
        continue
    source_id = link["src"]
    destination_id = link["dest"]
    positions_with_links[destination_id, -1] = source_id

# Write the thing to file
with open("positions_with_links.csv", "w") as file_handle:
    file_handle.write("sep=,\n")  # Makes Excel realize that the separator is a comma
    file_handle.write("Id,Time,X,Y,Z,From_Id\n")  # The header
    numpy.savetxt(file_handle, positions_with_links, delimiter=",")

Add this code to your Python file, and run it. Afterwards, you should have a CSV file that you can open in Excel (or any other spreadsheet program).

You’ll notice that in the first time point, all “from_id”s are -1, as there is no previous detection. From the next time point on, most cells should be linked however. The linking data is far from perfect, since we didn’t give DPCT any clue which cells might be dividing and we’re doing everything purely based on distance, but it’s a start.

Closing words

If you followed the tutorial this far, congratulations! You now have a (quite minimal) cell linker. The input for DPCT can still be improved a lot:

Find some way to detect how good a cell detection is, and use that to adjust the penalty for ignoring that detection.
Find some way to detect cells that are about to divide, and adjust the division penalty for that.
Find some way to detect cells that are about to dissappear, and give those a lower appearance penalty.
- For example for dying cells, but also cells that are likely to leave the imaged area.
Find some way to detect cells that just appeared, and give those a lower appearance penalty.
Find some way to visualize the data.
Find better ways to decide what links are possible.
Find better ways to score links.

Some tips:

Not all cells will be detected. If you set the appearance and disspearance penalties too high, then one missed detection can already cause an entire cell trajectory to be missed. So keep that in mind.
Here, we ignored that the Z-resolution is lower than the XY-resolution. So the calculated distances in pixels are not that meaningful, since a 5px movement in Z is huge, but a 5px movement in X is tiny. You can solve this by having all coordinates in micrometers, instead of pixels.

There are many scientific papers that follow the approach used here, namely “Tracking by detection”. Some use a graph-based optimization method, some use something simpler. You should be able to reimplement parts of their method within this framework. Some examples include, in alphabetical order:

Bao, Z. et al. Automated cell lineage tracing in Caenorhabditis elegans. Proceedings of the National Academy of Sciences of the United States of America 103, 2707–12 (2006). doi: 10.1073/pnas.0511111103.
Kok, R.N.U. et al. OrganoidTracker: Efficient cell tracking using machine learning and manual error correction. PLoS ONE 15, e0240802 (2020). doi: 10.1371/journal.pone.0240802.
Malin-Mayor, C. et al. Automated Reconstruction of Whole-Embryo Cell Lineages by Learning from Sparse Annotations. Nature Biotechnology (2022) doi: 10.1038/s41587-022-01427-7.
Wen, C. et al. 3DeeCellTracker, a deep learning-based pipeline for segmenting and tracking cells in 3D time lapse images. eLife 10, (2021). doi: 10.7554/eLife.59187.
Ulicna, K., Vallardi, G., Charras, G. & Lowe, A. R. Automated deep lineage tree analysis using a Bayesian single cell tracking approach. Frontiers in Computer Science 3, 734559 (2021). doi: 10.1038/s41587-022-01427-7.
Löffler, K., Scherr, T. & Mikut, R. A graph-based cell tracking algorithm with few manually tunable parameters and automated segmentation error correction. PLoS ONE 16, e0249257 (2021). doi: 10.1371/journal.pone.0249257.
Scherr, T., Löffler, K., Böhland, M. & Mikut, R. Cell segmentation and tracking using CNN-based distance predictions and a graph-based matching strategy. PLoS ONE 15, 0243219 (2020). doi: 10.1371/journal.pone.0243219.
Sugawara, K., Çevrim, Ç. & Averof, M. Tracking cell lineages in 3D by incremental deep learning. eLife 11, 1–19 (2022). doi: 10.7554/elife.69380.