Optimizing the speed of data access using foreach

The Python language used in the Blender API is very loyal and undemanding when it comes to data typing. However, when working with large amounts of data, the universalization of type conversion can negatively affect the speed of the code. For example, the simplest foreach_get() cycle, which takes data from a set of elements and puts it into an array, can be significantly speeded up simply by choosing the right type of the destination array data.

.blend file on Patreon

User Mysteryem in the Blender chat conducted a study on the example of getting data for the location of each particle in a particle system of 10,000 elements.

In his experiment, different types of arrays were created, into which the location of particles was placed using the foreach_get() instruction.

Arrays were created of the following types:

regular array of floating point numbers (float)
regular array of double precision floating point numbers (double)
numpy array of floating point numbers (float)
numpy array of double precision floating point numbers (double)
ctypes array of floating point numbers (float)
ctypes array of double precision floating point numbers d(ouble)

Mysteryem’s code:

import bpy
import timeit
import ctypes
import numpy as np
import array

obj = bpy.context.object
depsgraph = bpy.context.evaluated_depsgraph_get()

obj_eval = obj.evaluated_get(depsgraph)

psys = obj_eval.particle_systems[0]

parlen = len(psys.particles)

par_loc = [0, 0, 0] * parlen
array_f = array.array('f', [0, 0, 0]) * parlen
array_d = array.array('d', [0, 0, 0]) * parlen
ndarray_f = np.empty(3 * parlen, dtype=np.single)
ndarray_d = np.empty(3 * parlen, dtype=np.double)
ctypes_f = (ctypes.c_float * (3 * parlen))()
ctypes_d = (ctypes.c_double * (3 * parlen))()

# My test case was a default particle system with 1000 particles.
# If your particle system has more than this, you may want to reduce the number of `trials` below
# Note that being able to use memcpy in foreach_get scales much better than iterating as the number of particles
# increases.
print(f"parlen: {parlen}")
trials = 1000

# foreach_get can't memcpy into lists, but lists are pretty fast to iterate compared to most other types
# ~ 0.026ms for 1000 particles
t_par_loc = timeit.timeit("psys.particles.foreach_get('location', par_loc)", globals=globals(), number=trials) / trials
print("t_par_loc:")
print(f"\t{t_par_loc * 1000:f}ms")

# foreach_get could perform a memcpy because the buffer type matched the property's C type!
# ~ 0.003ms for 1000 particles
t_array_f = timeit.timeit("psys.particles.foreach_get('location', array_f)", globals=globals(), number=trials) / trials
print("t_array_f:")
print(f"\t{t_array_f * 1000:f}ms")

# ~ 0.099ms for 1000 particles
t_array_d = timeit.timeit("psys.particles.foreach_get('location', array_d)", globals=globals(), number=trials) / trials
print("t_array_d:")
print(f"\t{t_array_d * 1000:f}ms")

# foreach_get could perform a memcpy because the buffer type matched the property's C type!
# ~ 0.003ms for 1000 particles
t_ndarray_f = timeit.timeit("psys.particles.foreach_get('location', ndarray_f)", globals=globals(), number=trials) / trials
print("t_ndarray_f:")
print(f"\t{t_ndarray_f * 1000:f}ms")

# ~ 0.113ms for 1000 particles
t_ndarray_d = timeit.timeit("psys.particles.foreach_get('location', ndarray_d)", globals=globals(), number=trials) / trials
print("t_ndarray_d:")
print(f"\t{t_ndarray_d * 1000:f}ms")

# foreach_get should have been able to perform a memcpy, but its C code has issues that prevent ctypes arrays from being
# supported
# ~ 0.374ms for 1000 particles
t_ctypes_f = timeit.timeit("psys.particles.foreach_get('location', ctypes_f)", globals=globals(), number=trials) / trials
print("t_ctypes_f:")
print(f"\t{t_ctypes_f * 1000:f}ms")

# ~ 0.374ms for 1000 particles
t_ctypes_d = timeit.timeit("psys.particles.foreach_get('location', ctypes_d)", globals=globals(), number=trials) / trials
print("t_ctypes_d:")
print(f"\t{t_ctypes_d * 1000:f}ms")

import bpy

import timeit

import ctypes

import numpy as np

import array

obj = bpy.context.object

depsgraph = bpy.context.evaluated_depsgraph_get()

obj_eval = obj.evaluated_get(depsgraph)

psys = obj_eval.particle_systems[0]

parlen = len(psys.particles)

par_loc = [0, 0, 0] * parlen

array_f = array.array('f', [0, 0, 0]) * parlen

array_d = array.array('d', [0, 0, 0]) * parlen

ndarray_f = np.empty(3 * parlen, dtype=np.single)

ndarray_d = np.empty(3 * parlen, dtype=np.double)

ctypes_f = (ctypes.c_float * (3 * parlen))()

ctypes_d = (ctypes.c_double * (3 * parlen))()

# My test case was a default particle system with 1000 particles.

# If your particle system has more than this, you may want to reduce the number of `trials` below

# Note that being able to use memcpy in foreach_get scales much better than iterating as the number of particles

# increases.

print(f"parlen: {parlen}")

trials = 1000

# foreach_get can't memcpy into lists, but lists are pretty fast to iterate compared to most other types

# ~ 0.026ms for 1000 particles

t_par_loc = timeit.timeit("psys.particles.foreach_get('location', par_loc)", globals=globals(), number=trials) / trials

print("t_par_loc:")

print(f"\t{t_par_loc * 1000:f}ms")

# foreach_get could perform a memcpy because the buffer type matched the property's C type!

# ~ 0.003ms for 1000 particles

t_array_f = timeit.timeit("psys.particles.foreach_get('location', array_f)", globals=globals(), number=trials) / trials

print("t_array_f:")

print(f"\t{t_array_f * 1000:f}ms")

# ~ 0.099ms for 1000 particles

t_array_d = timeit.timeit("psys.particles.foreach_get('location', array_d)", globals=globals(), number=trials) / trials

print("t_array_d:")

print(f"\t{t_array_d * 1000:f}ms")

# foreach_get could perform a memcpy because the buffer type matched the property's C type!

# ~ 0.003ms for 1000 particles

t_ndarray_f = timeit.timeit("psys.particles.foreach_get('location', ndarray_f)", globals=globals(), number=trials) / trials

print("t_ndarray_f:")

print(f"\t{t_ndarray_f * 1000:f}ms")

# ~ 0.113ms for 1000 particles

t_ndarray_d = timeit.timeit("psys.particles.foreach_get('location', ndarray_d)", globals=globals(), number=trials) / trials

print("t_ndarray_d:")

print(f"\t{t_ndarray_d * 1000:f}ms")

# foreach_get should have been able to perform a memcpy, but its C code has issues that prevent ctypes arrays from being

# supported

# ~ 0.374ms for 1000 particles

t_ctypes_f = timeit.timeit("psys.particles.foreach_get('location', ctypes_f)", globals=globals(), number=trials) / trials

print("t_ctypes_f:")

print(f"\t{t_ctypes_f * 1000:f}ms")

# ~ 0.374ms for 1000 particles

t_ctypes_d = timeit.timeit("psys.particles.foreach_get('location', ctypes_d)", globals=globals(), number=trials) / trials

print("t_ctypes_d:")

print(f"\t{t_ctypes_d * 1000:f}ms")

The result of executing the code is as follows:

parlen: 10000

t_par_loc:
    0.446990ms

t_array_f:
    0.033495ms

t_array_d:
    2.511545ms

t_ndarray_f:
    0.031628ms

t_ndarray_d:
    1.293199ms

t_ctypes_f:
    4.844052ms

t_ctypes_d:
    4.707416ms

parlen: 10000

t_par_loc:

0.446990ms

t_array_f:

0.033495ms

t_array_d:

2.511545ms

t_ndarray_f:

0.031628ms

t_ndarray_d:

1.293199ms

t_ctypes_f:

4.844052ms

t_ctypes_d:

4.707416ms

Execution results may depend on the computer and operating system you are using.

Based on the results of executing the code, the following conclusions can be drawn:

Numpy array is the best.

The maximum speed was demonstrated by the foreach_get() loop when using a numpy array with a float data type, which corresponds to the particle location data type.

Any difference with the original data type requires additional conversions, which can significantly affect the speed of the loop. So, the same numpy array shows a significant decrease in speed when using the double type.