Row/Column Major Order Issues
RSTSR is currently one of the very few tensor libraries that supports both row-major and column-major (col-major) storage layouts.
By default, RSTSR uses row-major order. However, advanced users may consider using column-major or even mixing both row/column-major layouts. This could potentially confuse beginners learning RSTSR or users transitioning from NumPy workflows to Rust.
This documentation section will provide a detailed discussion of row/column-major issues in RSTSR. Generally, what RSTSR refers to as row/column-major corresponds to:
- Row-major: NumPy
- Column-major: Julia (with some extensions)
When row-major and column-major differ, even with identical input data and code, different results may be produced!
1. Default Row/Column Major Determined by Cargo Feature
In the rstsr
or rstsr-core
crates, the cargo features row_major
and col_major
determine the default storage layout for the entire tensor library. Users should also note:
- If neither
row_major
norcol_major
is specified, RSTSR defaults to row-major; - If both
row_major
andcol_major
are specified, a compilation error will occur.
Additionally, users should be aware that cargo features are designed to be additive. Therefore:
- If your program doesn't specify
row_major
orcol_major
(intending to use the default row-major), but upstream/downstream dependencies specifycol_major
, your program will run in column-major mode, which may differ from your intention. - Conversely, if your program specifies one layout while dependencies specify the other, a compilation error will occur.
2. Runtime Modification of Tensor Layout via Device Settings
RSTSR also allows modifying row/column-major order at runtime through device.set_default_order
:
let vec = vec![0., 1., 2., 3., 4., 5.];
let mut device = DeviceFaer::new(4);
device.set_default_order(RowMajor);
let a = rt::asarray((&vec, [2, 3], &device));
println!("{:?}", a);
// output:
// [[ 0.0 1.0 2.0]
// [ 3.0 4.0 5.0]]
// 2-Dim (dyn), contiguous: Cc
device.set_default_order(ColMajor);
let b = rt::asarray((&vec, [2, 3], &device));
println!("{:?}", b);
// output:
// [[ 0.0 2.0 4.0]
// [ 1.0 3.0 5.0]]
// 2-Dim (dyn), contiguous: Ff
However, note that operations between row-major and column-major tensors are prohibited. This won't cause compile-time errors but will trigger runtime errors:
One solution is to leverage device switching (using the change_device
trait function). For CPU devices, change_device
consumes variable b
without explicitly copying data, making this device switch computationally cost-free.
// change_device will consume variable `b`
let b_row_major = b.change_device(a.device());
let c = &a + &b_row_major;
println!("{:?}", c);
// output:
// [[ 0.0 3.0 6.0]
// [ 4.0 7.0 10.0]]
// 2-Dim (dyn), contiguous: Cc
3. Result Differences Caused by Row/Column Major
3.1 reshape and asarray
The behavior of reshape
differs completely between row-major and column-major.
The previous section's code actually demonstrates how asarray
produces different results under different layouts:
When the asarray
function receives parameters containing data and shape (note: not layout):
rt::asarray((data_vec, shape, &device))
It's functionally equivalent to reshape
(or similar into_shape
):
rt::asarray((data_vec, &device)).into_shape(shape)
Thus, asarray
can encounter similar situations as reshape
.
While reshape
produces computationally identical results for c-contiguous and f-contiguous tensors, the processes may differ.
This can be confusing: row/column-major differs from c/f-contiguous. Row/column-major refers to iteration order, while c/f-contiguous refers to storage order.
For two tensors, as long as their iteration orders match, they're computationally equivalent regardless of storage order. reshape
can be viewed as a computation; given consistent iteration order, any storage order will produce identical computational results.
However, identical results don't imply identical processes. Under row-major, c-contiguous tensors can reshape
without data copying since pre/post-reshape data remains [0, 1, 2, 3, 4, 5]
. For f-contiguous tensors, reshape
requires explicit data copying since pre-reshape data is [0, 3, 1, 4, 2, 5]
, differing from the result.
The following code implements this example:
let mut device = DeviceFaer::new(4);
device.set_default_order(RowMajor);
let vec_c: Vec<f64> = vec![0., 1., 2., 3., 4., 5.];
let c = rt::asarray((&vec_c, [2, 3].c(), &device));
let vec_f: Vec<f64> = vec![0., 3., 1., 4., 2., 5.];
let f = rt::asarray((&vec_f, [2, 3].f(), &device));
let diff = (&c - &f).abs().sum();
println!("{:?}", diff);
// output: 0.0
let c_flatten = c.reshape(-1);
let f_flatten = f.reshape(-1);
println!("{:}", c_flatten);
// output: [ 0 1 2 3 4 5]
println!("{:}", f_flatten);
// output: [ 0 1 2 3 4 5]
use core::ptr::eq as ptr_eq;
// c_flatten shares same ptr to vec_c? true
println!("{:?}", ptr_eq(vec_c.as_ptr(), c_flatten.raw().as_ptr()));
// f_flatten shares same ptr to vec_f? false
println!("{:?}", ptr_eq(vec_f.as_ptr(), f_flatten.raw().as_ptr()));
3.2 broadcast
Row-major and column-major have completely opposite broadcast rules.
For row/column-major, the following scenarios produce computationally identical results despite differing c/f-contiguous outputs:
- Elementwise operations where tensors have matching dimensions (
ndim
), or when one operand is 0-dimensional (scalar); - Matrix multiplications where all operands are matrices or vectors (≤2 dimensions).
Other cases may differ significantly.
For programs handling both layouts, it's recommended to align tensor dimensions (ndim
) before binary operations requiring broadcasting. Basic indexing can achieve this: e.g., expanding 1D tensor a
via a.i((None, ..))
(row vector) or a.i((.., None))
(column vector).
Consider these two elementwise problems:
The first problem works in row-major but fails in column-major:
- Problem Setting
- Row-major
- Column-major
let vec_a = vec![1, 2, 3, 4, 5, 6];
let a = rt::asarray((&vec_a, [2, 3].c(), &DeviceFaer::default()));
let vec_b = vec![1, 0, -1];
let b = rt::asarray((&vec_b, [3], &DeviceFaer::default()));
let mut device = DeviceFaer::new(4);
device.set_default_order(RowMajor);
let a_row_major = a.to_device(&device);
let b_row_major = b.to_device(&device);
let c = a_row_major * b_row_major;
println!("{:3}", c);
// [[ 1 0 -3]
// [ 4 0 -6]]
The second problem works in column-major but fails in row-major:
- Problem Setting
- Row-major
- Column-major
let vec_a = vec![1, 2, 3, 4, 5, 6];
let a = rt::asarray((&vec_a, [2, 3].c(), &DeviceFaer::default()));
let vec_b = vec![1, -1];
let b = rt::asarray((&vec_b, [2], &DeviceFaer::default()));
let mut device = DeviceFaer::new(4);
device.set_default_order(ColMajor);
let a_row_major = a.to_device(&device);
let b_row_major = b.to_device(&device);
let c = a_row_major * b_row_major;
println!("{:3}", c);
// [[ 1 2 3]
// [ -4 -5 -6]]
Matrix multiplication behaves similarly. For example:
- Multiplying
(2, 3, 4)
with(4, 5)
tensors yields(2, 3, 5)
in row-major but fails in column-major; - Multiplying
(2, 3, 4)
with(5, 2)
tensors fails in row-major but yields(5, 3, 4)
in column-major.
4. Performance Differences Between Row/Column Major?
In short: generally no.
This requires clarification.
First, efficiency depends on iteration order - a c/f-contiguous issue rather than row/column-major. Performance varies based on algorithms and how they align with storage order.
Second, RSTSR internally uses column-major iteration for elementwise operations, transposing layouts as needed. This process is layout-agnostic, though c/f-contiguous tensors may undergo different transpositions. Generally, operations with matching contiguity (either c or f) achieve similar efficiency through internal column-major iteration.
Third, for BLAS-compatible (f32
/f64
or complex) matrix multiplications , efficiency remains consistent when all matrices are either c-prefer or f-prefer (where one stride is 1 and the other a positive integer). BLAS devices achieve this through transpose parameters and multiplication order.
Fourth, for other linear algebra operations, RSTSR currently follows LAPACKE - requiring row-major matrices to be copied and transposed, incurring acceptable overhead since LAPACK operations are typically O(N³) with computation time dwarfing memory operations. Future versions may improve this.
Fifth, for very small tensors where layout handling becomes significant, column-major may incur overhead as RSTSR internally transposes to row-major for broadcasting before transposing back. However, this overhead is negligible for typical scientific computing.
RSTSR developers maintain that with proper usage, row/column-major differences primarily reflect conventions rather than performance gaps.
5. Motivation: Why Support Both Layouts?
- Quantum chemistry programs like PySCF/Psi4 (using NumPy), tensor libraries like TiledArray, and ML frameworks like PyTorch/JAX typically use row-major.
- Fortran-based programs (GAMESS, Gaussian, FHI-Aims) and languages like Matlab/Julia use column-major.
- Libraries like Eigen support both through generics, defaulting to column-major;
- Pure tensor contraction libraries like TBLIS only read layouts without specializing for either.
It's unclear which layout holds majority.
The REST quantum chemistry program uses column-major, motivating RSTSR's column-major support. However, RSTSR also aims for NumPy-like Rust programming, requiring row-major support for compatibility. This dual need drove RSTSR's bidirectional layout implementation.