Row/Column Major Order Issues

RSTSR is currently one of the very few tensor libraries that supports both row-major and column-major (col-major) storage layouts.

By default, RSTSR uses row-major order. However, advanced users may consider using column-major or even mixing both row/column-major layouts. This could potentially confuse beginners learning RSTSR or users transitioning from NumPy workflows to Rust.

This documentation section will provide a detailed discussion of row/column-major issues in RSTSR. Generally, what RSTSR refers to as row/column-major corresponds to:

Row-major: NumPy
Column-major: Julia (with some extensions)

warning

When row-major and column-major differ, even with identical input data and code, different results may be produced!

1. Default Row/Column Major Determined by Cargo Feature

In the rstsr or rstsr-core crates, the cargo features row_major and col_major determine the default storage layout for the entire tensor library. Users should also note:

If neither row_major nor col_major is specified, RSTSR defaults to row-major;
If both row_major and col_major are specified, a compilation error will occur.

Additionally, users should be aware that cargo features are designed to be additive. Therefore:

If your program doesn't specify row_major or col_major (intending to use the default row-major), but upstream/downstream dependencies specify col_major, your program will run in column-major mode, which may differ from your intention.
Conversely, if your program specifies one layout while dependencies specify the other, a compilation error will occur.

2. Runtime Modification of Tensor Layout via Device Settings

RSTSR also allows modifying row/column-major order at runtime through device.set_default_order:

let vec = vec![0., 1., 2., 3., 4., 5.];
let mut device = DeviceFaer::new(4);

device.set_default_order(RowMajor);
let a = rt::asarray((&vec, [2, 3], &device));
println!("{:?}", a);
// output:
// [[ 0.0 1.0 2.0]
//  [ 3.0 4.0 5.0]]
// 2-Dim (dyn), contiguous: Cc

device.set_default_order(ColMajor);
let b = rt::asarray((&vec, [2, 3], &device));
println!("{:?}", b);
// output:
// [[ 0.0 2.0 4.0]
//  [ 1.0 3.0 5.0]]
// 2-Dim (dyn), contiguous: Ff

However, note that operations between row-major and column-major tensors are prohibited. This won't cause compile-time errors but will trigger runtime errors:

// a is row-major, b is col-major
let c = &a + &b;
// panics:
// Error::DeviceMismatch : a.device().same_device(b.device())

One solution is to leverage device switching (using the change_device trait function). For CPU devices, change_device consumes variable b without explicitly copying data, making this device switch computationally cost-free.

// change_device will consume variable `b`
let b_row_major = b.change_device(a.device());
let c = &a + &b_row_major;
println!("{:?}", c);
// output:
// [[ 0.0 3.0 6.0]
//  [ 4.0 7.0 10.0]]
// 2-Dim (dyn), contiguous: Cc

3. Result Differences Caused by Row/Column Major

3.1 reshape and asarray

warning

The behavior of reshape differs completely between row-major and column-major.

The previous section's code actually demonstrates how asarray produces different results under different layouts:

\begin{gathered} \text{data} \begin{pmatrix} 0 & 1 & 2 & 3 & 4 & 5 \end{pmatrix} \\ \text{row-major} \begin{pmatrix} 0 & 1 & 2 \\ 3 & 4 & 5 \end{pmatrix}, \quad \text{col-major} \begin{pmatrix} 0 & 2 & 4 \\ 1 & 3 & 5 \end{pmatrix} \end{gathered}

When the asarray function receives parameters containing data and shape (note: not layout):

rt::asarray((data_vec, shape, &device))

It's functionally equivalent to reshape (or similar into_shape):

rt::asarray((data_vec, &device)).into_shape(shape)

Thus, asarray can encounter similar situations as reshape.

warning

While reshape produces computationally identical results for c-contiguous and f-contiguous tensors, the processes may differ.

This can be confusing: row/column-major differs from c/f-contiguous. Row/column-major refers to iteration order, while c/f-contiguous refers to storage order.

\begin{gathered} \text{assume row-major} \\ \text{c-contiguous} \begin{pmatrix} 0 & 1 & 2 \\ 3 & 4 & 5 \end{pmatrix} \quad \text{data} \begin{pmatrix} 0 & 1 & 2 & 3 & 4 & 5 \end{pmatrix} \\ \text{f-contiguous} \begin{pmatrix} 0 & 1 & 2 \\ 3 & 4 & 5 \end{pmatrix} \quad \text{data} \begin{pmatrix} 0 & 3 & 1 & 4 & 2 & 5 \end{pmatrix} \\ \downarrow \text{reshape to 1-D vector} \\ \text{contiguous} \begin{pmatrix} 0 & 1 & 2 & 3 & 4 & 5 \end{pmatrix} \quad \left\{ \begin{matrix} \text{c-contiguous}: & \text{referenced as } \texttt{TensorCow::View} \\ \text{f-contiguous}: & \text{cloned to } \texttt{TensorCow::Owned} \end{matrix} \right. \end{gathered}

For two tensors, as long as their iteration orders match, they're computationally equivalent regardless of storage order. reshape can be viewed as a computation; given consistent iteration order, any storage order will produce identical computational results.

However, identical results don't imply identical processes. Under row-major, c-contiguous tensors can reshape without data copying since pre/post-reshape data remains [0, 1, 2, 3, 4, 5]. For f-contiguous tensors, reshape requires explicit data copying since pre-reshape data is [0, 3, 1, 4, 2, 5], differing from the result.

The following code implements this example:

let mut device = DeviceFaer::new(4);
device.set_default_order(RowMajor);

let vec_c: Vec<f64> = vec![0., 1., 2., 3., 4., 5.];
let c = rt::asarray((&vec_c, [2, 3].c(), &device));

let vec_f: Vec<f64> = vec![0., 3., 1., 4., 2., 5.];
let f = rt::asarray((&vec_f, [2, 3].f(), &device));

let diff = (&c - &f).abs().sum();
println!("{:?}", diff);
// output: 0.0

let c_flatten = c.reshape(-1);
let f_flatten = f.reshape(-1);

println!("{:}", c_flatten);
// output: [ 0 1 2 3 4 5]
println!("{:}", f_flatten);
// output: [ 0 1 2 3 4 5]

use core::ptr::eq as ptr_eq;

// c_flatten shares same ptr to vec_c? true
println!("{:?}", ptr_eq(vec_c.as_ptr(), c_flatten.raw().as_ptr()));

// f_flatten shares same ptr to vec_f? false
println!("{:?}", ptr_eq(vec_f.as_ptr(), f_flatten.raw().as_ptr()));

3.2 broadcast

warning

Row-major and column-major have completely opposite broadcast rules.

For row/column-major, the following scenarios produce computationally identical results despite differing c/f-contiguous outputs:

Elementwise operations where tensors have matching dimensions (ndim), or when one operand is 0-dimensional (scalar);
Matrix multiplications where all operands are matrices or vectors (≤2 dimensions).

Other cases may differ significantly.

info

For programs handling both layouts, it's recommended to align tensor dimensions (ndim) before binary operations requiring broadcasting. Basic indexing can achieve this: e.g., expanding 1D tensor a via a.i((None, ..)) (row vector) or a.i((.., None)) (column vector).

Consider these two elementwise problems:

\begin{gathered} \begin{pmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{pmatrix} \times \begin{pmatrix} 1 & 0 & -1 \end{pmatrix} = ? \\ \begin{pmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{pmatrix} \times \begin{pmatrix} 1 & -1 \end{pmatrix} = ? \end{gathered}

The first problem works in row-major but fails in column-major:

Problem Setting
Row-major
Column-major

let vec_a = vec![1, 2, 3, 4, 5, 6];
let a = rt::asarray((&vec_a, [2, 3].c(), &DeviceFaer::default()));
let vec_b = vec![1, 0, -1];
let b = rt::asarray((&vec_b, [3], &DeviceFaer::default()));

let mut device = DeviceFaer::new(4);
device.set_default_order(RowMajor);

let a_row_major = a.to_device(&device);
let b_row_major = b.to_device(&device);

let c = a_row_major * b_row_major;
println!("{:3}", c);
// [[   1   0  -3]
//  [   4   0  -6]]

let mut device = DeviceFaer::new(4);
device.set_default_order(ColMajor);

let a_row_major = a.to_device(&device);
let b_row_major = b.to_device(&device);

let c = a_row_major * b_row_major;
println!("{:3}", c);
// panics:
// Error::InvalidLayout : Broadcasting failed. : d1 = 2 not equal to d2 = 3

The second problem works in column-major but fails in row-major:

Problem Setting
Row-major
Column-major

let vec_a = vec![1, 2, 3, 4, 5, 6];
let a = rt::asarray((&vec_a, [2, 3].c(), &DeviceFaer::default()));
let vec_b = vec![1, -1];
let b = rt::asarray((&vec_b, [2], &DeviceFaer::default()));

let mut device = DeviceFaer::new(4);
device.set_default_order(RowMajor);

let a_row_major = a.to_device(&device);
let b_row_major = b.to_device(&device);

let c = a_row_major * b_row_major;
println!("{:3}", c);
// panics:
// Error::InvalidLayout : Broadcasting failed. : d1 = 3 not equal to d2 = 2

let mut device = DeviceFaer::new(4);
device.set_default_order(ColMajor);

let a_row_major = a.to_device(&device);
let b_row_major = b.to_device(&device);

let c = a_row_major * b_row_major;
println!("{:3}", c);
// [[   1   2   3]
//  [  -4  -5  -6]]

Matrix multiplication behaves similarly. For example:

Multiplying (2, 3, 4) with (4, 5) tensors yields (2, 3, 5) in row-major but fails in column-major;
Multiplying (2, 3, 4) with (5, 2) tensors fails in row-major but yields (5, 3, 4) in column-major.

4. Performance Differences Between Row/Column Major?

In short: generally no.

This requires clarification.

First, efficiency depends on iteration order - a c/f-contiguous issue rather than row/column-major. Performance varies based on algorithms and how they align with storage order.

Second, RSTSR internally uses column-major iteration for elementwise operations, transposing layouts as needed. This process is layout-agnostic, though c/f-contiguous tensors may undergo different transpositions. Generally, operations with matching contiguity (either c or f) achieve similar efficiency through internal column-major iteration.

Third, for BLAS-compatible (f32/f64 or complex) matrix multiplications $\mathbf{C} = \mathbf{A} \mathbf{B}$ , efficiency remains consistent when all matrices are either c-prefer or f-prefer (where one stride is 1 and the other a positive integer). BLAS devices achieve this through transpose parameters and multiplication order.

Fourth, for other linear algebra operations, RSTSR currently follows LAPACKE - requiring row-major matrices to be copied and transposed, incurring acceptable overhead since LAPACK operations are typically O(N³) with computation time dwarfing memory operations. Future versions may improve this.

Fifth, for very small tensors where layout handling becomes significant, column-major may incur overhead as RSTSR internally transposes to row-major for broadcasting before transposing back. However, this overhead is negligible for typical scientific computing.

RSTSR developers maintain that with proper usage, row/column-major differences primarily reflect conventions rather than performance gaps.

5. Motivation: Why Support Both Layouts?

Quantum chemistry programs like PySCF/Psi4 (using NumPy), tensor libraries like TiledArray, and ML frameworks like PyTorch/JAX typically use row-major.
Fortran-based programs (GAMESS, Gaussian, FHI-Aims) and languages like Matlab/Julia use column-major.
Libraries like Eigen support both through generics, defaulting to column-major;
Pure tensor contraction libraries like TBLIS only read layouts without specializing for either.

It's unclear which layout holds majority.

The REST quantum chemistry program uses column-major, motivating RSTSR's column-major support. However, RSTSR also aims for NumPy-like Rust programming, requiring row-major support for compatibility. This dual need drove RSTSR's bidirectional layout implementation.

1. Default Row/Column Major Determined by Cargo Feature​

2. Runtime Modification of Tensor Layout via Device Settings​

3. Result Differences Caused by Row/Column Major​

3.1 reshape and asarray​

3.2 broadcast​

4. Performance Differences Between Row/Column Major?​

5. Motivation: Why Support Both Layouts?​

1. Default Row/Column Major Determined by Cargo Feature

2. Runtime Modification of Tensor Layout via Device Settings

3. Result Differences Caused by Row/Column Major

3.1 reshape and asarray

3.2 broadcast

4. Performance Differences Between Row/Column Major?

5. Motivation: Why Support Both Layouts?