Sorting NumPy Arrays: A Comprehensive Guide

In this tutorial, we will discuss how to sort a NumPy array using different techniques.

Moving forward, we’ll look at how to sort a NumPy array in both ascending and descending orders, and how to handle multidimensional arrays, in-place sorting, indirect sorts, and common problems encountered when sorting.

 

 

Sort NumPy array with np.sort()

We can sort the given array by using the np.sort() function. By default, this function sorts the array in ascending order.

import numpy as np
arr = np.array([6, 4, 8, 1, 3])
sorted_arr = np.sort(arr)
print("Sorted Array:", sorted_arr)

Output:

Sorted Array: [1 3 4 6 8]

Here we’ve used np.sort(), which sorts the NumPy array in ascending order. We passed the original array as the argument, and it returns a sorted copy of the array, without modifying the original array.

 

Sort NumPy in Descending order

To sort a NumPy array in descending order (from largest to smallest), we will still use np.sort(), but we will reverse the order of the sorted array.

descending_arr = np.sort(arr)[::-1]
print("Array in Descending Order:", descending_arr)

Output:

Array in Descending Order: [8 6 4 3 1]

Here, we used a slicing technique [::-1] to reverse the sorted array. This returns the array in descending order.

 

Sort by Multiple Columns (Structured Array)

To sort a NumPy structured array by multiple columns, you can pass the column names you want to sort to the order parameter of the numpy.sort() function:

import numpy as np
data_type = np.dtype([('name', 'S15'), ('age', 'i4'), ('height', 'f4')])
data = np.array([('John', 20, 170.1),
                 ('Doe', 20, 180.8),
                 ('Alice', 22, 155.3),
                 ('Bob', 22, 175.2),
                 ('Charlie', 23, 165.7)],
                dtype=data_type)
data_sorted = np.sort(data, order=['age', 'height'])
print(data_sorted)

Output:

[(b'John', 20, 170.1)
(b'Doe', 20, 180.8)
(b'Alice', 22, 155.3)
(b'Bob', 22, 175.2)
(b'Charlie', 23, 165.7)]

In this example, we have a NumPy structured array with three fields – ‘name’, ‘age’, and ‘height’. We sort it by ‘age’ first and then ‘height’.

This means that if there are two people with the same age, the shorter one will come first.

To sort a 2D (or higher) array by multiple columns, you can use lexsort() as we’ll see later in this tutorial.

 

Sorting along an Axis (Multidimensional Array)

For a multidimensional nonstructured array, we can specify the axis along which to sort the elements.

Sort 2D NumPy array

# Create a 2-dimensional array
arr_2d = np.array([[2, 3, 1], [5, 0, 4]])
print("Original 2D Array:n", arr_2d)

# Sort along the first axis
sorted_arr_2d_axis0 = np.sort(arr_2d, axis=0)
print("Array sorted along the first axis:n", sorted_arr_2d_axis0)

Output:

Original 2D Array:
 [[2 3 1]
 [5 0 4]]
Array sorted along the first axis:
 [[2 0 1]
 [5 3 4]]

In this code, we first created a 2-dimensional array. To sort this array along the first axis (i.e., along the columns), we used the np.sort() function with axis=0.

This rearranges the elements in each column of the array separately, sorting from smallest to largest.

Sort 3D NumPy Array

Sorting a 3-dimensional (or n-dimensional) array involves the same principles as sorting a 2D array. You just need to specify the axis along which you want to sort.

# Create a 3-dimensional array
arr_3d = np.array([[[3, 2, 1], [6, 5, 4]], [[9, 8, 7], [12, 11, 10]]])
print("Original 3D Array:n", arr_3d)

# Sort along the last axis
sorted_arr_3d = np.sort(arr_3d, axis=-1)
print("3D Array sorted along the last axis:n", sorted_arr_3d)

Output:

Original 3D Array:
 [[[ 3  2  1]
  [ 6  5  4]]
 [[ 9  8  7]
  [12 11 10]]]
3D Array sorted along the last axis:
 [[[ 1  2  3]
  [ 4  5  6]]
 [[ 7  8  9]
  [10 11 12]]]

In this code, we first created a 3D array. To sort this array along the last axis (i.e., axis=-1), we used the np.sort() function with axis=-1. As a result, each 1D sub-array is sorted independently.

 

Sorting Algorithms

np.sort() uses a quicksort algorithm by default, but you can choose a different sorting algorithm with the kind parameter. The available algorithms include ‘quicksort’, ‘mergesort’, ‘heapsort’, and ‘stable’.

 

Quicksort (The fastest)

You can use the kind parameter to specify the quicksort algorithm like this:

sorted_arr = np.sort(arr, kind='quicksort')
print("Array sorted using quicksort:", sorted_arr)

Output:

Array sorted using quicksort: [1 3 4 6 8]

Mergesort

Mergesort is another sorting algorithm available in NumPy, known for its predictable O(n log n) time complexity.

sorted_arr = np.sort(arr, kind='mergesort')
print("Array sorted using mergesort:", sorted_arr)

Output:

Array sorted using mergesort: [1 3 4 6 8]

Heapsort (The slowest)

Heapsort is a comparison-based sorting algorithm with O(n log n) time complexity.

sorted_arr = np.sort(arr, kind='heapsort')
print("Array sorted using heapsort:", sorted_arr)

Output:

Array sorted using heapsort: [1 3 4 6 8]

Stable Sort

The ‘stable’ algorithm keeps equal elements in the same order as they were in the original array. This algorithm is based on radix sort or mergesort, depending on the data type.

sorted_arr = np.sort(arr, kind='stable')
print("Array sorted using stable sort:", sorted_arr)

Output:

Array sorted using stable sort: [1 3 4 6 8]

 

When to Use Each Algorithm?

Choosing the right sorting algorithm for your NumPy array can greatly impact the efficiency of your code. Here’s a general guide to choosing the right algorithm:

  • Quicksort: This is often the fastest algorithm for numerical data. It’s a good choice if you need to sort large arrays, and don’t need to worry about worst-case scenarios or stability. However, remember that the worst-case time complexity of quicksort is O(n^2).
  • Mergesort: If you need a stable sort (i.e., the order of equal elements is preserved), mergesort is a good choice. It always has a time complexity of O(n log n), but requires additional memory, which can be an issue for large arrays.
  • Heapsort: If memory is a concern, heapsort is a good option, and has a time complexity of O(n log n). However, it’s not stable.
  • Stable sort: If you need a stable sort and memory isn’t a concern, or you are working with non-numerical data, the stable sort algorithm is a good choice.

 

In-Place Sorting with np.ndarray.sort()

If you want to sort the original array without creating a copy, you can use the sort() method of the ndarray object. This performs an in-place sort.

# Original array
arr = np.array([4, 3, 1, 6, 8])
print("Original Array:", arr)

# In-place sort
arr.sort()
print("Sorted Array:", arr)

Output:

Original Array: [4 3 1 6 8]
Sorted Array: [1 3 4 6 8]

In this code, we called the sort() method on the NumPy array object, which sorts the array in-place. Unlike np.sort() which modifies the original array.

 

Indirect Sorts: argsort and lexsort

Indirect sorts, such as argsort() and lexsort(), do not sort the values of the array, but return an array of indices which, if applied to the array, would result in a sorted array.

argsort

# Using argsort
sorted_indices = np.argsort(arr)
print("Sorted Indices:", sorted_indices)

# Apply the sorted indices to arr
sorted_arr = arr[sorted_indices]
print("Sorted Array:", sorted_arr)

Output:

Sorted Indices: [2 1 0 3 4]
Sorted Array: [1 3 4 6 8]

Here, np.argsort() returns an array of indices that would sort the array. We then applied these indices to the array to obtain the sorted array.

lexsort (Sort non-structured array by Multiple Columns)

Let’s say we want to sort a 2D array based on the values in the first column, then by the values in the second column.

We can use the np.lexsort() function, which performs an indirect sort on multiple keys.

# Create a 2-dimensional array
arr_2d = np.array([[2, 3], [1, 5], [1, 4], [2, 2]])
print("Original 2D Array:n", arr_2d)

# Sequence of keys
keys = (arr_2d[:, 1], arr_2d[:, 0]) 

# Indices for sorted array
sorted_indices = np.lexsort(keys)

# Apply array indexing to obtain sorted array
sorted_arr_2d = arr_2d[sorted_indices]
print("Array sorted by multiple columns:n", sorted_arr_2d)

Output:

Original 2D Array:
 [[2 3]
 [1 5]
 [1 4]
 [2 2]]
Array sorted by multiple columns:
 [[1 4]
 [1 5]
 [2 2]
 [2 3]]

In this code, we created a 2D array and then a sequence of keys based on which we want to sort the array. We used np.lexsort() to get an array of indices that can be used to produce the sorted array.

Finally, we applied array indexing to obtain the sorted array.

 

Measuring Performance

It’s important to empirically measure the performance of different sorting algorithms, especially when dealing with larger datasets.

Here’s a small script I wrote to measure the execution time of each sorting algorithm on a large random array of size 100 million:

import numpy as np
import time

# Create a large numpy array
large_arr = np.random.randint(0, high=1000000, size=100000000)

algorithms = ['quicksort', 'mergesort', 'heapsort', 'stable']
for algorithm in algorithms:
    start = time.time()
    np.sort(large_arr, kind=algorithm)
    end = time.time()
    print(f"{algorithm.capitalize()} Time: {end - start} seconds")

# Test the in-place sort
start = time.time()
arr_copy = np.copy(large_arr)
arr_copy.sort()  # Uses quicksort by default
end = time.time()
print(f"In-place sort Time: {end - start} seconds")

Output:

Quicksort Time: 14.426326036453247 seconds
Mergesort Time: 18.75661039352417 seconds
Heapsort Time: 47.7882182598114 seconds
Stable Time: 18.685729265213013 seconds
In-place sort Time: 14.05978536605835 seconds

These results will vary based on the specific hardware and software configuration of your system. But generally, quicksort tends to be the fastest for numerical data, as expected.

 

Common Problems Encountered & Solutions

There can be several issues and considerations when sorting NumPy arrays:

Sorting NumPy array of different data types: NumPy arrays can contain different data types, which may cause problems when sorting.

The solution is to ensure that the array contains only one data type, or convert it to a type that can be sorted.

For example, if you have an array of strings and integers, you can convert the integers to strings before sorting.

# An array with different data types
arr = np.array(['banana', 2, 'apple'])
print("Original array:", arr)

# Convert all elements to strings
arr = arr.astype(str)
print("Converted array:", arr)

# Now we can sort the array
print("Sorted array:", np.sort(arr))

Output:

Original array: ['banana' '2' 'apple']
Converted array: ['banana' '2' 'apple']
Sorted array: ['2' 'apple' 'banana']

Sorting an array with missing values (NaN): NumPy considers NaN (Not a Number) to be greater than any other value, which can lead to unexpected sorting results.

You may want to handle NaN values (for example, by replacing or removing them) before sorting.

Memory usage with large arrays: Large arrays can consume a lot of memory. In these cases, you might want to use in-place sorting.

 

Further Readings

https://numpy.org/doc/stable/reference/generated/numpy.sort.html

https://numpy.org/doc/stable/reference/generated/numpy.ndarray.sort.html

Leave a Reply

Your email address will not be published. Required fields are marked *