This is a very short introduction to numpy, focused on the basic data structure, ndarray
. Numpy is the most important scientific package in the Python ecosystem because it provides a common datastructure on which many other packages build on.
To make this tutorial work on Python 2 & Python 3, let's import some future features into Python 2
from __future__ import print_function, division
# np is the standard abbreviation for numpy in the code
# Even the numpy docs use it
import numpy as np
The ndarray
is the biggest contribution of numpy. An ndarray is
We can build an array from Python lists:
arr = np.array([
[1.2, 2.3, 4.0],
[1.2, 3.4, 5.2],
[0.0, 1.0, 1.3],
[0.0, 1.0, 2e-1]])
print(arr)
print(arr.dtype)
print(arr.ndim)
print(arr.shape)
This array is of float64
(at least on my computer, probably on yours too), it has 2 dimensions and its shape is 4 rows and 3 columns.
When constructing an array, we can explicitly specify the type:
iarr = np.array([1,2,3], np.uint8)
Arithmetic operations on the array respect the type and can including rounding and overflow!
arr *= 2.5
iarr *= 2.5
print(arr)
print(iarr)
An important subset of operations with numpy arrays concerns using logical operators to build boolean arrays. For example:
is_greater_one = (arr >= 1.)
print(is_greater_one)
We can use Python's []
operator to slice and dice the array:
print(arr[0,0]) # First row, first column
print(arr[1]) # The whole second row
print(arr[:,2]) # The third column
Slices share memory with the original array!
print("Before: {}".format(arr[1,0]))
view = arr[1]
view[0] += 100
print("After: {}".format(arr[1,0]))
a = np.array([
[ 0, 1, 2, 3, 4, 5],
[10, 11, 12, 13, 14, 15],
[20, 21, 22, 23, 24, 25],
[30, 31, 32, 33, 34, 35],
[40, 41, 42, 43, 44, 45],
[50, 51, 52, 53, 54, 55]])
This image is taken from scipy-lectures, a more complete tutorial on numpy than what we have here.
arr.mean()
Also available: max
, min
, sum
, ptp
(point-to-point, i.e., difference between maximum and minimum values).
These functions can also work axis-wise:
arr.mean(axis=0)
An important trick is to combine logical operations with A
is_greater_one = (arr > 1)
print(is_greater_one.mean())
You can often perform operations
print(arr)
print("Now adding [1,1,0] to *every row*")
print()
arr += np.array([1,1,0])
print(arr)
The exact rules of how broadcasting work are a bit complex to explain, but it generally works as expected. For example, if your data is a set of measurements for a sample, and your columns are the different types of measurements, then, you can easily remove the mean like this:
print(arr.mean(0))
arr -= arr.mean(0)
print(arr.mean(0))
[homogeneous]: There is a loophole to get heterogeneous arrays, namely an array of object
. Then, you can store any Python object. This comes at the cost of decreased computational efficiency (both in terms of processing time and memory usage).