Module 1: NumPy Python
Python is attractive because of its intuitive syntax. But until recently, Python as an interpreted language was marred by bad performance. Creating and importing modules in Python is very simple, even for those written in C or Java. NumPy is a module implemented using high performance languages for numerical applications. Initially, the implementers were mainly graduate students who often spend time creating NumPy against the advice of their advisors. Now it is relied on by large communities, including many data scientists. Its twin SciPy adds many mathematically important packages such as Fourier transforms and even more Statistics.
If you want to use the many modules that use NumPy well, you need to know the basics about NumPy. Unfortunately, the richness of NumPy does not come with the same elegance as "Vanilla Python", mainly because NumPy was written by Python enthusiasts (and all of them apparently volunteers) that did not follow the sometimes excruciatingly slow process of deep discussions on changes to the language. This means that the learning curve for NumPy is steeper and longer. In my experience, getting used to NumPy involves many web-searches and many readings of the NumPy manual. It is easy to feel overwhelmed with NumPy and it takes time for NumPy to settle. The Python maxim that if you need it, there is a simple way is still mainly valid. I recommend strongly to follow all of the examples in this module on your own computer, experiment with the code by changing it slightly, and most importantly, to not despair because there is so much material and so many intricacies. If you want to dive further, you can consult the NumPy user guide currently at https://numpy.org/doc/1.20/numpy-user.pdf.
Installing NumPy
NumPy needs to be installed in your system. These days, we use pip3 from a console / terminal window by saying
pip3 install numpy
If there are different Python versions, you can give the complete version number. At the time of this writing on this machine, the Python version is 3.12. Therefore,
pip3.12 install numpy
will install NumPy for this version of Python. (Advanced users might use Python environments to deal with installing sets of modules that might have conflicts. By the time you need this advice, you will be capable to rely on the documentation.) We then import NumPy traditionally with the abbreviation np.
import numpy as np
Why NumPy is Faster
Originally, Python was lacking an array data structure and relayed instead on lists. A Python list can be heterogeneous. Each element of a Python list is an independent object and hence will have information about its class. Thus, if we want to add a constant to each element of a list, then for each element, the Python interpreter will look up the element's class, determine what method implements the addition, and then execute this method. The time for these look-ups sums up if the list is long. In addition, the additional information about each individual element adds to the storage need of an array. Now, bringing data from memory to the CPU has become proportionally much slower than computation over time, so that some computer scientists speak of the Memory Wall. Thus, not only does a list cost us look-up times, it also costs to bring it into the CPUs. In contrast, an array is homogenous; it consists of elements of the same type, such as 32-bit integers or 64-bit floating point numbers. Because all elements are the same type, the Python interpreter can look up the addition operation only once. In addition, the type information is for the complete array. Thus, arrays are smaller and they are faster to deal with.
A Comparison of Methods
As a test, lets calculate the sum of squares in four different ways
$$\sum_{i = 1}^{10^{6} - 1}i^{2}$$
in four different ways. We can do so using a while-loop, we can do so using a for-loop, we can do so using the built-in function sum and we can use NumPy. Here is the code:
import numpy as np
def using_while(number):
result = 0
i = 0
while i < number:
result += i**2
i += 1
return result
def using_for(number):
result = 0
for i in range(number):
result += i**2
return result
def using_sum(number):
return sum([i**2 for i in range(1, number)])
def using_np(number):
np_array = np.arange(1, number)
return np.sum(np_array**2)
The third function uses the sum function that calculates the sum of the
list, which we generate using list comprehension. The fourth function
needs to import NumPy. The NumPy arange method (notice the spelling,
which is a pun on a-range) creates a NumPy array of integers starting
with 1 and ending with number-1. The return line first squares all of
the elements of the array (one of the nice features of NumPy) and then
applies NumPy's sum method. We also need to measure the time. We use the
time module and in this module the time method which returns the
wall-clock time in seconds since January 1, 1970 midnight UTC. It has
higher precision than the Unix time function and is perfectly adequate
to measure execution times.
import time
def timeit(fun, args, trials):
total_time = 0
for _ in range(trials):
start = time.time()
fun(*args)
total_time += time.time() - start
return total_time / trials
We give our timeit function the name of the function to execute, the arguments in form of a positional list, and the number of repetitions. This is because execution times depend on many different factors but especially the load of the system. Reporting the mean execution time gives a better idea for the comparison of implementations. For each execution, we call the function with the list of arguments as fun(*args). Before the execution, we load the clock value in the variable start and after execution, we calculate the execution time as the difference between the current time and the start time. We add this execution time to the total time. We then report the average execution time as the total time divided by the number of function calls. Of course, the answer is not entirely exact, as getting the time takes time, but this is not a major component if the execution times are large. Finally, we report the execution times of the four different functions.
print('using_while', f'{timeit(using_while, [10**6], 30):6.3}')
print('using_for', f'{timeit(using_for, [10**6], 30):6.3}')
print('using_sum', f'{timeit(using_sum, [10**6], 30):6.3}')
print('using_np', f'{timeit(using_np, [10**6], 30):6.3}')
We use f-strings with formatting instructions to report the average execution times over 30 runs. The result is machine dependent. For example, one possible result is:
using_while 0.0516
using_for 0.0431
using_sum 0.0397
using_np 0.00108
This experiment is quite convincing. The more sophisticated version in "Vanilla Python" shaves off 20% of the execution time of the function implemented with a while-loop, but the NumPy version is about 35 times faster. No wonder that NumPy is the go-to-module when dealing with large numeric calculations. (Notice that the single argument needs to be given as an array, because our time-it function expects an array.)
Figure 1: A NumPy array
1.2 NumPy arrays
A NumPy array consists of a header and a potentially large number of elements, see Fig. 1. The header contains all of the information that a typical Python object has. Because all elements in an NumPy array have the same data-type, this information is stored centrally. When an element is extracted, it receives the Python header for a Python object. NumPy arrays have a dimension. Vectors are one-dimensional, matrices are two-dimensional, and higher dimensional NumPy arrays are known as tensors. There are NumPy matrices with only one row and one column element. These would still be different from a NumPy array of dimension one with the same, single element. You have access to information about the internal structure of a NumPy array through a number of properties:
ndarray.flags: A collection of information about the memory layoutndarray.shape: A tuple consisting of the dimensionsndarray.ndim: The number of dimensions, i.e. 1 for a vector and 2 for a matrixndarray.size: The number of elementsndarray.itemsize: The length of a single element in bytes
ndarray.nbytes: The total number of bytes of all the elements in the array
ndarray.dtype: Describes the data type of the objects in the array.
Unlike "Vanilla Python's" list construct, adding to an NumPy array will create a new, expanded array. NumPy achieves its better performance by vectorizing operations on NumPy arrays, which is possible because all elements are of the same data type. There is a small loop-hole in this requirement since one can create arrays of Python objects and everything in Python (with the exception of small integer constants) is an object.
An ever-growing list of Python modules use NumPy arrays. They often allow the programmer to pass Python sequences such as lists as input. As a general rule, these modules will transform these input sequences into NumPy arrays and often return NumPy arrays. To avoid costly copying, optimal use of these modules might require the programmers themselves to handle NumPy arrays. This is the reason that even Python programmers not interested in learning how to do numerical calculations efficiently should still learn something about NumPy.
1.2.1 NumPy Array Creation
There are a number of ways to create arrays. One possibility is to use Python lists (possibly of more than one dimension). The types of the elements need to be the same or be converted into one. Thus
import numpy as np
my_list = [1,5,4.0,2]
my_vec = np.array(my_list)
my_list = [[1,2],[4,3.0]]
my_mat = np.array(my_list)
my_mat2 = np.array(my_list, dtype = np.int32)
will create a NumPy vector and matrix. You can change the type of the elements by setting the dtype parameter. Most of the types you will use start with int (for integer), uint (for unsigned integer), float (for floating-point) followed by the length in bits, such as 8, 16, 32, 64, and 128). Because the two lists contain a floating-point number, the NumPy vector and matrix will contain floating point numbers. For the second matrix, we have specified that the data type is a 32-bit integer. If you are using small data types, truncation errors can occur. NumPy has a number of ways to generate arrays from values. Often, you have to specify the shape (the sizes of dimensions) as a tuple. You really should try this out, since it will take some time to get used to. We have zeros (spelled like this) and ones for a NumPy array where all elements are either zero or one. We use full to
print(np.zeros((3,5)))
print(np.zeros((3,5), dtype = np.int8))
print(np.ones((2,5)))
print(np.full((2,2), 3.141))
In the second example, we specify a different data type than the default, namely an 8-bit integer. This gives us
[[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]]
[[0 0 0 0 0]
[0 0 0 0 0]
[0 0 0 0 0]]
[[1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1.]]
[[3.141 3.141]
[3.141 3.141]]
Notice how in the second example we set the data type to a NumPy data type, namely an 8-bit integer. We also have methods for matrices, eye (pronounced 'I') for the identity matrix, and diag to generate diagonal matrices (or return the diagonal elements if the argument is a matrix). For example
print(np.diag([1,2,3,4,5]))
print(np.eye(4))
gives
[[1 0 0 0 0]
[0 2 0 0 0]
[0 0 3 0 0]
[0 0 0 4 0]
[0 0 0 0 5]]
[[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]]
NumPy's arange method (sic) is a generalization of Vanilla Python's range function. It has a start, a stop, and a step value. It is best practice to give all three parameters. NumPy will use integer as the default data type, unless you give it a floating-point number as one or more of the arguments or set the data type with dtype. The normal Python data types int, float, and complex can be used or special NumPy data types such as np.float32, np.int16, or np.uint8.
print(np.arange(1,10, 1))
print(np.arange(1,10, dtype=np.float32))
print(np.arange(1,5,0.3))
results in
[1 2 3 4 5 6 7 8 9]
[1. 2. 3. 4. 5. 6. 7. 8. 9.]
[1. 1.3 1.6 1.9 2.2 2.5 2.8 3.1 3.4 3.7 4. 4.3 4.6 4.9]
A similar method is linspace, which takes a beginning, and end, and the total number of points evenly distributed over the interval. linspace will include both the beginning and the end value.( logspace is used to generate points logarithmically distributed.)
print(np.linspace(5,7.5, 6))
print(np.logspace(-10, -3, 8))
gives
[5. 5.5 6. 6.5 7. 7.5]
[1.e-10 1.e-09 1.e-08 1.e-07 1.e-06 1.e-05 1.e-04 1.e-03]
Finally, we can use the random submodule of numpy to generate arrays with random numbers. For the normal (Gaussian) distribution, loc gives the mean and scale the standard deviation.
print( np.random.random((3,2)) )
print( np.random.normal(loc=2, scale=0.5, size=(2,3)) )
gives for me, but your experience will differ,
[[0.75776749 0.21084972]
[0.99431444 0.11249181]
[0.09565248 0.678935 ]]
[[1.77588646 1.35296546 1.83915351]
[2.01124033 1.82448666 1.89579561]]
We can create arrays using functions as in the following example
x = np.fromfunction(lambda i,j:(i**2+j**2)//2, (4,5), dtype=int)
which yields
[[ 0, 0, 2, 4, 8],
[ 0, 1, 2, 5, 8],
[ 2, 2, 4, 6, 10],
[ 4, 5, 6, 9, 12]]
1.2.2 Importing data sets
Data science courses used traditionally the Iris data set used by the British statistician Ronald Fisher in a 1936 paper as an application for linear discriminant analysis. Besides making outstanding contributions to Statistics and Biology, Fisher was beyond doubt a racist and a eugenics enthusiast, who wanted to improve the human race through selective breeding. These attitudes are not just coincidental to his work but provide a framework for it. We choose to still use the data set, because it is not really Fisher's, but Edgar Anderson's, who published a paper in 1935, "The irises of the Gaspé Peninsula". Bulletin of the American Iris Society. 59: 2--5. You can find the data set in many versions on the internet, such as Kaggle. We will also use an alternative data-set, developed exactly to be an alternative to the Iris data set. The data was collected by Gorman in 2007 -- 2009 measuring adult penguins at the Palmer archipelago in Antarctica. It was then preprocessed by Allison Horst as a drop-in replacement to the Iris-dataset. You can find both the csv and the original data at Kaggle.
We input from a csv-file using np.loadtxt. This is a function with many different named arguments. More advanced modules like Pandas make this easier, so we include examples here mainly for completeness.
We first take a look at the file "Iris.py", which we assume is in the same directory as our program. The first four lines look like this:
Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
1,5.1,3.5,1.4,0.2,Iris-setosa
2,4.9,3.0,1.4,0.2,Iris-setosa
3,4.7,3.2,1.3,0.2,Iris-setosa
The first line is a header of column titles. For example, we notice that the dimensions are measured in centimeters. The last column contains the name of the species as a string. This immediately raises the problem of encodings. We can either use binary strings or we can use the "utf-8" or "latin-1" encoding. We will also need to define a converter so that our resulting NumPy array consists only of floating points. Finally, the first column is an ID, but we already can use row indices, so we need to suppress this column. We also need to specify the delimiter, because this is indeed literally a comma-separated file with commata between the column values in a row. After perusing the manual entry for np.loadtxt, we come up with the following solution:
def toint(astring):
if 'setosa' in astring:
return 0
elif 'versicolor' in astring:
return 1
else:
return 2
def get_iris():
return np.loadtxt('Iris.csv',
usecols=(1,2,3,4,5),
delimiter = ',',
converters = {5: toint },
skiprows=1,
encoding='utf-8')
my_iris = get_iris()
print(my_iris)
The loadtxt function first gets as a positional parameter the name of the file. We then specify that we want to use all but Column 0. The converters argument is a dictionary that gives the function that converts each file entry. Here, we only need to take care of the last column. (In fact, you could argue that we should only use the four values per row and remember that the row index tells us the species.) The skiprows argument tells us to skip over the header. We select 'utf-8' for the encoding. Alternatively, we can just deal with byte-strings. Finally, for such a small data set, it is often easier to use Vanilla-Python, read the file line-by-line, and create a NumPy array that way. If there are missing values, then NumPy provides genfromtxt, with unfortunately slightly different conventions.
If you have to deal with missing values, you will be using the Numpy Not-A-Number (np.nan). We first look at the penguin-dataset:
species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
Adelie,Torgersen,39.1,18.7,181,3750,MALE
Adelie,Torgersen,39.5,17.4,186,3800,FEMALE
Adelie,Torgersen,40.3,18,195,3250,FEMALE
Adelie,Torgersen,NA,NA,NA,NA,NA
The first line gives the explanation of the columns. (The culmen is the upper ridge of a bird's bill). We give up on the island, since there is not enough data to distinguish the Adelie population between different islands. We need to use two different encodings into floats for the species and the sex. We need to give a set of 'bad value'-markers and which is the datum we want to use instead as the missing_values and filling_values, respectively. Without specifying the encoding, we will have to deal with byte-streams.
import numpy as np
def encode_species(astring):
astring = astring.strip()
if astring=='Adelie':
return 0.
elif astring=='Chinstrap':
return 1.
elif astring == 'Gentoo':
return 2.
def encode_gender(astring):
if astring=='MALE':
return 0
elif astring=='FEMALE':
return 1
da = np.genfromtxt('penguins_size.csv', delimiter=',', skip_header=1,
converters={0: encode_species,6: encode_gender},
missing_values={'NA'},
filling_values=np.NAN,
usecols=(0,2,3,4,5,6),
encoding='utf-8'
)
1.2.3 Accessing NumPy Arrays
We can use normal Python syntax to access (read and write) NumPy array elements. We can also use a short-cut notation for more than one index. If we create a tensor (3-dimensional array) using
tensor = np.random.normal(100, 10, (3,4,5))
we obtain a $3 \times 4 \times 5$ array, as we can check with tensor.shape. Notice that we give and obtain the dimensions as a tuple. A handy convention of NumPy avoids brackets in order to access coefficients of multi-dimensional arrays. We use one bracket and a tuple of the coordinates. For example, we can then abbreviate a call to tensor[2][2][3] as tensor[2,2,3].
Just as before, we can also use slices. There is however a large difference in how slicing is handled. In Vanilla Python, a slice creates a new list. NumPy is capable of dealing with large amounts of data and coping data can be a major contributor to run-times. Therefore, a slice of a NumPy array is just a short-cut to parts of the original array. It is often referred to as a view. Changing data in a slice changes data in the original. In Vanilla Python, a simple way to make a copy of a list is to use slicing. In NumPy, we need to use the copy method on the NumPy array.
Multi-dimensional slicing is also used with the comma notation. You put a colon for every dimension where you do not want to slice. We see examples in the following section.
1.2.4 Example: Simple Image Manipulation
One way to experiment with slicing in NumPy is to manipulate an image that is internally stored as a NumPy slice. The magic is in the matplotlib library, which you have to install first using pip3. Please use your favorite picture. The one we are using here is a picture taken by one of us on the grounds of the Konark Sun Temple in Odisha State . We open it up with mpimg and use the imshow. The result is a three-dimensional array. The first and second component give a black-and white picture. We get a color-picture by using three two-dimensional arrays, one each for the red, the green, and the blue color.
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
im = mpimg.imread('suntemple.jpg')
print(im.shape)
plt.imshow(im)
plt.show()
The result is the picture (with coordinates added, which you can remove once you learn about matplotlib.)
This is a true rendering of the photograph. We can also look at the individual color components, one for red, one for blue, and one for green. But when we render a two-dimensional array only, the Matplotlib module will translate the magnitudes into colors. We need to specify how this translation is handled. The best version is to choose a predefined colormap 'gray' that translated numbers into shades of gray. To look at the red-component for instance, we specify
plt.imshow(im[:,:,0], cmap='gray')
If we replace the third coordinates with 1 and 2, we get the green and blue components of the picture. If you do not like strings like ":,:,", you can also use an ellipsis once, as in plt.imshow(im[...,0], cmap='gray').
We can swap the x-coordinates using the second component:
plt.imshow(im[:,::-1,:])
We can swap the y-coordinates with plt.imshow(im[::-1,:,:]). We can restrict the image with
plt.imshow(im[1000:3000,1000:3000,:])
If we swap the third coordinate, we change the colors.
plt.imshow(im[1000:3000,1000:3000,::-1])
gives
We can use np.where to set the color values for red-green-blue to either 255 or 0 by using NumPy's where function:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import numpy as np
im = mpimg.imread('konarc.jpg')
print(im.shape)
image = np.where(im100, 255, 0)
plt.imshow(image)
plt.show()
The result looks a bit like pop-art:
1.2.6 Example: The Sieve of Eratosthenes
One method to calculate prime numbers starts with a list of all numbers
and then eliminates successively composite numbers, namely the multiples
of already recognized primes. These Sieve Methods are still one of the
best ways to efficiently calculate lists of prime numbers. We consider
here the simplest incarnation of a sieve method, named after
Eratosthenes of Cyrene (c. 276--194 BC), who is also credited with
calculating the earth's circumference. The idea is simple. We write down
True for all numbers between 0 and N. (We start with 0 because this is
how Python indexes lists.) Starting with 2, we "cross out" all true
multiples of 2 by setting the corresponding array value to False. Then
we move on to 3, crossing out all true multiples of 3. In fact, we only
need to start with $3^2$, since smaller multiples of 3, namely 6 as
the only one, have already been crossed out. We can skip over 4, since
we already know from the array that 4 is not a prime. (We only ever
change the array contents from True to False, so we can trust that a
False-value sticks.) This leaves us now with 5. Starting with $5^2$,
we cross out all multiples of 5. We can stop just before we reach
$\sqrt{N}$. We now translate this algorithm directly to NumPy:
import numpy as np
is_prime = np.ones((1001,), dtype=bool)
n_max = int(np.sqrt(len(is_prime)))
for index in range(2,n_max):
if is_prime[index]:
is_prime[index**2::index] = False
print(is_prime.nonzero())
In Line 3, we exploit the convention that all non-zero values, including
1 of course, are translated to True in Boolean. We can set the data type
of the coefficient to bool using the dtype-parameter. We then set the
maximum index to $\left\lfloor \sqrt{N} \right\rfloor$. Starting from 2,
we start the crossing-out process. Only if the current index is not
known to be a composite number, we cross out its multiples, starting
with the square of index and stepping up by index. We create a slice of
the is_prime array corresponding to this task, namely
is_prime[index**2::index]
The start value is index**2, the end-value is the default, and the step is index. Changing a slice in NumPy will change the original. We thus set the slice to False. This is possible because of broadcasting where NumPy will interpret the scalar False to be an array matching the size of the array on the left. In the last line, we use the nonzero method that returns all indices that are non-zero, which in this case means, that they evaluate to False.
1.2.7 Indexing with Arrays
To select elements from an array, we can also use an array of indices, which needs to be one of the integer types defined in NumPy. Here, we first create an array my_nums of 8-bit integers and then select elements 5, 10, 15, 20, and 25 of this array my_nums:
my_nums = np.random.randint(0, 10, 100, dtype=np.int8)
index = np.arange(5, 26, 5, dtype=np.int8)
print(my_nums[index])
1.2.8 Conditional Indexing
We can create a Boolean NumPy array out of a NumPy array by posing a condition inside the square brackets. For example,
x = np.random.randint(0,10, (3,5))
creates a $3 \times 5$ matrix of random numbers between 0 (included) and 10 (excluded). I got
array([[7, 9, 4, 0, 6],
[1, 5, 0, 9, 3],
[0, 0, 5, 7, 6]])
To select the even numbers, we write the condition, substituting the array name for the coefficient, as in x%2==0. We then place this condition into the brackets in lieu of the index:
x[x%2==0]
The result is an array of all even numbers in the array, but as a vector.
array([4, 0, 6, 0, 0, 0, 6])
To understand the workings, first create a Boolean array out of the condition:
bool_array = x%2==0
(The comparison operator is binds stronger than the assignment.) The assignment gives
array([[False, False, True, True, True],
[False, False, True, False, False],
[ True, True, False, False, True]])
Such a Boolean array can then be used as an index.
NumPy has sophisticated logic functions that allow you to filter out certain rows or columns. If you look at the Penguin example, you see that there are some entries with np.nan values. To check whether a NumPy object is np.nan, we use the np.isnan( ) method. To filter out any rows that contain one or more np.nan values, we can use np.any( ), where the axis-parameter will allow us to decide whether we want rows (axis=1) or columns (axis=0). np.any is the Boolean-Or of the values, interpreted as Boolean values, in the same row or column, whereas np.all returns the Boolean-And of the values. Finally, we want the rows that are do not contain an np.nan value, so we need to negate. Boolean operators in NumPy use almost always symbols, and the symbol for not is the tilde ~ .
Putting all of these together, we can define now a new penguins array using
penguins2 = da[~np.isnan(da).any(axis=1)]
Much more can be said about these types of data cleaning operations, but in practice, this part of NumPy has been superseded by the Pandas module, which has the dropnan method
1.2.9 NumPy Array Properties
Every NumPy array has a number of properties, just as a Python list has a length. Access to properties is often given through methods, just as len does for a Python list. The most important property is probably the shape of a numpy.array. We can change the shape of an array with the reshape-method. Here, we first make a vector into a $5x2$ matrix and then extract the number of dimensions and the shape. The length of a NumPy array is the size of the first component of the vector. If you deal with massive data, you might be interested in the memory use of your arrays, size gives the number of coefficients and nbytes the memory footprint of the data-part of your NumPy array:
a = np.arange(1,11).reshape(5,2)
a.ndim # is 2
a.shape # is (5,2)
a.len # is 5
a.size # 10=2x5
a.nbytes # 80
A matrix might have a single coefficient in a dimension. For example,
b = np.array([[[[1]]]])
b.ndim # is 4
b.shape # is (1, 1, 1, 1)
b.size # is 1 (in bytes)
b.nbytes # is 8 (in bytes)
NumPy data types also have properties. If you have a NumPy array, you can find out the size of each coefficient with itemsize. We can specify the data type during array creation with dtype. Here we are using 8-bit integers (ranging from -256 to 255) and short floats.
a = np.ndarray([1,2,3],dtype = np.int8)
b = np.array([1,3,2], dtype = np.float16)
a.itemsize # gives 1
b.itemsize # gives 2
1.2.10 Changing the shape of NumPy arrays
We can flatten a NumPy matrix or tensor using ravel. The last dimension is un-raveled first.
a = np.random.randint(-5,4,(3,4,5))
gives for me
array([[[ 1, -2, 1, -2, -1],
[-5, -1, -3, 2, 0],
[ 0, 0, 3, -2, -2],
[-1, 1, 3, -3, 3]],
[[ 2, -4, 0, 1, -4],
[-5, -2, -1, -4, -1],
[ 3, 2, -4, 3, 3],
[-1, 3, 2, 1, 3]],
[[ 2, -1, -2, -5, -4],
[ 3, -5, 0, -1, -5],
[-2, 1, 3, -3, -1],
[ 0, -4, -4, -1, 0]]])
Unraveling it with a.ravel() gives the first matrix flattend, followed by the second matrix, etc.
array([ 1, -2, 1, -2, -1, -5, -1, -3, 2, 0, 0, 0, 3,
-2, -2, -1, 1, 3, -3, 3, 2, -4, 0, 1, -4, -5,
-2, -1, -4, -1, 3, 2, -4, 3, 3, -1, 3, 2, 1,
3, 2, -1, -2, -5, -4, 3, -5, 0, -1, -5, -2, 1,
3, -3, -1, 0, -4, -4, -1, 0])
The opposite operation is reshaping. To obtain the original array back, we just say
a.ravel().reshape(3,4,5)
reshape will return a view of the original, unless we make more changes to it that would result in the data no longer being contiguous, in which case a copy is made. While ravel gives a view, the similar flatten method gives a copy. There are some subtleties as the order of elements in a matrix differs between C and Fortran.
1.3. NumPy Vectorization
NumPy excels in effective calculations on arrays. Many operations are directly on NumPy arrays. This is called vectorization.
1.3.1 Operations between Vectors
There are a number of operations between NumPy arrays, such as addition, subtraction, floor-division, division, and multiplication. The latter ones are element-wise. There are also a number of binary operations. The operations are implemented by universal functions np.absolute, np.add, np.subtract, np.mod, np.negative, np.multiply, np.power, np.divide, and np.floor_divide.
If your operation encounters an error, no error is raised but a warning is given.
x = np.random.randint(-9,10, (2,3))
gives
array([[ 9, 9, -5],
[ 0, 1, -5]])
and
y = np.random.randint(-9,10, (2,3))
gives
array([[-8, 7, 9],
[-4, -4, -7]])
When we multiply, divide, or floor-divide, the operation is done coefficient by coefficient. When we execute y//x with these numbers, we get a warning and a result that sets 0//0 to zero. If we do the (normal) division, we get a value of -np.inf.
array([[-0.88888889, 0.77777778, -1.8 ],
[ -inf, -4. , 1.4 ]])
There are also unary operations, the negative and the transpose, most frequently we use y.T to obtain the transpose of a matrix. Since Python 3.5, the "commercial at" symbol @ is used for matrix multiplication. Thus, we can use x@y.T to multiply x and the transpose of y. Following the use of MATLAB, we can multiply a matrix from the right to a row vector (instead of a column vector). Thus
x@np.array([2,4,8])
yields
array([ 14, -36]).
The dot-product of two vectors can be calculated as a matrix operation, but the np.vdot and np.dot built-in NumPy function is more direct. To calculate the dot product of two vectors, we can use
np.dot(vector1, vector2)
np.vdot(vector1, vector2)
vector1 @ vector2
interchangeably. There are differences for matrices, though. To calculate the inverse of a matrix, we need to use the linalg sub-package which has a function for the inverse.
z = np.random.randint(-24,25,(3,3))
zinv = np.linalg.inv(z)
We can compare two arrays (of the same shape) and obtain a Boolean array. For example
a = np.array(\[\[1,5\], \[3,2\]\])
b = np.array(\[\[3,3\],\[4,4\]\])
(a \< b).flatten()
results in an array [ True, False, True, True]. Some operations are in the linalg submodule. For example
$$\begin{pmatrix}
1 & 2 \\
1 & - 1 \\
\end{pmatrix}^{10}$$
can be calculated by
np.linalg.matrix_power(np.array([[1,2],[1,-1]]),10)
and we can solve a system of linear equations such as $x + 2y = 2;x - y = 3$, with solutions $x = \frac89;y = - \frac13$ by
np.linalg.solve( np.array([[1,2],[1,-1]]) ,np.array([2,3]) )
1.3.3. Broadcasting
Typical NumPy operations are between arrays of the same shape. Broadcasting allows operations between arrays of different sizes. A simple example is scalar multiplication. We multiply all elements of a vector (or matrix as below) with the scalar. For example,
a = np.array( [[1,2], [3,4]])
5*a
yields the array
array([[ 5, 10],
[15, 20]])
NumPy broadcasting extends scalar multiplication. If we create a vector and then a matrix with the same number of columns, we can add them up. First, we create a $2 \times 5$ matrix using reshaping of the vector $(1,2,3,4,5,6,7,8,9,10)$. We then create a vector using full of size 5. We then add them up. Broadcasting in this case doubles the vector to match the $2 \times 5$ shape of the matrix.
matrix = np.arange(1,11).reshape((2,5))
vector = np.full(5,1.25)
At this point, matrix is
array([[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10]])
and vector is
array([1.25, 1.25, 1.25, 1.25, 1.25])
Before adding up, broadcasting adds an identical row to the vector, before adding up. Thus,
vector + matrix
becomes
array([[ 2.25, 3.25, 4.25, 5.25, 6.25],
[ 7.25, 8.25, 9.25, 10.25, 11.25]])
Broadcasting expands a single coordinate in a dimension to match the value in the other one.
In this example, we have a $1 \times 3$ vector, i.e., of shape (3,), and a (1,1) scalar, which is promoted to a $1 \times 3$ vector as well by tripling it. The result is
array([5, 6, 7]).
This also happens with column vectors. First, we create a $3 \times 3$ matrix by reshaping to (3,3) and a $3 \times 1$ matrix (alias a column vector) with shape (3,1) from the np.arange function. When we add, the missing number of elements in the second component of the shape are provided using copying:
The result is a $3 \times 3$ matrix. If we have a row and a column vector, the row vector receives additional rows and the column vector receives additional columns. We can also do this when both operands have to change.
On the left, we have an array with shape (3,1) and on the right, one with shape (1,4). The first dimension becomes 3 and the second dimension becomes 4, so that the result has shape (3,4). These examples should suffice to explain broadcasting in practice. Broadcasting follows two simple rules based on the shape of the operands: Working from the rightmost dimension, the two operands' dimensions are compatible when
-
they are equal
-
one of them is equal to 1 (in which case the array is replicated in that dimension).
1.3.4. Ufunctions
NumPy knows of a large number of transcendental functions that can be applied directly to NumPy arrays, whatever their shape. For example, with the previous matrix a, we have np.exp(a)is
array([[ 2.71828183, 148.4131591 ],
[ 20.08553692, 7.3890561 ]])
The family of ufunctions is fairly complete. There are the exponents and logarithms, there are trigonometric functions, there are binary operations like addition and subtraction, rounding functions like floor, ceil, and around. There is also a complete set of descriptive statistics allowing the user to calculate percentiles, medians, averages, variance, standard deviation and correlations. A complete list is in the NumPy online manual.
1.3.5. Broadcasting Example with NumPy and PyPlot
PyPlot is the standard graphics package, which we will consider at greater length in Module 3. One of its facilities is to convert a matrix of values such as floating-point numbers into a color-coded image. PyPlot has a number of different color-schemes, and you encouraged to use the PyPlot manual for your experimentation. You might have to install matplotlib.pyplot via the pip utility in your Python installation. Our simple code creates a two-dimensional matrix by adding a row and a column vector. We create them using the linspace function.
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0,5,101)
y = np.linspace(0,5,101).reshape(101,1)
z = np.sin(x)**5+np.cos(10+y)
plt.imshow(z,origin='lower',extent=[0,5,0,5],cmap='magma')
plt.show()
You can experiment with other functions and of course, you can print out the matrix z.
1.3.6 Binary UFunction Methods
Binary ufunctions such as np.mult come with four reduce and cumulative methods. Reducing reduces the dimension of a tensor by one. For example, we can create a $4 \times 5$ matrix by
x = np.linspace(1,20,20).reshape(4,5)
which gives us
array([[ 1., 2., 3., 4., 5.],
[ 6., 7., 8., 9., 10.],
[11., 12., 13., 14., 15.],
[16., 17., 18., 19., 20.]])
We then can reduce this matrix by one dimension using multiplication. If we use
np.product(x)
we get the product of all elements. If we use np.multiply.reduce(x), we get
array([ 1056., 2856., 5616., 9576., 15000.])
which is the product of coefficients in each column. We can determine the direction. If we want the product of the coefficients in each row, we need to specify the axis.
np.multiply.reduce(x, axis=1)
gives us
array([1.20000e+02, 3.02400e+04, 3.60360e+05, 1.86048e+06])
By the way, we can also use negative numbers to specify the axis.
np.multiply.reduce(x, axis=-1)
reduces along the last axis, which happens to be axis 1.
We can also accumulate along an axis with the accumulator. The accumulator combines the current coefficient with the next coefficient along the selected axis using the binary operation. For example, np.cumsum is an abbreviation of np.sum.accumulate and similarly, np.cumproduct is an abbreviation of np.multiply.accumulate. With the same matrix x as before, we can accumulate by adding columns to each other:
np.cumsum(x, axis=1)
yields
array([[ 1., 3., 6., 10., 15.],
[ 6., 13., 21., 30., 40.],
[11., 23., 36., 50., 65.],
[16., 33., 51., 70., 90.]])
where the second column is the sum of the first two columns, the third column the sum of the first three columns, etc. By changing the axis to 0 we obtain the sum of rows. If we do not specify an axis, then cumsum flattens the matrix and accumulates along it.
Figure 2: Graph of share prices of a mythical stock.
As an example, assume that you are writing a machine learning algorithm that decides the best time to buy and sell a share. You want to compare the results of your algorithm with the optimum solution over a given time interval. In Figure 2, we have a graph of the share price of a mythical stock. An optimal strategy consists in buying stock one on day and selling it on a later day. If $(i,j)$
is an optimal strategy, where on day $i$ we buy at price $p_{i}$ and on
day $j$ we sell at price $p_{j}$, then $i \leq j$ and the buying price
is the minimum share price before day $j$, i.e. $p_{i} = \min{(\left\{ p_{\nu}\ \right|\ \nu \leq j\}).\ }$ Otherwise,
we could have bought at a cheaper price and sold for the same one. NumPy
has no cumulative minimum method, so we need to use
np.minimum.accumulate. The difference between the prices-array and the
accumulated minimum $p_{j} - \min{(\left\{ p_{\nu} \middle| \nu \leq j \right\})}$ is the
profit made. We then can use np.argmax to find the day on which we ought
to sell.
import numpy as np
prices = np.array(\[100, 102, 110, 120, 130, 134, 125, 113, 101,
99, 98, 97, 90, 88, 92, 103, 106, 110, 102,
100, 105, 115, 123, 129, 126, 125, 125,
128, 118, 108, 87\])
print(np.argmax(prices - np.minimum.accumulate(prices)))
In this solution, we traverse the array thrice. First, to calculate the cumulative minimum, second to find the profit for selling on a specific day, and finally to find the maximum. Thus, this solution can still be improved. However, the runtime is proportional to the length of the array, meaning that this is still a linear algorithm. Since we have to look at all of the prices, any better algorithm would still be linear in the length of the array.
1.3.7 Another broadcasting example: Finding the nearest point
We have looked at the Penguin dataset, which gives four measurements (not counting gender) for three different species of penguins. One simple way of classifying unknown entities (Antarctic flightless birds in our case) is by extracting the same measurements and then finding the specimen with the closest set of measurements among those that are already classified. We will see a simple, but quite effective machine learning tool in the next module that refines this idea.
We pick three penguins from the penguin data set (namely Penguins 1, 200, and 300). We then take all penguin measurements and compare them to the measurements of these three penguins. We then "guess" the species of the bird under consideration based on which Penguin, 1, 200, or 300, has the closest measurements. We use the square-root of the square of the differences.
samples = da[[1,200, 300], 2:5]
for x in da:
diff = x[2:5] - samples
dist = np.sqrt(np.sum(diff**2, axis=-1))
print(x[0], np.argmin(dist))
In our code, we first obtain the data as seen previously. Our sample is a $3 \times 4$ matrix, made up of three rows of the data set, one each representing the measurements of one penguin, one for each species. Our data points x in da are $1 \times 6$ vectors. We first select coefficients two to four and then subtract samples. For this, the data point is broadcast into a $3 \times 4$ matrix. We then sum up the square of the coefficients along each row (axis=-1) and take the square-root of the resulting scalar. This gives a $3 \times 1$ column vector. We then use argmin to select the index of the minimum distance. This is our guess for the species of a certain penguin. We then print them all out.
Now, as we will see shortly, while the result is not horrible, it is certainly open to improvements. Certain of the three values we are looking at are by their ontological nature larger (height vs. beak length, for example) and will contribute disproportionally to the value of the distance. Therefore, when we are using this method, we will normalize the measurements by subtracting their mean and dividing by their standard deviation. With this procedure, we eliminate this particular type of error. The second, immediately obvious improvement is to use a small number of members of a certain species. For example, if we categorize animals with sexual dimorphism, then Mr. Specimen might on average be much larger / smaller than Ms. Specimen. In the Penguin data set, you will find some dependence of measurements on the island the Penguins inhabit.
1.3.8 The out-parameter
Involved computations benefit greatly from avoiding unnecessary copying. If you use NumPy to calculate in the same manner as you would with scalars in "Vanilla Python", then you might generate unnecessary performance overhead by creating temporary arrays. If your arrays are small, this will not be visible, but if your arrays are large, then you might be interested in avoiding this. The primary manner of doing so is the use of the out parameter that allows many UFunctions to specify where the output is put. For binary operations like addition or subtraction, this is done with the +=, -= et cet. statements.
Activities:
(1) Create a one-dimensional vector with ten non-zero elements. Define a slice using the even elements. Change all elements in the slice to zero. Check that the original vector has also been altered. Repeat for a Python list of ten non-zero elements.
(2) Import your favorite photograph from your phone to your computer as a jpeg or jpg file. Then manipulate the photo.
(3) Download the Abalone data set from the UC Irvine archive at https://archive.ics.uci.edu/dataset/1/abalone.
The explanation of the columns is in the .name file or in the meta-data on the webpage cited. Import the data file into NumPy and then use NumPy's mean function to find the average and standard deviation of the features for Immature, Male, and Female abalone.
Resources
-
Jake VanderPlas: Python Data Science Handbook: Essential Tools for Working with Data: https://jakevdp.github.io -
Wes McKinney: Python for Data Analysis: https://github.com/wesm/pydata-book -
NumPy Manual: https://numpy.org/doc/stable/user/index.html -
SciPy Manual: https://docs.scipy.org/