The learning objectives of this section are:
NumPy is a library written for scientific computing and data analysis. It stands for numerical python.
The most basic object in NumPy is the ndarray
, or simply an array
, which is an n-dimensional, homogenous array. By homogenous, we mean that all the elements in a NumPy array have to be of the same data type, which is commonly numeric (float or integer).
Such as
list
tuple
range
iteratorNotice that not all iterables can be used to create a numpy array, such as set
and dict
#np is simply an alias, you may use any other alias, though np is quite standard
import numpy as np
# Creating a 1-D array using a list
arr = np.array([1,2,3,4,5])
print(arr)
print(type(arr))
# Creating a 1-D array using a tuple
arr = np.array((1,2,3,4,5))
print(arr)
arr = np.array(range(10))
print(arr)
arr = np.array([[1,2,3], [4,5,6]], dtype='int')
print(arr)
print('Data Type:',arr.dtype)
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(arr)
np.arange()
method can be used to replace np.array(range())
method
# np.arange(start, stop, step)
arr = np.arange(0, 20, 2)
print(arr)
The other common way is to initialise arrays. You do this when you know the size of the array beforehand.
The following ways are commonly used:
np.linspace()
: Create array of fixed lengthnp.random.rand()
: method returns values in the range [0,1)np.ones()
: Create array of 1snp.zeros()
: Create array of 0snp.random.random()
: Create array of random numbersnp.arange()
: Create array with increments of a fixed step sizenp.linspace(start, stop, num_of_elements, endpoint=True, retstep=False)
has 5 parameters:
start
: start number (inclusive)stop
: end number (inclusive unless endpoint
set to False
)num_of_elements
: number of elements contained in the arrayendpoint
: boolean value representing whether the stop
number is inclusive or notretstep
: boolean value representing whether to return the step sizearr, step_size = np.linspace(0, 5, 8, endpoint=False, retstep=True)
print(arr)
print('The step size is ' + str(step_size))
np.random.rand()
method returns values in the range [0,1)
np.random.rand()
np.random.rand(5)
arr = np.random.rand(3, 3)
print(arr)
np.random.rand(3,3)
# Create a 4 x 4 random array of integers ranging from 0 to 9
np.random.randint(0, 100, (4,4))
np.zeros()
: create array of all zeros in given shapenp.zeros_like()
: create array of all zeros with the same shape and data type as the given input arrayzeros = np.zeros((2,3))
print(zeros)
arr = np.array([[1,2], [3,4],[5,6]])
arr
zeros = np.zeros_like(arr)
print(zeros)
print('Data Type:',zeros.dtype)
np.ones()
: create array of all ones in given shapenp.ones_like()
: create array of all ones with the same shape and data type as the given input arrayones = np.ones((3,2))
print(ones)
arr = [[1,2,3], [4,5,6]]
ones = np.ones_like(arr)
print(ones)
print('Data Type: ' + str(ones.dtype))
np.empty()
: create array of empty values in given shapenp.empty_like()
: create array of empty values with the same shape and data type as the given input arrayNotice that the initial values are not necessarily set to zeroes.
They are just some garbage values in random memory addresses.
empty = np.empty((5,5))
print(empty)
arr = np.array([[1,2,3], [4,5,6]], dtype=np.int64)
empty = np.empty_like(arr)
print(empty)
print('Data Type: ' + str(empty.dtype))
np.full()
: create array of constant values in given shapenp.full_like()
: create array of constant values with the same shape and data type as the given input arrayfull = np.full((4,4), 5)
print(full)
arr = np.array([[1,2], [3,4]], dtype=np.float64)
full = np.full_like(arr, 5)
print(full)
print('Data Type: ' + str(full.dtype))
np.repeat(iterable, reps, axis=None)
: repeat each element by n timesiterable
: input arrayreps
: number of repetitionsaxis
: which axis to repeat along, default is None
which will flatten the input array and then repeatnp.tile()
: repeat the whole array by n timesiterable
: input arrayreps
: number of repetitions, it can be a tuple to represent repetitions along x-axis and y-axis# No axis specified, then flatten the input array first and repeat
arr = [[0, 1, 2], [3, 4, 5]]
print(np.repeat(arr, 3))
# An example of repeating along x-axis
arr = [[0, 1, 2], [3, 4, 5]]
print(np.repeat(arr, 3, axis=0))
# An example of repeating along y-axis
arr = [[0, 1, 2], [3, 4, 5]]
print(np.repeat(arr, 3, axis=1))
# Repeat the whole array by a specified number of times
arr = [0, 1, 2]
print(np.tile(arr, 3))
# Repeat along specified axes
print(np.tile(arr, (2,2)))
np.eye(size, k=0)
: create an identity matrix of given sizesize
: the size of the identity matrixk
: the diagonal offsetnp.identity()
: same as np.eye()
but does not carry parametersidentity_matrix = np.eye(5)
print(identity_matrix)
# An example of diagonal offset
identity_matrix = np.eye(5, k=-1)
print(identity_matrix)
identity_matrix = np.identity(5)
print(identity_matrix)
arr = np.random.rand(5,5)
print(arr)
# Extract values on the diagonal
print('Values on the diagonal: ' + str(np.diag(arr)))
# Not necessarily to be a square matrix
arr = np.random.rand(10,3)
print(arr)
# Extract values on the diagonal
print('Values on the diagonal: ' + str(np.diag(arr)))
# Create a matrix given values on the diagonal
# All non-diagonal values set to zeros
arr = np.diag([1,2,3,4,5])
print(arr)
What is the use of arrays over lists, specifically for data analysis? Putting crudely, it is convenience and speed :
Vectorised code typically does not contain explicit looping and indexing etc. (all of this happens behind the scenes, in precompiled C-code), and thus it is much more concise.
Let's see an example of convenience, we'll see one later for speed.
Say you have two lists of numbers, and want to calculate the element-wise product. The standard python list way would need you to map a lambda function (or worse - write a for
loop), whereas with NumPy, you simply multiply the arrays.
list_1 = [3, 6, 7, 5]
list_2 = [4, 5, 1, 7]
# the list way to do it: map a function to the two lists
product_list = list(map(lambda x, y: x*y, list_1, list_2))
print(product_list)
# The numpy array way to do it: simply multiply the two arrays
array_1 = np.array(list_1)
array_2 = np.array(list_2)
array_3 = array_1*array_2
print(array_3)
print(type(array_3))
As you can see, the NumPy way is clearly more concise.
Even simple mathematical operations on lists require for loops, unlike with arrays. For example, to calculate the square of every number in a list:
# Square a list
list_squared = [i**2 for i in list_1]
# Square a numpy array
array_squared = array_1**2
print(list_squared)
print(array_squared)
We mentioned that the key advantages of numpy are convenience and speed of computation.
You'll often work with extremely large datasets, and thus it is important point for you to understand how much computation time (and memory) you can save using numpy, compared to standard python lists.
Let's compare the computation times of arrays and lists for a simple task of calculating the element-wise product of numbers.
list_1 = [i for i in range(10000000)]
list_2 = [j**2 for j in range(10000000)]
import time
# store start time, time after computation, and take the difference
t0 = time.time()
product_list = list(map(lambda x, y: x*y, list_1, list_2))
t1 = time.time()
list_time = t1 - t0
print("Time Taken:",t1-t0)
array_1 = np.array(list_1)
array_2 = np.array(list_2)
t0 = time.time()
array_3 = array_1*array_2
t1 = time.time()
numpy_time = t1 - t0
print("Time Taken:",t1-t0)
print("The ratio of time taken is {}".format(list_time/numpy_time))
In this case, numpy is an order of magnitude faster than lists. This is with arrays of size in millions, but you may work on much larger arrays of sizes in order of billions. Then, the difference is even larger.
Some reasons for such difference in speed are:
The following discussions demonstrate the differences in speeds of NumPy and standard python: