Create a NumPy Array¶

The learning objectives of this section are:

Understand advantages of vectorised code using NumPy (over standard python ways)
Create NumPy arrays
- Convert lists and tuples to NumPy arrays
- Create (initialise) arrays
Compare computation times in NumPy and standard Python lists

NumPy Basics¶

NumPy is a library written for scientific computing and data analysis. It stands for numerical python.

The most basic object in NumPy is the ndarray, or simply an array, which is an n-dimensional, homogenous array. By homogenous, we mean that all the elements in a NumPy array have to be of the same data type, which is commonly numeric (float or integer).

Create an array From an Iterable¶

Such as

list
tuple
range iterator

Notice that not all iterables can be used to create a numpy array, such as set and dict

#np is simply an alias, you may use any other alias, though np is quite standard
import numpy as np

Create an 1D Array¶

# Creating a 1-D array using a list
arr = np.array([1,2,3,4,5])
print(arr)

[1 2 3 4 5]

print(type(arr))

<class 'numpy.ndarray'>

# Creating a 1-D array using a tuple
arr = np.array((1,2,3,4,5))
print(arr)

[1 2 3 4 5]

arr = np.array(range(10))
print(arr)

[0 1 2 3 4 5 6 7 8 9]

Create an 2D Array with Specified Data Type¶

arr = np.array([[1,2,3], [4,5,6]], dtype='int')
print(arr)
print('Data Type:',arr.dtype)

[[1 2 3]
 [4 5 6]]
Data Type: int32

Create an 3D Array¶

arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(arr)

[[[1 2 3]
  [4 5 6]]

 [[1 2 3]
  [4 5 6]]]

Create an aray within specified range¶

np.arange() method can be used to replace np.array(range()) method

# np.arange(start, stop, step)
arr = np.arange(0, 20, 2)  
print(arr)

[ 0  2  4  6  8 10 12 14 16 18]

The other common way is to initialise arrays. You do this when you know the size of the array beforehand.

The following ways are commonly used:

np.linspace(): Create array of fixed length
np.random.rand(): method returns values in the range [0,1)
np.ones(): Create array of 1s
np.zeros(): Create array of 0s
np.random.random(): Create array of random numbers
np.arange(): Create array with increments of a fixed step size

Create an array of evenly spaced numbers within specified range¶

np.linspace(start, stop, num_of_elements, endpoint=True, retstep=False) has 5 parameters:

start: start number (inclusive)
stop: end number (inclusive unless endpoint set to False)
num_of_elements: number of elements contained in the array
endpoint: boolean value representing whether the stop number is inclusive or not
retstep: boolean value representing whether to return the step size

arr, step_size = np.linspace(0, 5, 8, endpoint=False, retstep=True)
print(arr)
print('The step size is ' + str(step_size))

[0.    0.625 1.25  1.875 2.5   3.125 3.75  4.375]
The step size is 0.625

Create an array of random values of given shape¶

np.random.rand() method returns values in the range [0,1)

np.random.rand()

0.6544221693621494

np.random.rand(5)

array([0.01068095, 0.44318069, 0.87650624, 0.80425891, 0.67379408])

arr = np.random.rand(3, 3)
print(arr)

[[0.47263949 0.83856703 0.44455054]
 [0.32544708 0.40413686 0.95260151]
 [0.33851787 0.53030022 0.52063924]]

np.random.rand(3,3)

array([[0.40314645, 0.01681125, 0.37738118],
       [0.85356548, 0.50143152, 0.37164291],
       [0.66888267, 0.49091088, 0.11092338]])

# Create a 4 x 4 random array of integers ranging from 0 to 9
np.random.randint(0, 100, (4,4))

array([[75, 47, 54, 45],
       [14, 83, 72,  6],
       [57, 97, 32, 22],
       [69, 75, 84, 12]])

Create an array of zeros of given shape¶

np.zeros(): create array of all zeros in given shape
np.zeros_like(): create array of all zeros with the same shape and data type as the given input array

zeros = np.zeros((2,3))
print(zeros)

[[0. 0. 0.]
 [0. 0. 0.]]

np.zeros_like()¶

arr = np.array([[1,2], [3,4],[5,6]])
arr

array([[1, 2],
       [3, 4],
       [5, 6]])

zeros = np.zeros_like(arr)
print(zeros)
print('Data Type:',zeros.dtype)

[[0 0]
 [0 0]
 [0 0]]
Data Type: int32

Create an array of ones of given shape¶

np.ones(): create array of all ones in given shape
np.ones_like(): create array of all ones with the same shape and data type as the given input array

ones = np.ones((3,2))
print(ones)

[[1. 1.]
 [1. 1.]
 [1. 1.]]

arr = [[1,2,3], [4,5,6]]
ones = np.ones_like(arr)
print(ones)
print('Data Type: ' + str(ones.dtype))

[[1 1 1]
 [1 1 1]]
Data Type: int32

Create an empty array of given shape¶

np.empty(): create array of empty values in given shape
np.empty_like(): create array of empty values with the same shape and data type as the given input array

Notice that the initial values are not necessarily set to zeroes.

They are just some garbage values in random memory addresses.

empty = np.empty((5,5))
print(empty)

[[1.64872267e-315 2.11259150e-316 0.00000000e+000 0.00000000e+000
  1.33360289e+241]
 [1.16095484e-028 8.76381537e+252 1.28153217e-152 7.35167805e+223
  7.56455764e-096]
 [1.31072267e+179 1.16138473e-012 5.39223091e+241 6.28814250e+097
  1.75300433e+243]
 [4.90910702e-109 8.29655075e-114 5.30352517e+180 2.45129535e+198
  8.76380613e+252]
 [5.49548265e+247 4.50603886e-144 4.82412328e+228 1.04718130e-142
  1.03769118e-314]]

arr = np.array([[1,2,3], [4,5,6]], dtype=np.int64)
empty = np.empty_like(arr)
print(empty)
print('Data Type: ' + str(empty.dtype))

[[4607182418800017408 4607182418800017408 4607182418800017408]
 [4607182418800017408 4607182418800017408 4607182418800017408]]
Data Type: int64

Create an array of constant values of given shape¶

np.full(): create array of constant values in given shape
np.full_like(): create array of constant values with the same shape and data type as the given input array

full = np.full((4,4), 5)
print(full)

[[5 5 5 5]
 [5 5 5 5]
 [5 5 5 5]
 [5 5 5 5]]

arr = np.array([[1,2], [3,4]], dtype=np.float64)
full = np.full_like(arr, 5)
print(full)
print('Data Type: ' + str(full.dtype))

[[5. 5.]
 [5. 5.]]
Data Type: float64

Create an array in a repetitive manner¶

np.repeat(iterable, reps, axis=None): repeat each element by n times
- iterable: input array
- reps: number of repetitions
- axis: which axis to repeat along, default is None which will flatten the input array and then repeat
np.tile(): repeat the whole array by n times
- iterable: input array
- reps: number of repetitions, it can be a tuple to represent repetitions along x-axis and y-axis

# No axis specified, then flatten the input array first and repeat
arr = [[0, 1, 2], [3, 4, 5]]
print(np.repeat(arr, 3))

[0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5]

# An example of repeating along x-axis
arr = [[0, 1, 2], [3, 4, 5]]
print(np.repeat(arr, 3, axis=0))

[[0 1 2]
 [0 1 2]
 [0 1 2]
 [3 4 5]
 [3 4 5]
 [3 4 5]]

# An example of repeating along y-axis
arr = [[0, 1, 2], [3, 4, 5]]
print(np.repeat(arr, 3, axis=1))

[[0 0 0 1 1 1 2 2 2]
 [3 3 3 4 4 4 5 5 5]]

# Repeat the whole array by a specified number of times
arr = [0, 1, 2]
print(np.tile(arr, 3))

[0 1 2 0 1 2 0 1 2]

# Repeat along specified axes
print(np.tile(arr, (2,2)))

[[0 1 2 0 1 2]
 [0 1 2 0 1 2]]

Create an identity matrix of given size¶

np.eye(size, k=0): create an identity matrix of given size
- size: the size of the identity matrix
- k: the diagonal offset
np.identity(): same as np.eye() but does not carry parameters

identity_matrix = np.eye(5)
print(identity_matrix)

[[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]]

# An example of diagonal offset
identity_matrix = np.eye(5, k=-1)
print(identity_matrix)

[[0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]]

identity_matrix = np.identity(5)
print(identity_matrix)

[[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]]

Create an array with given values on the diagonal¶

arr = np.random.rand(5,5)
print(arr)
# Extract values on the diagonal
print('Values on the diagonal: ' + str(np.diag(arr)))

[[0.26266218 0.08789392 0.41876102 0.96403012 0.85398597]
 [0.16299527 0.80539869 0.73643274 0.64080738 0.64197196]
 [0.69999506 0.17747126 0.56864491 0.11785487 0.28684403]
 [0.55266274 0.05616572 0.6226652  0.2449638  0.15015444]
 [0.94221876 0.07278349 0.6031527  0.2678986  0.84390092]]
Values on the diagonal: [0.26266218 0.80539869 0.56864491 0.2449638  0.84390092]

# Not necessarily to be a square matrix
arr = np.random.rand(10,3)
print(arr)
# Extract values on the diagonal
print('Values on the diagonal: ' + str(np.diag(arr)))

[[0.58347526 0.06371201 0.55105406]
 [0.54159295 0.58019729 0.91736232]
 [0.93554576 0.47499904 0.5781882 ]
 [0.41176853 0.17215819 0.00149476]
 [0.75224412 0.55996666 0.49031755]
 [0.8315763  0.3990014  0.58779411]
 [0.62559893 0.7009128  0.61726245]
 [0.29840661 0.82313107 0.10956081]
 [0.9800065  0.43790639 0.34349066]
 [0.76585686 0.08173312 0.44109314]]
Values on the diagonal: [0.58347526 0.58019729 0.5781882 ]

# Create a matrix given values on the diagonal
# All non-diagonal values set to zeros
arr = np.diag([1,2,3,4,5])
print(arr)

[[1 0 0 0 0]
 [0 2 0 0 0]
 [0 0 3 0 0]
 [0 0 0 4 0]
 [0 0 0 0 5]]

Advantages of NumPy¶

What is the use of arrays over lists, specifically for data analysis? Putting crudely, it is convenience and speed :

You can write vectorised code on numpy arrays, not on lists, which is convenient to read and write, and concise.
Numpy is much faster than the standard python ways to do computations.

Vectorised code typically does not contain explicit looping and indexing etc. (all of this happens behind the scenes, in precompiled C-code), and thus it is much more concise.

Let's see an example of convenience, we'll see one later for speed.

Say you have two lists of numbers, and want to calculate the element-wise product. The standard python list way would need you to map a lambda function (or worse - write a for loop), whereas with NumPy, you simply multiply the arrays.

list_1 = [3, 6, 7, 5]
list_2 = [4, 5, 1, 7]

# the list way to do it: map a function to the two lists
product_list = list(map(lambda x, y: x*y, list_1, list_2))
print(product_list)

[12, 30, 7, 35]

using array¶

# The numpy array way to do it: simply multiply the two arrays
array_1 = np.array(list_1)
array_2 = np.array(list_2)

array_3 = array_1*array_2
print(array_3)
print(type(array_3))

[12 30  7 35]
<class 'numpy.ndarray'>

As you can see, the NumPy way is clearly more concise.

Even simple mathematical operations on lists require for loops, unlike with arrays. For example, to calculate the square of every number in a list:

# Square a list
list_squared = [i**2 for i in list_1]

# Square a numpy array
array_squared = array_1**2

print(list_squared)
print(array_squared)

[9, 36, 49, 25]
[ 9 36 49 25]

Compare Computation Times in NumPy and Standard Python Lists¶

We mentioned that the key advantages of numpy are convenience and speed of computation.

You'll often work with extremely large datasets, and thus it is important point for you to understand how much computation time (and memory) you can save using numpy, compared to standard python lists.

Let's compare the computation times of arrays and lists for a simple task of calculating the element-wise product of numbers.

list_1 = [i for i in range(10000000)]
list_2 = [j**2 for j in range(10000000)]

import time
# store start time, time after computation, and take the difference
t0 = time.time()
product_list = list(map(lambda x, y: x*y, list_1, list_2))
t1 = time.time()
list_time = t1 - t0 
print("Time Taken:",t1-t0)

Time Taken: 3.0225095748901367

Using numpy array¶

array_1 = np.array(list_1)
array_2 = np.array(list_2)

t0 = time.time()
array_3 = array_1*array_2
t1 = time.time()
numpy_time = t1 - t0

print("Time Taken:",t1-t0)

Time Taken: 0.06000089645385742

print("The ratio of time taken is {}".format(list_time/numpy_time))

The ratio of time taken is 50.374406942645294

In this case, numpy is an order of magnitude faster than lists. This is with arrays of size in millions, but you may work on much larger arrays of sizes in order of billions. Then, the difference is even larger.

Some reasons for such difference in speed are:

NumPy is written in C, which is basically being executed behind the scenes
NumPy arrays are more compact than lists, i.e. they take much lesser storage space than lists

The following discussions demonstrate the differences in speeds of NumPy and standard python: