Tuesday , March 19 2024
Creating NumPy Array

Creating NumPy Array

2 – Create an Array

Create a NumPy Array

The learning objectives of this section are:

  • Understand advantages of vectorised code using NumPy (over standard python ways)
  • Create NumPy arrays
    • Convert lists and tuples to NumPy arrays
    • Create (initialise) arrays
  • Compare computation times in NumPy and standard Python lists

NumPy Basics

NumPy is a library written for scientific computing and data analysis. It stands for numerical python.

The most basic object in NumPy is the ndarray, or simply an array, which is an n-dimensional, homogenous array. By homogenous, we mean that all the elements in a NumPy array have to be of the same data type, which is commonly numeric (float or integer).

Create an array From an Iterable

Such as

  • list
  • tuple
  • range iterator

Notice that not all iterables can be used to create a numpy array, such as set and dict

In [2]:
#np is simply an alias, you may use any other alias, though np is quite standard
import numpy as np

Create an 1D Array

In [3]:
# Creating a 1-D array using a list
arr = np.array([1,2,3,4,5])
print(arr)
[1 2 3 4 5]
In [4]:
print(type(arr))
<class 'numpy.ndarray'>
In [5]:
# Creating a 1-D array using a tuple
arr = np.array((1,2,3,4,5))
print(arr)
[1 2 3 4 5]
In [6]:
arr = np.array(range(10))
print(arr)
[0 1 2 3 4 5 6 7 8 9]

Create an 2D Array with Specified Data Type

In [7]:
arr = np.array([[1,2,3], [4,5,6]], dtype='int')
print(arr)
print('Data Type:',arr.dtype)
[[1 2 3]
 [4 5 6]]
Data Type: int32

Create an 3D Array

In [8]:
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(arr) 
[[[1 2 3]
  [4 5 6]]
 [[1 2 3]
  [4 5 6]]]

Create an aray within specified range

np.arange() method can be used to replace np.array(range()) method

In [9]:
# np.arange(start, stop, step)
arr = np.arange(0, 20, 2)  
print(arr)
[ 0  2  4  6  8 10 12 14 16 18]

The other common way is to initialise arrays. You do this when you know the size of the array beforehand.

The following ways are commonly used:

  • np.linspace(): Create array of fixed length
  • np.random.rand(): method returns values in the range [0,1)
  • np.ones(): Create array of 1s
  • np.zeros(): Create array of 0s
  • np.random.random(): Create array of random numbers
  • np.arange(): Create array with increments of a fixed step size

Create an array of evenly spaced numbers within specified range

np.linspace(start, stop, num_of_elements, endpoint=True, retstep=False) has 5 parameters:

  • start: start number (inclusive)
  • stop: end number (inclusive unless endpoint set to False)
  • num_of_elements: number of elements contained in the array
  • endpoint: boolean value representing whether the stop number is inclusive or not
  • retstep: boolean value representing whether to return the step size
In [10]:
arr, step_size = np.linspace(0, 5, 8, endpoint=False, retstep=True)
print(arr)
print('The step size is ' + str(step_size))
[0.    0.625 1.25  1.875 2.5   3.125 3.75  4.375]
The step size is 0.625

Create an array of random values of given shape

np.random.rand() method returns values in the range [0,1)

In [11]:
np.random.rand()
Out[11]:
0.6544221693621494
In [12]:
np.random.rand(5)
Out[12]:
array([0.01068095, 0.44318069, 0.87650624, 0.80425891, 0.67379408])
In [13]:
arr = np.random.rand(3, 3)
print(arr)
[[0.47263949 0.83856703 0.44455054]
 [0.32544708 0.40413686 0.95260151]
 [0.33851787 0.53030022 0.52063924]]
In [14]:
np.random.rand(3,3)
Out[14]:
array([[0.40314645, 0.01681125, 0.37738118],
       [0.85356548, 0.50143152, 0.37164291],
       [0.66888267, 0.49091088, 0.11092338]])
In [15]:
# Create a 4 x 4 random array of integers ranging from 0 to 9
np.random.randint(0, 100, (4,4))
Out[15]:
array([[75, 47, 54, 45],
       [14, 83, 72,  6],
       [57, 97, 32, 22],
       [69, 75, 84, 12]])

Create an array of zeros of given shape

  • np.zeros(): create array of all zeros in given shape
  • np.zeros_like(): create array of all zeros with the same shape and data type as the given input array
In [16]:
zeros = np.zeros((2,3))
print(zeros)
[[0. 0. 0.]
 [0. 0. 0.]]

np.zeros_like()

In [17]:
arr = np.array([[1,2], [3,4],[5,6]])
arr
Out[17]:
array([[1, 2],
       [3, 4],
       [5, 6]])
In [18]:
zeros = np.zeros_like(arr)
print(zeros)
print('Data Type:',zeros.dtype)
[[0 0]
 [0 0]
 [0 0]]
Data Type: int32

Create an array of ones of given shape

  • np.ones(): create array of all ones in given shape
  • np.ones_like(): create array of all ones with the same shape and data type as the given input array
In [19]:
ones = np.ones((3,2))
print(ones)
[[1. 1.]
 [1. 1.]
 [1. 1.]]
In [20]:
arr = [[1,2,3], [4,5,6]]
ones = np.ones_like(arr)
print(ones)
print('Data Type: ' + str(ones.dtype))
[[1 1 1]
 [1 1 1]]
Data Type: int32

Create an empty array of given shape

  • np.empty(): create array of empty values in given shape
  • np.empty_like(): create array of empty values with the same shape and data type as the given input array

Notice that the initial values are not necessarily set to zeroes.

They are just some garbage values in random memory addresses.

In [21]:
empty = np.empty((5,5))
print(empty)
[[1.64872267e-315 2.11259150e-316 0.00000000e+000 0.00000000e+000
  1.33360289e+241]
 [1.16095484e-028 8.76381537e+252 1.28153217e-152 7.35167805e+223
  7.56455764e-096]
 [1.31072267e+179 1.16138473e-012 5.39223091e+241 6.28814250e+097
  1.75300433e+243]
 [4.90910702e-109 8.29655075e-114 5.30352517e+180 2.45129535e+198
  8.76380613e+252]
 [5.49548265e+247 4.50603886e-144 4.82412328e+228 1.04718130e-142
  1.03769118e-314]]
In [22]:
arr = np.array([[1,2,3], [4,5,6]], dtype=np.int64)
empty = np.empty_like(arr)
print(empty)
print('Data Type: ' + str(empty.dtype))
[[4607182418800017408 4607182418800017408 4607182418800017408]
 [4607182418800017408 4607182418800017408 4607182418800017408]]
Data Type: int64

Create an array of constant values of given shape

  • np.full(): create array of constant values in given shape
  • np.full_like(): create array of constant values with the same shape and data type as the given input array
In [23]:
full = np.full((4,4), 5)
print(full)
[[5 5 5 5]
 [5 5 5 5]
 [5 5 5 5]
 [5 5 5 5]]
In [24]:
arr = np.array([[1,2], [3,4]], dtype=np.float64)
full = np.full_like(arr, 5)
print(full)
print('Data Type: ' + str(full.dtype))
[[5. 5.]
 [5. 5.]]
Data Type: float64

Create an array in a repetitive manner

  • np.repeat(iterable, reps, axis=None): repeat each element by n times
    • iterable: input array
    • reps: number of repetitions
    • axis: which axis to repeat along, default is None which will flatten the input array and then repeat
  • np.tile(): repeat the whole array by n times
    • iterable: input array
    • reps: number of repetitions, it can be a tuple to represent repetitions along x-axis and y-axis
In [25]:
# No axis specified, then flatten the input array first and repeat
arr = [[0, 1, 2], [3, 4, 5]]
print(np.repeat(arr, 3)) 
[0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5]
In [26]:
# An example of repeating along x-axis
arr = [[0, 1, 2], [3, 4, 5]]
print(np.repeat(arr, 3, axis=0)) 
[[0 1 2]
 [0 1 2]
 [0 1 2]
 [3 4 5]
 [3 4 5]
 [3 4 5]]
In [27]:
# An example of repeating along y-axis
arr = [[0, 1, 2], [3, 4, 5]]
print(np.repeat(arr, 3, axis=1))    
[[0 0 0 1 1 1 2 2 2]
 [3 3 3 4 4 4 5 5 5]]
In [28]:
# Repeat the whole array by a specified number of times
arr = [0, 1, 2]
print(np.tile(arr, 3))
[0 1 2 0 1 2 0 1 2]
In [29]:
# Repeat along specified axes
print(np.tile(arr, (2,2)))
[[0 1 2 0 1 2]
 [0 1 2 0 1 2]]

Create an identity matrix of given size

  • np.eye(size, k=0): create an identity matrix of given size
    • size: the size of the identity matrix
    • k: the diagonal offset
  • np.identity(): same as np.eye() but does not carry parameters
In [30]:
identity_matrix = np.eye(5)
print(identity_matrix)
[[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]]
In [31]:
# An example of diagonal offset
identity_matrix = np.eye(5, k=-1)
print(identity_matrix)
[[0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]]
In [32]:
identity_matrix = np.identity(5)
print(identity_matrix)
[[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]]

Create an array with given values on the diagonal

In [33]:
arr = np.random.rand(5,5)
print(arr)
# Extract values on the diagonal
print('Values on the diagonal: ' + str(np.diag(arr)))
[[0.26266218 0.08789392 0.41876102 0.96403012 0.85398597]
 [0.16299527 0.80539869 0.73643274 0.64080738 0.64197196]
 [0.69999506 0.17747126 0.56864491 0.11785487 0.28684403]
 [0.55266274 0.05616572 0.6226652  0.2449638  0.15015444]
 [0.94221876 0.07278349 0.6031527  0.2678986  0.84390092]]
Values on the diagonal: [0.26266218 0.80539869 0.56864491 0.2449638  0.84390092]
In [34]:
# Not necessarily to be a square matrix
arr = np.random.rand(10,3)
print(arr)
# Extract values on the diagonal
print('Values on the diagonal: ' + str(np.diag(arr)))
[[0.58347526 0.06371201 0.55105406]
 [0.54159295 0.58019729 0.91736232]
 [0.93554576 0.47499904 0.5781882 ]
 [0.41176853 0.17215819 0.00149476]
 [0.75224412 0.55996666 0.49031755]
 [0.8315763  0.3990014  0.58779411]
 [0.62559893 0.7009128  0.61726245]
 [0.29840661 0.82313107 0.10956081]
 [0.9800065  0.43790639 0.34349066]
 [0.76585686 0.08173312 0.44109314]]
Values on the diagonal: [0.58347526 0.58019729 0.5781882 ]
In [35]:
# Create a matrix given values on the diagonal
# All non-diagonal values set to zeros
arr = np.diag([1,2,3,4,5])
print(arr)
[[1 0 0 0 0]
 [0 2 0 0 0]
 [0 0 3 0 0]
 [0 0 0 4 0]
 [0 0 0 0 5]]

Advantages of NumPy

What is the use of arrays over lists, specifically for data analysis? Putting crudely, it is convenience and speed :

  1. You can write vectorised code on numpy arrays, not on lists, which is convenient to read and write, and concise.
  2. Numpy is much faster than the standard python ways to do computations.

Vectorised code typically does not contain explicit looping and indexing etc. (all of this happens behind the scenes, in precompiled C-code), and thus it is much more concise.

Let’s see an example of convenience, we’ll see one later for speed.

Say you have two lists of numbers, and want to calculate the element-wise product. The standard python list way would need you to map a lambda function (or worse – write a for loop), whereas with NumPy, you simply multiply the arrays.

In [36]:
list_1 = [3, 6, 7, 5]
list_2 = [4, 5, 1, 7]
# the list way to do it: map a function to the two lists
product_list = list(map(lambda x, y: x*y, list_1, list_2))
print(product_list)
[12, 30, 7, 35]

using array

In [37]:
# The numpy array way to do it: simply multiply the two arrays
array_1 = np.array(list_1)
array_2 = np.array(list_2)
array_3 = array_1*array_2
print(array_3)
print(type(array_3))
[12 30  7 35]
<class 'numpy.ndarray'>

As you can see, the NumPy way is clearly more concise.

Even simple mathematical operations on lists require for loops, unlike with arrays. For example, to calculate the square of every number in a list:

In [38]:
# Square a list
list_squared = [i**2 for i in list_1]
# Square a numpy array
array_squared = array_1**2
print(list_squared)
print(array_squared)
[9, 36, 49, 25]
[ 9 36 49 25]

Compare Computation Times in NumPy and Standard Python Lists

We mentioned that the key advantages of numpy are convenience and speed of computation.

You’ll often work with extremely large datasets, and thus it is important point for you to understand how much computation time (and memory) you can save using numpy, compared to standard python lists.

Let’s compare the computation times of arrays and lists for a simple task of calculating the element-wise product of numbers.

In [39]:
list_1 = [i for i in range(10000000)]
list_2 = [j**2 for j in range(10000000)]
import time
# store start time, time after computation, and take the difference
t0 = time.time()
product_list = list(map(lambda x, y: x*y, list_1, list_2))
t1 = time.time()
list_time = t1 - t0 
print("Time Taken:",t1-t0)
Time Taken: 3.0225095748901367

Using numpy array

In [40]:
array_1 = np.array(list_1)
array_2 = np.array(list_2)
t0 = time.time()
array_3 = array_1*array_2
t1 = time.time()
numpy_time = t1 - t0
print("Time Taken:",t1-t0)
Time Taken: 0.06000089645385742
In [41]:
print("The ratio of time taken is {}".format(list_time/numpy_time))
The ratio of time taken is 50.374406942645294

In this case, numpy is an order of magnitude faster than lists. This is with arrays of size in millions, but you may work on much larger arrays of sizes in order of billions. Then, the difference is even larger.

Some reasons for such difference in speed are:

  • NumPy is written in C, which is basically being executed behind the scenes
  • NumPy arrays are more compact than lists, i.e. they take much lesser storage space than lists

The following discussions demonstrate the differences in speeds of NumPy and standard python:

  1. https://stackoverflow.com/questions/8385602/why-are-numpy-arrays-so-fast
  2. https://stackoverflow.com/questions/993984/why-numpy-instead-of-python-lists

About Machine Learning

Check Also

Combining and Merging in Pandas - Data Science Tutorials

Combining and Merging in Pandas – Data Science Tutorials

13- Combining and Merging Combining and Merging in Pandas¶The datasets you want to analyze can …

Leave a Reply

Your email address will not be published. Required fields are marked *