Introduction to NumPy (Part-II)

Sweta Barnwal
7 min readJan 6, 2021

NumPy is a Python library used for working with arrays. It also has functions for working in the domain of linear algebra, Fourier transform, and matrices. That is what you will hear from most people. Although NumPy is an essential package to do mathematical computations, some people are still out of the loop as to how it can be used. This blog aims to clarify how you can make the most out of Numpy. In continuation of the last blog,

Today we will be extending our discussion to-

  • Memory layout of ndarray
  • Views and copies
  • Vectorized operations
  • Universal functions
  • Broadcasting
  • Boolean mask
  • dates and time in NumPy

Memory Layout of ndarray

Taken from the official doc, array attributes reflect information that is intrinsic to the array itself. Generally, accessing an array through its attributes allows you to get and sometimes set intrinsic properties of the array without creating a new array. The exposed attributes are the core parts of an array and only some of them can be reset meaningfully without creating a new array. Information on each attribute is given below.

numpy.ndarray object has an interesting attribute, flags. The flags attribute holds information about the memory layout of the array.

>>> import numpy as np
>>> arr=np.array([1,2,3,4,5,6,7,8,9,10])
>>> arr
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>>> arr.flags
C_CONTIGUOUS : True
F_CONTIGUOUS : True
OWNDATA : True
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False

The C_CONTIGUOUS the field in the output indicates whether the array was a C-style array. This means that the indexing of this array is done like a C array. This is also called row-major indexing in the case of 2D arrays. This means that, when moving through the array, the row index is incremented first, and then the column index is incremented.

Array flags provide information about how the memory area used for the array is to be interpreted. There are 7 Boolean flags in use, only four of which can be changed by the user: WRITEBACKIFCOPY, UPDATEIFCOPY, WRITEABLE, and ALIGNED, via direct assignment to the attribute or dictionary entry, or by callingndarray.setflags.

>>> arr.setflags(write=0)
>>> arr.flags
C_CONTIGUOUS : True
F_CONTIGUOUS : True
OWNDATA : True
WRITEABLE : False
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False

ndarray.shape returns the shape of your ndarray as a tuple.

>>> arr.shape
(10,)

ndarray.reshape is an attribute that shapes an array without changing its data.

>>> arr.reshape(2,5)
array([[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10]])

Well, I guess, very few of us have ever wondered why ndarray.shape returns (m,) and (m,1). It makes the matrix multiplication more tedious and to reduce redundancies, explicit reshape is required.

>>> arr.reshape(10,)
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>>> arr.reshape(10,1)
array([[ 1],
[ 2],
[ 3],
[ 4],
[ 5],
[ 6],
[ 7],
[ 8],
[ 9],
[10]])

(m,) means that the array is indexed from 0 to m-1.

(m,1) means that the array is indexed by two indices, the first of which runs from 0 to m-1, and the second index is always 0.

ndarray.strides tell us how many bytes we have to skip in memory to move to the next position along a certain axis.

>>> x = np.array([1,2,3,4,5,6,7,8,9], dtype='int32')
>>> x.strides
(4,)
>>> x = np.array([1,2,3,4,5,6,7,8,9], dtype='float')
>>> x.strides
(8,)

There are many more attributes that are helpful in retaining information about the memory layout of an array. Like ndarray.ndim which returns the dimension of an array.

View and Copy

Whatever we work on an array, it always returns either a copy of the array or returns a view of that array. Copy generates a new array whereas the view returns the same array.

>>> import numpy as np
>>> arr=np.array([1,2,3,4,5,6,7,8,9,10])
>>> arr
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) #ORIGINAL ARRAY

The normal assignment doesn’t generate a new array. It uses the same ID and has the same shape. Changes made in one will be directly reflected in another one.

>>> arr2=arr
>>> print("ID of original array:",id(arr))
ID of original array: 2084401527824
>>> print("ID of assigned array:",id(arr2))
ID of assigned array: 2084401527824

The view is also known as a shallow copy in NumPy. Just like window shopping, here the view also just creates a view of the original array. Both arrays will have the different ID and changes made in view will affect the original array.

>>> arr3=arr.view()
>>> arr3
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>>> print("ID of original array:",id(arr))
ID of original array: 2084401527824
>>> print("ID of viewed array:",id(arr3))
ID of viewed array: 2084401527744

Deep copy or generating a new array with a copy(). Changes made in the new array doesn’t affect the original array. The original array will remain unchanged.

>>> arr4=arr.copy()
>>> arr4
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>>> print("ID of copied array:",id(arr4))
ID of copied array: 2084652207520
>>> arr4[4]=20
>>> arr4
array([ 1, 2, 3, 4, 20, 6, 7, 8, 9, 10])
>>> arr
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])#ORIGINAL ARRAY

Vectorized operations

Single Instruction, Multiple Data operation that is how the vectorized operation works. For loop slows down execution of any program and increases time complexity. Vectorization simply says forget loop and embrace vectorization. It is used to speed up the Python code without using a loop. Using such a function can help in minimizing the running time of code efficiently.

The image says more than I could add words to it. Let’s see some example now-

First import timeit

>>> a = [random.randint(1, 100) for _ in range(1000000)] 
>>> b = [random.randint(1, 100) for _ in range(1000000)]
>>> %timeit res = [x * y for x, y in zip(a, b)]
63.7 ms ± 554 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit is a magic command in the IPython session to measure the execution time.

>>> import numpy as np 
>>> a = np.random.randint(1, 100, 1000000)
>>> b = np.random.randint(1, 100, 1000000)
>>> %timeit a * b
1.88 ms ± 5.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

If you could notice here, execution time is decreased to 1.88ms, which is nearly 3% of the time taken by pure Python code. It reduces implementation and fastens execution.

In general, it usually pays off when compared to the enormous waiting time that you may need when doing large-scale calculations inefficiently.

Universal Functions

In Python, NumPy provides universal functions to eliminate as many loops and much we can and optimize our code. In Numpy, universal functions are instances of the numpy.ufunc class. Many of the built-in functions are implemented in compiled C code, but ufunc instances can also be produced using the frompyfunc factory function.

It has a number of mathematical functions which will help in easy implementation and reduce time complexity.

Broadcasting

Each universal function takes array inputs and produces array outputs by performing the core function element-wise on the input. Standard broadcasting rules are applied so that inputs not sharing exactly the same shapes can still be usefully operated on.

Broadcasting is used throughout NumPy to decide how to handle disparately shaped arrays; for example, all arithmetic operations (+, -, *, …) between ndarrays broadcast the arrays before an operation.

Boolean Mask

Masking in python and data science is when you want manipulated data in a collection based on some criteria. The criteria you use is typical of true or false nature, hence the boolean part. Boolean masking is typically the most efficient way to quantify a sub-collection in a collection. Boolean arrays as masks are a special kind of array in NumPy that are more powerful and even better than indexing and slicing to select particular subsets of the data themselves.

>>> import numpy as np
>>> arr=np.array([1,2,3,4,5,6,7,8,9])
>>> arr
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> arr>8
array([False, False, False, False, False, False, False, False, True])

Dates and Time in NumPy

The most basic way to create datetimes is from strings in ISO 8601 date or datetime format. The unit for internal storage is automatically selected from the form of the string and can be either a date unit or a time unit. The date units are years (‘Y’), months (‘M’), weeks (‘W’), and days (‘D’), while the time units are hours (‘h’), minutes (‘m’), seconds (‘s’), milliseconds (‘ms’), and some additional SI-prefix seconds-based units. The datetime64 data type also accepts the string “NAT”, in any combination of lowercase/uppercase letters, for a “Not A Time” value.

>>> import numpy as np
>>> yesterday = np.datetime64('today', 'D') - np.timedelta64(1, 'D')
>>> print("Yestraday: ",yesterday)
Yestraday: 2021-01-05
>>> today = np.datetime64('today', 'D')
>>> print("Today: ",today)
Today: 2021-01-06
>>> tomorrow = np.datetime64('today', 'D') + np.timedelta64(1, 'D')
>>> print("Tomorrow: ",tomorrow)
Tomorrow: 2021-01-07

That’s all in this blog. Thank you for spending your time reading it. :)

--

--