Python Prerequisites for Data Science Part I : Python Data Structures
This article is a Part I for the Python Prerequisites Series about what should be known before starting your Data Science journey. In this article, I have briefly covered the different Python data structures that are commonly used during Data Science and how to perform commonly used operations on them. Moreover, I would recommend readers to have a little knowhow about the syntax and structure of Python language before starting. First, lets take a look at some common datatypes used in Python:
- Integer : Represented by the keyword ‘int’
- Float : Represented by the keyword ‘float’
- Strings : Represented by the keyword ‘str’
- Boolean : Represented by the keyword ‘bool’
- Null : Represented by the keyword ‘NoneType’
Now to use these values of different datatypes, Python provides a variety of objects that can store these values (some may store only those of same datatype while others might support heterogeneity). Python divides such data structures into two types i.e.:
Mutable Data Structures:
Mutable Data Structures refer to such data structures whose values can be altered after they have been assigned. Some commonly used mutable data structures are Lists, Dictionaries and Sets, etc.
Immutable Data Structures:
Immutable Data Structures refer to such data structures whose values cannot be altered after they have been assigned. Some commonly used immutable data structures are Strings, Tuples, Frozensets, etc.
One thing to note before moving on is that since these data structures are Python objects, hence, they can also store other data structures within themselves. Similarly, all these data structures have their own pre-built methods like other Python objects.
Lists:
Lists are mutable data structures that can contain heterogenous data values. A list is represented by square-brackets [] with each value in the list, separated by a ‘,’ (comma). Lists start their index with 0, thus the first element of a list is referred as the 0th element and not the 1st element of the list.
Initialization and Definition:
# First Method to Initialize List
l = []# Second Method to Initialize List
l = list()# Creating a list of heterogenous datatypes
l = [3, 4.5, 5.67, "Hello", None, True, 0.45, 'False']
Indexing and Slicing:
List indexing and slicing uses following different general formulae:
- l[i] : Gives the ith element
- l[-i] : Gives the ith element starting from the end of the list
- l[a:b] : Gives a sublist with elements from ath element to (b-1)th element
- l[a:] : Gives a sublist with elements after ath element (inclusive)
- l[:b] : Gives a sublist with elements before bth element (exclusive)
print(l[0]) #Gives output 3
print(l[-1]) #Gives output 'False'
print(l[2:7]) #Gives output [5.67, "Hello", None, True, 0.45]
print(l[4:]) #Gives output ["Hello", None, True, 0.45, 'False']
print(l[:5]) #Gives output [3, 4.5, 5.67, "Hello", None]
Object Assignment in Python:
Before we move onto other things, there is one thing to note regarding Object Assignment in Python, i.e. in Python objects are stored in memory i.e. if we assign some variable x = 2 and y = x, then both x and y would start to point to the same location in memory. Hence, changing value of x or y would result in changing value of the other as well.
#Assigning two different variables with same list
x = y = [2, 4, 5, 7]#Changing an element from the list variable x
x[2] = 8 #Changes the 3rd element of the list to 8#Viewing the contents of list variable y
print(y) #Gives output: [2, 4, 8, 7]
To solve this problem, we can make use of two different strategies:
#First Method: (Using slicing to create a new list)
y = x[:] #Copies all content of x and creates a new list#Second Method: (Using built-in copy() method for objects)
y = x.copy() #Creates a list by copying the contents of x
Built-In Methods:
Just for clearance, I will perform some basic built-in methods that are available with lists. I recommend checking Python Documentation for detail on these methods and the others available.
#Creating a list of values to apply operations
l = [2.5, 7.8, 12.4, 0.45, 12.0, 5.67, 0.45]print(min(l)) #Gives output 0.45
print(max(l)) #Gives output 12.4
print(del(l[2])) #Gives output [2.5, 7.8, 0.45, 12.0, 5.67, 0.45]
print(l.index(7.8)) #Gives output 1
print(l.count(0.45)) #Gives output 2
Strings:
Strings are another data structure in Python which are actually encapsulated list of characters but with immutability i.e. strings like lists can perform indexing and slicing for each character but cannot be modified once their values have been assigned.
Initialization and Definition:
#Creating a string (First Method)
s = "Hello"#Creating a string (Second Method)
s = 'Hello'
Trying out String Manipulation:
Like I mentioned earlier, Strings are immutable datatypes i.e. they do not support manipulation once they have been assigned. Lets check this property:
#Changing the 2nd character of the string to 'a'
s[2] = 'a'#Outputs TypeError: 'str' object does not support item assignment
Built-In Methods:
Just like Lists and other objects, Strings have their own built-in methods. For details and other possible methods available, check out Python Documentation.
#Creating a string to check different built-in methods
s = "Hello World*"print(s.lower()) #Outputs "hello world*"
print(s.upper()) #Outputs "HELLO WORLD*"
print(s.capitalize()) #Outputs "Hello world*"
print(s.split()) #Outputs ["Hello", "World*"]
print(s.strip("*")) #Outputs "Hello World"
print(s.replace("*","")) #Outputs "Hello World"
One may ask the question as to how the resulting answers on applying these methods are manipulating the string when it is immutable in nature. The answer to that is that these methods take the defined string as input and return a new string as output rather than applying manipulation on it.
Tuples:
Tuples are immutable data structures that can contain multiple values just like lists. The difference is that it uses round-brackets () for initialization and are immutable in nature.
Initialization and Definition:
#First Method to Initialize
t = ()#Second Method to Initialize
t = tuple()#Creating a tuple
t = (3, 4.5, "Hello")
Unpacking Values:
Tuples support a unique operation known as unpacking of values which allows to assign order-wise values to that many variables.
#Assigning values of tuples to three different variables
x, y, z = tprint(x) #Gives output 3
print(y) #Gives output 4.5
print(z) #Gives output "Hello"
Indexing and Slicing:
Indexing and Slicing in tuples is the same as in lists, as indicated here:
print(t[0]) #Gives output 3
print(t[:2]) #Gives output (3, 4.5)
Sets:
Sets are mutable data structures that store unordered unique heterogenous values i.e. every value can occur only once. Sets use the {} notation but cannot be initialized using them (as in lists and tuples because {} is used to initialize dictionary)
Initialization and Definition:
#Initializing a set
s = set()#Creating a set of values
s = {1, 2, 3, 3, 5, 6, 7, 8, 8, 8, 9}
print(s) #Gives output {1,2,3,5,6,7,8,9}
Set Operations:
Set Operations in Python are the same as in mathematical sets, as indicated below:
#Creating two sets to perform set operations
A = {1,2,3,4,5}
B = {3,4,5,6,7}print(A.union(B)) #Gives output {1,2,3,4,5,6,7}
print(A.intersection(B)) #Gives output {3,4,5}
print(A.difference(B)) #Gives output {1,2}
print(A.symmetric_difference(B)) #Gives output {1,2,6,7}
Dictionaries:
Dictionaries are mutable data structures that make use of key-value pairs in which a key is immutable and is used for searching for a particular value (maps a particular value) whereas the value for that key is mutable. The dictionary is represented using the {} notation (just like sets) but the difference is that each value in dictionary is of the form key:value.
#Initializing a dictionary (First Method)
d = {}#Initializing a dictionary (Second Method)
d = dict()#Creating a dictionary
d = {'a':2, 'b':3, 'c':4}
Key Immutability:
As mentioned, keys of a dictionary are immutable and thus can be an immutable object such as string or tuple but not a mutable object such as list or set, etc.
#Creating a dictionary with string and tuple as keys
d = {"gridsize":(3,3), (2,3):7} #Works perfectly fine#Creating a dictionary with list as key
d = {"gridsize":(3,3), [2,3]:7} #Gives TypeError: unhashable type: 'list'
Indexing and Mutability:
Indexing in Dictionary is done using the keys of dictionary whereas we can use this indexing to assign new values to that particular key (since the value of the key supports mutability unlike the key itself)
#Creating a dictionary
d = {'a':3, 'b':4, 'c':5} #Indexing value of 'b' in dictionary
print(d['b']) #Gives output 4#Performing assignment via indexing
d['b'] = 6
print(d) #Gives output {'a':3, 'b':6, 'c':5}
Built-In Methods:
Dictionary also supports built-in methods like the rest of the data structures. Some of them are mentioned as under:
d.update({'d':7, 'e':8})
print(d) #Gives output {'a':3, 'b':6, 'c':5, 'd':7, 'e':8}print(d.keys()) #Gives output dict_keys(['a','b','c','d','e','f'])print(d.items())
#Gives output dict_items([('a', 3), ('b', 6), ('c', 7), ('d', 9), ('e', 6), ('f', 10)])
That’s it for this Part
With the data structures overviewed, the next thing we have to cover is the usage of conditionals, looping constructs and functions in Python which play an essential part in the Data Science journey. In the meantime, I will try to put all this content as well as the upcoming one in a repository so it is easily available for access. Looking forward to writing the next article!!
For Python Documentation on Data Structures: https://docs.python.org/3/tutorial/datastructures.html
For any queries, contact me at Linkedin: https://www.linkedin.com/in/hassan-farid-5a17541a2/