Algorithms and Data Structures



August, September 2020


I’m very excited to begin year two of studying data science. Year one was a fantastic challenge and I’m appreciative of being pushed hard by the material. This year promises more of a single focus on data science skills, so it will be interesting to compare notes with fellow students who also have similar career ambitions. I’ve had a period of valuable work, continuing tasks in data visualization and working more with data, albeit in Javascript. Hopefully this helps my progression in the course.

LECTURE 1 – 17th August 

Lecture reflections

A topic I’m just impatient to gain some skill in! I am a member of “The Daily Interview Question” email which is a real challenge and makes me aware how much I’m lacking the ability to solve some algorithmic challenges and to understand the time implications of chosen methods.

I note that hashmaps were mentioned in the lecture, something I’ve heard multiple times as a preferred abstract structure for fast lookups in apps. I need to expand my knowledge further on other types.

Data structures optimise for easiness, speed or space.

The Facebook lookup efficiency was stunning. I was naively unaware the impact could be so severe! Some light experience with functional programming recently highlighted how often unintended side effects can appear in a codebase with too many dependencies. The lecture highlighted just how much time and efficiency are affected in this way also. There are so many factors to manage.


I’m very happy to be coding extensively on the first day – from what I’ve read on blogs and interviews, life as a data scientist is just so much easier with advanced programming skills. 

I’m very familiar with typical loops in Javascript now, so it took me a little time to get comfortable with the Python syntax again. Eventually I was able to settle on a simplified version that didn’t involve so many steps. 

I struggled with the details of the time function at first. The documentation is quite dense implying how deep the topic is. As highlighted in the lecture, when milliseconds make such a big difference it is vital to be able to optimise every line in a big application architecture. I wasn’t quite aware how big an impact system resources made even on simple functions. I’ve come across ‘threading’ in some of my reading and the automation of ‘garbage collection’ in the memory but I’m looking forward to taking a deeper dive during the course.

Finally the fibonacci sequence generator was a nice challenge that I was able to solve without looking up any resources. That’s a nice feeling on day one! The request to test and time the procedure was beneficial as it relates very well to the ‘real world’ and test driven systems that I’m sure we will be working with in the future.

LECTURE 2 – 18th August

Time complexity of algorithms

Lecture reflections

As mentioned yesterday, a lot of the coding challenges I have been attempting over the summer describe optimal solutions in relation to time. I’ve never dived into the topic, having been a little intimidated by the apparent complexity and also having been preoccupied with just getting my code to run!

With this in mind, I value the chance today to dig deeper into the theory. I felt like we were given a comprehensive overview and I definitely have a better understanding but it could take a while before it becomes intuitive. I expect it takes even longer to be able to design algorithms efficiently from the outset but on the other hand there is probably an extensive suite of measuring tools to help the process of optimizing each step.

As this topic is clearly very mathematical I decided to look up Khan Academy which has always been my preferred resource for math. This page ( was brilliant in giving me a couple of other ways to visualise and attempt to comprehend the topic. This sentence and the appended graph stuck out for me, 

“We use big-O notation for asymptotic upper bounds, since it bounds the growth of the running time from above for large enough input sizes.”

 I’m hoping I will be able to claim greater fluency in the topic come the end of the course. 

Image 1 – Khan Academy


I’d never heard of this terminology before today despite wandering as far as some random quantum theory blogs since starting my studies! But it frames specific problems in a very clear way and has even clearer implications in security and artificial intelligence. Before I attempt a summary of the topic below I would like to quote a closing comment from the author which highlights how dense the topic is:

“None of us truly understands the P versus NP problem, we have only begun to peel the layers around this increasingly complex question.”

Firstly it’s worth defining that P is for “Polynomial Time” and NP is for “Nondeterministic Polynomial-Time. NP “can be seen as a graph where every element is connected to every other element”. The problem is framed by the author as “whether every algorithmic problem with efficiently verifiable solutions have efficiently computable solutions.” 

Early on the article is very clear in stating that despite so much change in computational power over recent decades, little has changed with regards to this problem. The scale of the challenge leads him to state, “we have little reason to believe we will see a proof separating P from NP in the near future”.

One example of a particular problem stated that with 40 pairs in the problem, 300 billion trillion possible alternative pairings existed. Despite the intimidating scale of the problem, it remains high in the collective scientific consciousness as a breakthrough would have amazing implications for changing the face of humanity. It is also stated that this problem has in fact been ‘inspirational’ for computer science in determining its progress.

I’ve often come across the assumption that once quantum computing goes mainstream, “it will solve every problem” yet research highlighted in this article strongly implies the opposite. 

The mention of complexity theorists throughout led me to seek a definition of their work versus that of chaos theorists.

“Chaos theory seeks an understanding of simple systems that may change in a sudden, unexpected, or irregular way. Complexity theory focuses on complex systems involving numerous interacting parts, which often give rise to unexpected order”. 

Current approaches to showing P ≠ nP

  • Diagonalization – requires simulation and we don’t know how a fixed NP machine can simulate an arbitrary P machine.
  • Circuit complexity – we haven’t seen any significantly new circuit lower bounds in the past 20 years.
  • Proof complexity – if one could prove there are no short proofs of tautology that would imply P ≠ NP.


A new approach  dubbed “Geometric Complexity Theory” shows promise. The following tasks remain:

1. Prove that the LP relaxation solves the integer programming problem for Pn in polynomial time;

2. Find an efficient, simple combinatorial algorithm for the integer programming problem for Pn.

Unfortunately the theory’s author believes it will take about 100 years to carry out this program!

LECTURE 3 – 19th August

Searching, arrays and lists

Notes and reflections

  • Big O notation – algorithm time complexity is defined such that we always consider the worst case scenario.
  • Lists are compound data types – compound data types are those that are made up of smaller parts – ie an integer is not whilst a list is because it could be constructed by an integer. The optional ambiguity of treating a list as a single thing or accessing its parts is useful (link).
  • Lists manage sequences – a generic term for an ordered set.
  • Lists are powerful data types – in summary they are extremely flexible enabling a massive number of usage cases. They may contain many different types of data and are able to take on minimal characteristics of a container / storage like property.

List methods

Whilst working extensively in Javascript this summer, I note how often I used the .push method in my functions when working with data using a functional approach, attempting to avoid ‘in place’ modification. The Python list function, sorted(), has this behavior of returning a new object.

The conversion function list() could be useful as a code check / test to prevent errors in bigger projects where type safety is key.

The enumerate function isn’t something I have much experience with but it looks like it could really simplify certain functions where the indexes are essential. I enjoyed looking at the source code of the function in this article (link and image below).

Image 2 – enumerate() function source code

Lists versus arrays

On the face of it they are very similar but in Python the use of an array is possible, but a declaration of type is required. As lists are more versatile they carry a bit of ‘overhead’ in terms of memory management at the level of the interpreter and so an array is considered preferable with large data sets.


Below is the code from today’s work. Angesh mentioned in the lecture that the origin of these algorithms was in the early days of computing when operations were far simpler. Despite this fact, I see the value in doing it to learn the lesson early on of memory management and efficiency in a direct way. 

Although perhaps considered elementary by some, I found the work challenging to code in an effective style. The logic is fairly simple to grasp but this challenge is a stark reminder about the level of detail and precision required to code up logic. 

I’ve included my code in an attached file so as not to make fellow students scroll hundreds of lines. I duplicated a lot of the setup when using the time function to ensure I was getting a precise reading of the time. It took longer than I anticipated in completing this exercise as there are a few ways to measure the wrong thing – I found myself having to run each section a few times just to be sure after misinterpreting the number of runs parameter and its effect on output.  

Binary search is a lot more efficient as the number of guesses is the log2 of the number of items. So for a 1000 items, binary search would require a maximum of 10 guesses. For 10000000 items, the maximum would be 24 guesses. The worst case is the first or last item as the algorithm starts in the middle and slices from that point towards its goal. If the list is sorted, linear search execution time scales predictability with the number of items required.

Image 3 – binary search function from Jupyter Notebook

As a final visual summary, Khan Academy once again was a great resource for today’s topic (link) with the graph below, quite illuminative.

Image 4 – comparison of the algorithms 

LECTURE 4 – 20th August

Sorting algorithms

Today’s lecture was an inspection of the bubble sort algorithm. This paper (link) from Duke University was quite informative and my reflection contains quotes from this source. Angesh made it clear how impractical the algorithm is yet we look at it, partly to understand the inner workings of it and the constraints with which early computers applied. But mainly so as to be comfortable working with code in this way of measuring efficiency.

Bubble sort history

Despite Knuth’s premise that “bubble sort seems to have nothing to recommend it”, it is very popular in part since it was first used in the late 1950’s. One positive is “the bubble sort has the additional virtue that it requires almost no space in addition to that used to contain the input vector.”

Core features 

“Nearly every description of bubble sort describes how to terminate the sort early if the vector becomes sorted. This optimization requires checking if any swaps are made and terminating if no swaps are made after j iterations of the inner loop”.


I have surfaced three potential optimizations over the typical implementation.

  1. alternate the direction of bubbling each time the inner loop iterates – shaker sort. (link).
  2. a boolean flag and check if any swaps were made in the previous iteration (link).
  3. Count iterations- in the k-th iteration, only need to look at the first n – k + 1 elements (link).

Bubble sort diagrams and charts

Image 5 – The logic behind the algorithm (link)

Image 6 – Comparison with other algorithms -Utah University (link)

Image 7 – Utah University – Linear performance on sorted lists


The tutorial played a very helpful role for me again today. This topic is something I feel like I would assume I understand or have gained the skill in if I had just watched / read the topic. But the forcing function of running a lot of different time tests and inserting measurements into the code was valuable.

I am slightly disappointed at my attempt to write the algorithm. I was able to write the reassignment using the temporary variable without referencing any material but I struggled to ensure the correct recursive function. 

Whilst researching optimised versions it became really clear where I was going wrong by not implementing the inner loop. In addition an alternate method of swapping seemed to be very readable and removed the need for the temporary variable. I liked the use of boolean flags to manage the control flow also.

To summarise the results image 7 above visually demonstrates the results I achieved below. 

Image 8 – Tutorial results

LECTURE 5 – 24th August 


Lecture Notes


  • each identified by an array index or key 
  • random access is possible
  • can have as many dimensions as we want
  • Applications: lookup tables / hashtables, heaps
  • Very fast data structure
  • Reduced memory usage when working with numbers
  • Array must contain same types

Time complexity

Add new item: O(1) 

Insert item to a given index: O(N)

Removing last item: O(1) 

Removing f.e. middle item: O(N)

Image 9 – visualisation of contagious v non-contagious memory (link)

Linked lists

  • composed of nodes and references
  • A reference is a pointer pointing from one node to the other
  • The last reference in a linked list always points to a NULL
  • Each node is composed of a data element and a reference/link to the next node in the sequence
  • Used for stacks and queues
  • Can allocate memory
  • No random access – implies checking most elements – requires sequential access
  • Linked lists are dynamic
  • Singly linked lists are extremely difficult to navigate backwards

Time complexity

Insert at the beginning O(1) 

Inserting at the end O(N)

Remove items at the beginning: O(1)

Remove items at given positions: O(N) generally

Reflection and tutorial

This resource (link) at Real Python was very helpful in addition to the lecture. I learnt a further real world use case of graphs for linked lists and “lifecycle management for an operating system application”. They provide superior performance over that over a typical python list.

Image 10 – use of linked list in a graph (link)

I enjoyed the lecture topic very much as it’s not a data structure I’ve worked with before. I’ve heard on podcasts the mention of the use of pointers in languages like C++ but assumed it was a very manual process. Fantastic to learn about the practical use cases. So much is abstracted in Python which is what makes it very readable and easy to get started with so I feel like today was valuable in getting a closer look at the processes ‘behind the scenes’.

In the summer I worked with Javascript in a functional way and so I needed a refresher on working with classes to complete the work today. Inheritance is such an efficient way to program so I’m hopeful of becoming more proficient in working with code in this way.

The linked list would be less effective as there would be no lookup by index as implemented in optimised versions of the algorithm where. In addition the swapping function would require a change to the pointers as well as the node to re-order the list.

Attached is the notebook with the node and linked list class. Although the logic was fairly simple in the lecture, the actual code took a while longer to get in place. Managing the two pointers in the removing method was tricky. This article (link) had a step by step explanation of the logic which was a real help.

LECTURE 6 – 25th August 

Algorithmic problem solving

Lecture Notes

Image 11 – time complexity chart

The chart (image 11) was present again in today’s slides. It’s a great way to visualise the definitions of complexity and the effect of size. 

Divide and Conquer – this method is a vast improvement over bubble sort as it reduces the number of swaps from n2 to n. Essentially it creates and solves a set of subproblems to complete the overall solution. It’s a process that’s part of a binary search solution.

  • 1. Divide: Break the given problem into subproblems of the same type. 
  • 2. Conquer: Recursively solve these subproblems
  • 3. Combine: Appropriately combine the answers

Selection sort – Find the position of the largest item, at the end of the list SWAP once only.

Tutorial reading material – An Anagram Detection Example

Solution 1: Checking Off

Time complexity efficiency – increment loop so relative to input size – total length of two strings. For large value n2 will dominate thus O(n2).

Code differences – convert the second string to a list. A check is made for the existence of each character. Recursively resets the boolean and list 2 variable to false and 0 begin looking for the second string.

Solution 2: Sort and Compare

Time complexity efficiency – much more efficient than solution 1 as on average the false outcomes will be identified quicker with less comparisons. Calling sort() twice however causes an outcome usually of  O(n log n) or O(n2).

Code differences –  use of built in sort() function. Assumption is that once sorted the two strings should be similar. If not, at the first difference can we stop the algorithm and return false.

Solution 3: Brute Force

Time complexity efficiency – a factorial approach, O(n!), is by far the worst choice in this analysis. 20 characters equals  77,146,816,596 years to go through the entire list!!

Code differences – generate a list of all possible strings using the characters from s1 and then see if s2 occurs. 

Solution 4: Count and Compare

Time complexity efficiency – O(n) = Best performance in this problem set. There are no nested loops which creates the efficiency. This algorithm sacrificed space in order to gain time as only 26 possible characters, not millions.

Code differences – count character frequency (using two counter lists of each character) and finally compare the counts. Any differences in the totals leads to a false outcome. 

Tutorial coding

I’m still feeling a little rusty in Python after a whole summer of Javascript. I felt like I had the logic really clear when I opened up the Jupyter Notebook but again the recursive element caused me a few headaches. I reached the error message “exceeded maximum call stack” a few times when I forgot to set the boolean flag or when I placed the return call wrongly.

My first attempt included a temporary list that included copies of the largest values which had been placed correctly. That list would be the reference for what is allowed to be compared with. But I figured pretty quickly that simply having a reference for the position in the list, that has been sorted.

The final issue I had was similar to the previous tutorial where I tried to do everything in one loop rather than simply using an inner and outer loop. Hopefully I don’t repeat these mistakes tomorrow!

Image 12 – my familiar error!

I was able to use a solution which made 100 less comparisons than the provided selection sort algorithm, perhaps by removing the need for the secondary sort function that gets called. The number of swaps remains equal to the input size in both so ‘big O’ is similar.

Variants of bubble sort

As discussed in a previous reflection, blog entry, the two main optimization options are present in today’s notebook. BubbleSort2 uses the i variable to reduce the comparisons almost in half that need to be made. BubbleSort3 isn’t a massive change in efficiency but it further reduces the number of swaps needed by using a boolean flag and a while condition loop to exit the algorithm quicker in addition to a counter method.

Notebook data:

BubbleSort versus SelectionSort

These were my selection sort timings from the attached notebook:

SelectionSort – Sorted list with 100 items: 0.0005432363100044313

SelectionSort -Reversed list with 100 items: 0.0005639374400197994

SelectionSort -Random list with 100 items: 0.0006676772599894321

SelectionSort -Random list with 5000 items: 1.4295125554002879

BubbleSort – Random list with 5000 items: 3.0147043420001864

So in my code, selection sort is more than twice as quick as bubble sort with this particular set of numbers. Below is an external reference also on a set of 5000 items.

Image 13 – external timing data (link). I was able to achieve a similar result, 1.4295125554002879 seconds with similar inputs of 5000 items. But it looks like my algorithm could be optimised a little further.

Interestingly selection sort outputs similar comparisons and swaps, no matter the order of the input list as shown from the test data in our notebook. At first I thought I’d made a mistake but after testing other solutions and inputs and after discussion with a fellow student I see that’s the way it works. It’s performance is linear to the input size. Below is a table summarizing my results.

Still curious though about the inefficiency I ran some code (source) which finds the minimum number of swaps in a general sense of an array as seen below. This means we can say that selection sort is a clear improvement over bubble sort; its main disadvantage is that it has no best case scenario as summarized in image 16 below.

Image 15 – from my notebook (attached)

Image 16 – No best case for selection sort (source)

LECTURE 7 – 26th August 


Lecture Notes

Stack == ADT, interface

Last In First Out (LIFO) Structure

Supports – push() – pop()- peek() – (time complexity 0(1))

Multiple working methods

Python list / dynamic array / linked lists can implement a stack

Use cases:

  • Web navigation
  • stack-oriented programming languages
  • Graph algorithms (depth-first search)
  • Recursive Algorithms

Call stack manages programme flow control and associated variables in a specially defined and limited area of the memory. Allow for fast access.

Heap memory is general, larger than a stack,  managed manually with pointers.

Stack oriented

Programming languages that are stack oriented include Forth, RPL, PostScript, BibTeX (source). The article states that “some stack-oriented languages operate in postfix or Reverse Polish notation”. I learnt a little about a LISP dialect, Clojure during the summer which uses Reverse Polish notation in some areas and can be defined as a language that employs a functional approach. Here’s a nice article with some math examples (link).

Compare both the implementations in a qualitative sense:

Python list– allows for random access or via an index. This enables a much faster lookup time on average for most use cases. Memory is pre-allocated. Less code required to provide node and pointer functionality found in linked lists.

Linked list – Requires sequential access and is dynamic with memory allocation, there is no limit to using the stack theoretically.

As a programmer which of the implementation method would you prefer and why?

A list is a far more familiar data structure for me to work with and so and my skill level today I would choose this. It’s very flexible and handles nearly all use cases as far as I’m aware of. However in specific use cases such as a graph algorithm, this method could be more efficient in managing memory and space. Although Python is described as a high level language, there are so many layers deeper one can dive. In researching this question I discovered that manual garbage collection is possible by using gc.collect() to ensure against memory leakage bugs.


Making the direct comparison with code in today’s tutorial was ideal for me as ‘learning by doing’ is a preferable approach rather than taking in only the abstract concept. Writing the code and thinking through the logic forces an extra appreciation of the topic. 

My tutorial speed tests below seem to back up the general claim of improved speed with the Python list. However, a week working with the timeit function makes me appreciate the fickle nature of this test due to the potential myriad of factors.

Looking forward to a continuation of the topic in the next lecture with Angesh.

Image 17 – push() and peek() method speed test

Image 18 – push() and pop() method speed test

LECTURE 8 – 27th August 


Lecture Notes

Definition – “a method of problem-solving in which a function calls itself from within its own code

Recursive calls use the stack” from suggested extra reading (link)

Depth First Search, traversing a binary search tree equals recursion

Part of a problem that can be broken down to repetitive small incremental and inductive steps

Define a base case and an indicative case

An incorrect recursive function creates stack overflow by making too many calls to the stack.

Stack overflow prevention happens by transforming the recursion into a loop and storing the function arguments in an explicit stack


Another worthwhile set of activities today. I’m familiar with the concept in code from our work in year 1 with the sudoku assignment but I learned a lot more about the subject today. It took a bit of time to code my solutions, understanding a few extra dimensions of the return statement along the way. 

Image 19 – Q1

Image 20 – Q2 Fibonacci

Q3 – Other stack overflow factors / causes


This answer (link) explains well with what I found to be true when coding my own solution. Namely that the print statement when called excessively can be a cause in itself when otherwise the function may complete without error.

CPU limitations

Hardware defines how deep a very deep recursion can go in some cases. Using the in built sys function sys.getrecursionlimit() returns an integer limit but this seems arbitrary and non specific to my machine. For example machine learning algorithms are extremely demanding of the computer architecture and cannot be performed locally with the majority of computers without causing stack overflow.

Memory leak

If a poorly designed function keeps doing something unintentionally it could cause a leak and crash the programme. For example a faulty math statement could cause a list to grow uncontrollably.

Creating stack overflow

Objective achieved! I didn’t have to wait too long in creating the following error messages when writing my code. Afterwards, intentionally causing this output was done by faulty math and adding a print statement for every recursion of a large number. The full code is in the attached notebook.

Image 21 – my stack overflow outputs in the notebook

Further research and reflection


This code snippet was an interesting addition to my reading on today’s topic. It was linked to as a suggestion for optimizing a recursive statement by avoiding repeat calls. Using the @ decorator is a clean way to apply this to a function.

Image 22 – memoization example (link)

Tail-call optimization

A topic suggested by Angesh to investigate, I uncovered some new insights (for me!) in the recursion process in Python. It’s not specific to Python and indeed not native to the language.

It can be considered as a form of optimization at the compiler level, as it prevents further, unnecessary frames being added to the stack. Some define it as a change to looping with the added benefit of increased readability. Here is a Github link to a module for this in Python (link).

However, Python’s author, Guido van Rossum is not a proponent of this feature, something in which he elaborated on in two blog posts (link1, link2). His main arguments are that:

  • “the elimination of stack frames doesn’t do anything for the algorithmic complexity of the code”
  • “it does make debugging harder – there’s no stack frame left to use to print a traceback when something goes wrong later.” 
  • “I don’t believe in recursion as the basis of all programming”.
  • “Most of Python’s library is written with sequences and iterators as fundamental building blocks (and dictionaries, of course), not linked lists, so you’d be locking yourself out of a lot of pre-defined functionality by not using lists or sequences.”

Image 21  -using  inspect.stack(context=1), Anaconda responsible for the sizable output of a stack inspection

LECTURE 9 – 31st August 

Queues and Binary search trees

Lecture Reflection

Interesting to learn a little more general computer science today in the form of queues. As Angesh mentioned, it’s very similar to the stack data structure. The architecture and functionality of trees were something that made a lasting impression on me last year during our discrete mathematics class using algorithms such as Djikstras. 

New today was the point that unbalanced trees are sub-optimal in terms of speed and complexity. The slide was a good visual for me on how this can basically become a linked list. Also despite a clear explanation I assume deleting nodes with two children could have some tricky possibilities in a complex graph.

The traversal of a BST is a common interview question from what I’ve read. The videos suggested to us were initially helpful (images22, 23 below) providing a concise and abstract view of the code we need to write.

Image 22 

Image 23


For the popular computer / data science algorithms that we are tackling in this course it is simple to find quality open source scripts as Angesh has mentioned. But I know from last year how important it is to try things yourself first. For me I often get lulled into a false sense of security, scanning a script and assuming I can see the logic and the methods being used.

But today was a real wake up call for me in object orientation methodology. I need to spend a lot more time in this style of code so that I can read it more easily and appreciate the background process and inheritance that occurs.

I feel very comfortable with the BST logic yet it took me a long time to get the script together. I’m hoping I can look back at this awkwardness with a smile in a few weeks time! Attached is my current work in progress although I hope to finish it later. I am still having a problem with the predecessor and successor methods as shown with my error in image 24. There is a child node in my test call so I must be passing the node data incorrectly.

Image 24 – my error

Before this error though I was able to get some basic functionality going with some help from this script (provided here). It helped me define the recursive element of insertion.

Image 25 – starting code (link)

I found further assistance in two other areas. I reference them in case they can be of assistance to others in our class or in case they spark a conversation about optimal methods. Firstly this blog (link) provided a very clear step by step logic although the classes are structured slightly differently than was asked of us in the tutorial. He links to the full code in Github including a very sophisticated series of tests. This is something I’ve never coded before but something I appreciate is necessary at the professional level. Secondly this stack overflow topic (link) was insightful as the debate between users helps a novice like me appreciate the finer details and requirements of the code for BST to function properly and handle edge cases.

Angesh mentioned that we would have a secondary lecture on this topic where we add more code to our class and BST functionality. Hopefully by then I have debugged my issue!

LECTURE 10 – 1st September

Balanced trees and Decision trees

Lecture Reflection

AVL tree

Continuing on with the concept of efficiency, AVL trees have a distinctive characteristic that the “height of the two child subtrees of any node differ by at most one“ ensuring O(log N) performance. Operating systems are a prime use case. The core process to enable this is a check on every insert that the tree remains balanced. When one structure change is made upon an insert, checks must continue, looking for knock on or side affects for ‘balance’.


Knowing that ML is a major part of data science I’m sure most of us in the class were very happy to be working in a notebook with scikit-learn. Normal software engineering entails multiple layers of abstraction but my impression is that ML takes this a step further! The following is my take on the two main areas we were tasked with – decision tree structure and the value of numpy and pandas in the Ml ecosystem.

How are decision trees used to build a classifier?

“A classifier is any algorithm that sorts data into labeled classes, or categories of information” (source link). To look at their usage in examples such as spam filters on incoming email we should first define the underlying structure of a decision tree. This source (link) states that it will “go from observations about an item (represented in the branches) to conclusions about the item’s target value (represented in the leaves)”.  Furthermore this structure is “among the most popular machine learning algorithms given their intelligibility and simplicity.” 

A point that occurs to me is that it is a predictive process, so quality historical data is of value despite knowing that phrase could be applied to most ML problems! A key point here is the extent  of how “supervised” the learning is. To understand how this particular predictive model works we can look at the scikit-learn module ( 

Image 26 – methods for scikit-learn implementation of a decision tree classifier

Above we see the extent to which the algorithm can be tuned or explored. Having just looked in more detail at recursive methods it makes sense that we tackle this particular topic as the recursion used in the algorithm acts as an updating of possible predictions upon a visit to each node in the tree. It can be defined as “binary recursive partitioning” (link). A type of “divide and conquer” strategy the point at which enough stability or pattern is found in the data can be tuned in the algorithm parameters. I like the terminology of a stopping point of iteration on “each child node until the leaves are pure” (which means class likeness in the leaves). The term used is “purity” and it is used in the scikit-learn algorithm below.

Image 27 – algorithm parameters that tweak the recursion process

A final structural comment regards the process of making the decisions around the features of the data and the way in which the algorithm can be most effective by keeping the tree structure minimal. “Information gain” is a process which can be described as selection of minimal entropy (“lack of order or predictability”). This is core to deciding on how to split the data.

The apparent simplicity and design which enables tracking of the process step by step does come at one major cost – the algorithm is prone to large fluctuation in output with a small change in input data.

Details on how NumPy and Pandas are used to solve interesting problems in DS?

Numpy and Pandas are foundational to the current ecosystem of data science. An example of this can be found on the scikit-learn (a Python machine learning library) website pictured below. It highlights the structural value they provide in enabling other systems and libraries to build applications on top of them. Secondly it highlights the effect of accessibility and discovery in running data science experiments and operations.

Image 28 – clipped from the scikit-learn website

In this post (link)  I amusingly learnt that Pandas actually meant “Panel Data” contrary to everyone’s animal assumption! The author makes a point that I have also found to be true, “pandas is closely related to NumPy and we will often see the use of NumPy when we are working with Pandas.”

Image 29 – the homepage descriptor

To quote the open source project’s webpage directly, “Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.” In personal experience so far the package enables us to extract data from a data set easily, manipulate with minimal code and equally make use of its built in features one of which is outputting a much better visual style on the data.

NumPy can be succinctly summarised as providing the computing power to the very complex calculations required by some programmes and is specific to Python. To quote their website they can power all of science and they provide a visual which I’ve included below in image 30. Researching this question has really helped in understanding why these packages are so fundamental and beneficial.

Image 30 – NumPy applications – a phenomenal distribution

Tutorial reflection

I took a little extra time on this tutorial exploring the documentation for the associated methods such as values, creating numerical values for category topic mapping and the way in which a training set is implemented. Overfitting is a term I’m very aware of from our studio project last year and it appeared in the documentation of scikit-learn stating “it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set”. This can be seen in image 31 below from the cross-validation feature docs.

Image 31- model process overview (link)

In summary it was a productive and engaging process, one where a few general ideas clicked into place in a more practical way. Finally I’ll suggest this piece (link) as being of valuable assistance in producing my final visualization of the tutorials work, below which importantly includes the class names (drug type). 

Image 32 – my tutorial output

LECTURE 10 – 1st September

Balanced trees and Decision trees

Lecture Reflection

AVL tree

Notes for course summary

End of course

Valuable time it info – demo on breaking down script per object

Also profiling

Leave a Reply