Python Data Structures And Algorithms



Python Objects, Types and Expression

Python is the language of choice for many advanced data tasks for a very good reason. Python is one of the easiest advanced programming languages to learn. Intuitive structures and semantics mean that for people who are not computer scientists, but maybe biologists, statisticians, or the directors of a start-up, Python is a straightforward way to perform a wide variety of data tasks. It is not just a scripting language, but a full-featured object oriented programming language. 

In Python, there are many useful data structures and algorithms built in to the language. Also, because Python is an object-based language, it is relatively easy to create custom data objects. In this book, we will examine both Python internal libraries, some of the external libraries, as well as learning how to build your own data objects from first principles. 

 

Understanding data structures and algorithms 

  

Algorithms and data structures are the most fundamental concepts in computing. They are the building blocks from which complex software is built. Having an understanding of these foundation concepts is hugely important in software design and this involves the following three characteristics: 

How algorithms manipulate information contained within data structures How data is arranged in memory 

What the performance characteristics of particular data structures are 

In this book, we will examine this topic from several perspectives. Firstly, we will look at the fundamentals of the Python programming language from the perspective of data structures and algorithms. Secondly, it is important that we have the correct mathematical tools. We need to understand some fundamental concepts of computer science and for this we need mathematics. By taking a heuristics approach, developing some guiding principles means that, in general, we do not need any more than high school mathematics to understand the principles of these key ideas. 

Another important aspect is evaluation. Measuring an algorithms performance involves understanding how each increase in data size affects operations on that data. When we are working on large datasets or real-time applications, it is essential that our algorithms and structures are as efficient as they can be. 

Finally, we need a sound experimental design strategy. Being able to conceptually translate a real-world problem into the algorithms and data structures of a programming language involves being able to understand the important elements of a problem and a methodology for mapping these elements to programming structures. 

To give us some insight into algorithmic thinking, let's consider a real-world example. Imagine we are at an unfamiliar market and we are given the task of purchasing a list of items. We assume that the market is laid out randomly, and each vendor sells a random subset of items, some of which may be on our list. Our aim is to minimize the price we pay for each item as well as minimize the time spent at the market. One way to approach this is to write an algorithm like the following:

Repeat for each vendor: 

  • Does the vendor have items on my list and is the cost less than a predicted cost for the item? 
  • If yes, buy and remove from list; if no, move on to the next vendor. 
  • If no more vendors, end. 

Python for data 

Python has several built-in data structures, including lists, dictionaries, and sets, that we use to build customized objects. In addition, there are a number of internal libraries, such as collections and the math object, which allow us to create more advanced structures as well as perform calculations on those structures. Finally, there are the external libraries such as those found in the SciPy packages. These allow us to perform a range of advanced data tasks such as logistic and linear regression, visualization, and mathematical calculations such as operations on matrixes and vectors. External libraries can be very useful for an out of-the-box solution. However, we must also be aware that there is often a performance penalty compared to building customized objects from the ground up. By learning how to code these objects ourselves, we can target them to specific tasks, making them more efficient. This is not to exclude the role of external libraries and we will look at this in Chapter 12, Design Techniques and Strategies. 

To begin, we will take an overview of some of the key language features that make Python such a great choice for data programming. 

The Python environment 

A feature of the Python environment is its interactive console allowing you to both use Python as a desktop programmable calculator and also as an environment to write and test snippets of code. The read-evaluate-print loop of the console is a very convenient way to interact with a larger code base, such as to run functions and methods or to create instances of classes. This is one of the major advantages of Python over compiled languages such as C/C++ or Java, where the write-compile-test-recompile cycle can increase development time considerably compared to Python's read - evaluate - print loop. Being able to type in expressions and get an immediate response can greatly speed up data science tasks. 

There are some excellent distributions of Python apart from the official CPython version. Two of the most popular are Anaconda (https://www.continuum.io/downloads) and Canopy (https://www.enthought.com/products/canopy/). Most distributions come with their own developer environments. Both Canopy and Anaconda include libraries for scientific, machine learning, and other data applications. Most distributions come with an editor. 

There are also a number of implementations of the Python console, apart from the CPython version. Most notable amongst these is the Ipython/Jupyter platform that includes a web based computational environment.