Essential Libraries For Science¶
A core part of programming is having a vague understanding of what exists out there in terms of functionality, tools, libraries, … You do not need to know how to do a task, but that this task is possible using this or that library. In this chapter, we provide a somewhat opiniated list of tools and libraries that you should know about when writing code for science. We will not teach how to use those libraries, as there are plenty of available resources for that. Instead, we only provide a short introduction of what those libraries do.
We have divided the list of libraries into 5 categories: (1) core libraries, ie libraries that you cannot code for science without; (2) libraries and tools you should not work without; (3) visualizationg and plotting libraries; (4) things that will help you speed up your code; (5) domain specific libraries.
Core libraries: numpy, scipy, pandas.¶
About 95% of scientific Python code starts with the following lines:
import numpy as np
from scipy import XXX
numpy
’s core functionality is to provide an easy-to-use and efficient n-dimensional array data structure calledndarray
. Any numerical analysis code in Python heavily relies on thendarray
data structure. The main difference with Python’s list is that arrays are almost always homogeneously typed (all elements of the array are of the same type).numpy
also provides fast and easy operations such as sum, element-wise product, matrix product, indexing, slicing, broadcasting, … Note that we have a full chapter on how to usenumpy
(see [🚧 Numpy 🚧] for details)!scipy
is the core toolbox for scientific and technical computing for Python. It contains many submodules, some related to complex data structures (for sparse matrices for example) or operations onndarray
(linear algebra, fourier transforms, IO), others domain specific (signal processing, image processing, clustering). Viewscipy
as the Swiss army knife of scientific computing in Python.pandas
is a software library for the manipulation of data frames (an object to store tabular data, as you might find in Excel or relational databases). This differs from NumPy arrays in being ‘column-oriented’: the values in each column have to be of the same type, but one row can contain many different types of objects.” Just as NumPy does for arrays,pandas
provides fast and easy operations on dataframes, such as pivoting, merging, joining, data filtering, column-wise operations. We also have a full chapter on how to usepandas
(see [🚧 Pandas 🚧] for details)!
“Things you cannot (or should not) work without”¶
A good text editor or an IDE (See [🚧 Editors And Ides 🚧] for details)!
ipython
is a command shell for Python. Thinkpython
, but a hundred times better! It has syntax highlighting, auto-completion, and many magic functions (%debug
,%run
, etc) to ease the process of programming, data exploration, and debugging.pdb
,ipdb
,pdb++
, andpudb
are four python debuggers. Knowing how to use a Python debugger is a must! If you don’t know how to use one, read [🚧 Debugging 🚧]) without further delay!Whether you use an IDE or a text editor, configure it with a static linter to check your code as you write it. There are different options available. A widely used static linter is
flake8
: it combines three tools into one (pyflakes
,pycodestyle
, andmccabe
). Another popular one ispylint
.
Visualization libraries¶
Matplotlib
is the most widely used visualization library in Python. While the API takes a while to get used to, there is nothing better to create high quality scientific figures for publication! Literally everything can be tweaked, tuned, and modified.seaborn
is a thin layer on top of Matplotlib: you get the flexibility ofMatplotlib
with an easy-to-use interface.bokeh
,plotly
, andaltair
are three excellent choices of plotting libraries with interactive outputs. All of these play very well with notebooks, such as myst-nb or jupyter notebooks. Note thatplotly
can be used to create interactive tables as well. Because we can’t resist showing off cool features, here’s a small code snippet we shamelessly stole from the myst-nb documentation on creating an interactive table and plot.
Here’s an interactive table with plotly
:
import plotly.express as px
data = px.data.iris()
data.head()
sepal_length | sepal_width | petal_length | petal_width | species | species_id | |
---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 1 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 1 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 1 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 1 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | 1 |
And now a small graph of the same data with Altair
:
import altair as alt
alt.Chart(data=data).mark_point().encode(
x="sepal_width",
y="sepal_length",
color="species",
size='sepal_length'
)
Speed ups¶
Once you have a performing piece of code, it may come the time where you have to speed your code up. Keep in mind that “Premature optimization is the root of all evil,” so don’t look too closely at the list of packages in this section unless you are sure that (1) your code works; (2) you really need to speed things up.
The first step of speeding up your code is to profile it: no need to spend time optimizing the data analysis part if 90% of the time is spend on loading the data. In order to do this, profile your code! There are many profilers out there, but our favorite ones are
line_profiler
andkernprof
.line_profiler
allows to do line-by-line profiling of functions, whilekernprof
is a convenient script for running a line-profiler (eitherline_profiler
orcProfile
, which is part of Python standard library). If you are interested in profiling memory, we recommendmemprofiler
. We invite you to read [🚧 Profiling 🚧] if you are at this step of the speed-up process!cython
is a programming language. It is a superset of Python, which means that any valid Python is also valid Cython. However, it supports additional syntax that allows it to be closer to the C language. Cython thus sits between Python and C and serves two purposes: (1) easily interface Python code and C code; (2) a Python/Cython to C compiler. The goal of Cython is to give C-like performance with the ease of readability and flexibility of Python. See [🚧 Cython 🚧] for more details!numba
is a JIT compiler for Python, as well as a parallelization toolkit. It uses LLVM to generate machine code from Python code. Specifically designed for scientific purposes, it can yield 100x speed ups to a function by simply adding a decorator to it. We also give more details about using it in [🚧 Numba 🚧].joblib
provides a set of tools to create lightweight pipelining in Python. Precisely, it has three main features: (1) fast disk-caching capabilities; (2) embarassingly parallel helper functions; (3) and fast compressed persistence for large data. The power of this library comes from how lightweight it is and non-intrusive. You can use many ofjoblib
’s feature with barely any changes to the original code.dask
is a very powerful, parallelization libraries for data analytics. It integrates very nicely with existing projects, such asnumpy
,pandas
, andscikit-learn
. We won’t go into much details here, as we once again have a full chapter on this library! See [🚧 Parallel Python 🚧] for more details.
Domain specific¶
The problem when it comes to domain specific libraries is that, well… They are
domain specific. As a result, only specialists of each domain know which
library to use when. Yet, unlike some communities (such as R), there are still
widely used and maintained packages for some domain. For example, the go-to
library to do astronomy data analysis in Python is astropy
: none of us,
authors, are astronomers, yet we know that it is the go-to library (but,
please, don’t ask us more about astropy
…).
Here, we provide a list of packages, with a small description, and sometimes a bit more detail about when to use which package.
statsmodel
: if you need to do standard statistics in Python, andscipy.stats
doesn’t cover it, there’s a good chance it can be done withstatsmodel
. It covers generalized least squares, quantile regression, linear mixed-effects model (no, you do not have to use R to do those!!).scikit-learn
is the go-to libary for machine learning that isn’t deep: SVM, regression models, random forests, cross-validation, and many metrics are included in this package. But if you need a p-value out of your model, you should check outstatsmodels
and notscikit-learn
. If you want to do deep learning, then check out our next element in the list.tensorflow
is the go-to library for doing deep-learning & neural networks in Python.scikit-image
complementsscipy.ndimage
for image processing algorithms.networkx
is meant for network analysis in Pythonbiopython
provides support for biological data analysis in Python (sequencing, alignment, population genetics algorithm, and structure bioinformatrics).1astropy
is, as mentioned, a toolbox for astronomy data analysis.
- 1
Unfortunately, if you are looking for a tool to do differential expression analysis, nothing much exists and the standard is still to use R and bioconductor…