Useful Libraries in Python

pip3 install --user \
     requests lxml cssselect click dataset ipy networkx

February 11, 2015

Ole Martin Bjørndalen (ole.martin.bjorndalen@uit.no)

(Press "a" for slide view.)

Installing Packages

pip3 install --user request

This will install the package in ~/.local/lib/python<VERSION>/.

You can also use virtual environments.

Requests: HTTP for Humans

images/requests.png
>>> import requests
>>> response = requests.get('http://utviklerlunsj.uit.no/')
>>> response.status
200
>>> response.encoding
'UTF-8'
>>> response.text.count('og')
8

http://docs.python-requests.org/

Requests: Web Services

Fetching JSON:

response = requests.get(url, auth=('username', 'password'))
data = response.json()

Uploading a file (multipart MIME):

requests.post(url, auth=('username', 'password'),
                   files={'upfile', open('data.xml', 'rb')})

Requests Features

lxml: XML and HTML with Python

images/lxml.png

http://lxml.de/

Great XML library, but I use it for HTML Scraping.

lxml: Movie Showtimes

http://fokus.aurorakino.no/billetter-og-program/

<ul class="programList">
  <li class="showing">
    <div class="movieDescr">
       <a class="movieTitle">Sauen Shaun</a>
    </div>
    <div class="outerMovieDescription">
      <div>
        <div class="outerProgramTicketSale">
          <button type="button">16:30 (2D)</button>
    ...
  </li>
  <li class="showing">
16:30 (2D)  Sauen Shaun
19:00       Staying alive
21:30       The Imitation Game

lxml: cssselect()

>>> for title in document.csselect('.movieTitle'):
...     print(title, title.text)
<Element a at 0x7f5ddad5e890>  Sauen Shaun
<Element a at 0x7f5ddad5e628>  Staying Alive
<Element a at 0x7f5ddad5e680>  The Imitation Game
canvas = element.cssselect('#drawing')[0]
table = element.cssselect('table[summary="Group Block"])[0]

lxml: Scraping Movie Showtimes

import requests
from lxml import html

def get_movies(url):
    response = requests.get(url)
    doc = html.fromstring(response.text)
    for movie in doc.cssselect('.showing'):
        time = movie.cssselect('button')[0].text.strip()
        title = movie.cssselect('.movieTitle')[0].text
        yield (time, title)

URL = 'http://fokus.aurorakino.no/billetter-og-program/')

for time, title in sorted(get_movies(URL)):
    print('{:10}  {}'.format(time, title))
16:30 (2D)  Sauen Shaun
19:00       Staying alive
21:30       The Imitation Game

Detour: CSS selectors in Javascript

titles = element.querySelectorAll('.movieTitle')
canvas = element.querySelector('#drawing')
table = element.querySelector('table[summary="Group Block"]')

Supported in all modern browsers.

Click (command line library)

images/click.png
import click

@click.command()
def main():
    print('I has command line parsing!')

main()

http://click.pocoo.org/

Click: Minimal Curl

$ python3 curl.py http://uit.no/
<!DOCTYPE HTML>
<html>
<head>
...
$ python3 curl.py --help
Usage: curl.py [OPTIONS] URL

Fetch document and print to stdout.

Options:
  --help  Show this message and exit.

Click: Minimal Curl

import requests
import click

@click.command()
@click.argument('url')
def main(url):
    """Fetch document and print to stdout."""
    print(requests.get(url).text)

main()

Click: Command Groups

$ python3 calc.py add 1 2
3.0
$ python3 calc.py add --negate 1 2
-3.0
$ python3 calc.py square 2
4.0
$ python3 calc.py square NaN-NaN-NaN-NaN-Batman
Usage: calc.py square [OPTIONS] NUMBER

Error: Invalid value for "number": NaN-NaN-NaN-NaN-batman
       is not a valid floating point value

Click: Command Groups

$ python3 calc.py
Usage: calc.py [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  add     Add two numbers.
  square  Print square of number.
$ python3 calc.py add --help
Usage: calc.py add [OPTIONS] [NUMBERS]...

  Add two numbers.

Options:
  --negate TEXT
  --help         Show this message and exit.

Click: Command Groups

import click

@click.group()
def cli():
    pass

@cli.command()
@click.argument('numbers', type=float, nargs=-1)
@click.option('--negate', default=False)
def add(numbers, negate):
    """Add two numbers."""
    if negate:
        print(-sum(numbers))
    else:
        print(sum(numbers))

@cli.command()
@click.argument('number', type=float)
def square(number):
    """Print square of number."""
    print(number ** 2)

cli()

dataset: databases for lazy people

images/dataset.png

"Because managing databases in Python should be as simple as reading and writing JSON files."

http://dataset.readthedocs.org/

dataset: Example

import dataset

db = dataset.connect('sqlite:///:memory:')

table = db['sometable']
table.insert(dict(name='John Doe', age=37))
table.insert(dict(name='Jane Doe', age=34, gender='female'))

john = table.find_one(name='John Doe')
print(john)
OrderedDict([('id', 1), ('age', 37), ('name', 'John Doe'),
             ('gender', None)])

Similar code without dataset.

dataset Features

dataset: Syntax Examples

users = db['user'].all()
for user in db['user']:
    print(user['age'])
# All users from China
chinese_users = table.find(country='China')
result = db.query('SELECT country, COUNT(*) c '
                  'FROM user GROUP BY country')
for row in result:
    print(row['country'], row['c'])

IPy: IPv4/6 Addresses and Networks

>>> from IPy import IP

>>> local = IP('127.0.0.0/30')
>>> '127.0.0.1' in local
True
>>> '192.168.0.1' in local
False
>>> list(local)
[IP('127.0.0.0'), IP('127.0.0.1'),
 IP('127.0.0.2'), IP('127.0.0.3')]
>>> IP('127.0.0.0/30') == IP('0x7f000000/30')
True
>>> print(IP('127.0.0.0-127.255.255.255'))
127.0.0.0/8

https://pypi.python.org/pypi/IPy/

NetworkX: Graph Analysis

images/networkx.png

https://networkx.github.io/

Tutorial

NetworkX: Example Graph

import networkx as nx

g = nx.Graph()

g.add_nodes_from([
     'Tromsø', 'Oslo', 'Stockholm',
     'Helsinki', 'London', 'Amsterdam', 'Paris',
     'Tokyo', 'Taipei', 'Wellington'])

g.add_edges_from([
    ('Tromsø', 'London'), ('Tromsø', 'Oslo'),
    ('Oslo', 'London'), ('London', 'Amsterdam'),
    ('Amsterdam', 'Paris'), ('Amsterdam', 'Stockholm'),
    ('Stockholm', 'Helsinki'), ('Oslo', 'Stockholm'),
    ('Tokyo', 'Taipei'), ('Tokyo', 'Wellington'),
    ('Amsterdam', 'Wellington'), ('Tromsø', 'Tokyo')])

NetworkX: Shortest Path

images/graph.png
>>> nx.shortest_path(g, 'Tromsø', 'Tokyo')
['Tromsø', 'Tokyo', 'Wellington']
>>> nx.shortest_path(g, 'Tromsø', 'Helsinki')
['Tromsø', 'Oslo', 'Stockholm', 'Helsinki']

(Hand drawn map for illustration. (svg, drawn in Inkscape.))

NetworkX Content

Algorithms
Approximation, Assortativity, Bipartite, Blockmodeling, Boundary, Centrality, Chordal, Clique, Clustering, Communities, Components, Connectivity, Cores, Cycles, Directed Acyclic Graphs, Distance Measures, Distance-Regular Graphs, Dominating Sets, Eulerian, Flows, Graphical degree sequence, Hierarchy, Isolates, Isomorphism, Link, Analysis, Link, Prediction, Matching, Maximal independent set, Minimum Spanning Tree, Operators, Rich, Club, Shortest, Paths, Simple, Paths, Swap, Traversal, Tree, Vitality
Graph generators:
Atlas, Classic, Small, Random Graphs, Degree Sequence, Random Clustered, Directed, Geometric, Hybrid, Bipartite, Line Graph, Ego Graph, Stochastic, Intersection, Social Networks
Linear algebra:
Graph Matrix, Laplacian Matrix, Spectrum Algebraic Connectivity, Attribute Matrices

https://networkx.github.io/documentation/latest/reference/

End of Talk