Modules Category Page - PythonForBeginners.com

Beautiful Soup 4 Python

PFB Staff Writer — Wed, 09 Mar 2016 18:04:26 +0000

Overview

This article is an introduction to BeautifulSoup 4 in Python. If you want to know more I recommend you to read the official documentation found here.

What is Beautiful Soup?

Beautiful Soup is a Python library for pulling data out of HTML and XML files.

BeautifulSoup 3 or 4?

Beautiful Soup 3 has been replaced by Beautiful Soup 4. Beautiful Soup 3 only works on Python 2.x, but Beautiful Soup 4 also works on Python 3.x. Beautiful Soup 4 is faster, has more features, and works with third-party parsers like lxml and html5lib. You should use Beautiful Soup 4 for all new projects.

Installing Beautiful Soup

If you run Debian or Ubuntu, you can install Beautiful Soup with the system package manager

apt-get install python-bs4

Beautiful Soup 4 is published through PyPi, so if you can’t install it with the system packager, you can install it with easy_install or pip. The package name is beautifulsoup4, and the same package works on Python 2 and Python 3.

easy_install beautifulsoup4

pip install beautifulsoup4

If you don’t have easy_install or pip installed, you can download the Beautiful Soup 4 source tarball and install it with setup.py. python setup.py install

BeautifulSoup Usage

Right after the installation you can start using BeautifulSoup. At the beginning of your Python script, import the library Now you have to pass something to BeautifulSoup to create a soup object. That could be a document or an URL. BeautifulSoup does not fetch the web page for you, you have to do that yourself. That’s why I use urllib2 in combination with the BeautifulSoup library.

Filtering

There are some different filters you can use with the search API. Below I will show you some examples on how you can pass those filters into methods such as find_all You can use these filters based on a tag’s name, on its attributes, on the text of a string, or on some combination of these.

A string

The simplest filter is a string. Pass a string to a search method and Beautiful Soup will perform a match against that exact string. This code finds all the ‘b’ tags in the document (you can replace b with any tag you want to find)

soup.find_all('b')

If you pass in a byte string, Beautiful Soup will assume the string is encoded as UTF-8. You can avoid this by passing in a Unicode string instead.

A regular expression

If you pass in a regular expression object, Beautiful Soup will filter against that regular expression using its match() method. This code finds all the tags whose names start with the letter “b”, in this case, the ‘body’ tag and the ‘b’ tag:

import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)

This code finds all the tags whose names contain the letter “t”:

for tag in soup.find_all(re.compile("t")):
    print(tag.name)

A list

If you pass in a list, Beautiful Soup will allow a string match against any item in that list. This code finds all the ‘a’ tags and all the ‘b’ tags

print soup.find_all(["a", "b"])

True

The value True matches everything it can. This code finds all the tags in the document, but none of the text strings:

for tag in soup.find_all(True):
    print(tag.name)

A function

If none of the other matches work for you, define a function that takes an element as its only argument. Please see the official documentation if you want to do that.

BeautifulSoup Object

As an example, we’ll use the very website you currently are on (https://www.pythonforbeginners.com) To parse the data from the content, we simply create a BeautifulSoup object for it That will create a soup object of the content of the url we passed in. From this point, we can now use the Beautiful Soup methods on that soup object. We can use the prettify method to turn a BS parse tree into a nicely formatted Unicode string

The Find_all method

The find_all method is one of the most common methods in BeautifulSoup. It looks through a tag’s descendants and retrieves all descendants that match your filters.

soup.find_all("title")

soup.find_all("p", "title")

soup.find_all("a")

soup.find_all(id="link2")

Let’s see some examples on how to use BS 4

from bs4 import BeautifulSoup
import urllib2

url = "https://www.pythonforbeginners.com"

content = urllib2.urlopen(url).read()

soup = BeautifulSoup(content)

print soup.prettify()

print title
>> 'title'? Python For Beginners

print soup.title.string
>> ? Python For Beginners

print soup.p

print soup.a
Python For Beginners

Navigating the Parse Tree

If you want to know how to navigate the tree please see the official documentation . There you can read about the following things:

Going down

Navigating using tag names
.contents and .children
.descendants
.string .strings and stripped_strings

Going up

.parent
.parents

Going sideways

.next_sibling and .previous_sibling
.next_siblings and .previous_siblings

Going back and forth

.next_element and .previous_element
.next_elements and .previous_elements

Extracting all the URLs found within a page ‘a’ tags

One common task is extracting all the URLs found within a page’s ‘a’ tags Using the find_all method, gives us a whole list of elements with the tag “a”.

for link in soup.find_all('a'):
    print(link.get('href'))

Output:

..https://www.pythonforbeginners.com
..https://www.pythonforbeginners.com/python-overview-start-here/
..https://www.pythonforbeginners.com/dictionary/
..https://www.pythonforbeginners.com/python-functions-cheat-sheet/
..https://www.pythonforbeginners.com/lists/python-lists-cheat-sheet/
..https://www.pythonforbeginners.com/loops/
..https://www.pythonforbeginners.com/python-modules/
..https://www.pythonforbeginners.com/strings/
..https://www.pythonforbeginners.com/sitemap/
...
...

Extracting all the text from a page

Another common task is extracting all the text from a page:

print(soup.get_text())

Output:

Python For Beginners
Python Basics
Dictionary
Functions
Lists
Loops
Modules
Strings
Sitemap
...
...

Get all links from Reddit

As a last example, let’s grab all the links from Reddit

from bs4 import BeautifulSoup
import urllib2

redditFile = urllib2.urlopen("http://www.reddit.com")
redditHtml = redditFile.read()
redditFile.close()

soup = BeautifulSoup(redditHtml)
redditAll = soup.find_all("a")
for links in soup.find_all('a'):
    print (links.get('href'))

Output:

#content
..http://www.reddit.com/r/AdviceAnimals/
..http://www.reddit.com/r/announcements/
..http://www.reddit.com/r/AskReddit/
..http://www.reddit.com/r/atheism/
..http://www.reddit.com/r/aww/
..http://www.reddit.com/r/bestof/
..http://www.reddit.com/r/blog/
..http://www.reddit.com/r/funny/
..http://www.reddit.com/r/gaming/
..http://www.reddit.com/r/IAmA/
..http://www.reddit.com/r/movies/
..http://www.reddit.com/r/Music/
..http://www.reddit.com/r/pics/
..http://www.reddit.com/r/politics/
...

For more information, please see the official documentation.

The post Beautiful Soup 4 Python appeared first on PythonForBeginners.com.

Using Feedparser in Python

PFB Staff Writer — Mon, 14 Oct 2013 05:04:18 +0000

Overview

In this post we will take a look on how we can download and parse syndicated
feeds with Python.

The Python module we will use for that is “Feedparser”.

The complete documentation can be found here.

What is RSS?

RSS stands for Rich Site Summary and uses standard web feed formats to publish
frequently updated information: blog entries, news headlines, audio, video.

An RSS document (called “feed”, “web feed”, or “channel”) includes full or
summarized text, and metadata, like publishing date and author’s name. [source]

What is Feedparser?

Feedparser is a Python library that parses feeds in all known formats, including
Atom, RSS, and RDF. It runs on Python 2.4 all the way up to 3.3. [source]

RSS Elements

Before we install the feedparser module and start to code, let’s take a look
at some of the available RSS elements.

The most commonly used elements in RSS feeds are “title”, “link”, “description”,
“publication date”, and “entry ID”.

The less commonnly used elements are “image”, “categories”, “enclosures”
and “cloud”.

Install Feedparser

To install feedparser on your computer, open your terminal and install it using
“pip” (A tool for installing and managing Python packages)

sudo pip install feedparser

To verify that feedparser is installed, you can run a “pip list”.

You can of course also enter the interactive mode, and import the feedparser
module there.

If you see an output like below, you can be sure it’s installed.

>>> import feedparser
>>>

Now that we have installed the feedparser module, we can go ahead and begin
to work with it.

Getting the RSS feed

You can use any RSS feed that you want. Since I like to read Reddit, I will use
that for my example.

Reddit is made up of many sub-reddits, the one I am particular interested in for
now is the “Python” sub-reddit.

The way to get the RSS feed, is just to look up the URL to that sub-reddit and
add a “.rss” to it.

The RSS feed that we need for the python sub-reddit would be:
http://www.reddit.com/r/python/.rss

Using Feedparser

You start your program with importing the feedparser module.

import feedparser

Create the feed. Put in the RSS feed that you want.

d = feedparser.parse('http://www.reddit.com/r/python/.rss')

The channel elements are available in d.feed (Remember the “RSS Elements” above)

The items are available in d.entries, which is a list.

You access items in the list in the same order in which they appear in the
original feed, so the first item is available in d.entries[0].

Print the title of the feed

print d['feed']['title']

>>> Python

Resolves relative links

print d['feed']['link']

>>> http://www.reddit.com/r/Python/

Parse escaped HTML

print d.feed.subtitle

>>> news about the dynamic, interpreted, interactive, object-oriented, extensible
programming language Python

See number of entries

print len(d['entries'])

>>> 25

Each entry in the feed is a dictionary. Use [0] to print the first entry.

print d['entries'][0]['title'] 

>>> Functional Python made easy with a new library: Funcy

Print the first entry and its link

print d.entries[0]['link'] 

>>> http://www.reddit.com/r/Python/comments/1oej74/functional_python_made_easy_with_a_new_
library/

Use a for loop to print all posts and their links.

for post in d.entries:
    print post.title + ": " + post.link + "
"

>>>
Functional Python made easy with a new library: Funcy: http://www.reddit.com/r/Python/
comments/1oej74/functional_python_made_easy_with_a_new_
library/

Python Packages Open Sourced: http://www.reddit.com/r/Python/comments/1od7nn/
python_packages_open_sourced/

PyEDA 0.15.0 Released: http://www.reddit.com/r/Python/comments/1oet5m/
pyeda_0150_released/

PyMongo 2.6.3 Released: http://www.reddit.com/r/Python/comments/1ocryg/
pymongo_263_released/
.....
.......
........

Reports the feed type and version

print d.version      

>>> rss20

Full access to all HTTP headers

print d.headers          	

>>> 
{'content-length': '5393', 'content-encoding': 'gzip', 'vary': 'accept-encoding', 'server':
"'; DROP TABLE servertypes; --", 'connection': 'close', 'date': 'Mon, 14 Oct 2013 09:13:34
GMT', 'content-type': 'text/xml; charset=UTF-8'}

Just get the content-type from the header

print d.headers.get('content-type')

>>> text/xml; charset=UTF-8

Using the feedparser is an easy and fun way to parse RSS feeds.

Sources

http://www.slideshare.net/LindseySmith1/feedparser
http://code.google.com/p/feedparser/

The post Using Feedparser in Python appeared first on PythonForBeginners.com.

Using the Requests Library in Python

PFB Staff Writer — Mon, 11 Feb 2013 14:19:19 +0000

First things first, let’s introduce you to Requests.

What is the Requests Resource?

Requests is an Apache2 Licensed HTTP library, written in Python. It is designed to be used by humans to interact with the language. This means you don’t have to manually add query strings to URLs, or form-encode your POST data. Don’t worry if that made no sense to you. It will in due time.

What can Requests do?

Requests will allow you to send HTTP/1.1 requests using Python. With it, you can add content like headers, form data, multipart files, and parameters via simple Python libraries. It also allows you to access the response data of Python in the same way.

In programming, a library is a collection or pre-configured selection of routines, functions, and operations that a program can use. These elements are often referred to as modules, and stored in object format.

Libraries are important, because you load a module and take advantage of everything it offers without explicitly linking to every program that relies on them. They are truly standalone, so you can build your own programs with them and yet they remain separate from other programs.

Think of modules as a sort of code template.

To reiterate, Requests is a Python library.

How to Install Requests

The good news is that there are a few ways to install the Requests library. To see the full list of options at your disposal, you can view the official install documentation for Requests here.

You can make use of pip, easy_install, or tarball.

If you’d rather work with source code, you can get that on GitHub, as well.

For the purpose of this guide, we are going to use pip to install the library.

In your Python interpreter, type the following:

pip install requests

Importing the Requests Module

To work with the Requests library in Python, you must import the appropriate module. You can do this simply by adding the following code at the beginning of your script:

import requests

Of course, to do any of this – installing the library included – you need to download the necessary package first and have it accessible to the interpreter.

Making a Request

When you ping a website or portal for information this is called making a request. That is exactly what the Requests library has been designed to do.

To get a webpage you would do something like the following:

r = requests.get(‘https://github.com/timeline.json’)

Working with Response Code

Before you can do anything with a website or URL in Python, it’s a good idea to check the current status code of said portal. You can do this with the dictionary look-up object.

r = requests.get('https://github.com/timeline.json')
r.status_code
>>200
 
r.status_code == requests.codes.ok
>>> True
 
requests.codes['temporary_redirect']
>>> 307
 
requests.codes.teapot
>>> 418
 
requests.codes['o/']
>>> 200

Get the Content

After a web server returns a response, you can collect the content you need. This is also done using the get requests function.

import requests
r = requests.get('https://github.com/timeline.json')
print r.text
 
# The Requests library also comes with a built-in JSON decoder,
# just in case you have to deal with JSON data
 
import requests
r = requests.get('https://github.com/timeline.json')
print r.json

Working with Headers

By utilizing a Python dictionary, you can access and view a server’s response headers. Thanks to how Requests works, you can access the headers using any capitalization you’d like.

If you perform this function but a header doesn’t exist in the response, the value will default to None.

r.headers
{
    'status': '200 OK',
    'content-encoding': 'gzip',
    'transfer-encoding': 'chunked',
    'connection': 'close',
    'server': 'nginx/1.0.4',
    'x-runtime': '148ms',
    'etag': '"e1ca502697e5c9317743dc078f67693f"',
    'content-type': 'application/json; charset=utf-8'
}
 
r.headers['Content-Type']
>>>'application/json; charset=utf-8'
 
r.headers.get('content-type')
>>>'application/json; charset=utf-8'
 
r.headers['X-Random']
>>>None
 
# Get the headers of a given URL
resp = requests.head("http://www.google.com")
print resp.status_code, resp.text, resp.headers

Encoding

Requests will automatically decade any content pulled from a server. But most Unicode character sets are seamlessly decoded anyway.

When you make a request to a server, the Requests library make an educated guess about the encoding for the response, and it does this based on the HTTP headers. The encoding that is guessed will be used when you access the r.text file.

Through this file, you can discern what encoding the Requests library is using, and change it if need be. This is possible thanks to the r.encoding property you’ll find in the file.

If and when you change the encoding value, Requests will use the new type so long as you call r.text in your code.

print r.encoding
>> utf-8
 
>>> r.encoding = ‘ISO-8859-1’

Custom Headers

If you want to add custom HTTP headers to a request, you must pass them through a dictionary to the headers parameter.

import json
url = 'https://api.github.com/some/endpoint'
payload = {'some': 'data'}
headers = {'content-type': 'application/json'}
 
r = requests.post(url, data=json.dumps(payload), headers=headers)

Redirection and History

Requests will automatically perform a location redirection when you use the GET and OPTIONS verbs in Python.

GitHub will redirect all HTTP requests to HTTPS automatically. This keeps things secure and encrypted.

You can use the history method of the response object to track redirection status.

r = requests.get('http://github.com')
r.url
>>> 'https://github.com/'
 
r.status_code
>>> 200
 
r.history 
>>> []

Make an HTTP Post Request

You can also handle post requests using the Requests library.

r = requests.post(http://httpbin.org/post)

But you can also rely on other HTTP requests too, like PUT, DELETE, HEAD, and OPTIONS.

r = requests.put("http://httpbin.org/put")
r = requests.delete("http://httpbin.org/delete")
r = requests.head("http://httpbin.org/get")
r = requests.options("http://httpbin.org/get")

You can use these methods to accomplish a great many things. For instance, using a Python script to create a GitHub repo.

import requests, json
 
github_url = "https://api.github.com/user/repos"
data = json.dumps({'name':'test', 'description':'some test repo'})
r = requests.post(github_url, data, auth=('user', '*****'))
 
print r.json

Errors and Exceptions

There are a number of exceptions and error codes you need to be familiar with when using the Requests library in Python.

If there is a network problem like a DNS failure, or refused connection the Requests library will raise a ConnectionError exception.
With invalid HTTP responses, Requests will also raise an HTTPError exception, but these are rare.
If a request times out, a Timeout exception will be raised.
If and when a request exceeds the preconfigured number of maximum redirections, then a TooManyRedirects exception will be raised.

Any exceptions that Requests raises will be inherited from the requests.exceptions.RequestException object.

You can read more about the Requests library at the links below.

http://docs.python-requests.org/en/latest/api/

http://pypi.python.org/pypi/requests

http://docs.python-requests.org/en/latest/user/quickstart/

http://isbullsh.it/2012/06/Rest-api-in-python/#requests

The post Using the Requests Library in Python appeared first on PythonForBeginners.com.

How to use the Pexpect Module

PFB Staff Writer — Mon, 28 Jan 2013 16:29:10 +0000

This article is based on documentation from http://www.noah.org/wiki/pexpect and http://pypi.python.org/pypi/pexpect/

The reason I started to use Pexepect was because I was looking for a module that can take care of some of the automation needs I have (mostly with ssh and ftp).

You can use other modules such as subprocess, but I find this module easier to use.

Note, this post is not for a Python beginner, but hey it’s always fun to learn new things

What is Pexpect?

Pexpect is a pure Python module that makes Python a better tool for controlling
and automating other programs.

Pexpect is basically a pattern matching system. It runs program and watches
output.

When output matches a given pattern Pexpect can respond as if a human were
typing responses.

What can Pexpect be used for?

Pexpect can be used for automation, testing, and screen scraping.

Pexpect can be used for automating interactive console applications such as
ssh, ftp, passwd, telnet, etc.

It can also be used to control web applications via `lynx`, `w3m`, or some
other text-based web browser.

Installing Pexpect

The latest version of Pexpect can be found here

wget http://pexpect.sourceforge.net/pexpect-2.3.tar.gz
tar xzf pexpect-2.3.tar.gz
cd pexpect-2.3
sudo python ./setup.py install
 
# If your systems support yum or apt-get, you might be able to use the
# commands below to install the pexpect package. 

sudo yum install pexpect.noarch

# or

sudo apt-get install python-pexpect

Pexpect Methods

There are two important methods in Pexpect: expect() and send() (or sendline()
which is like send() with a linefeed).

The expect() method

Waits for the child application to return a given strong.

The string you specify is a regular expression, so you can match complicated
patterns.

emember that any time you try to match a pattern that needs look-ahead that
you will always get a minimal match.

The following will always return just one character:
child.expect (‘.+’)

Specify correctly the text you expect back, you can add ‘.*’ to the beginning
or to the end of the text you’re expecting to make sure you’re catching
unexpected characters

This example will match successfully, but will always return no characters:
child.expect (‘.*’)

Generally any star * expression will match as little as possible.

The pattern given to expect() may also be a list of regular expressions,
this allows you to match multiple optional responses.
(example if you get various responses from the server)

The send() method

Writes a string to the child application.

From the child’s point of view it looks just like someone typed the text from
a terminal.

The before and after properties

After each call to expect() the before and after properties will be set to the
text printed by child application.

The before property will contain all text up to the expected string pattern.

You can use child.before to print the output from the other side of the connection

The after string will contain the text that was matched by the expected pattern.

The match property is set to the re MatchObject.

Connect and download a file from a remote FTP Server

This connects to the openbsd ftp site and downloads the recursive directory
listing.

You can use this technique with any application.

This is especially handy if you are writing automated test tools.

Again, this example is copied from here

import pexpect
child = pexpect.spawn ('ftp ftp.openbsd.org')
child.expect ('Name .*: ')
child.sendline ('anonymous')
child.expect ('Password:')
child.sendline ('[email protected]')
child.expect ('ftp> ')
child.sendline ('cd pub')
child.expect('ftp> ')
child.sendline ('get ls-lR.gz')
child.expect('ftp> ')
child.sendline ('bye')

In the second example, we can see how to get back the control from Pexpect

Connect to a remote FTP Server and get control

This example uses ftp to login to the OpenBSD site (just as above),
list files in a directory and then pass interactive control of the ftp session
to the human user.

import pexpect
child = pexpect.spawn ('ftp ftp.openbsd.org')
child.expect ('Name .*: ')
child.sendline ('anonymous')
child.expect ('Password:')
child.sendline ('[email protected]')
child.expect ('ftp> ')
child.sendline ('ls /pub/OpenBSD/')
child.expect ('ftp> ')
print child.before    # Print the result of the ls command.
child.interact()       # Give control of the child to the user.

EOF, Timeout and End Of Line

There are special patters to match the End Of File or a Timeout condition.

I will not write about it in this article, but refer to the official documentation
because it is good to know how it works.

The post How to use the Pexpect Module appeared first on PythonForBeginners.com.

Python Collections Counter

PFB Staff Writer — Tue, 08 Jan 2013 13:57:18 +0000

Overview of the Collections Module

The Collections module implements high-performance container datatypes (beyond
the built-in types list, dict and tuple) and contains many useful data structures
that you can use to store information in memory.

This article will be about the Counter object.

Counter

A Counter is a container that tracks how many times equivalent values are added.

It can be used to implement the same algorithms for which other languages commonly
use bag or multiset data structures.

Importing the module

Import collections makes the stuff in collections available as:
collections.something

import collections

Since we are only going to use the Counter, we can simply do this:

from collections import Counter

Initializing

Counter supports three forms of initialization.

Its constructor can be called with a sequence of items (iterable), a dictionary
containing keys and counts (mapping, or using keyword arguments mapping string
names to counts (keyword args).

import collections

print collections.Counter(['a', 'b', 'c', 'a', 'b', 'b'])

print collections.Counter({'a':2, 'b':3, 'c':1})

print collections.Counter(a=2, b=3, c=1)

The results of all three forms of initialization are the same.

$ python collections_counter_init.py

Counter({'b': 3, 'a': 2, 'c': 1})
Counter({'b': 3, 'a': 2, 'c': 1})
Counter({'b': 3, 'a': 2, 'c': 1})

Create and Update Counters

An empty Counter can be constructed with no arguments and populated via the
update() method.

import collections

c = collections.Counter()
print 'Initial :', c

c.update('abcdaab')
print 'Sequence:', c

c.update({'a':1, 'd':5})
print 'Dict    :', c

The count values are increased based on the new data, rather than replaced.

In this example, the count for a goes from 3 to 4.

$ python collections_counter_update.py

Initial : Counter()

Sequence: Counter({'a': 3, 'b': 2, 'c': 1, 'd': 1})

Dict    : Counter({'d': 6, 'a': 4, 'b': 2, 'c': 1})

Accessing Counters

Once a Counter is populated, its values can be retrieved using the dictionary API.

import collections

c = collections.Counter('abcdaab')

for letter in 'abcde':
    print '%s : %d' % (letter, c[letter])

Counter does not raise KeyError for unknown items.

If a value has not been seen in the input (as with e in this example),
its count is 0.

$ python collections_counter_get_values.py

a : 3
b : 2
c : 1
d : 1
e : 0

Elements

The elements() method returns an iterator over elements repeating each as many
times as its count.

Elements are returned in arbitrary order.

import collections

c = collections.Counter('extremely')

c['z'] = 0

print c

print list(c.elements())

The order of elements is not guaranteed, and items with counts less than zero are
not included.

$ python collections_counter_elements.py

Counter({'e': 3, 'm': 1, 'l': 1, 'r': 1, 't': 1, 'y': 1, 'x': 1, 'z': 0})
['e', 'e', 'e', 'm', 'l', 'r', 't', 'y', 'x']

Most_Common

Use most_common() to produce a sequence of the n most frequently encountered
input values and their respective counts.

import collections

c = collections.Counter()
with open('/usr/share/dict/words', 'rt') as f:
    for line in f:
        c.update(line.rstrip().lower())

print 'Most common:'
for letter, count in c.most_common(3):
    print '%s: %7d' % (letter, count)

This example counts the letters appearing in all of the words in the system
dictionary to produce a frequency distribution, then prints the three most common
letters.

Leaving out the argument to most_common() produces a list of all the items,
in order of frequency.

$ python collections_counter_most_common.py

Most common:
e:  234803
i:  200613
a:  198938

Arithmetic

Counter instances support arithmetic and set operations for aggregating results.

import collections

c1 = collections.Counter(['a', 'b', 'c', 'a', 'b', 'b'])
c2 = collections.Counter('alphabet')

print 'C1:', c1
print 'C2:', c2

print '
Combined counts:'
print c1 + c2

print '
Subtraction:'
print c1 - c2

print '
Intersection (taking positive minimums):'
print c1 & c2

print '
Union (taking maximums):'
print c1 | c2

Each time a new Counter is produced through an operation, any items with zero or
negative counts are discarded.

The count for a is the same in c1 and c2, so subtraction leaves it at zero.

$ python collections_counter_arithmetic.py

C1: Counter({'b': 3, 'a': 2, 'c': 1})
C2: Counter({'a': 2, 'b': 1, 'e': 1, 'h': 1, 'l': 1, 'p': 1, 't': 1})

#Combined counts:
Counter({'a': 4, 'b': 4, 'c': 1, 'e': 1, 'h': 1, 'l': 1, 'p': 1, 't': 1})

#Subtraction:
Counter({'b': 2, 'c': 1})

#Intersection (taking positive minimums):
Counter({'a': 2, 'b': 1})

#Union (taking maximums):
Counter({'b': 3, 'a': 2, 'c': 1, 'e': 1, 'h': 1, 'l': 1, 'p': 1, 't': 1})

Counting words

Tally occurrences of words in a list.

cnt = Counter()

for word in ['red', 'blue', 'red', 'green', 'blue', 'blue']:
    cnt[word] += 1

print cnt

Counter({'blue': 3, 'red': 2, 'green': 1})

The counter takes an iterable and could also be written like this:

mywords = ['red', 'blue', 'red', 'green', 'blue', 'blue']

cnt = Counter(mywords)

print cnt

Counter({'blue': 3, 'red': 2, 'green': 1})

Find the most common words

Find the ten most common words in Hamlet

import re

words = re.findall('w+', open('hamlet.txt').read().lower())

print Counter(words).most_common(10)

[('the', 1143), ('and', 966), ('to', 762), ('of', 669), ('i', 631),
 ('you', 554),  ('a', 546), ('my', 514), ('hamlet', 471), ('in', 451)]

Sources

Please don’t forget to read the links below for more information.

http://www.doughellmann.com/PyMOTW/collections/
http://docs.python.org/2/library/collections.html#collections.Counter

The post Python Collections Counter appeared first on PythonForBeginners.com.

OS.Walk and Fnmatch in Python

PFB Staff Writer — Fri, 04 Jan 2013 04:56:48 +0000

Overview

In an earlier post, OS.walk in Python, I described how to use os.walk and showed some examples on how to use it in scripts.

In this article, I will show how to use the os.walk() module function to walk a directory tree, and the fnmatch module for matching file names.

What is OS.walk?

It generates the file names in a directory tree by walking the tree either top-down or bottom-up.

For each directory in the tree rooted at directory top (including top itself), it yields a 3-tuple (dirpath, dirnames, filenames).

dirpath # is a string, the path to the directory.

dirnames # is a list of the names of the subdirectories in dirpath (excluding ‘.’ and ‘..’).

filenames # is a list of the names of the non-directory files in dirpath.

Note that the names in the lists contain no path components.

To get a full path (which begins with top) to a file or directory in dirpath, do os.path.join(dirpath, name). For more information, please see the Python Docs.

What is Fnmatch

The fnmatch module compares file names against glob-style patterns such as used by Unix shells.

These are not the same as the more sophisticated regular expression rules. It’s purely a string matching operation.

If you find it more convenient to use a different pattern style, for example regular expressions, then simply use regex operations to match your filenames. http://www.doughellmann.com/PyMOTW/fnmatch/

What does it do?

The fnmatch module is used for the wild-card pattern matching.

Simple Matching

fnmatch() compares a single file name against a pattern and returns a boolean indicating whether or not they match. The comparison is case-sensitive when the operating system uses a case-sensitive file system.

Filtering

To test a sequence of filenames, you can use filter(). It returns a list of the names that match the pattern argument.

Find all mp3 files

This script will search for *.mp3 files from the rootPath (“/”)


import fnmatch
import os
 
rootPath = '/'
pattern = '*.mp3'
 
for root, dirs, files in os.walk(rootPath):
    for filename in fnmatch.filter(files, pattern):
        print( os.path.join(root, filename))

Search computer for specific files

This script uses ‘os.walk’ and ‘fnmatch’ with filters to search the hard-drive for all image files

import fnmatch
import os

images = ['*.jpg', '*.jpeg', '*.png', '*.tif', '*.tiff']
matches = []

for root, dirnames, filenames in os.walk("C:\"):
    for extensions in images:
        for filename in fnmatch.filter(filenames, extensions):
            matches.append(os.path.join(root, filename))

There are many other (and faster) ways to do this, but now you understand the basics of it.

Python Docstrings

PFB Staff Writer — Thu, 03 Jan 2013 08:18:45 +0000

What is a Docstring?

Python documentation strings (or docstrings) provide a convenient way of associating documentation with Python modules, functions, classes, and methods.

An object’s docsting is defined by including a string constant as the first statement in the object’s definition. It’s specified in source code that is used, like a comment, to document a specific segment of code.

Unlike conventional source code comments the docstring should describe what the function does, not how.

All functions should have a docstring This allows the program to inspect these comments at run time, for instance as an interactive help system, or as metadata.

Docstrings can be accessed by the __doc__ attribute on objects.

How should a Docstring look like?

The doc string line should begin with a capital letter and end with a period. The first line should be a short description.

Don’t write the name of the object. If there are more lines in the documentation string, the second line should be blank, visually separating the summary from the rest of the description.

The following lines should be one or more paragraphs describing the object’s calling conventions, its side effects, etc.

Docstring Example

Let’s show how an example of a multi-line docstring:

def my_function():
    """Do nothing, but document it.

    No, really, it doesn't do anything.
    """
    pass

Let’s see how this would look like when we print it

>>> print my_function.__doc__
Do nothing, but document it.

    No, really, it doesn't do anything.

Declaration of docstrings

The following Python file shows the declaration of docstrings within a python source file:

"""
Assuming this is file mymodule.py, then this string, being the
first statement in the file, will become the "mymodule" module's
docstring when the file is imported.
"""
 
class MyClass(object):
    """The class's docstring"""
 
    def my_method(self):
        """The method's docstring"""
 
def my_function():
    """The function's docstring"""

How to access the Docstring

The following is an interactive session showing how the docstrings may be accessed

>>> import mymodule
>>> help(mymodule)

Assuming this is file mymodule.py then this string, being the first statement in the file will become the mymodule modules docstring when the file is imported.

>>> help(mymodule.MyClass)
The class's docstring

>>> help(mymodule.MyClass.my_method)
The method's docstring

>>> help(mymodule.my_function)
The function's docstring

How to import modules in Python

PFB Staff Writer — Thu, 27 Sep 2012 02:02:31 +0000

Modules

Python modules makes the programming a lot easier.

It’s basically a file that consist of already written code.

When Python imports a module, it first checks the module registry (sys.modules)
to see if the module is already imported.

If that’s the case, Python uses the existing module object as is.

Importing Modules

There are different ways to import a module

import sys    		
# access module, after this you can use sys.name to refer to
# things defined in module sys.

from sys import stdout  
# access module without qualifying name. 
# This reads >> from the module "sys" import "stdout", so that
# we would be able to refer "stdout" in our program.

from sys import *       
# access all functions/classes in the sys module.

The post How to import modules in Python appeared first on PythonForBeginners.com.

Modules Category Page - PythonForBeginners.com

Beautiful Soup 4 Python

Overview

What is Beautiful Soup?

BeautifulSoup 3 or 4?

Installing Beautiful Soup

BeautifulSoup Usage

Filtering

A string

A regular expression

A list

True

A function

BeautifulSoup Object

The Find_all method

Navigating the Parse Tree

Extracting all the URLs found within a page ‘a’ tags

Extracting all the text from a page

Get all links from Reddit

Using Feedparser in Python

Overview

What is RSS?

What is Feedparser?

RSS Elements

Install Feedparser

Getting the RSS feed

Using Feedparser

Sources

Using the Requests Library in Python

What is the Requests Resource?

How to Install Requests

Importing the Requests Module

Making a Request

Working with Response Code

Get the Content

Working with Headers

Encoding

Custom Headers

Redirection and History

Make an HTTP Post Request

Errors and Exceptions

How to use the Pexpect Module

What is Pexpect?

What can Pexpect be used for?

Installing Pexpect

Pexpect Methods

The expect() method

The send() method

The before and after properties

Connect and download a file from a remote FTP Server

Connect to a remote FTP Server and get control

EOF, Timeout and End Of Line

Python Collections Counter

Overview of the Collections Module

Counter

Importing the module

Initializing

Create and Update Counters

Accessing Counters

Elements

Most_Common

Arithmetic

Counting words

Find the most common words

Sources

OS.Walk and Fnmatch in Python

Overview

What is OS.walk?

What is Fnmatch

What does it do?

Find all mp3 files

Search computer for specific files

More Reading

Python Docstrings

What is a Docstring?

How should a Docstring look like?

Docstring Example

Declaration of docstrings

How to access the Docstring

More Reading