The post Beautiful Soup 4 Python appeared first on PythonForBeginners.com.
]]>This article is an introduction to BeautifulSoup 4 in Python. If you want to know more I recommend you to read the official documentation found here.
Beautiful Soup is a Python library for pulling data out of HTML and XML files.
Beautiful Soup 3 has been replaced by Beautiful Soup 4. Beautiful Soup 3 only works on Python 2.x, but Beautiful Soup 4 also works on Python 3.x. Beautiful Soup 4 is faster, has more features, and works with third-party parsers like lxml and html5lib. You should use Beautiful Soup 4 for all new projects.
If you run Debian or Ubuntu, you can install Beautiful Soup with the system package manager
apt-get install python-bs4
Beautiful Soup 4 is published through PyPi, so if you can’t install it with the system packager, you can install it with easy_install or pip. The package name is beautifulsoup4, and the same package works on Python 2 and Python 3.
easy_install beautifulsoup4
pip install beautifulsoup4
If you don’t have easy_install or pip installed, you can download the Beautiful Soup 4 source tarball and install it with setup.py. python setup.py install
Right after the installation you can start using BeautifulSoup. At the beginning of your Python script, import the library Now you have to pass something to BeautifulSoup to create a soup object. That could be a document or an URL. BeautifulSoup does not fetch the web page for you, you have to do that yourself. That’s why I use urllib2 in combination with the BeautifulSoup library.
There are some different filters you can use with the search API. Below I will show you some examples on how you can pass those filters into methods such as find_all You can use these filters based on a tag’s name, on its attributes, on the text of a string, or on some combination of these.
The simplest filter is a string. Pass a string to a search method and Beautiful Soup will perform a match against that exact string. This code finds all the ‘b’ tags in the document (you can replace b with any tag you want to find)
soup.find_all('b')
If you pass in a byte string, Beautiful Soup will assume the string is encoded as UTF-8. You can avoid this by passing in a Unicode string instead.
If you pass in a regular expression object, Beautiful Soup will filter against that regular expression using its match() method. This code finds all the tags whose names start with the letter “b”, in this case, the ‘body’ tag and the ‘b’ tag:
import re
for tag in soup.find_all(re.compile("^b")):
print(tag.name)
This code finds all the tags whose names contain the letter “t”:
for tag in soup.find_all(re.compile("t")):
print(tag.name)
If you pass in a list, Beautiful Soup will allow a string match against any item in that list. This code finds all the ‘a’ tags and all the ‘b’ tags
print soup.find_all(["a", "b"])
The value True matches everything it can. This code finds all the tags in the document, but none of the text strings:
for tag in soup.find_all(True):
print(tag.name)
If none of the other matches work for you, define a function that takes an element as its only argument. Please see the official documentation if you want to do that.
As an example, we’ll use the very website you currently are on (https://www.pythonforbeginners.com) To parse the data from the content, we simply create a BeautifulSoup object for it That will create a soup object of the content of the url we passed in. From this point, we can now use the Beautiful Soup methods on that soup object. We can use the prettify method to turn a BS parse tree into a nicely formatted Unicode string
The find_all method is one of the most common methods in BeautifulSoup. It looks through a tag’s descendants and retrieves all descendants that match your filters.
soup.find_all("title")
soup.find_all("p", "title")
soup.find_all("a")
soup.find_all(id="link2")
Let’s see some examples on how to use BS 4
from bs4 import BeautifulSoup
import urllib2
url = "https://www.pythonforbeginners.com"
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content)
print soup.prettify()
print title
>> 'title'? Python For Beginners
print soup.title.string
>> ? Python For Beginners
print soup.p
print soup.a
Python For Beginners
If you want to know how to navigate the tree please see the official documentation . There you can read about the following things:
Going down
Going up
Going sideways
Going back and forth
One common task is extracting all the URLs found within a page’s ‘a’ tags Using the find_all method, gives us a whole list of elements with the tag “a”.
for link in soup.find_all('a'):
print(link.get('href'))
Output:
..https://www.pythonforbeginners.com
..https://www.pythonforbeginners.com/python-overview-start-here/
..https://www.pythonforbeginners.com/dictionary/
..https://www.pythonforbeginners.com/python-functions-cheat-sheet/
..https://www.pythonforbeginners.com/lists/python-lists-cheat-sheet/
..https://www.pythonforbeginners.com/loops/
..https://www.pythonforbeginners.com/python-modules/
..https://www.pythonforbeginners.com/strings/
..https://www.pythonforbeginners.com/sitemap/
...
...
Another common task is extracting all the text from a page:
print(soup.get_text())
Output:
Python For Beginners
Python Basics
Dictionary
Functions
Lists
Loops
Modules
Strings
Sitemap
...
...
As a last example, let’s grab all the links from Reddit
from bs4 import BeautifulSoup
import urllib2
redditFile = urllib2.urlopen("http://www.reddit.com")
redditHtml = redditFile.read()
redditFile.close()
soup = BeautifulSoup(redditHtml)
redditAll = soup.find_all("a")
for links in soup.find_all('a'):
print (links.get('href'))
Output:
#content
..http://www.reddit.com/r/AdviceAnimals/
..http://www.reddit.com/r/announcements/
..http://www.reddit.com/r/AskReddit/
..http://www.reddit.com/r/atheism/
..http://www.reddit.com/r/aww/
..http://www.reddit.com/r/bestof/
..http://www.reddit.com/r/blog/
..http://www.reddit.com/r/funny/
..http://www.reddit.com/r/gaming/
..http://www.reddit.com/r/IAmA/
..http://www.reddit.com/r/movies/
..http://www.reddit.com/r/Music/
..http://www.reddit.com/r/pics/
..http://www.reddit.com/r/politics/
...
For more information, please see the official documentation.
The post Beautiful Soup 4 Python appeared first on PythonForBeginners.com.
]]>The post Using Feedparser in Python appeared first on PythonForBeginners.com.
]]>In this post we will take a look on how we can download and parse syndicated
feeds with Python.
The Python module we will use for that is “Feedparser”.
The complete documentation can be found here.
RSS stands for Rich Site Summary and uses standard web feed formats to publish
frequently updated information: blog entries, news headlines, audio, video.
An RSS document (called “feed”, “web feed”, or “channel”) includes full or
summarized text, and metadata, like publishing date and author’s name. [source]
Feedparser is a Python library that parses feeds in all known formats, including
Atom, RSS, and RDF. It runs on Python 2.4 all the way up to 3.3. [source]
Before we install the feedparser module and start to code, let’s take a look
at some of the available RSS elements.
The most commonly used elements in RSS feeds are “title”, “link”, “description”,
“publication date”, and “entry ID”.
The less commonnly used elements are “image”, “categories”, “enclosures”
and “cloud”.
To install feedparser on your computer, open your terminal and install it using
“pip” (A tool for installing and managing Python packages)
sudo pip install feedparser
To verify that feedparser is installed, you can run a “pip list”.
You can of course also enter the interactive mode, and import the feedparser
module there.
If you see an output like below, you can be sure it’s installed.
>>> import feedparser
>>>
Now that we have installed the feedparser module, we can go ahead and begin
to work with it.
You can use any RSS feed that you want. Since I like to read Reddit, I will use
that for my example.
Reddit is made up of many sub-reddits, the one I am particular interested in for
now is the “Python” sub-reddit.
The way to get the RSS feed, is just to look up the URL to that sub-reddit and
add a “.rss” to it.
The RSS feed that we need for the python sub-reddit would be:
http://www.reddit.com/r/python/.rss
You start your program with importing the feedparser module.
import feedparser
Create the feed. Put in the RSS feed that you want.
d = feedparser.parse('http://www.reddit.com/r/python/.rss')
The channel elements are available in d.feed (Remember the “RSS Elements” above)
The items are available in d.entries, which is a list.
You access items in the list in the same order in which they appear in the
original feed, so the first item is available in d.entries[0].
Print the title of the feed
print d['feed']['title']
>>> Python
Resolves relative links
print d['feed']['link']
>>> http://www.reddit.com/r/Python/
Parse escaped HTML
print d.feed.subtitle
>>> news about the dynamic, interpreted, interactive, object-oriented, extensible
programming language Python
See number of entries
print len(d['entries'])
>>> 25
Each entry in the feed is a dictionary. Use [0] to print the first entry.
print d['entries'][0]['title']
>>> Functional Python made easy with a new library: Funcy
Print the first entry and its link
print d.entries[0]['link']
>>> http://www.reddit.com/r/Python/comments/1oej74/functional_python_made_easy_with_a_new_
library/
Use a for loop to print all posts and their links.
for post in d.entries:
print post.title + ": " + post.link + "
"
>>>
Functional Python made easy with a new library: Funcy: http://www.reddit.com/r/Python/
comments/1oej74/functional_python_made_easy_with_a_new_
library/
Python Packages Open Sourced: http://www.reddit.com/r/Python/comments/1od7nn/
python_packages_open_sourced/
PyEDA 0.15.0 Released: http://www.reddit.com/r/Python/comments/1oet5m/
pyeda_0150_released/
PyMongo 2.6.3 Released: http://www.reddit.com/r/Python/comments/1ocryg/
pymongo_263_released/
.....
.......
........
Reports the feed type and version
print d.version
>>> rss20
Full access to all HTTP headers
print d.headers
>>>
{'content-length': '5393', 'content-encoding': 'gzip', 'vary': 'accept-encoding', 'server':
"'; DROP TABLE servertypes; --", 'connection': 'close', 'date': 'Mon, 14 Oct 2013 09:13:34
GMT', 'content-type': 'text/xml; charset=UTF-8'}
Just get the content-type from the header
print d.headers.get('content-type')
>>> text/xml; charset=UTF-8
Using the feedparser is an easy and fun way to parse RSS feeds.
http://www.slideshare.net/LindseySmith1/feedparser
http://code.google.com/p/feedparser/
The post Using Feedparser in Python appeared first on PythonForBeginners.com.
]]>The post Using the Requests Library in Python appeared first on PythonForBeginners.com.
]]>Requests is an Apache2 Licensed HTTP library, written in Python. It is designed to be used by humans to interact with the language. This means you don’t have to manually add query strings to URLs, or form-encode your POST data. Don’t worry if that made no sense to you. It will in due time.
What can Requests do?
Requests will allow you to send HTTP/1.1 requests using Python. With it, you can add content like headers, form data, multipart files, and parameters via simple Python libraries. It also allows you to access the response data of Python in the same way.
In programming, a library is a collection or pre-configured selection of routines, functions, and operations that a program can use. These elements are often referred to as modules, and stored in object format.
Libraries are important, because you load a module and take advantage of everything it offers without explicitly linking to every program that relies on them. They are truly standalone, so you can build your own programs with them and yet they remain separate from other programs.
Think of modules as a sort of code template.
To reiterate, Requests is a Python library.
The good news is that there are a few ways to install the Requests library. To see the full list of options at your disposal, you can view the official install documentation for Requests here.
You can make use of pip, easy_install, or tarball.
If you’d rather work with source code, you can get that on GitHub, as well.
For the purpose of this guide, we are going to use pip to install the library.
In your Python interpreter, type the following:
pip install requests
To work with the Requests library in Python, you must import the appropriate module. You can do this simply by adding the following code at the beginning of your script:
import requests
Of course, to do any of this – installing the library included – you need to download the necessary package first and have it accessible to the interpreter.
When you ping a website or portal for information this is called making a request. That is exactly what the Requests library has been designed to do.
To get a webpage you would do something like the following:
r = requests.get(‘https://github.com/timeline.json’)
Before you can do anything with a website or URL in Python, it’s a good idea to check the current status code of said portal. You can do this with the dictionary look-up object.
r = requests.get('https://github.com/timeline.json')
r.status_code
>>200
r.status_code == requests.codes.ok
>>> True
requests.codes['temporary_redirect']
>>> 307
requests.codes.teapot
>>> 418
requests.codes['o/']
>>> 200
After a web server returns a response, you can collect the content you need. This is also done using the get requests function.
import requests
r = requests.get('https://github.com/timeline.json')
print r.text
# The Requests library also comes with a built-in JSON decoder,
# just in case you have to deal with JSON data
import requests
r = requests.get('https://github.com/timeline.json')
print r.json
By utilizing a Python dictionary, you can access and view a server’s response headers. Thanks to how Requests works, you can access the headers using any capitalization you’d like.
If you perform this function but a header doesn’t exist in the response, the value will default to None.
r.headers
{
'status': '200 OK',
'content-encoding': 'gzip',
'transfer-encoding': 'chunked',
'connection': 'close',
'server': 'nginx/1.0.4',
'x-runtime': '148ms',
'etag': '"e1ca502697e5c9317743dc078f67693f"',
'content-type': 'application/json; charset=utf-8'
}
r.headers['Content-Type']
>>>'application/json; charset=utf-8'
r.headers.get('content-type')
>>>'application/json; charset=utf-8'
r.headers['X-Random']
>>>None
# Get the headers of a given URL
resp = requests.head("http://www.google.com")
print resp.status_code, resp.text, resp.headers
Requests will automatically decade any content pulled from a server. But most Unicode character sets are seamlessly decoded anyway.
When you make a request to a server, the Requests library make an educated guess about the encoding for the response, and it does this based on the HTTP headers. The encoding that is guessed will be used when you access the r.text file.
Through this file, you can discern what encoding the Requests library is using, and change it if need be. This is possible thanks to the r.encoding property you’ll find in the file.
If and when you change the encoding value, Requests will use the new type so long as you call r.text in your code.
print r.encoding
>> utf-8
>>> r.encoding = ‘ISO-8859-1’
If you want to add custom HTTP headers to a request, you must pass them through a dictionary to the headers parameter.
import json
url = 'https://api.github.com/some/endpoint'
payload = {'some': 'data'}
headers = {'content-type': 'application/json'}
r = requests.post(url, data=json.dumps(payload), headers=headers)
Requests will automatically perform a location redirection when you use the GET and OPTIONS verbs in Python.
GitHub will redirect all HTTP requests to HTTPS automatically. This keeps things secure and encrypted.
You can use the history method of the response object to track redirection status.
r = requests.get('http://github.com')
r.url
>>> 'https://github.com/'
r.status_code
>>> 200
r.history
>>> []
You can also handle post requests using the Requests library.
r = requests.post(http://httpbin.org/post)
But you can also rely on other HTTP requests too, like PUT, DELETE, HEAD, and OPTIONS.
r = requests.put("http://httpbin.org/put")
r = requests.delete("http://httpbin.org/delete")
r = requests.head("http://httpbin.org/get")
r = requests.options("http://httpbin.org/get")
You can use these methods to accomplish a great many things. For instance, using a Python script to create a GitHub repo.
import requests, json
github_url = "https://api.github.com/user/repos"
data = json.dumps({'name':'test', 'description':'some test repo'})
r = requests.post(github_url, data, auth=('user', '*****'))
print r.json
There are a number of exceptions and error codes you need to be familiar with when using the Requests library in Python.
Any exceptions that Requests raises will be inherited from the requests.exceptions.RequestException object.
You can read more about the Requests library at the links below.
http://docs.python-requests.org/en/latest/api/
http://pypi.python.org/pypi/requests
http://docs.python-requests.org/en/latest/user/quickstart/
http://isbullsh.it/2012/06/Rest-api-in-python/#requests
The post Using the Requests Library in Python appeared first on PythonForBeginners.com.
]]>The post How to use the Pexpect Module appeared first on PythonForBeginners.com.
]]>The reason I started to use Pexepect was because I was looking for a module that can take care of some of the automation needs I have (mostly with ssh and ftp).
You can use other modules such as subprocess, but I find this module easier to use.
Note, this post is not for a Python beginner, but hey it’s always fun to learn new things
Pexpect is a pure Python module that makes Python a better tool for controlling
and automating other programs.
Pexpect is basically a pattern matching system. It runs program and watches
output.
When output matches a given pattern Pexpect can respond as if a human were
typing responses.
Pexpect can be used for automation, testing, and screen scraping.
Pexpect can be used for automating interactive console applications such as
ssh, ftp, passwd, telnet, etc.
It can also be used to control web applications via `lynx`, `w3m`, or some
other text-based web browser.
The latest version of Pexpect can be found here
wget http://pexpect.sourceforge.net/pexpect-2.3.tar.gz
tar xzf pexpect-2.3.tar.gz
cd pexpect-2.3
sudo python ./setup.py install
# If your systems support yum or apt-get, you might be able to use the
# commands below to install the pexpect package.
sudo yum install pexpect.noarch
# or
sudo apt-get install python-pexpect
There are two important methods in Pexpect: expect() and send() (or sendline()
which is like send() with a linefeed).
Waits for the child application to return a given strong.
The string you specify is a regular expression, so you can match complicated
patterns.
emember that any time you try to match a pattern that needs look-ahead that
you will always get a minimal match.
The following will always return just one character:
child.expect (‘.+’)
Specify correctly the text you expect back, you can add ‘.*’ to the beginning
or to the end of the text you’re expecting to make sure you’re catching
unexpected characters
This example will match successfully, but will always return no characters:
child.expect (‘.*’)
Generally any star * expression will match as little as possible.
The pattern given to expect() may also be a list of regular expressions,
this allows you to match multiple optional responses.
(example if you get various responses from the server)
Writes a string to the child application.
From the child’s point of view it looks just like someone typed the text from
a terminal.
After each call to expect() the before and after properties will be set to the
text printed by child application.
The before property will contain all text up to the expected string pattern.
You can use child.before to print the output from the other side of the connection
The after string will contain the text that was matched by the expected pattern.
The match property is set to the re MatchObject.
This connects to the openbsd ftp site and downloads the recursive directory
listing.
You can use this technique with any application.
This is especially handy if you are writing automated test tools.
Again, this example is copied from here
import pexpect
child = pexpect.spawn ('ftp ftp.openbsd.org')
child.expect ('Name .*: ')
child.sendline ('anonymous')
child.expect ('Password:')
child.sendline ('[email protected]')
child.expect ('ftp> ')
child.sendline ('cd pub')
child.expect('ftp> ')
child.sendline ('get ls-lR.gz')
child.expect('ftp> ')
child.sendline ('bye')
In the second example, we can see how to get back the control from Pexpect
This example uses ftp to login to the OpenBSD site (just as above),
list files in a directory and then pass interactive control of the ftp session
to the human user.
import pexpect
child = pexpect.spawn ('ftp ftp.openbsd.org')
child.expect ('Name .*: ')
child.sendline ('anonymous')
child.expect ('Password:')
child.sendline ('[email protected]')
child.expect ('ftp> ')
child.sendline ('ls /pub/OpenBSD/')
child.expect ('ftp> ')
print child.before # Print the result of the ls command.
child.interact() # Give control of the child to the user.
There are special patters to match the End Of File or a Timeout condition.
I will not write about it in this article, but refer to the official documentation
because it is good to know how it works.
The post How to use the Pexpect Module appeared first on PythonForBeginners.com.
]]>The post Python Collections Counter appeared first on PythonForBeginners.com.
]]>The Collections module implements high-performance container datatypes (beyond
the built-in types list, dict and tuple) and contains many useful data structures
that you can use to store information in memory.
This article will be about the Counter object.
A Counter is a container that tracks how many times equivalent values are added.
It can be used to implement the same algorithms for which other languages commonly
use bag or multiset data structures.
Import collections makes the stuff in collections available as:
collections.something
import collections
Since we are only going to use the Counter, we can simply do this:
from collections import Counter
Counter supports three forms of initialization.
Its constructor can be called with a sequence of items (iterable), a dictionary
containing keys and counts (mapping, or using keyword arguments mapping string
names to counts (keyword args).
import collections
print collections.Counter(['a', 'b', 'c', 'a', 'b', 'b'])
print collections.Counter({'a':2, 'b':3, 'c':1})
print collections.Counter(a=2, b=3, c=1)
The results of all three forms of initialization are the same.
$ python collections_counter_init.py
Counter({'b': 3, 'a': 2, 'c': 1})
Counter({'b': 3, 'a': 2, 'c': 1})
Counter({'b': 3, 'a': 2, 'c': 1})
An empty Counter can be constructed with no arguments and populated via the
update() method.
import collections
c = collections.Counter()
print 'Initial :', c
c.update('abcdaab')
print 'Sequence:', c
c.update({'a':1, 'd':5})
print 'Dict :', c
The count values are increased based on the new data, rather than replaced.
In this example, the count for a goes from 3 to 4.
$ python collections_counter_update.py
Initial : Counter()
Sequence: Counter({'a': 3, 'b': 2, 'c': 1, 'd': 1})
Dict : Counter({'d': 6, 'a': 4, 'b': 2, 'c': 1})
Once a Counter is populated, its values can be retrieved using the dictionary API.
import collections
c = collections.Counter('abcdaab')
for letter in 'abcde':
print '%s : %d' % (letter, c[letter])
Counter does not raise KeyError for unknown items.
If a value has not been seen in the input (as with e in this example),
its count is 0.
$ python collections_counter_get_values.py
a : 3
b : 2
c : 1
d : 1
e : 0
The elements() method returns an iterator over elements repeating each as many
times as its count.
Elements are returned in arbitrary order.
import collections
c = collections.Counter('extremely')
c['z'] = 0
print c
print list(c.elements())
The order of elements is not guaranteed, and items with counts less than zero are
not included.
$ python collections_counter_elements.py
Counter({'e': 3, 'm': 1, 'l': 1, 'r': 1, 't': 1, 'y': 1, 'x': 1, 'z': 0})
['e', 'e', 'e', 'm', 'l', 'r', 't', 'y', 'x']
Use most_common() to produce a sequence of the n most frequently encountered
input values and their respective counts.
import collections
c = collections.Counter()
with open('/usr/share/dict/words', 'rt') as f:
for line in f:
c.update(line.rstrip().lower())
print 'Most common:'
for letter, count in c.most_common(3):
print '%s: %7d' % (letter, count)
This example counts the letters appearing in all of the words in the system
dictionary to produce a frequency distribution, then prints the three most common
letters.
Leaving out the argument to most_common() produces a list of all the items,
in order of frequency.
$ python collections_counter_most_common.py
Most common:
e: 234803
i: 200613
a: 198938
Counter instances support arithmetic and set operations for aggregating results.
import collections
c1 = collections.Counter(['a', 'b', 'c', 'a', 'b', 'b'])
c2 = collections.Counter('alphabet')
print 'C1:', c1
print 'C2:', c2
print '
Combined counts:'
print c1 + c2
print '
Subtraction:'
print c1 - c2
print '
Intersection (taking positive minimums):'
print c1 & c2
print '
Union (taking maximums):'
print c1 | c2
Each time a new Counter is produced through an operation, any items with zero or
negative counts are discarded.
The count for a is the same in c1 and c2, so subtraction leaves it at zero.
$ python collections_counter_arithmetic.py
C1: Counter({'b': 3, 'a': 2, 'c': 1})
C2: Counter({'a': 2, 'b': 1, 'e': 1, 'h': 1, 'l': 1, 'p': 1, 't': 1})
#Combined counts:
Counter({'a': 4, 'b': 4, 'c': 1, 'e': 1, 'h': 1, 'l': 1, 'p': 1, 't': 1})
#Subtraction:
Counter({'b': 2, 'c': 1})
#Intersection (taking positive minimums):
Counter({'a': 2, 'b': 1})
#Union (taking maximums):
Counter({'b': 3, 'a': 2, 'c': 1, 'e': 1, 'h': 1, 'l': 1, 'p': 1, 't': 1})
Tally occurrences of words in a list.
cnt = Counter()
for word in ['red', 'blue', 'red', 'green', 'blue', 'blue']:
cnt[word] += 1
print cnt
Counter({'blue': 3, 'red': 2, 'green': 1})
The counter takes an iterable and could also be written like this:
mywords = ['red', 'blue', 'red', 'green', 'blue', 'blue']
cnt = Counter(mywords)
print cnt
Counter({'blue': 3, 'red': 2, 'green': 1})
Find the ten most common words in Hamlet
import re
words = re.findall('w+', open('hamlet.txt').read().lower())
print Counter(words).most_common(10)
[('the', 1143), ('and', 966), ('to', 762), ('of', 669), ('i', 631),
('you', 554), ('a', 546), ('my', 514), ('hamlet', 471), ('in', 451)]
Please don’t forget to read the links below for more information.
http://www.doughellmann.com/PyMOTW/collections/
http://docs.python.org/2/library/collections.html#collections.Counter
The post Python Collections Counter appeared first on PythonForBeginners.com.
]]>The post OS.Walk and Fnmatch in Python appeared first on PythonForBeginners.com.
]]>In an earlier post, OS.walk in Python, I described how to use os.walk and showed some examples on how to use it in scripts.
In this article, I will show how to use the os.walk() module function to walk a directory tree, and the fnmatch module for matching file names.
It generates the file names in a directory tree by walking the tree either top-down or bottom-up.
For each directory in the tree rooted at directory top (including top itself), it yields a 3-tuple (dirpath, dirnames, filenames).
dirpath # is a string, the path to the directory.
dirnames # is a list of the names of the subdirectories in dirpath (excluding ‘.’ and ‘..’).
filenames # is a list of the names of the non-directory files in dirpath.
Note that the names in the lists contain no path components.
To get a full path (which begins with top) to a file or directory in dirpath, do os.path.join(dirpath, name). For more information, please see the Python Docs.
The fnmatch module compares file names against glob-style patterns such as used by Unix shells.
These are not the same as the more sophisticated regular expression rules. It’s purely a string matching operation.
If you find it more convenient to use a different pattern style, for example regular expressions, then simply use regex operations to match your filenames. http://www.doughellmann.com/PyMOTW/fnmatch/
The fnmatch module is used for the wild-card pattern matching.
Simple Matching
fnmatch() compares a single file name against a pattern and returns a boolean indicating whether or not they match. The comparison is case-sensitive when the operating system uses a case-sensitive file system.
Filtering
To test a sequence of filenames, you can use filter(). It returns a list of the names that match the pattern argument.
This script will search for *.mp3 files from the rootPath (“/”)
import fnmatch
import os
rootPath = '/'
pattern = '*.mp3'
for root, dirs, files in os.walk(rootPath):
for filename in fnmatch.filter(files, pattern):
print( os.path.join(root, filename))
This script uses ‘os.walk’ and ‘fnmatch’ with filters to search the hard-drive for all image files
import fnmatch
import os
images = ['*.jpg', '*.jpeg', '*.png', '*.tif', '*.tiff']
matches = []
for root, dirnames, filenames in os.walk("C:\"):
for extensions in images:
for filename in fnmatch.filter(filenames, extensions):
matches.append(os.path.join(root, filename))
There are many other (and faster) ways to do this, but now you understand the basics of it.
http://rosettacode.org/wiki/Walk_a_directory/Recursively#Python
Stackoverflow oswalk with fnmatch
The post OS.Walk and Fnmatch in Python appeared first on PythonForBeginners.com.
]]>The post Python Docstrings appeared first on PythonForBeginners.com.
]]>Python documentation strings (or docstrings) provide a convenient way of associating documentation with Python modules, functions, classes, and methods.
An object’s docsting is defined by including a string constant as the first statement in the object’s definition. It’s specified in source code that is used, like a comment, to document a specific segment of code.
Unlike conventional source code comments the docstring should describe what the function does, not how.
All functions should have a docstring This allows the program to inspect these comments at run time, for instance as an interactive help system, or as metadata.
Docstrings can be accessed by the __doc__ attribute on objects.
The doc string line should begin with a capital letter and end with a period. The first line should be a short description.
Don’t write the name of the object. If there are more lines in the documentation string, the second line should be blank, visually separating the summary from the rest of the description.
The following lines should be one or more paragraphs describing the object’s calling conventions, its side effects, etc.
Let’s show how an example of a multi-line docstring:
def my_function():
"""Do nothing, but document it.
No, really, it doesn't do anything.
"""
pass
Let’s see how this would look like when we print it
>>> print my_function.__doc__
Do nothing, but document it.
No, really, it doesn't do anything.
The following Python file shows the declaration of docstrings within a python source file:
"""
Assuming this is file mymodule.py, then this string, being the
first statement in the file, will become the "mymodule" module's
docstring when the file is imported.
"""
class MyClass(object):
"""The class's docstring"""
def my_method(self):
"""The method's docstring"""
def my_function():
"""The function's docstring"""
The following is an interactive session showing how the docstrings may be accessed
>>> import mymodule
>>> help(mymodule)
Assuming this is file mymodule.py then this string, being the first statement in the file will become the mymodule modules docstring when the file is imported.
>>> help(mymodule.MyClass)
The class's docstring
>>> help(mymodule.MyClass.my_method)
The method's docstring
>>> help(mymodule.my_function)
The function's docstring
The post Python Docstrings appeared first on PythonForBeginners.com.
]]>The post How to import modules in Python appeared first on PythonForBeginners.com.
]]>Python modules makes the programming a lot easier.
It’s basically a file that consist of already written code.
When Python imports a module, it first checks the module registry (sys.modules)
to see if the module is already imported.
If that’s the case, Python uses the existing module object as is.
There are different ways to import a module
import sys
# access module, after this you can use sys.name to refer to
# things defined in module sys.
from sys import stdout
# access module without qualifying name.
# This reads >> from the module "sys" import "stdout", so that
# we would be able to refer "stdout" in our program.
from sys import *
# access all functions/classes in the sys module.
The post How to import modules in Python appeared first on PythonForBeginners.com.
]]>