py4e/html3/07-files.php at master · Abhinandan-Ricky/py4e

History

344 lines (328 loc) · 25.4 KB

Raw

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

<?php if ( file_exists("../booktop.php") ) {

require_once "../booktop.php";

ob_start();

}?>

<!DOCTYPE html>

<head>

<style>

code{white-space: pre-wrap;}

span.smallcaps{font-variant: small-caps;}

span.underline{text-decoration: underline;}

div.column{display: inline-block; vertical-align: top; width: 50%;}

div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;}

ul.task-list{list-style: none;}

</style>

<!--[if lt IE 9]>

<![endif]-->

</head>

<body>

<h1 id="files">Files</h1>

<h2 id="persistence">Persistence</h2>

So far, we have learned how to write programs and communicate our intentions to the Central Processing Unit using conditional execution, functions, and iterations. We have learned how to create and use data structures in the Main Memory. The CPU and memory are where our software works and runs. It is where all of the “thinking” happens.

But if you recall from our hardware architecture discussions, once the power is turned off, anything stored in either the CPU or main memory is erased. So up to now, our programs have just been transient fun exercises to learn Python.

<img src="../images/arch.svg" alt="" /><figcaption>Secondary Memory</figcaption>

</figure>

In this chapter, we start to work with Secondary Memory (or files). Secondary memory is not erased when the power is turned off. Or in the case of a USB flash drive, the data we write from our programs can be removed from the system and transported to another system.

We will primarily focus on reading and writing text files such as those we create in a text editor. Later we will see how to work with database files which are binary files, specifically designed to be read and written through database software.

<h2 id="opening-files">Opening files</h2>

When we want to read or write a file (say on your hard drive), we first must open the file. Opening the file communicates with your operating system, which knows where the data for each file is stored. When you open a file, you are asking the operating system to find the file by name and make sure the file exists. In this example, we open the file mbox.txt, which should be stored in the same folder that you are in when you start Python. You can download this file from <a href="http://www.py4e.com/code3/mbox.txt">www.py4e.com/code3/mbox.txt</a>

<pre class="python"><code>>>> fhand = open('mbox.txt')

>>> print(fhand)

<_io.TextIOWrapper name='mbox.txt' mode='r' encoding='cp1252'></code></pre>

If the <code>open</code> is successful, the operating system returns us a file handle. The file handle is not the actual data contained in the file, but instead it is a “handle” that we can use to read the data. You are given a handle if the requested file exists and you have the proper permissions to read the file.

<img src="../images/handle.svg" alt="" /><figcaption>A File Handle</figcaption>

</figure>

If the file does not exist, <code>open</code> will fail with a traceback and you will not get a handle to access the contents of the file:

<pre class="python"><code>>>> fhand = open('stuff.txt')

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

FileNotFoundError: [Errno 2] No such file or directory: 'stuff.txt'</code></pre>

Later we will use <code>try</code> and <code>except</code> to deal more gracefully with the situation where we attempt to open a file that does not exist.

<h2 id="text-files-and-lines">Text files and lines</h2>

A text file can be thought of as a sequence of lines, much like a Python string can be thought of as a sequence of characters. For example, this is a sample of a text file which records mail activity from various individuals in an open source project development team:

<pre><code>From [email protected] Sat Jan 5 09:14:16 2008

Return-Path: <[email protected]>

Date: Sat, 5 Jan 2008 09:12:18 -0500

To: [email protected]

From: [email protected]

Subject: [sakai] svn commit: r39772 - content/branches/

Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772

...</code></pre>

The entire file of mail interactions is available from

and a shortened version of the file is available from

<a href="http://www.py4e.com/code3/mbox-short.txt">www.py4e.com/code3/mbox-short.txt</a>

These files are in a standard format for a file containing multiple mail messages. The lines which start with “From” separate the messages and the lines which start with “From:” are part of the messages. For more information about the mbox format, see <a href="https://en.wikipedia.org/wiki/Mbox" class="uri">https://en.wikipedia.org/wiki/Mbox</a>.

To break the file into lines, there is a special character that represents the “end of the line” called the newline character.

In Python, we represent the newline character as a backslash-n in string constants. Even though this looks like two characters, it is actually a single character. When we look at the variable by entering “stuff” in the interpreter, it shows us the <code>\n</code> in the string, but when we use <code>print</code> to show the string, we see the string broken into two lines by the newline character.

<pre class="python"><code>>>> stuff = 'Hello\nWorld!'

>>> stuff

'Hello\nWorld!'

>>> print(stuff)

Hello

World!

>>> stuff = 'X\nY'

>>> print(stuff)

>>> len(stuff)

3</code></pre>

You can also see that the length of the string <code>X\nY</code> is three characters because the newline character is a single character.

So when we look at the lines in a file, we need to imagine that there is a special invisible character called the newline at the end of each line that marks the end of the line.

So the newline character separates the characters in the file into lines.

<h2 id="reading-files">Reading files</h2>

While the file handle does not contain the data for the file, it is quite easy to construct a <code>for</code> loop to read through and count each of the lines in a file:

<pre class="python"><code>fhand = open('mbox-short.txt')

count = 0

for line in fhand:

count = count + 1

print('Line Count:', count)

# Code: http://www.py4e.com/code3/open.py</code></pre>

We can use the file handle as the sequence in our <code>for</code> loop. Our <code>for</code> loop simply counts the number of lines in the file and prints them out. The rough translation of the <code>for</code> loop into English is, “for each line in the file represented by the file handle, add one to the <code>count</code> variable.”

The reason that the <code>open</code> function does not read the entire file is that the file might be quite large with many gigabytes of data. The <code>open</code> statement takes the same amount of time regardless of the size of the file. The <code>for</code> loop actually causes the data to be read from the file.

When the file is read using a <code>for</code> loop in this manner, Python takes care of splitting the data in the file into separate lines using the newline character. Python reads each line through the newline and includes the newline as the last character in the <code>line</code> variable for each iteration of the <code>for</code> loop.

Because the <code>for</code> loop reads the data one line at a time, it can efficiently read and count the lines in very large files without running out of main memory to store the data. The above program can count the lines in any size file using very little memory since each line is read, counted, and then discarded.

If you know the file is relatively small compared to the size of your main memory, you can read the whole file into one string using the <code>read</code> method on the file handle.

<pre class="python"><code>>>> fhand = open('mbox-short.txt')

>>> inp = fhand.read()

>>> print(len(inp))

94626

>>> print(inp[:20])

From stephen.marquar</code></pre>

In this example, the entire contents (all 94,626 characters) of the file mbox-short.txt are read directly into the variable <code>inp</code>. We use string slicing to print out the first 20 characters of the string data stored in <code>inp</code>.

When the file is read in this manner, all the characters including all of the lines and newline characters are one big string in the variable <code>inp</code>. It is a good idea to store the output of <code>read</code> as a variable because each call to <code>read</code> exhausts the resource:

<pre class="python"><code>>>> fhand = open('mbox-short.txt')

>>> print(len(fhand.read()))

94626

>>> print(len(fhand.read()))

0</code></pre>

Remember that this form of the <code>open</code> function should only be used if the file data will fit comfortably in the main memory of your computer. If the file is too large to fit in main memory, you should write your program to read the file in chunks using a <code>for</code> or <code>while</code> loop.

<h2 id="searching-through-a-file">Searching through a file</h2>

When you are searching through data in a file, it is a very common pattern to read through a file, ignoring most of the lines and only processing lines which meet a particular condition. We can combine the pattern for reading a file with string methods to build simple search mechanisms.

For example, if we wanted to read a file and only print out lines which started with the prefix “From:”, we could use the string method startswith to select only those lines with the desired prefix:

<pre class="python"><code>fhand = open('mbox-short.txt')

count = 0

for line in fhand:

if line.startswith('From:'):

print(line)

# Code: http://www.py4e.com/code3/search1.py</code></pre>

When this program runs, we get the following output:

<pre><code>From: [email protected]

From: [email protected]

...</code></pre>

The output looks great since the only lines we are seeing are those which start with “From:”, but why are we seeing the extra blank lines? This is due to that invisible newline character. Each of the lines ends with a newline, so the <code>print</code> statement prints the string in the variable line which includes a newline and then <code>print</code> adds another newline, resulting in the double spacing effect we see.

We could use line slicing to print all but the last character, but a simpler approach is to use the rstrip method which strips whitespace from the right side of a string as follows:

<pre class="python"><code>fhand = open('mbox-short.txt')

for line in fhand:

line = line.rstrip()

if line.startswith('From:'):

print(line)

# Code: http://www.py4e.com/code3/search2.py</code></pre>

When this program runs, we get the following output:

<pre><code>From: [email protected]

From: [email protected]

...</code></pre>

As your file processing programs get more complicated, you may want to structure your search loops using <code>continue</code>. The basic idea of the search loop is that you are looking for “interesting” lines and effectively skipping “uninteresting” lines. And then when we find an interesting line, we do something with that line.

We can structure the loop to follow the pattern of skipping uninteresting lines as follows:

<pre class="python"><code>fhand = open('mbox-short.txt')

for line in fhand:

line = line.rstrip()

# Skip 'uninteresting lines'

if not line.startswith('From:'):

continue

# Process our 'interesting' line

print(line)

# Code: http://www.py4e.com/code3/search3.py</code></pre>

The output of the program is the same. In English, the uninteresting lines are those which do not start with “From:”, which we skip using <code>continue</code>. For the “interesting” lines (i.e., those that start with “From:”) we perform the processing on those lines.

We can use the <code>find</code> string method to simulate a text editor search that finds lines where the search string is anywhere in the line. Since <code>find</code> looks for an occurrence of a string within another string and either returns the position of the string or -1 if the string was not found, we can write the following loop to show lines which contain the string “@uct.ac.za” (i.e., they come from the University of Cape Town in South Africa):

<pre class="python"><code>fhand = open('mbox-short.txt')

for line in fhand:

line = line.rstrip()

if line.find('@uct.ac.za') == -1: continue

print(line)

# Code: http://www.py4e.com/code3/search4.py</code></pre>

Which produces the following output:

<pre><code>From [email protected] Sat Jan 5 09:14:16 2008

X-Authentication-Warning: set sender to [email protected] using -f

From: [email protected]

Author: [email protected]

From [email protected] Fri Jan 4 07:02:32 2008

X-Authentication-Warning: set sender to [email protected] using -f

From: [email protected]

Author: [email protected]

...</code></pre>

Here we also use the contracted form of the <code>if</code> statement where we put the <code>continue</code> on the same line as the <code>if</code>. This contracted form of the <code>if</code> functions the same as if the <code>continue</code> were on the next line and indented.

<h2 id="letting-the-user-choose-the-file-name">Letting the user choose the file name</h2>

We really do not want to have to edit our Python code every time we want to process a different file. It would be more usable to ask the user to enter the file name string each time the program runs so they can use our program on different files without changing the Python code.

This is quite simple to do by reading the file name from the user using <code>input</code> as follows:

<pre class="python"><code>fname = input('Enter the file name: ')

fhand = open(fname)

count = 0

for line in fhand:

if line.startswith('Subject:'):

count = count + 1

print('There were', count, 'subject lines in', fname)

# Code: http://www.py4e.com/code3/search6.py</code></pre>

We read the file name from the user and place it in a variable named <code>fname</code> and open that file. Now we can run the program repeatedly on different files.

<pre><code>python search6.py

Enter the file name: mbox.txt

There were 1797 subject lines in mbox.txt

python search6.py

Enter the file name: mbox-short.txt

There were 27 subject lines in mbox-short.txt</code></pre>

Before peeking at the next section, take a look at the above program and ask yourself, “What could go possibly wrong here?” or “What might our friendly user do that would cause our nice little program to ungracefully exit with a traceback, making us look not-so-cool in the eyes of our users?”

<h2 id="using-try-except-and-open">Using <code>try, except,</code> and <code>open</code></h2>

I told you not to peek. This is your last chance.

What if our user types something that is not a file name?

<pre><code>python search6.py

Enter the file name: missing.txt

Traceback (most recent call last):

File "search6.py", line 2, in <module>

fhand = open(fname)

FileNotFoundError: [Errno 2] No such file or directory: 'missing.txt'

python search6.py

Enter the file name: na na boo boo

Traceback (most recent call last):

File "search6.py", line 2, in <module>

fhand = open(fname)

FileNotFoundError: [Errno 2] No such file or directory: 'na na boo boo'</code></pre>

Do not laugh. Users will eventually do every possible thing they can do to break your programs, either on purpose or with malicious intent. As a matter of fact, an important part of any software development team is a person or group called Quality Assurance (or QA for short) whose very job it is to do the craziest things possible in an attempt to break the software that the programmer has created.

The QA team is responsible for finding the flaws in programs before we have delivered the program to the end users who may be purchasing the software or paying our salary to write the software. So the QA team is the programmer’s best friend.

So now that we see the flaw in the program, we can elegantly fix it using the <code>try</code>/<code>except</code> structure. We need to assume that the <code>open</code> call might fail and add recovery code when the <code>open</code> fails as follows:

<pre class="python"><code>fname = input('Enter the file name: ')

try:

fhand = open(fname)

except:

print('File cannot be opened:', fname)

exit()

count = 0

for line in fhand:

if line.startswith('Subject:'):

count = count + 1

print('There were', count, 'subject lines in', fname)

# Code: http://www.py4e.com/code3/search7.py</code></pre>

The <code>exit</code> function terminates the program. It is a function that we call that never returns. Now when our user (or QA team) types in silliness or bad file names, we “catch” them and recover gracefully:

<pre><code>python search7.py

Enter the file name: mbox.txt

There were 1797 subject lines in mbox.txt

python search7.py

Enter the file name: na na boo boo

File cannot be opened: na na boo boo</code></pre>

Protecting the <code>open</code> call is a good example of the proper use of <code>try</code> and <code>except</code> in a Python program. We use the term “Pythonic” when we are doing something the “Python way”. We might say that the above example is the Pythonic way to open a file.

Once you become more skilled in Python, you can engage in repartee with other Python programmers to decide which of two equivalent solutions to a problem is “more Pythonic”. The goal to be “more Pythonic” captures the notion that programming is part engineering and part art. We are not always interested in just making something work, we also want our solution to be elegant and to be appreciated as elegant by our peers.

<h2 id="writing-files">Writing files</h2>

To write a file, you have to open it with mode “w” as a second parameter:

<pre class="python"><code>>>> fout = open('output.txt', 'w')

>>> print(fout)

<_io.TextIOWrapper name='output.txt' mode='w' encoding='cp1252'></code></pre>

If the file already exists, opening it in write mode clears out the old data and starts fresh, so be careful! If the file doesn’t exist, a new one is created.

The <code>write</code> method of the file handle object puts data into the file, returning the number of characters written. The default write mode is text for writing (and reading) strings.

<pre class="python"><code>>>> line1 = "This here's the wattle,\n"

>>> fout.write(line1)

24</code></pre>

Again, the file object keeps track of where it is, so if you call <code>write</code> again, it adds the new data to the end.

We must make sure to manage the ends of lines as we write to the file by explicitly inserting the newline character when we want to end a line. The <code>print</code> statement automatically appends a newline, but the <code>write</code> method does not add the newline automatically.

<pre class="python"><code>>>> line2 = 'the emblem of our land.\n'

>>> fout.write(line2)

24</code></pre>

When you are done writing, you have to close the file to make sure that the last bit of data is physically written to the disk so it will not be lost if the power goes off.

<pre class="python"><code>>>> fout.close()</code></pre>

We could close the files which we open for read as well, but we can be a little sloppy if we are only opening a few files since Python makes sure that all open files are closed when the program ends. When we are writing files, we want to explicitly close the files so as to leave nothing to chance.

<h2 id="debugging">Debugging</h2>

When you are reading and writing files, you might run into problems with whitespace. These errors can be hard to debug because spaces, tabs, and newlines are normally invisible:

<pre class="python"><code>>>> s = '1 2\t 3\n 4'

>>> print(s)

1 2 3

4</code></pre>

The built-in function <code>repr</code> can help. It takes any object as an argument and returns a string representation of the object. For strings, it represents whitespace characters with backslash sequences:

<pre class="python"><code>>>> print(repr(s))

'1 2\t 3\n 4'</code></pre>

This can be helpful for debugging.

One other problem you might run into is that different systems use different characters to indicate the end of a line. Some systems use a newline, represented <code>\n</code>. Others use a return character, represented <code>\r</code>. Some use both. If you move files between different systems, these inconsistencies might cause problems.

For most systems, there are applications to convert from one format to another. You can find them (and read more about this issue) at <a href="wikipedia.org/wiki/Newline">wikipedia.org/wiki/Newline</a>. Or, of course, you could write one yourself.

<h2 id="glossary">Glossary</h2>

<dl>

<dt>catch</dt>

<dd>To prevent an exception from terminating a program using the <code>try</code> and <code>except</code> statements.

</dd>

<dt>newline</dt>

<dd>A special character used in files and strings to indicate the end of a line.

</dd>

<dt>Pythonic</dt>

<dd>A technique that works elegantly in Python. “Using try and except is the Pythonic way to recover from missing files”.

</dd>

<dt>Quality Assurance</dt>

<dd>A person or team focused on insuring the overall quality of a software product. QA is often involved in testing a product and identifying problems before the product is released.

</dd>

<dd>A sequence of characters stored in permanent storage like a hard drive.

</dd>

</dl>

<h2 id="exercises">Exercises</h2>

Exercise 1: Write a program to read through a file and print the contents of the file (line by line) all in upper case. Executing the program will look as follows:

<pre><code>python shout.py

Enter a file name: mbox-short.txt

FROM [email protected] SAT JAN 5 09:14:16 2008

RETURN-PATH: <[email protected]>

RECEIVED: FROM MURDER (MAIL.UMICH.EDU [141.211.14.90])

BY FRANKENSTEIN.MAIL.UMICH.EDU (CYRUS V2.3.8) WITH LMTPA;

SAT, 05 JAN 2008 09:14:16 -0500</code></pre>

You can download the file from <a href="http://www.py4e.com/code3/mbox-short.txt">www.py4e.com/code3/mbox-short.txt</a>

Exercise 2: Write a program to prompt for a file name, and then read through the file and look for lines of the form:

<pre><code>X-DSPAM-Confidence: 0.8475</code></pre>

When you encounter a line that starts with “X-DSPAM-Confidence:” pull apart the line to extract the floating-point number on the line. Count these lines and then compute the total of the spam confidence values from these lines. When you reach the end of the file, print out the average spam confidence.

<pre><code>Enter the file name: mbox.txt

Average spam confidence: 0.894128046745

Enter the file name: mbox-short.txt

Average spam confidence: 0.750718518519</code></pre>

Test your file on the mbox.txt and mbox-short.txt files.

Exercise 3: Sometimes when programmers get bored or want to have a bit of fun, they add a harmless Easter Egg to their program. Modify the program that prompts the user for the file name so that it prints a funny message when the user types in the exact file name “na na boo boo”. The program should behave normally for all other files which exist and don’t exist. Here is a sample execution of the program:

<pre><code>python egg.py

Enter the file name: mbox.txt

There were 1797 subject lines in mbox.txt

python egg.py

Enter the file name: missing.tyxt

File cannot be opened: missing.tyxt

python egg.py

Enter the file name: na na boo boo

NA NA BOO BOO TO YOU - You have been punk'd!</code></pre>

We are not encouraging you to put Easter Eggs in your programs; this is just an exercise.

</body>

</html>

<?php if ( file_exists("../bookfoot.php") ) {

$HTML_FILE = basename(__FILE__);

$HTML = ob_get_contents();

ob_end_clean();

require_once "../bookfoot.php";

}?>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

07-files.php

Latest commit

History

07-files.php

File metadata and controls