Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

README.md

HTML Cleaner

Motivation

Eventually, we want to be able to crawl the web and parse the words from web pages. As such, we must be able to remove all of the HTML tags in the web page. This assignment is a step towards this functionality.

Background

You can assume the HTML to parse validates. As a result, the < less than and > greater than symbols will only appear as HTML tags, and will be entities &lt; or &gt; if they are symbols in the text instead.

Files

The following files are required for this assignment:

Hints

The following example Java classes from the Sockets lecture may be helpful for this assignment:

There are two URLs that the tests expect you can download via a socket connection and parse. These test cases are:

You can use the "View Source" functionality in any browser to see the HTML being parsed by these tests.