Eventually, we want to be able to crawl the web and parse web pages. As such, we must be able to find the links embedded in the HTML code of the web pages. This assignment is a step towards this functionality.
See the Javadoc comments in HTMLLinkParser and HTMLLinkParserExtraTest for additional details.
We do not explicitly cover HTML in this class. However, this markup language should be easy to pickup for any programmer. Some resources include:
You will need to be familiar with the anchor tag <a> for this assignment. This is the tag used to create links on web pages. For example:
<a href="http://www.cs.usfca.edu/">USF CS</a>The above code will generate the link USF CS, where the link text is USF CS and the link destination is http://www.cs.usfca.edu/. The link will always be placed in the href attribute of the a tag, but not all a tags will have this attribute.
The following files are required for this project:
Please download the above files and add them to your Java project in Eclipse to get started. See the Homework README for details on how to download individual files or subdirectories from this repository.
Below are some hints that may help with this homework assignment:
-
Most people will only need to modify the
REGEXinHTMLLinkParser. If you use an unusual regular expression, you may have to change theGROUPas well. -
You will likely want to use one or more flags in your regular expression.
-
You can write the unit tests for
HTMLLinkParserExtraTestBEFORE completing the regular expression inHTMLLinkParser.
You are not required to use these hints in your solution. There may be multiple approaches to solving this homework.