Skip to content

dbennett455/DetectHtml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DetectHtml

A Java static method to detect text that has been marked up with HTML tags or entities.

I needed to detect self-contained HTML tags or entities in user supplied data to make formatting determinations. After searching the Internet I found a few examples as regular expressions. Most of the examples failed my initial test cases and didn't handle conditions such as text without tags that contained HTML entity escape codes.

I continued to refine the regular expression until I came up with a good meta expression that handled:

  • Start and End tag combinations in single or multi-line text values.
  • Text marked up with self-closing tags such as <br/> or <hr/>
  • Text marked up with HTML entity escape sequences like &lt; or &frac12;

I also wanted to make sure that it didn't match other common text phrases that may be misinterpreted as HTML.

  • Logic expressions such as: "If A<B then B>A"
  • Ampersand usage: AT&T, D&B, etc...
  • Malformed or partial HTML: </body></html>

Sample Usage

    String htmlContent="<a href=\"http://www.example.com/\">\nclick here\n</a>";
    if (DetectHtml.isHtml(htmlContent))
      System.out.println("htmlContent is HTML");

Please Note:

This in no way will check user provided HTML for safety. You still need to sanitize your HTML. I recommend OWASP to sanitize your HTML.


No dependencies required. Just refactor the class into your project and you're done.

--Dave

About

Java static method to detect text that has been marked up with HTML tags or entities

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors