Project Documentation

Intelligent browsing system that takes topic keywords as input, scrapes web for relevant documents and generates inverted index and visualizations of most frequent relevant words.

Team Members

David Ho - davidsh3
Ben Yang - bhyang2
Nicholas Truong - ntruong3 (Captain)
Jun Zhong - jmzhong2

Purpose

In research settings where a new project is embarked upon, a researcher might only have a general knowledge of a topic and is familiar with only a limited scope of keywords. Currently, such researchers would query upon keywords he is familiar with in order to browse and pull less-familiar keywords related to the research project. The researcher would then combine the familiar and less-familiar keywords to create more effective queries. This intelligent browsing project seeks to generate statistical visualizations relevant to a limited-keyword query, in order to help the researcher more quickly and more easily discover keywords that would help generate an effective query. The intelligent browsing program would take in some known keywords as input, scrape all docs and create an inverted index of the most frequent relevant words that appear in the docs resulting from the input query, and then generate statistical visualizations of those most frequent relevant words.

System Requirements

Languages and Modules

-Python 3.7: requests, urllib, sys, string, pandas, requests_html, HTML, HTMLSession, BeautifulSoup, MetaPy, numpy, glob, pathlib, io, os, shutil, seaborn, matplotlib, flask

-Javascript

Installation Documentation

Download the folder from the GitHub page: github.com/nstruong13/CourseProject
Go on extensions page of the Chrome browser, and enable Developer Mode at the top right corner.
Click "Load unpacked" on the top left corner, navigate to the CourseProject-main folder, and select the chromeExtension folder. Enable the ScraperProject extension [The modules/packages required to be installed, most importantly Flask, are listed above]
Open terminal, cd to the local CourseProject-main folder, and enter command "python apiServer.py". if this command is successful, it will print "Serving Flask app "apiServer". The last print will be a "Running on [address]".
Initiate the extension by clicking on its icon in the extensions bar (or drop-down list) at the top right corner of the browser.
Enter a query of term(s) you seek to find relevant terms to. (The query the presenter used was Black Scholes). It will print a heatmap showing the most related words by document and term frequencies. The heatmap will be followed by gradient-colored postings tables of each of the 10 related terms one can scroll through. (Since, this extension is in its elementary stages, the scraping process can range from a time of a second to more than a few seconds. If it doesn't print, check the terminal to see if there is a module not installed or another error.)
If one wanted to compare two queries, one can enter another query and the heatmap and accompanying posting tables will follow the print of the first query. If one wanted a fresh query print, then unclick and reclick on the extension logo.
The terminal will show all the vocabulary scraped from the relevant sites, in case the 10 most related terms are still unfamiliar to the user. The terminal will also display sites that were unable to be scraped.
If one wants an offline version of the heatmap, a .png file will be generated within the CourseProject-main folder after every query. Make sure to save the heatmap as it will be overwritten after every query.

Code Sections and Team Contributions

Web Scraper: Developed by David Ho

The web scraper adds the user query to Google search url and returns a list of links from the first page of results, removes unwanted sites such as Youtube and more Google pages, as well as duplicate sites. It scrapes sites using Beautiful Soup for body text, removes punctuations and nontext and writes out the results to a .dat file for further processing with MetaPy.

Text Analysis: Developed by Ben Yang

Using MetaPy, the text outputted from the web scraper is tokenized into unigrams and filtered for stop words, made lowercase, and stemmed. The unigrams are then analyzed by creating an inverted index with lexicon and postings to understand each term's document frequency, term frequency, and term frequency per document. The terms can then be sorted to display the most frequent terms associated with the search query at the top of the list. This analysis will be used for visualizations for the user.

Data Visualizations: Developed by Jun Zhong

The content of the visualizations are based off the ten most related words generated by the text analysis. Using Seaborn, the visualizations of this project are designed to be friendly to an user without a background in text analysis and mining, helping such a researcher find the terms most research-relevant to terms in her query. With a primary visualization as a Seaborn heatmap, the fitting visualization for two categorical variables such as freqencies and terms, it utilizes perceptual uniform color spectrum to guide users to the terms that have highest document frequency. The visualization also contains clarifications for users unfamiliar with the terminology we are unfamiliar with in this class: document and term frequency. The secondary visualization is ten sub-tables describing the postings of each term listed in the primary visualization, identifying the document(s) in which the term can be found and the number of occurrences of that term in each document. This helps the user researcher discover which related terms are overly concentrated among a few documents, rendering their high total frequencies less meaningful than a well-distributed term.

Chrome Extension: Developed by Nicholas Truong

The chrome extension can be deployed to the chrome store so that anyone can download the extension; however, to run it locally, you would need to navigate to chrome://extensions/ in your chrome browser. On the top right, enable developer mode. Then, click the button on the left that says “load unpacked” and select the folder chromeExtension from the project directory. This will automatically load the extension into your browser. The extension needs to communicate with an api server to make the request for the query text. If this project was to be scaled out, the server would need to be hosted by a cloud provider such as AWS. For local development and testing, you would need to go into the project directory and run python3 apiServer.py . The api server will automatically spin up and you can now test the chrome extension. Note: the extension can only handle one query request per session. If you are trying multiple queries, please re-click on the extension.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
chromeExtension		chromeExtension
CS410 Project Proposal.pdf		CS410 Project Proposal.pdf
Progress Doc.pdf		Progress Doc.pdf
README.md		README.md
ScrapeMineViz.py		ScrapeMineViz.py
Tutorial-BenYang-David Ho-Nicholas Truong- JunZhong.mp4		Tutorial-BenYang-David Ho-Nicholas Truong- JunZhong.mp4
apiServer.py		apiServer.py
config.toml		config.toml
lemur-stopwords.txt		lemur-stopwords.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Documentation

Team Members

Purpose

System Requirements

Installation Documentation

Code Sections and Team Contributions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Project Documentation

Team Members

Purpose

System Requirements

Installation Documentation

Code Sections and Team Contributions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages