View these docs here if you like pretty colors.
Clone the ClipboardApp repository into your preferred directory with Git Bash on Windows or a normal terminal otherwise: git clone https://github.com/ClipboardProject/ClipboardApp.git
For Windows Home, download from here. Documentation is here.
For Windows Professional or Enterprise, download from here. Documentation is here.
For Mac, download from here. Documentation is here.
For Linux, download from your package manager. Documentation is here (Other distros have links on the left side of the page).
Make sure you follow any OS and distro-specific instructions for setting up Docker. It may be helpful to go through the getting started guide here.
These steps are for Ubuntu. Arch Linux has Docker available in pacman without any manual steps required. Other distros may require different steps.
If you're new to Docker or you're recovering from a failed installation attempt, it's best to start by uninstalling older versions of Docker: sudo apt-get remove docker docker-engine docker.io
Run: sudo apt-get update
Install the following packages:
sudo apt-get install apt-transport-https
sudo apt-get install ca-certificates
sudo apt-get install curl
sudo apt-get install software-properties-common
These allow apt to use a repository over HTTPS
Add Docker's official GNU Privacy Guard (GPG) key
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
This should print, "OK" to the terminal.
Run: sudo apt-key fingerprint 0EBFCD88
Verify that the Key Fingerprint line shows: 9DC8 5822 9FC7 DD38 854A E2D8 8D81 803C 0EBF CD88
Set up the stable Docker repository:
sudo add-apt-repository \ "deb [arch=amd64] https://download.docker.com/linux/ubuntu \ $(lsb_release -cs) \ stable"
Run: sudo apt-get update again.
Install the latest version of Docker CE: sudo apt-get install docker-ce
If there were problems during the installation, try removing docker and starting over.
sudo apt-get purge docker-ce
sudo rm -rf /var/lib/docker
Run: sudo curl -L https://github.com/docker/compose/releases/download/1.21.2/docker-compose-$(uname -s)-$(uname -m) -o /usr/local/bin/docker-compose
Add executable permissions to the docker-compose binary: sudo chmod +x /usr/local/bin/docker-compose
Run docker-compose --version to verify it installed correctly. It should show a version and build number similar to:
"docker-compose version 1.21.2, build 1719ceb"
If the docker-compose command doesn't work, add the following line to your ~/.bashrc file
export PATH="/usr/bin/docker-compose:$PATH"
Close and reopen your terminal(s) to apply the changes.
I was unable to get Kinematic to work on Docker Toolbox, so I would recommend skipping that. Make sure virtualization is enabled in the BIOS. If you need to change virtualization settings, do a full reboot cycle, otherwise Windows may not report that the settings have changed. If you're running Windows 10 Professional, you'll need to make sure Hyper-V is enabled in the "Turn Windows Features On or Off" dialog. If you're using Docker Toolbox on Windows Home edition, you'll want to start the VirtualBox instance manually before starting Docker every time or Docker will complain about not having an IP address.
If you are using Linux, all of the subsequent Docker commands in this guide might have to be run with sudo.
If you would like to be able to use Docker without sudo, look through the answers here. If you're using Docker Toolbox on Windows Home,
all subsequent statements that mention localhost should be replaced with 192.168.99.100. This is because the Docker engine can't bind to localhost when using Docker Toolbox.
Verify that Docker installed correctly with: docker run hello-world. You should see, "Hello from Docker!"
The startup process runs a Python script to check if any docker images are out of date, so you will need python >=3.6 installed for that script to run properly. If you do not, it will throw an error, but it won't affect the rest of the process, so this part can be skipped if you want. If you're on Windows and do not have Python set up, you should install Anaconda from here. During set up, you should check the box that says to add Python to your system path. If you do not, you'll need to add it to your path later. Without Python being accessible in the system path, Python commands won't be visible to terminals like the Docker Terminal or Git Bash. On Mac or Linux, you can install Python directly from your system's package manager. If your Python version is too old, I recommend using pyenv to manage your Python versions.
Once Python is set up, run scripts/install.sh from the root of the Github repo to install necessary Python dependencies.
Open a Docker terminal on Windows Home, Git Bash or some kind of bash emulator on Windows Professional, or a normal terminal otherwise, and cd into the Git repo. Run ./start.bash.
If you get a permissions error, you may need to run chmod +x start.bash. This will grant execution permissions to the file.
If all goes well, the database will be created, the scrapers will start running, and the website will start up. This process will take some time.
Eventually, you should start seeing messages about events being saved. Once a message says Data retrieved successfully, the code is done running.
Several components should be visible now:
localhostandlocalhost:3000will show the sitelocalhost:5000/docswill show a frontend for viewing the data and testing the APIlocalhost:9000will show a frontend for managing the Docker containers. Create whatever username and password you want.
If you want to see more details about the data in the database, download NoSQL Booster from here. You can use another MongoDB client if you'd prefer. Create a connection to localhost:27017 and
you should see the data show up.
For debugging, we've set up configurations to allow for remote debugging in Docker using VS Code. This allows you to set breakpoints and step through code remotely while it's running in Docker.
You can use another editor if you'd like, but you'll have to set up remote debugging yourself. Whenever you open VS Code, it creates a directory called .vscode which stores local configurations.
This repo contains all of the components needed to run the system in separate folders:
clipboardapp/in2it_sitecontains all code pertaining to the site itselfclipboardapp/event_processorcontains the web scrapersclipboardapp/event_servicecontains the API that the site and the event processor both call to interface with the database
When you're developing, you'll want to think of those folders as separate projects and open a separate instance of VS Code in each of those subdirectories. This is important because the remote debugger requires
the folder structure of the remote and local repository to match. To do so, you can launch VS Code, then choose File -> Open Folder or open it from the command line like this: code ./in2it_site.
Once you have VS Code open, you should see a bug icon on the left panel. This contains the debugger settings. If you click the gear icon near the top right of the submenu, it will open a prompt to choose an environment.
It doesn't matter which one you choose because we'll overwrite this file in a minute. In this repo, there is a folder called sample_vscode_config with one config per component. Replace the entire launch.json file with
whatever config matches your current folder. As the comment in the files explain, you will need to replace localhost with 192.168.99.100 for Docker Toolbox.
Once you have the configuration saved, you'll be able to select it from the debug menu. When you have the code running in Docker, click the green arrow to attach to the running process.
All of the code is running through a program called nodemon which allows you to use hot reloading while debugging. Hot reloading means that any time you change the source code in your editor, nodemon will detect the change and automatically restart the attached process. This way, testing your changes requires no manual intervention.
The following parameters can be passed to start.bash to change its runtime behavior.
-
-d or --processor-debug: This parameter is needed when using the debugger with the event processor. When passed in,runner.pywill pause at the start of execution until you connect to it from the VS Code debugger. This isn't needed by the other components because you can attach to a Node process without any special configuration. -
-v or --verbose-outputThis tells scrapy to send verbose output to the logs. Otherwise, only errors will be displayed. Scrapy generates a lot of output so this is only useful when debugging odd behavior. -
-s or --run-schedulerThis tells the event processor to run the scrapers on a schedule (currently once a minute in dev and once every two hours in prod). It's easier to test without this flag since it will run them all at once when this is not passed in.
The following settings are defined in event_processor/config.py:
-
enable_api_cache: If
True, any API calls made will be cached to a local file. This is useful to speed up development and to prevent hitting sites repeatedly. -
api_cache_expiration: Time in seconds that API data will be cached for.
-
api_delay_seconds: The amount of time between API calls. This is used by calling
ApiBase.wait(). This is necessary when making large amounts of API calls in quick succession so as not to overrun the server. -
enable_scrapy_cache: If
True, any Scrapy calls made will be cached using Scrapy's builtin cache system. This is useful to speed up development and to prevent hitting sites repeatedly. -
scrapy_cache_expiration: Time in seconds that Scrapy data will be cached for.
Our current development tasks and bugs are kept in the issues list here.
The easiest way to learn the code base and get started contributing is to add a new scraper as defined in this issue.
The issue contains instructions on how to pick a specific site.
This project consists of four parts
-
Event Processor: This is the heart of the application. It asynchronously scrapes websites and pulls in data from APIs, cleans and formats the data, then sends it to the MongoDB client.
-
Event Service: This is a standalone service that receives data from the event processor for insertion into MongoDB and processes requests from the clipboard site to display data to the user.
Any time data is received from a website, the old data from that site is deleted and refreshed with the new data. -
MongoDB Instance: This holds a single collection of all data from the sites. Only the database client interacts with the database.
-
In2It Site: The website that displays the aggregated data. Interacts with the database via the database client.
As stated previously, adding a scraper is the best way to start contributing. If you're not familiar with web scraping, this gives a decent overview about what web scraping is. We're using Scrapy for this project, which is a complex and sophisticated web scraping framework. If you'd to start with a tutorial that will help you learn more about how to write a scraper without worrying about the complexities of Scrapy, take a look at this guide which uses a library called BeautifulSoup. If you're comfortable with the concepts used in web scraping, take a look at this tutorial. Ignore the installation instructions because you should have installed Scrapy earlier in this guide.
Scrapy uses the CssSelect module to implement css selectors. Docs can be found here. CssSelect defines its selectors according to the w3 specification here with a few exceptions that are listed in CssSelect's documentation.
Most websites that we're dealing with will need to be scraped because the data on them is statically loaded from the server as html. However, some sites use APIs to dynamically load data. We should use these whenever possible because scrapers are fragile and need to be changed any time the content on the page changes. APIs are more stable and are less likely to have breaking changes introduced often.
Here is an example of how to detect if a site has an API we can use.
- Go to https://chipublib.bibliocommons.com/events/search/index in Google Chrome
- Open the developer tools using F12 on Windows/Linux and Command+Option+I on Mac
- Click on the "Network" tab at the top of the toolbox
- Reload the page. The grid should be populated with data.
- Click on the "Name" column for any of the requests. A detailed view should appear and the "Headers" tab should be selected.
- Click on the "Response" tab. There could be a variety of data in here. This view can have a variety of data.
For resource requests like images, it will say there is no data available, javascript files will show the javascript code, css files will show the stylesheet, etc. The only response data we care about right now is json. - Look for a request name that starts with "search?". Looking through the response, you should see a json object.
- Click on the "Headers" tab. The Request URL is what was requested by your browser to retrieve the json data. We can use that same url to get that data in our application.
- If you keep clicking through more requests, you should see several more that also returned json data.
This is the code that was used to create an API client for that site.
You can use this as a guide if you need to create your own API client. Some sites have APIs that are well-documented and designed for external use. These should be used if they are available.
Some sites may provide an iCalendar feed. Try to use the iCal reader if it is possible to do so.
Some sites may also provide an RSS feed. This is an example of how to use the feedparser module to parse a feed.
All new scrapers should inherit from one of the classes listed here All new API clients should inherit from ApiSpider and scrapers should inherit from ScraperSpider or ScraperCrawlSpider, depending on if the spider needs to visit multiple urls or not.
The end goal of all scrapers and API clients is to transform the raw data into event objects that conform to the Event class in this file.
For each item, you'll want to parse out the following data (as much as is available).
organization: The name of the organization that's putting on the eventtitle: The name of the eventdescription: Detailed description of the eventaddress: Location of the event (okay if exact address is not known)url: Link to url for event. Link to specific event is preferred, but a link to a page containing general event listings is okay.price: Cost to attend, if providedcategory: Category of event, as defined here. (Work in progress. We'll flesh out categories more eventually)- Start/End Time and Date: Dates and times can be supplied with several parameters. Choose one date formate and one time format. Eventually, all dates and times will be converted into Unix timestamps.
time: Use if only one time is supplied for the event (not time range)start_TimeandEnd_Time: Use if the site supplies distinct data for these two valuestime_Range: Use if the start and end time is supplied in a single string ex: 6:00-8:00 PMdate: Use if the event could be one day or multiple days but it is contained in a single string. This is done this way because some sites have data that could be single days or multiple days.start_dateandend_date: Use if the site supplies distinct data for these two valuesstart_timestampandend_timestamp: Use if the data is formatted like a Unix timestamp (Unlikely for scrapers but possible for an API)
Once you've decided how to find these fields for your site, look at the existing examples to see what methods to use to extract the data.