Skip to content

feat: add direct remote access over s3 and https via warcio >= 1.8.0#25

Draft
handecelikkanat wants to merge 1 commit intomainfrom
feat/s3-access-via-warcio1.8
Draft

feat: add direct remote access over s3 and https via warcio >= 1.8.0#25
handecelikkanat wants to merge 1 commit intomainfrom
feat/s3-access-via-warcio1.8

Conversation

@handecelikkanat
Copy link
Copy Markdown
Contributor

@handecelikkanat handecelikkanat commented Apr 9, 2026

From https://github.com/commoncrawl/issues/issues/684

This PR adds direct remote access (s3, https) to warc/wet/wat files in S3 buckets, using warcio.

Since 1.8.0, warcio supports direct remote file access over s3 and https: https://github.com/webrecorder/warcio/blob/master/CHANGELIST.rst

This PR adds:

  • fsspec_open call from warcio.utils to replace local open call in warcio-iterator.py
  • New make target to remote access directly over s3: make iterate-remote-s3
  • [Pending moving the files to public bucket] New make target to remote access directly over https: make iterate-remote-https
  • New section to README.md to run these targets: Task 2-i: Iterating over "Remote" Files
  • New section to use warcio index remotely
  • New requirement warcio[s3]>=1.8.0

[ ! Note ]

  • I still keep task/target to iterate over local files from the Github repo. (make iterate)
  • I think this is a gentle start.
  • And might be inviting for people to have the files in their local, to check up close well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant