Add storage.cloud documentation and example scripts#241
Open
Sazwanismail wants to merge 1 commit intoVisual-Studio-Code:mainfrom
Open
Add storage.cloud documentation and example scripts#241Sazwanismail wants to merge 1 commit intoVisual-Studio-Code:mainfrom
Sazwanismail wants to merge 1 commit intoVisual-Studio-Code:mainfrom
Conversation
Added documentation and example scripts for Google Cloud Storage usage, including quickstart guides, merging data, and CORS configuration.````markdown name=README.md
# storage.cloud — Google Cloud Storage (GCS) docs & quickstart
storage.cloud is a compact documentation and example repo with copy-pastable commands and small scripts for common Google Cloud Storage tasks:
- Authentication (gcloud ADC, service accounts, access tokens)
- Browser access vs programmatic access (storage.cloud.google.com vs API)
- Signed URLs and sharing
- CORS configuration for browser clients
- Practical patterns to merge many objects (CSV/text) in GCS
- Load patterns for BigQuery and recommendations for large datasets
Repository layout
- index.html — simple landing page / site overview
- docs/
- quickstart.md — commands & short how-tos
- merge-data.md — Malay quick guide for combining files
- signed-urls.md — signing approaches and examples
- examples/
- merge_csv_gcs.py — small Python script to merge CSVs in a GCS prefix
- cors.json — CORS example
- LICENSE — suggested license (MIT)
Quickstart highlights
- Authenticate locally (ADC):
gcloud auth application-default login
- Get an access token for Authorization header:
gcloud auth print-access-token
- List buckets:
gsutil ls gs://
- Copy to/from GCS:
gsutil cp ./local-file gs://BUCKET/OBJECT
gsutil cp gs://BUCKET/OBJECT ./local-file
- Create a signed URL:
gsutil signurl -d 1h /path/to/key.json gs://BUCKET/OBJECT
Security reminders
- Use service accounts with least privilege in production.
- Signed URLs grant access to anyone with the URL until expiry.
- Avoid making sensitive objects public; prefer signed URLs or IAM roles.
Contributing
- Send PRs or issues with improvements.
- Keep examples minimal and targeted.
- Include tests or usage examples when adding scripts.
If you want, I can:
- Expand any doc into a full tutorial (step-by-step with screenshots).
- Add a shell script to compose >32 objects safely.
- Create a Dataflow/Beam starter pipeline for large-scale merges.
````
```html name=index.html
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<title>storage.cloud — Google Cloud Storage quickstart</title>
<meta name="viewport" content="width=device-width,initial-scale=1"/>
<style>
body { font-family: system-ui, -apple-system, Roboto, "Segoe UI", Arial; max-width: 980px; margin: 36px auto; line-height: 1.6; padding: 0 20px; color:Visual-Studio-Code#111 }
pre{ background:#f6f8fa; padding:12px; overflow:auto; border-radius:6px; }
header {border-bottom:1px solid #eee; margin-bottom:20px; padding-bottom:12px}
h1{margin:0;padding:0}
nav a{margin-right:12px;color:#1a73e8;text-decoration:none}
small { color:#666 }
</style>
</head>
<body>
<header>
<h1>storage.cloud</h1>
<p><small>Quick reference and starter content for Google Cloud Storage (GCS)</small></p>
<nav>
<a href="proxy.php?url=https%3A%2F%2Fgithub.com.%2Fdocs%2Fquickstart.md">Quickstart</a>
<a href="proxy.php?url=https%3A%2F%2Fgithub.com.%2Fdocs%2Fmerge-data.md">Combine data</a>
<a href="proxy.php?url=https%3A%2F%2Fgithub.com.%2Fdocs%2Fsigned-urls.md">Signed URLs</a>
<a href="proxy.php?url=https%3A%2F%2Fgithub.com.%2Fexamples%2Fmerge_csv_gcs.py">Examples</a>
</nav>
</header>
<main>
<section>
<h2>Overview</h2>
<p>storage.cloud collects short guides, command snippets, and small reusable scripts for common GCS workflows:</p>
<ul>
<li>Authenticated access (gcloud, service accounts, access tokens)</li>
<li>Signed URLs for temporary public access</li>
<li>CORS configuration for browser-based apps</li>
<li>Merging objects (CSV/text) and loading to BigQuery</li>
</ul>
</section>
<section>
<h2>Helpful links</h2>
<ul>
<li><a href="proxy.php?url=https%3A%2F%2Fgithub.com.%2F%3Ca+href%3D"https://storage.cloud.google.com/">storage.cloud.google.com</a" rel="nofollow">https://storage.cloud.google.com/">storage.cloud.google.com</a> — web UI (requires Google sign-in)</li>
<li><a href="proxy.php?url=https%3A%2F%2Fgithub.com.%2F%3Ca+href%3D"https://console.cloud.google.com/storage/browser">Cloud" rel="nofollow">https://console.cloud.google.com/storage/browser">Cloud Console — Storage browser</a></li>
<li><a href="proxy.php?url=https%3A%2F%2Fgithub.com.%2F%3Ca+href%3D"https://cloud.google.com/storage/docs">GCS" rel="nofollow">https://cloud.google.com/storage/docs">GCS documentation</a></li>
</ul>
</section>
<section>
<h2>Get started</h2>
<p>Open <code>docs/quickstart.md</code> for commands and brief examples you can run locally. For combining files, see <code>docs/merge-data.md</code> and <code>examples/merge_csv_gcs.py</code>.</p>
</section>
</main>
</body>
</html>
```
````markdown name=docs/quickstart.md
# Quickstart — Google Cloud Storage (GCS)
This page collects the most-used commands and short examples for getting started with GCS.
Prerequisites
- Install Google Cloud SDK (gcloud, gsutil)
- For programmatic examples, install the relevant client libraries (Python: google-cloud-storage)
Authentication
- Local developer (Application Default Credentials - ADC):
gcloud auth application-default login
- Service account (recommended for server apps):
1. Create:
gcloud iam service-accounts create my-sa --display-name="My SA"
2. Grant roles (example: objectViewer):
gcloud projects add-iam-policy-binding PROJECT_ID \
--member="serviceAccount:my-sa@PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/storage.objectViewer"
3. (Optional) Create key for local use:
gcloud iam service-accounts keys create key.json \
--iam-account=my-sa@PROJECT_ID.iam.gserviceaccount.com
Access tokens
- Print an access token for HTTP Authorization:
gcloud auth print-access-token
- Use as header: Authorization: Bearer <ACCESS_TOKEN>
Common gsutil commands
- List buckets:
gsutil ls gs://
- List objects in a bucket/prefix:
gsutil ls gs://BUCKET/PREFIX/
- Download object:
gsutil cp gs://BUCKET/OBJECT ./local-file
- Upload file:
gsutil cp ./local-file gs://BUCKET/OBJECT
- Make object public (use sparingly):
gsutil acl ch -u AllUsers:R gs://BUCKET/OBJECT
Signed URLs
- Short way to create an expiring URL using a service account key:
gsutil signurl -d 1h /path/to/key.json gs://BUCKET/OBJECT
- Notes:
- V4 signed URLs are supported; maximum expiry is 7 days.
- Signed URLs allow access without a Google account.
CORS (for browser clients)
- Example file: cors.json (in repo)
- Apply:
gsutil cors set cors.json gs://BUCKET
BigQuery ingestion
- BigQuery accepts wildcards — you can load many CSVs without pre-merging:
bq load --autodetect --source_format=CSV dataset.table gs://BUCKET/PATH/*.csv
Troubleshooting
- Permission denied: check IAM roles (roles/storage.objectViewer or a custom role).
- Invalid credentials: re-run `gcloud auth application-default login` or refresh service account keys.
- CORS errors: ensure bucket CORS includes your domain and required methods/headers.
Security recommendations
- Use service accounts with least privilege.
- Prefer IAM + uniform bucket-level access over ACLs.
- Do not embed long-lived keys in client-side code; use signed URLs for browser access.
Further reading
- GCS docs: https://cloud.google.com/storage/docs
- Signed URLs: https://cloud.google.com/storage/docs/access-control/signed-urls
- gsutil reference: https://cloud.google.com/storage/docs/gsutil
````
````markdown name=docs/merge-data.md
# Gabungkan Semua Data (Combine all data)
Dokumen ringkas (Bahasa Melayu) untuk menggabungkan fail yang tersimpan di Google Cloud Storage.
Sebelum mula
- Pastikan anda mempunyai akses ke bucket (roles/storage.objectViewer atau storage.objectAdmin untuk penulisan).
- Jika dataset besar (GB/TB), pertimbangkan Dataflow/Dataproc atau import terus ke BigQuery.
Pilihan umum
1) Quick merge (fail kecil/sederhana)
- Jika saiz kecil supaya muat dalam memori:
gsutil cat gs://BUCKET/PATH/*.csv | gsutil cp - gs://BUCKET/PATH/combined.csv
- Risiko: memori dan rangkaian. Pastikan hanya digunakan untuk set kecil.
2) gsutil compose (gabungkan objek di GCS tanpa muat turun)
- Gabungkan hingga 32 objek per operasi:
gsutil compose gs://BUCKET/part1.csv gs://BUCKET/part2.csv gs://BUCKET/combined.csv
- Untuk >32 objek, buat compose berperingkat (tree compose).
- Nota: compose menggabungkan byte; pastikan setiap objek mempunyai newline yang sesuai dan tidak menggabungkan header berganda.
3) Load terus ke BigQuery (disarankan untuk analisis)
- BigQuery boleh baca wildcard CSV:
bq load --autodetect --source_format=CSV dataset.table gs://BUCKET/PATH/*.csv
- Kelebihan: skala, tiada pra-gabungan, schema handling.
4) Skrip tersuai (contoh Python)
- Kelebihan: kawalan penuh (buang header berganda, normalisasi).
- Lihat `examples/merge_csv_gcs.py` untuk contoh.
Contoh strategi untuk >32 objek menggunakan gsutil compose (pattern)
- Bahagikan senarai objek kepada kumpulan 32, compose setiap kumpulan ke objek sementara, kemudian compose hasil sementara (ulang sehingga satu objek terakhir).
- Atau gunakan pipeline Dataflow untuk penstriman dan penulisan semula.
Amalan terbaik
- Untuk output akhir yang akan dianalitik: lebih baik tulis ke BigQuery atau Parquet (kolumnar) untuk kos dan prestasi.
- Pastikan logging dan versi objek dihidupkan jika data penting (object versioning).
- Gunakan encryption keys (CMEK) jika diperlukan.
Contoh ringkas penggunaan skrip
- Jika anda mahu, berikan:
- Bucket name (contoh: gs://my-bucket/data/)
- Prefix dan jenis fail (.csv)
- Saiz anggaran (GB/TB)
Saya akan hasilkan skrip automatik yang sesuai (bash atau Python) untuk anda.
````
````markdown name=docs/signed-urls.md
# Signed URLs — Quick reference
Signed URLs grant temporary access to an object without requiring the requester to have a Google account.
Key points
- Signed URLs include authentication in query parameters.
- Anyone with the URL can access the object until expiry.
- Maximum expiry for V4 signed URLs is 7 days.
Create with gsutil (using service account key file)
gsutil signurl -d 1h /path/to/key.json gs://BUCKET/OBJECT
Create programmatically (Python example sketch)
- Use google-auth library to sign and generate a V4 URL, or use google-cloud-storage's blob.generate_signed_url method.
- Prefer service account credentials with least privilege.
Security tips
- Monitor signed-URL usage where possible (Cloud Storage logs).
- Shorter expiry reduces risk.
- Do not embed long-lived private keys in public repositories.
When to use
- Temporary downloads for users without Google accounts.
- Browser uploads (PUT) when combined with appropriate CORS settings.
````
```python name=examples/merge_csv_gcs.py
#!/usr/bin/env python3
"""
examples/merge_csv_gcs.py
Merge CSV files in a GCS prefix into one CSV while keeping only the first header.
Requirements:
pip install google-cloud-storage
Usage:
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/key.json"
python3 examples/merge_csv_gcs.py my-bucket data/prefix/ output/combined.csv
"""
import sys
import csv
from io import StringIO
from google.cloud import storage
def merge_csvs(bucket_name, prefix, output_blob_name):
client = storage.Client()
bucket = client.bucket(bucket_name)
blobs = sorted([b for b in bucket.list_blobs(prefix=prefix) if b.name.endswith('.csv')])
if not blobs:
print("No CSV files found with prefix:", prefix)
return
out_buf = StringIO()
writer = None
first = True
for blob in blobs:
print("Reading:", blob.name)
# Use streaming download if files are large: download_as_text() may load entire object
data = blob.download_as_text()
reader = csv.reader(StringIO(data))
header = next(reader, None)
if first:
writer = csv.writer(out_buf)
if header:
writer.writerow(header)
for row in reader:
writer.writerow(row)
first = False
else:
# skip header for subsequent files
for row in reader:
writer.writerow(row)
out_blob = bucket.blob(output_blob_name)
out_blob.upload_from_string(out_buf.getvalue(), content_type='text/csv')
print(f'Uploaded gs://{bucket_name}/{output_blob_name}')
if __name__ == "__main__":
if len(sys.argv) != 4:
print("Usage: merge_csv_gcs.py BUCKET PREFIX OUTPUT_BLOB")
print("Example: merge_csv_gcs.py my-bucket data/ output/combined.csv")
sys.exit(1)
merge_csvs(sys.argv[1], sys.argv[2], sys.argv[3])
```
```json name=cors.json
[
{
"origin": ["https://example.com"],
"method": ["GET", "HEAD", "PUT", "POST"],
"responseHeader": ["Content-Type", "x-goog-meta-custom"],
"maxAgeSeconds": 3600
}
]
```
```text name=LICENSE
MIT License
Copyright (c) 2025
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction...
(You can replace with your full MIT text or preferred license.)
```
Would you like me to:
- generate a shell helper to compose >32 objects safely (tree compose), or
- produce a Dataflow/Beam starter for large merges, or
- create a translated Malay full-site version?
Tell me which follow-up and I will produce the exact script or pipeline.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Added documentation and example scripts for Google Cloud Storage usage, including quickstart guides, merging data, and CORS configuration.````markdown name=README.md # storage.cloud — Google Cloud Storage (GCS) docs & quickstart
storage.cloud is a compact documentation and example repo with copy-pastable commands and small scripts for common Google Cloud Storage tasks:
Repository layout
Quickstart highlights
Authenticate locally (ADC): gcloud auth application-default login
Get an access token for Authorization header: gcloud auth print-access-token
List buckets: gsutil ls gs://
Copy to/from GCS: gsutil cp ./local-file gs://BUCKET/OBJECT gsutil cp gs://BUCKET/OBJECT ./local-file
Create a signed URL: gsutil signurl -d 1h /path/to/key.json gs://BUCKET/OBJECT
Security reminders
Contributing
If you want, I can:
[ { "origin": ["https://example.com"], "method": ["GET", "HEAD", "PUT", "POST"], "responseHeader": ["Content-Type", "x-goog-meta-custom"], "maxAgeSeconds": 3600 } ]Would you like me to:
Tell me which follow-up and I will produce the exact script or pipeline.