Professional Python project: working with data files for analytics.
This project illustrates ETL data pipelines processing raw data with the following types:
- CSV (comma separated values)
- JSON (structured data commonly used to exchange information over the web)
- Text (excerpt from Romeo and Juliet)
- Excel file (using the
openpyxlpackage added topyproject.toml)
The working example illustrates a complete pipeline. Use the working example and your resources to create your own processing pipelines.
Think about some raw data you would like to process.
- What format is the data? Choose from csv, json, text, or xlsx.
- Choose static data (e.g., in files), rather than data in motion (e.g., social media streams)
- Being able to read and process a wide variety of data files is critical in professional analytics.
- Python is popular partly because it makes building data pipelines relatively easy.
We've turned off some PyRight type checks since we are working with raw data pipelines.
- WHY: We don't know what types things are until after we read them.
- See pyproject.toml and the [tool.pyright] section for details.
We use keyword-only function arguments when defining our ETL functions.
- In our functions, you'll see a
*,. - The asterisk can appear anywhere in the list of parameters.
- EVERY argument AFTER the asterisk must be passed using the named keyword argument (also called kwarg), rather than by position.
- WHY: Requiring named arguments prevents argument-order mistakes.
- It also makes our function calls self-documenting, which can be especially helpful in data-processing pipelines.
This repo includes a 2.2 MB Excel data file. We have increased the size of the "large file" check in our pre-commit hooks.
There are three workflows for analytics projects.
- 01: Set Up Machine (Once Per Machine)
- 02: Set Up Project (Once Per Project)
- 03: Daily Workflow (Working With Python Project Code)
Follow the detailed instructions at: 01. Set Up Your Machine
🛑 All steps must be completed and verified successfully.
-
Get Repository: Sign in to GitHub, open this repository in your browser, and click Copy this template to get a copy in YOURACCOUNT.
-
Configure Repository Settings:
- Select your repository Settings (the gear icon way on the right).
- Go to Pages tab / Enable GitHub Pages / Build and deployment / set Source to GitHub Actions
- Go to Advanced Security tab / Dependabot / Dependabot security updates / Enable
- Go to Advanced Security tab / Dependabot / Grouped security updates / Enable
-
Clone to local: Open a machine terminal in your
Reposfolder and clone your new repo.
git clone https://github.com/YOURACCOUNT/datafun-03-analytics- Open project in VS Code: Change directory into the repo and open the project in VS Code by running
code .("code dot"):
cd datafun-03-analytics
code .-
Install recommended extensions.
- When VS Code opens, accept the Extension Recommendations (click
Install Allor similar when asked).
- When VS Code opens, accept the Extension Recommendations (click
-
Set up a project Python environment (managed by
uv) and align VS Code with it.- Use VS Code menu option
Terminal/New Terminalto open a VS Code terminal in the root project folder. - Run the following commands, one at a time, hitting ENTER after each:
uv self update uv python pin 3.14 uv sync --extra dev --extra docs --upgrade
- Use VS Code menu option
If asked: "We noticed a new environment has been created. Do you want to select it for the workspace folder?" Click "Yes".
If successful, you'll see a new .venv folder appear in the root project folder.
Optional (recommended): install and run pre-commit checks:
uvx pre-commit install
git add -A
uvx pre-commit run --all-filesFore more detailed instructions and troubleshooting, see the pro guide at: 02. Set Up Your Project
🛑 Do not continue until all REQUIRED steps are complete and verified.
Follow the detailed instructions at: 03. Daily Workflow
Commands are provided below to:
- Git pull
- Run and check the Python files
- Build and serve docs
- Save progress with Git add-commit-push
- Update project files
VS Code should have only this project (datafun-03-analytics) open.
Use VS Code menu option Terminal / New Terminal and run the following commands:
git pullIn the same VS Code terminal, run any Python source files:
uv run python src/datafun_03_analytics/app_case.py
uv run python src/datafun_03_analytics/app_yourname.pyOR: Run them as modules (preferred):
uv run python -m datafun_03_analytics.app_case
uv run python -m datafun_03_analytics.app_yournameFor more see: Running Python Reliably.
If a command fails, verify:
- Only this project is open in VS Code.
- The terminal is open in the project root folder.
- The
uv sync --extra dev --extra docs --upgradecommand completed successfully.
Hint: if you run ls in the terminal, you should see files including pyproject.toml, README.md, and uv.lock.
Run Python checks and tests (as available):
uv run ruff format .
uv run ruff check . --fix
uv run pytest --cov=src --cov-report=term-missingBuild and serve docs (hit CTRL+c in the VS Code terminal to quit serving):
uv run mkdocs build --strict
uv run mkdocs serveWhile editing project code and docs, repeat the commands above to run files, check them, and rebuild docs as needed.
Save progress frequently (some tools may make changes; you may need to re-run git add and commit to ensure everything gets committed before pushing):
git add -A
git commit -m "update"
git push -u origin mainAdditional details and troubleshooting are available in the Pro-Analytics-02 Documentation.
Open mkdocs.yaml.
This file configures the associated project documentation website (powered by MkDocs)
Use CTRL+f to find each occurrence of the source GitHub account (e.g. denisecase).
Change each occurrence to point to your GitHub account instead (spacing and capitalization MUST match the URL of your GitHub account exactly.)
- Rename
app_yourname.pyto reflect your name or alias.
- Find the file the file in the VS Code Explorer window (top icon on the left).
- Right-click / Rename.
- Follow conventions: name Python files in lower_snake_case, words joined with underscores, and using
.pyextension.
- Edit this README.md file to change the run command to call your file instead.
Use CTRL+f to search for
app_yourname.pyand replace all occurrences exactly. - Preview this README.md to make sure it still appears correctly.
- Find README.md in the VS Code Explorer window (top icon on the left)
- Right-click / Preview
- Fix any issues.
- Run the updated command to execute your Python script.
- Read the example code carefully before starting.
- Open your file. Search for "TODO" items. VS Code has icons down the left. Use either TODO Tree (tree, at the bottom) or Search (second from top).
- Complete each TODO carefully, one at a time.
- After implementing a TODO, paste your run command in the terminal and hit Enter to re-run it.
- When it runs without errors, delete the associated TODO command.
- Keep working through each TODO.
- When you finish, there should be zero TODO occurrences in your project.
Save often: After making any useful progress, follow the steps to git add-commit-push.
- You do not need to add to or modify
tests/. They are provided for example only. - You do not need to view or modify any of the supporting config files.
- Many of the repo files are silent helpers. Explore as you like, but nothing is required.
- You do NOT need to understand everything. Understanding builds naturally over time.
- Use the UP ARROW and DOWN ARROW in the terminal to scroll through past commands.
- Use
CTRL+fto find (and replace) with in a file.
If you see something like this in your terminal: >>> or ...
You accidentally started Python interactive mode.
It happens.
Press Ctrl+c (both keys together) or Ctrl+Z then Enter on Windows.
- Pro-Analytics-02 - guide to professional Python
- ANNOTATIONS.md - REQ/WHY/OBS annotations used
- INSTRUCTORS.md - guidance and notes for instructors and maintainers
- POLICIES.md - project rules and expectations that apply to all contributors
- SKILLS.md - skills, concepts, and professional practices (there are many)
- SE_MANIFEST.toml - project intent, scope, and role
CITATION.cff - TODO: update author and repository fields to reflect your creative work