Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions 02_activities/assignments/DC_Cohort/Assignment2.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,8 +47,11 @@ There are several tools online you can use, I'd recommend [Draw.io](https://www.

**HINT:** You do not need to create any data for this prompt. This is a conceptual model only.

![ERD](./images/assignment_2-1.drawio.png)

#### Prompt 2
We want to create employee shifts, splitting up the day into morning and evening. Add this to the ERD.
![ERD](./images/assignment_2-2.drawio.png)

#### Prompt 3
The store wants to keep customer addresses. Propose two architectures for the CUSTOMER_ADDRESS table, one that will retain changes, and another that will overwrite. Which is type 1, which is type 2?
Expand All @@ -57,6 +60,11 @@ The store wants to keep customer addresses. Propose two architectures for the CU

```
Your answer...

The first architecture will retain changes to the customer_addresses when customer's change their address. This is known as type 2. It would include: customer_address_id (unique id for each customer and address pairing), customer_id(identification for each customer), date_id (reference to the date to help find recent vs old addresses), and address (the actual address). This would have each customer's address as a separate line and the customer's would be duplicated in the table. The date and customer_address_id would help to organize what the most recent location is.

The second architecture would update the customer's address information each time. This is known as type 1. It would include: customer_id (unique identifier - one to one for customer_id table), date_id (date of update), address (current known address). This would mean that each customer gets a single entry in the table. It could also be added to the customer table with a new column of "address" since each customer is unique there. Every time the address is updated, it would be updated in the costumer table.

```

***
Expand Down Expand Up @@ -192,4 +200,20 @@ Consider, for example, concepts of labour, bias, LLM proliferation, moderating c

```
Your thoughts...

What stood out to me is that there was a requirement for people to be labelling the images/words and the researchers always seemed to try to optimize for the most efficient (i.e. high return for low cost) option. This is a problem in that it can potentially lead to some form of abuse since one of the parties in the level will be trying to extract more value compared to the cost and that gets pushed down to the bottom level (i.e. the government tries to extract value from the instituion, the instituion tries to extract value from the head scientist, the head scientist tries to extract value from the graduate student and the scientists (as a group) try to extract value from the data labellers).

If we consider the value of AI now and how much value has been extracted, the workers at the bottom doing the labelling get none of the return on their work. The scientists do get the professional recognition, but the workers do not get anything than their initial payment.

I think that this somewhat reflects the capitalistic culture and system that the scientists were working within. The scientists are likely required to do as much research as possible with limited funds, and this obviously affects their ability to perform the task. This is pervasive as the workers are also operating in the system and would like to get the highest return for lowest time/effort.

This same attitude also applies for how companies now view LLMs. LLMs are viewed as being able to produce a viable output with the minimal cost and time requirement. However from my understanding, the true costs of these LLMs are not realized for the companies since the LLMs companies often offer a 'free' version and do not necessarily charge the full cost to the companies for their use. This is a buisness strategy of giving a 'sample' or introductory discount. Many of the LLM companies are not yet profitable and continually require more funding from investors to get to the point where customers have replaced their workers. Once the lowest level workers are replaced by LLMs, the companies might see the true cost of LLM use as they will have fewer options for workers.

Aside from the ethical issues of not getting value from their work, I would imagine that the data collected would be reflecting the minimal correct answer. Since the workers who were able to answer the most questions would get the most payment, then they would likely get more questions over time, and therefore, might have more influence over the data and how it is labelled. That might result in less robust labelling where certain words or images were not labelled with the most wide/descriptive labels, but instead the most common label. Ethically, this might also result in a narrowing of labels to only reflect the most common label.

By using these workers, it also runs into the issue of representive sampling, which is common in psychology studies. This occurs when the sample that is being used is not representive of the overall population. If we only used Princeton students from 1960s as our initial label, then their inherent biases would propogate through the future datasets that used it. This data can be updated over time, but that is something that researchers would have to actively think about and the inital LLMs did have biases that took time to remove.




```
163 changes: 151 additions & 12 deletions 02_activities/assignments/DC_Cohort/assignment2.sql
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,12 @@ Edit the appropriate columns -- you're making two edits -- and the NULL rows wil
All the other rows will remain the same. */
--QUERY 1


SELECT
product_name || ', ' || -- keeps the product name the same
COALESCE(product_size, '')|| -- replaces NULL in product_size with a blank ''
' (' || COALESCE(product_qty_type, 'unit') || ')' -- replaces NULL in product_qty_type with 'unit'
as Product_List -- changes name of list to Product_list
FROM product;


--END QUERY
Expand All @@ -42,7 +47,14 @@ Filter the visits to dates before April 29, 2022. */
--QUERY 2



SELECT * FROM (
SELECT
*,
DENSE_RANK() OVER(PARTITION BY customer_id ORDER BY market_date ASC)
AS customer_visits
FROM customer_purchases
WHERE market_date < '2022-04-29'
) AS x;

--END QUERY

Expand All @@ -53,8 +65,14 @@ only the customer’s most recent visit.
HINT: Do not use the previous visit dates filter. */
--QUERY 3



SELECT * FROM (
SELECT
*,
DENSE_RANK() OVER(PARTITION BY customer_id ORDER BY market_date DESC, transaction_time DESC)
AS customer_visits
FROM customer_purchases
) AS x
WHERE x.customer_visits = 1;

--END QUERY

Expand All @@ -66,7 +84,21 @@ You can make this a running count by including an ORDER BY within the PARTITION
Filter the visits to dates before April 29, 2022. */
--QUERY 4


SELECT
*
,COUNT(product_id) OVER(
PARTITION BY customer_id, product_id)
AS times_product_in_order
-- this shows the number of orders that the product was included in an order, but not the actual amount that was purchased.
-- It does show different prices of the product id
,SUM(quantity) OVER(
PARTITION BY customer_id, product_id)
AS total_quantity_of_product_purchased
-- this shows the actual amount purchased for each customer
-- it does not show if the product was different price each time, just the total number that the customer purchased
FROM customer_purchases
WHERE market_date < '2022-04-29'
ORDER BY total_quantity_of_product_purchased DESC; -- shows us the customer who bought the most;


--END QUERY
Expand All @@ -85,7 +117,13 @@ Remove any trailing or leading whitespaces. Don't just use a case statement for
Hint: you might need to use INSTR(product_name,'-') to find the hyphens. INSTR will help split the column. */
--QUERY 5


SELECT product_name,
CASE
WHEN INSTR(product_name, '-')>0 -- only performs the substr on words with - in it
THEN RTRIM(LTRIM(SUBSTR(product_name,instr(product_name, '-')+1))) -- extracts text beyond first - and trims blank spaces
ELSE NULL -- fills in rest as NULL
END AS description -- names the column description
FROM product;


--END QUERY
Expand All @@ -94,7 +132,15 @@ Hint: you might need to use INSTR(product_name,'-') to find the hyphens. INSTR w
/* 2. Filter the query to show any product_size value that contain a number with REGEXP. */
--QUERY 6

SELECT product_name,
CASE
WHEN INSTR(product_name, '-')>0 -- only performs the substr on words with - in it
THEN TRIM(SUBSTR(product_name,instr(product_name, '-')+1)) -- extracts text beyond first - and trims blank spaces
ELSE NULL -- fills in rest as NULL
END AS description -- names the column description
FROM product

WHERE product_size REGEXP '\d';


--END QUERY
Expand All @@ -111,12 +157,56 @@ HINT: There are a possibly a few ways to do this query, but if you're struggling
with a UNION binding them. */
--QUERY 7

--------- 1. Create first temp table with total sales and each day
--- remove table if it exists to start fresh
DROP TABLE IF EXISTS temp.new_customer_purchases;

--- create table
CREATE TABLE temp.new_customer_purchases AS

--- defintion of table
SELECT *,
SUM(quantity * cost_to_customer_per_qty) AS total_sales
FROM customer_purchases
GROUP BY market_date;

--END QUERY
--------- 2. Create QUERY the table for top and bottom

DROP TABLE IF EXISTS temp.top_table;

--- create table
CREATE TABLE temp.top_table AS

--- query table for top
SELECT
market_date,
total_sales,
'best' AS day
FROM temp.new_customer_purchases
ORDER BY total_sales DESC
LIMIT 1;


--- query table for bottom
DROP TABLE IF EXISTS temp.bottom_table;

--- create table
CREATE TABLE temp.bottom_table AS
SELECT
market_date,
total_sales,
'worst' AS day
FROM temp.new_customer_purchases
ORDER BY total_sales ASC
LIMIT 1;


--- combine them together
SELECT *
FROM top_table
UNION -- combines them
SELECT *
FROM bottom_table;

/* SECTION 3 */

Expand All @@ -132,21 +222,55 @@ How many customers are there (y).
Before your final group by you should have the product of those two queries (x*y). */
--QUERY 8


--- find total number of customers
DROP TABLE IF EXISTS temp.num_of_customers;

CREATE TABLE temp.num_of_customers AS
SELECT
count(customer_id) AS total_customer_count
FROM customer;
-- should return 26 customers

--- find total number of unique products per vendor
DROP TABLE IF EXISTS temp.unique_products_per_vendor;

CREATE TABLE temp.unique_products_per_vendor AS
SELECT DISTINCT
v.vendor_name,
p.product_name,
vi.original_price,
5 AS quantity_per_customer
FROM vendor_inventory AS vi
JOIN vendor AS v
ON vi.vendor_id = v.vendor_id
JOIN product AS p
ON vi.product_id = p.product_id
ORDER BY v.vendor_name, p.product_name; -- should return 3 vendors, with 8 items (3, 1, 4 distinct products each)

--- now cross join the customers
SELECT
up.*,
(up.original_price * up.quantity_per_customer * nc.total_customer_count) AS possible_revenue
FROM temp.unique_products_per_vendor AS up
CROSS JOIN temp.num_of_customers AS nc;


--END QUERY


-- INSERT
/*1. Create a new table "product_units".
This table will contain only products where the `product_qty_type = 'unit'`.
It should use all of the columns from the product table, as well as a new column for the `CURRENT_TIMESTAMP`.
Name the timestamp column `snapshot_timestamp`. */
--QUERY 9

DROP TABLE IF EXISTS product_units; -- drops table in case there is an error and I need to start again


CREATE TABLE product_units AS -- creates a table called product_units
SELECT*, -- selects all of the columns and adds the new COLUMN
CURRENT_TIMESTAMP AS snapshot_timestamp
FROM product -- uses product table as the base
WHERE product_qty_type == 'unit'; -- filters it to only product qty type of unit

--END QUERY

Expand All @@ -155,7 +279,8 @@ Name the timestamp column `snapshot_timestamp`. */
This can be any product you desire (e.g. add another record for Apple Pie). */
--QUERY 10


INSERT INTO product_units -- inserts into the product unit TABLE
VALUES(25,'Strawberry Rhubarb Pie','10"',3,'unit',CURRENT_TIMESTAMP); -- manually inputted values


--END QUERY
Expand All @@ -167,8 +292,11 @@ This can be any product you desire (e.g. add another record for Apple Pie). */
HINT: If you don't specify a WHERE clause, you are going to have a bad time.*/
--QUERY 11

-- SELECT * FROM product_units -- testing so that I can see what is going to be selected for deletion
DELETE FROM product_units
WHERE product_id = 25;


SELECT * FROM product_units; -- testing to see that it is gone

--END QUERY

Expand All @@ -191,8 +319,19 @@ Finally, make sure you have a WHERE statement to update the right row,
When you have all of these components, you can run the update statement. */
--QUERY 12

SELECT * FROM product_units; -- testing to see updates

ALTER TABLE product_units
ADD current_quantity INT;


UPDATE product_units
SET current_quantity = ( -- last quantity (most recent)
SELECT coalesce(vi.quantity,0)
FROM vendor_inventory AS vi
WHERE vi.product_id = product_units.product_id
ORDER BY vi.market_date DESC
LIMIT 1);

--END QUERY

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified 05_src/sql/farmersmarket.db
Binary file not shown.