-
Notifications
You must be signed in to change notification settings - Fork 0
Data Preparation 2014
In the source code below, the data supplied by the Utah State Office of Education (USOE) is cleaned and processed using a two step process:
1a. In the first step the data is cleaned using the .sql script below wherein valid student records are selected based upon business rules specified in the sql code itself. The result is an exportable .csv file where each VALID_CASE is unique by CONTENT_AREA, YEAR and ID.
1b. In the second step the clean data from step 1a is read into R and modified slightly with regard to variable types/classes. The result is an Rdata file that is suitable for analysis with the SGP Package.
------ Columns are in order as listed in sgpData_LONG documentation:
------ http://cran.r-project.org/web/packages/SGPdata/SGPdata.pdf
-- FINALIZE DATA SET
SELECT sub2.*
FROM (
-- CONSTRUCT TIE BREAKER TO RESOLVE DUPLICATES
SELECT sub1.*,
ROW_NUMBER() OVER(
PARTITION BY school_year, student_id, CONTENT_AREA
ORDER BY scale_score DESC, CASE WHEN grade_level = grade_high THEN 1 ELSE 0 END DESC
-- Prefer the higher score, if the scores are the same, prefer the test taken at the student's grade level
) AS row_nbr
FROM (
-- CONSTRUCT RAW DATA SET
SELECT t.student_id AS ID,
m.last_name AS LAST_NAME,
m.first_name AS FIRST_NAME,
-- Test Score
CASE t.test_subject_id -- Course-based tests
WHEN 313 THEN 'SEC_MATH_I' -- New for 2014
WHEN 314 THEN 'SEC_MATH_II' -- New for 2014
WHEN 315 THEN 'SEC_MATH_III' -- New for 2014
WHEN 505 THEN 'EARTH_SCIENCE'
WHEN 506 THEN 'BIOLOGY'
WHEN 507 THEN 'CHEMISTRY'
WHEN 508 THEN 'PHYSICS'
ELSE (CASE -- Grade-based tests
WHEN t.subject_area = 'L'
AND t.test_subject_id NOT IN ('313','314','315','505','506','507','508') THEN 'ELA'
WHEN t.subject_area = 'M'
AND t.test_subject_id NOT IN ('313','314','315','505','506','507','508') THEN 'MATHEMATICS'
WHEN t.subject_area = 'S'
AND t.test_subject_id NOT IN ('313','314','315','505','506','507','508') THEN 'SCIENCE'
END) END AS CONTENT_AREA,
2014 AS YEAR,
CASE WHEN CAST(t.test_subject_id AS VARCHAR) IN ('313','314','315','505','506','507','508') THEN 'EOCT'
WHEN p.grade_low = p.grade_high THEN CAST(p.grade_low AS VARCHAR)
ELSE CAST(e.grade_level AS VARCHAR)
END AS GRADE,
t.scale_score AS SCALE_SCORE,
t.proficiency_level AS ACHIEVEMENT_LEVEL, -- Transform to factor data type in Utah_SGP_2014.R
-- Demographics
m.gender AS GENDER,
CASE e.race_merged
WHEN 'A' THEN 'Asian'
WHEN 'B' THEN 'Black'
WHEN 'C' THEN 'White'
WHEN 'H' THEN 'Hispanic'
WHEN 'I' THEN 'American Indian'
WHEN 'M' THEN 'More Than One Race'
WHEN 'P' THEN 'Pacific Islander'
WHEN 'U' THEN 'Unknown'
END AS ETHNICITY,
CASE WHEN e.low_income = 1 THEN 'Free Reduced Lunch: Yes' ELSE 'Free Reduced Lunch: No'
END AS FRL_STATUS,
CASE WHEN e.ell_status IN ('F','M','O','Y') THEN 'ELL: Yes' ELSE 'ELL: No'
END AS ELL_STATUS,
CASE WHEN e.special_ed = 'Y' THEN 'IEP: Yes' ELSE 'IEP: Yes'
END AS IEP_STATUS,
NULL AS GT_STATUS,
-- School
e.school_id AS SCHOOL_NUMBER,
i2.institution_name AS SCHOOL_NAME,
CASE s.grade_level_summary
WHEN 'ELEM' THEN 'E'
WHEN 'MID' THEN 'M'
WHEN 'HIGH' THEN 'H'
WHEN 'K12' THEN 'H'
ELSE NULL
END AS EMH_LEVEL,
e.district_id AS DISTRICT_NUMBER,
i1.institution_name AS DISTRICT_NAME,
-- Already limited to school FAY:
'Enrolled School: Yes' AS SCHOOL_ENROLLMENT_STATUS,
'Enrolled District: Yes' AS DISTRICT_ENROLLMENT_STATUS,
'Enrolled State: Yes' AS STATE_ENROLLMENT_STATUS,
'VALID_CASE' AS VALID_CASE, -- Only valid cases are loaded
-- Additional Variables
cts.test_subject, -- Combines CONTENT_AREA and GRADE information
-- Primary Key (for matching SGP 2014 output to SAGE reportable test data)
t.school_year,
t.student_id,
t.test_subject_id,
e.grade_level,
p.grade_high
FROM sage.sage_reportable_tests AS t
JOIN ct_test_subjects AS cts
ON t.test_subject_id = cts.test_subject_id
JOIN test_program p
ON t.test_subject_id = p.test_subject_id AND p.school_year = 2014
-- sage_reportable_tests attributes wrong test_prog_id to tests based only on subject area match
JOIN student_enrollment AS e
ON t.school_year = e.school_year
AND t.district_id = e.district_id
AND t.school_id = e.school_id
AND t.school_number = e.school_number
AND t.student_id = e.student_id
AND t.school_entry_date = e.entry_date
JOIN student_master AS m
ON e.student_id = m.student_id
JOIN v_be_institution AS i1
ON e.district_id = i1.institution_id
JOIN v_be_institution AS i2
ON e.school_id = i2.institution_id
JOIN v_be_school AS s
ON t.district_id = s.district_id
AND t.school_id = s.school_id
AND t.school_number = s.school_nbr
WHERE t.school_year = 2014
AND t.test_participation_code IN (200,201,300) -- Standard participation, accommodated
AND t.scale_score IS NOT NULL
AND e.full_academic_year = 'Y' -- School level
) AS sub1
) AS sub2
WHERE row_nbr = 1
Connect to data warehouse and extract input data:
# Note: This input data should consist of only the new 2014 data to be added to the data extracted in 2013.
require(RODBC)
dbhandle <- odbcDriverConnect('driver={SQL Server}; server=acsdbstage; database=be_upass;')
Utah_Data_LONG_2014 <- sqlQuery(dbhandle, 'SELECT * FROM v_sgp_longfile')
Utah_Data_LONG_2014 <- data.frame(Utah_Data_LONG_2014) # This step may be redundant?
write.csv(Utah_Data_LONG_2014, file='SGP_Longfile_2.csv', row.names=FALSE)
After text data has been cleaned using the above SQL script, a comma delimited text file is exported and needs to be read into
R in order to create an R based data set to be used for SGP calculations. The R script below
reads the SGP_Longfile_2.csv data (located in the relative folder Data/Base_Files) into R and tidies up the data variable by
variable to ensure that it matches previously established data naming conventions for subsequent analysis in the SGP package.
##########################################################
####
#### Code for preparation of Utah LONG data
####
##########################################################
library(data.table)
### Set working directory
### Read in USOE .csv file
Utah_Data_LONG_2014 <- fread("Data/Base_Files/SGP_Longfile_2.csv",
colClasses=rep("character", 27))
### Final cleaning needed for File Randy uploaded 11/09/14
# TRANSFORM PROFICIENCY LEVEL VARIABLES TO R FACTORS
Utah_Data_LONG_2014$ACHIEVEMENT_LEVEL <- factor(
Utah_Data_LONG_2014$ACHIEVEMENT_LEVEL, levels=0:4,
labels=c(NA, "Level 1", "Level 2", "Level 3", "Level 4"), ordered=TRUE)
Utah_Data_LONG_2014$ACHIEVEMENT_LEVEL_FULL<-Utah_Data_LONG_2014$ACHIEVEMENT_LEVEL
levels(Utah_Data_LONG_2014$ACHIEVEMENT_LEVEL) <- c(NA, "BP", "BP", "P", "A")
table(Utah_Data_LONG_2014$ACHIEVEMENT_LEVEL, exclude=NULL)
table(Utah_Data_LONG_2014$ACHIEVEMENT_LEVEL_FULL, exclude=NULL)
# Make all values of GT_STATUS NA (character values 'NULL' in data provided)
Utah_Data_LONG_2014$GT_STATUS <- NA
### Save data object.
save(Utah_Data_LONG_2014, file="Data/Utah_Data_LONG_2014.Rdata")
### Clean up the Demographic variable labels in the 2013 SGP object:
## ETHNICITY
# Existing levels: "African American", "American Indian", "Asian", "Hispanic/Latino",
# "Multiple Races", "Pacific Islander", "Unknown", "White"
# Provided levels: "American Indian", "Asian", "Black", "Hispanic",
# "More Than One Race", "Pacific Islander", "White" (No Unknown)
levels(Utah_SGP@Data$ETHNICITY) <- c("Black", "American Indian", "Asian",
"Hispanic", "More Than One Race", "Pacific Islander", "Unknown", "White")
levels(Utah_SGP@Data$FRL_STATUS) <- c("Free Reduced Lunch: No", "Free Reduced Lunch: Yes")
levels(Utah_SGP@Data$IEP_STATUS) <- c("IEP: No", "IEP: Yes")
levels(Utah_SGP@Data$GT_STATUS) <- c("GT: No", "GT: Yes")
### Save data object.
save(Utah_SGP, file="Data/Utah_SGP.Rdata")Note that the naming conventions associated with Utah data provided by USOE perfectly conform with the naming conventions used by the SGP Package.
Although the SGP Package accommodates different state data naming conventions through the use of a meta-data lookup table embedded
within the SGPstateData Rdata object in the package, this is not necessary for Utah because the
naming conventions have been established in the SQL code in data prep Step 1a above. The variable name lookup table
for Utah is shown below with names.provided giving the state specific names, names.sgp the SGP Package specific names, names.type giving the
type associated with the variable which is used within summarizeSGP to
construct group summaries, names.info providing meta-data associated with the variable, and names.output a Boolean indicator of whether variable
should be used with summarizeSGP.
| names.provided | names.sgp | names.type | names.info | names.output |
|---|---|---|---|---|
| ID | ID | individual | Unique student identifier | |
| LAST_NAME | LAST_NAME | label | Student last name | |
| FIRST_NAME | FIRST_NAME | label | Student first name | |
| CONTENT_AREA | CONTENT_AREA | content | Content area associated with record | |
| YEAR | YEAR | time | Year associated with record | |
| GRADE | GRADE | institution_level | Grade/Test level associated with record | |
| SCALE_SCORE | SCALE_SCORE | measure | Scale score | |
| VALID_CASE | VALID_CASE | individual_inclusion | Valid case indicator | |
| TEST_NAME | TEST_NAME | misc | Test name | |
| ACHIEVEMENT_LEVEL | ACHIEVEMENT_LEVEL | measure | Achievement level | |
| SCHOOL_LEVEL | SCHOOL_LEVEL | institution_type | Elementary/Middle/High School level flag | |
| SCHOOL_NAME | SCHOOL_NAME | label | School name | |
| SCHOOL_NUMBER | SCHOOL_NUMBER | institution | School number | |
| DISTRICT_NAME | DISTRICT_NAME | label | School name | |
| DISTRICT_NUMBER | DISTRICT_NUMBER | institution | District number | |
| STATE_ENROLLMENT_STATUS | STATE_ENROLLMENT_STATUS | institution_inclusion | State inclusion/accountability indicator | |
| DISTRICT_ENROLLMENT_STATUS | DISTRICT_ENROLLMENT_STATUS | institution_inclusion | District inclusion/accountability indicator | |
| SCHOOL_ENROLLMENT_STATUS | SCHOOL_ENROLLMENT_STATUS | institution_inclusion | School inclusion/accountability indicator | |
| SUBGROUP | SUBGROUP | demographic | Student super subgroup status | TRUE |
| GENDER | GENDER | demographic | Student gender | TRUE |
| ETHNICITY | ETHNICITY | demographic | Student ethnicity | TRUE |
| FRL_STATUS | FRL_STATUS | demographic | Student FRL status | TRUE |
| IEP_STATUS | IEP_STATUS | demographic | Student IEP status | TRUE |
| ELL_STATUS | ELL_STATUS | demographic | Student ELL status | TRUE |
| GT_STATUS | GT_STATUS | demographic | Student Gifted Talented Status | TRUE |
| TEST_MEDIUM | TEST_MEDIUM | misc | Test medium | |
| HIGH_NEED_STATUS | HIGH_NEED_STATUS | demographic | High need status flag | TRUE |
SGP - Student Growth Percentiles SGP Blog | SGP GitHub Repo | SGP on CRAN