Skip to content

Data Preparation 2014

Adam VanIwaarden edited this page Oct 16, 2015 · 7 revisions

Data Preparation of 2014 SAGE Data

Student level data preparation

In the source code below, the data supplied by the Utah State Office of Education (USOE) is cleaned and processed using a two step process:

1a. In the first step the data is cleaned using the .sql script below wherein valid student records are selected based upon business rules specified in the sql code itself. The result is an exportable .csv file where each VALID_CASE is unique by CONTENT_AREA, YEAR and ID.

1b. In the second step the clean data from step 1a is read into R and modified slightly with regard to variable types/classes. The result is an Rdata file that is suitable for analysis with the SGP Package.

Step 1a: Cleaning data with SQL

------ Columns are in order as listed in sgpData_LONG documentation: 
------ http://cran.r-project.org/web/packages/SGPdata/SGPdata.pdf

-- FINALIZE DATA SET
SELECT sub2.*
  FROM (
-- CONSTRUCT TIE BREAKER TO RESOLVE DUPLICATES
SELECT sub1.*,
   ROW_NUMBER() OVER(
       PARTITION BY school_year, student_id, CONTENT_AREA
       ORDER BY scale_score DESC, CASE WHEN grade_level = grade_high THEN 1 ELSE 0 END DESC
       -- Prefer the higher score, if the scores are the same, prefer the test taken at the student's grade level
  ) AS row_nbr
  FROM (
-- CONSTRUCT RAW DATA SET
SELECT t.student_id AS ID,
       m.last_name AS LAST_NAME,
       m.first_name AS FIRST_NAME,
    -- Test Score
  CASE t.test_subject_id -- Course-based tests
       WHEN 313 THEN 'SEC_MATH_I' -- New for 2014
       WHEN 314 THEN 'SEC_MATH_II' -- New for 2014
       WHEN 315 THEN 'SEC_MATH_III' -- New for 2014
       WHEN 505 THEN 'EARTH_SCIENCE'
       WHEN 506 THEN 'BIOLOGY'
       WHEN 507 THEN 'CHEMISTRY'
       WHEN 508 THEN 'PHYSICS'
       ELSE (CASE -- Grade-based tests
       WHEN t.subject_area = 'L'
       AND t.test_subject_id NOT IN ('313','314','315','505','506','507','508') THEN 'ELA'
       WHEN t.subject_area = 'M'
       AND t.test_subject_id NOT IN ('313','314','315','505','506','507','508') THEN 'MATHEMATICS'
       WHEN t.subject_area = 'S'
       AND t.test_subject_id NOT IN ('313','314','315','505','506','507','508') THEN 'SCIENCE'
  END) END AS CONTENT_AREA,
       2014 AS YEAR,
  CASE WHEN CAST(t.test_subject_id AS VARCHAR) IN ('313','314','315','505','506','507','508') THEN 'EOCT'
       WHEN p.grade_low = p.grade_high THEN CAST(p.grade_low AS VARCHAR)
       ELSE CAST(e.grade_level AS VARCHAR)
   END AS GRADE,
       t.scale_score AS SCALE_SCORE,
       t.proficiency_level AS ACHIEVEMENT_LEVEL, -- Transform to factor data type in Utah_SGP_2014.R
    -- Demographics
       m.gender AS GENDER,
  CASE e.race_merged 
       WHEN 'A' THEN 'Asian'
       WHEN 'B' THEN 'Black'
       WHEN 'C' THEN 'White'
       WHEN 'H' THEN 'Hispanic'
       WHEN 'I' THEN 'American Indian'
       WHEN 'M' THEN 'More Than One Race'
       WHEN 'P' THEN 'Pacific Islander'
       WHEN 'U' THEN 'Unknown'
   END AS ETHNICITY,
  CASE WHEN e.low_income = 1 THEN 'Free Reduced Lunch: Yes' ELSE 'Free Reduced Lunch: No'
   END AS FRL_STATUS,
  CASE WHEN e.ell_status IN ('F','M','O','Y') THEN 'ELL: Yes' ELSE 'ELL: No'
   END AS ELL_STATUS,
  CASE WHEN e.special_ed = 'Y' THEN 'IEP: Yes' ELSE 'IEP: Yes'
   END AS IEP_STATUS,
       NULL AS GT_STATUS,
    -- School
       e.school_id AS SCHOOL_NUMBER,
       i2.institution_name AS SCHOOL_NAME, 
  CASE s.grade_level_summary
       WHEN 'ELEM' THEN 'E'
       WHEN 'MID'  THEN 'M'
       WHEN 'HIGH' THEN 'H'
       WHEN 'K12'  THEN 'H'
       ELSE NULL
   END AS EMH_LEVEL,
       e.district_id AS DISTRICT_NUMBER,
       i1.institution_name AS DISTRICT_NAME,
    -- Already limited to school FAY:
      'Enrolled School: Yes' AS SCHOOL_ENROLLMENT_STATUS,
      'Enrolled District: Yes' AS DISTRICT_ENROLLMENT_STATUS,
      'Enrolled State: Yes' AS STATE_ENROLLMENT_STATUS,
      'VALID_CASE' AS VALID_CASE, -- Only valid cases are loaded
    -- Additional Variables
       cts.test_subject, -- Combines CONTENT_AREA and GRADE information
    -- Primary Key (for matching SGP 2014 output to SAGE reportable test data)
       t.school_year,
       t.student_id,
       t.test_subject_id,
       e.grade_level,
       p.grade_high
  FROM sage.sage_reportable_tests AS t
  JOIN ct_test_subjects AS cts
    ON t.test_subject_id = cts.test_subject_id
  JOIN test_program p
    ON t.test_subject_id = p.test_subject_id AND p.school_year = 2014
       -- sage_reportable_tests attributes wrong test_prog_id to tests based only on subject area match
  JOIN student_enrollment AS e
    ON t.school_year = e.school_year
       AND t.district_id = e.district_id
       AND t.school_id = e.school_id
       AND t.school_number = e.school_number
       AND t.student_id = e.student_id
       AND t.school_entry_date = e.entry_date
  JOIN student_master AS m
    ON e.student_id = m.student_id
  JOIN v_be_institution AS i1
    ON e.district_id = i1.institution_id
  JOIN v_be_institution AS i2
    ON e.school_id = i2.institution_id
  JOIN v_be_school AS s
    ON t.district_id = s.district_id
       AND t.school_id = s.school_id
       AND t.school_number = s.school_nbr
 WHERE t.school_year = 2014
   AND t.test_participation_code IN (200,201,300) -- Standard participation, accommodated
   AND t.scale_score IS NOT NULL
   AND e.full_academic_year = 'Y' -- School level
  ) AS sub1
  ) AS sub2
 WHERE row_nbr = 1

Connect to data warehouse and extract input data:

#    Note: This input data should consist of only the new 2014 data to be added to the data extracted in 2013.
require(RODBC)
dbhandle <- odbcDriverConnect('driver={SQL Server}; server=acsdbstage; database=be_upass;')
Utah_Data_LONG_2014 <- sqlQuery(dbhandle, 'SELECT * FROM v_sgp_longfile')
Utah_Data_LONG_2014 <- data.frame(Utah_Data_LONG_2014) # This step may be redundant?
write.csv(Utah_Data_LONG_2014, file='SGP_Longfile_2.csv', row.names=FALSE)

Step 1b: Creating an R version of the cleaned data

After text data has been cleaned using the above SQL script, a comma delimited text file is exported and needs to be read into R in order to create an R based data set to be used for SGP calculations. The R script below reads the SGP_Longfile_2.csv data (located in the relative folder Data/Base_Files) into R and tidies up the data variable by variable to ensure that it matches previously established data naming conventions for subsequent analysis in the SGP package.

##########################################################
####
#### Code for preparation of Utah LONG data
####
##########################################################

library(data.table)

### Set working directory

### Read in USOE .csv file
Utah_Data_LONG_2014 <- fread("Data/Base_Files/SGP_Longfile_2.csv", 
							colClasses=rep("character", 27))

### Final cleaning needed for File Randy uploaded 11/09/14

# TRANSFORM PROFICIENCY LEVEL VARIABLES TO R FACTORS
Utah_Data_LONG_2014$ACHIEVEMENT_LEVEL <- factor(
	Utah_Data_LONG_2014$ACHIEVEMENT_LEVEL, levels=0:4, 
	labels=c(NA, "Level 1", "Level 2", "Level 3", "Level 4"), ordered=TRUE)

Utah_Data_LONG_2014$ACHIEVEMENT_LEVEL_FULL<-Utah_Data_LONG_2014$ACHIEVEMENT_LEVEL

levels(Utah_Data_LONG_2014$ACHIEVEMENT_LEVEL) <- c(NA, "BP", "BP", "P", "A")

table(Utah_Data_LONG_2014$ACHIEVEMENT_LEVEL, exclude=NULL)
table(Utah_Data_LONG_2014$ACHIEVEMENT_LEVEL_FULL, exclude=NULL)

# Make all values of GT_STATUS NA (character values 'NULL' in data provided)
Utah_Data_LONG_2014$GT_STATUS <- NA

###  Save data object.
save(Utah_Data_LONG_2014, file="Data/Utah_Data_LONG_2014.Rdata")


###  Clean up the Demographic variable labels in the 2013 SGP object:

##  ETHNICITY
#  Existing levels: "African American", "American Indian", "Asian", "Hispanic/Latino",
#  		 "Multiple Races", "Pacific Islander", "Unknown", "White"
#  Provided levels: "American Indian", "Asian", "Black", "Hispanic", 
#		"More Than One Race", "Pacific Islander", "White" (No Unknown)

levels(Utah_SGP@Data$ETHNICITY) <- c("Black", "American Indian", "Asian", 
	"Hispanic", "More Than One Race", "Pacific Islander", "Unknown", "White")

levels(Utah_SGP@Data$FRL_STATUS) <- c("Free Reduced Lunch: No", "Free Reduced Lunch: Yes")
levels(Utah_SGP@Data$IEP_STATUS) <- c("IEP: No", "IEP: Yes")
levels(Utah_SGP@Data$GT_STATUS) <- c("GT: No", "GT: Yes")


###  Save data object.
save(Utah_SGP, file="Data/Utah_SGP.Rdata")

Note that the naming conventions associated with Utah data provided by USOE perfectly conform with the naming conventions used by the SGP Package. Although the SGP Package accommodates different state data naming conventions through the use of a meta-data lookup table embedded within the SGPstateData Rdata object in the package, this is not necessary for Utah because the naming conventions have been established in the SQL code in data prep Step 1a above. The variable name lookup table for Utah is shown below with names.provided giving the state specific names, names.sgp the SGP Package specific names, names.type giving the type associated with the variable which is used within summarizeSGP to construct group summaries, names.info providing meta-data associated with the variable, and names.output a Boolean indicator of whether variable should be used with summarizeSGP.

names.provided names.sgp names.type names.info names.output
ID ID individual Unique student identifier
LAST_NAME LAST_NAME label Student last name
FIRST_NAME FIRST_NAME label Student first name
CONTENT_AREA CONTENT_AREA content Content area associated with record
YEAR YEAR time Year associated with record
GRADE GRADE institution_level Grade/Test level associated with record
SCALE_SCORE SCALE_SCORE measure Scale score
VALID_CASE VALID_CASE individual_inclusion Valid case indicator
TEST_NAME TEST_NAME misc Test name
ACHIEVEMENT_LEVEL ACHIEVEMENT_LEVEL measure Achievement level
SCHOOL_LEVEL SCHOOL_LEVEL institution_type Elementary/Middle/High School level flag
SCHOOL_NAME SCHOOL_NAME label School name
SCHOOL_NUMBER SCHOOL_NUMBER institution School number
DISTRICT_NAME DISTRICT_NAME label School name
DISTRICT_NUMBER DISTRICT_NUMBER institution District number
STATE_ENROLLMENT_STATUS STATE_ENROLLMENT_STATUS institution_inclusion State inclusion/accountability indicator
DISTRICT_ENROLLMENT_STATUS DISTRICT_ENROLLMENT_STATUS institution_inclusion District inclusion/accountability indicator
SCHOOL_ENROLLMENT_STATUS SCHOOL_ENROLLMENT_STATUS institution_inclusion School inclusion/accountability indicator
SUBGROUP SUBGROUP demographic Student super subgroup status TRUE
GENDER GENDER demographic Student gender TRUE
ETHNICITY ETHNICITY demographic Student ethnicity TRUE
FRL_STATUS FRL_STATUS demographic Student FRL status TRUE
IEP_STATUS IEP_STATUS demographic Student IEP status TRUE
ELL_STATUS ELL_STATUS demographic Student ELL status TRUE
GT_STATUS GT_STATUS demographic Student Gifted Talented Status TRUE
TEST_MEDIUM TEST_MEDIUM misc Test medium
HIGH_NEED_STATUS HIGH_NEED_STATUS demographic High need status flag TRUE

Clone this wiki locally