Data Preparation 2014

Data Preparation of 2014 SAGE Data

Student level data preparation

In the source code below, the data supplied by the Utah State Office of Education (USOE) is cleaned and processed using a two step process:

1a. In the first step the data is cleaned using the .sql script below wherein valid student records are selected based upon business rules specified in the sql code itself. The result is an exportable .csv file where each VALID_CASE is unique by CONTENT_AREA, YEAR and ID.

1b. In the second step the clean data from step 1a is read into R and modified slightly with regard to variable types/classes. The result is an Rdata file that is suitable for analysis with the SGP Package.

Step 1a: Cleaning data with SQL

------ Columns are in order as listed in sgpData_LONG documentation: 
------ http://cran.r-project.org/web/packages/SGPdata/SGPdata.pdf

-- FINALIZE DATA SET
SELECT sub2.*
  FROM (
-- CONSTRUCT TIE BREAKER TO RESOLVE DUPLICATES
SELECT sub1.*,
   ROW_NUMBER() OVER(
       PARTITION BY school_year, student_id, CONTENT_AREA
       ORDER BY scale_score DESC, CASE WHEN grade_level = grade_high THEN 1 ELSE 0 END DESC
       -- Prefer the higher score, if the scores are the same, prefer the test taken at the student's grade level
  ) AS row_nbr
  FROM (
-- CONSTRUCT RAW DATA SET
SELECT t.student_id AS ID,
       m.last_name AS LAST_NAME,
       m.first_name AS FIRST_NAME,
    -- Test Score
  CASE t.test_subject_id -- Course-based tests
       WHEN 313 THEN 'SEC_MATH_I' -- New for 2014
       WHEN 314 THEN 'SEC_MATH_II' -- New for 2014
       WHEN 315 THEN 'SEC_MATH_III' -- New for 2014
       WHEN 505 THEN 'EARTH_SCIENCE'
       WHEN 506 THEN 'BIOLOGY'
       WHEN 507 THEN 'CHEMISTRY'
       WHEN 508 THEN 'PHYSICS'
       ELSE (CASE -- Grade-based tests
       WHEN t.subject_area = 'L'
       AND t.test_subject_id NOT IN ('313','314','315','505','506','507','508') THEN 'ELA'
       WHEN t.subject_area = 'M'
       AND t.test_subject_id NOT IN ('313','314','315','505','506','507','508') THEN 'MATHEMATICS'
       WHEN t.subject_area = 'S'
       AND t.test_subject_id NOT IN ('313','314','315','505','506','507','508') THEN 'SCIENCE'
  END) END AS CONTENT_AREA,
       2014 AS YEAR,
  CASE WHEN CAST(t.test_subject_id AS VARCHAR) IN ('313','314','315','505','506','507','508') THEN 'EOCT'
       WHEN p.grade_low = p.grade_high THEN CAST(p.grade_low AS VARCHAR)
       ELSE CAST(e.grade_level AS VARCHAR)
   END AS GRADE,
       t.scale_score AS SCALE_SCORE,
       t.proficiency_level AS ACHIEVEMENT_LEVEL, -- Transform to factor data type in Utah_SGP_2014.R
    -- Demographics
       m.gender AS GENDER,
  CASE e.race_merged 
       WHEN 'A' THEN 'Asian'
       WHEN 'B' THEN 'Black'
       WHEN 'C' THEN 'White'
       WHEN 'H' THEN 'Hispanic'
       WHEN 'I' THEN 'American Indian'
       WHEN 'M' THEN 'More Than One Race'
       WHEN 'P' THEN 'Pacific Islander'
       WHEN 'U' THEN 'Unknown'
   END AS ETHNICITY,
  CASE WHEN e.low_income = 1 THEN 'Free Reduced Lunch: Yes' ELSE 'Free Reduced Lunch: No'
   END AS FRL_STATUS,
  CASE WHEN e.ell_status IN ('F','M','O','Y') THEN 'ELL: Yes' ELSE 'ELL: No'
   END AS ELL_STATUS,
  CASE WHEN e.special_ed = 'Y' THEN 'IEP: Yes' ELSE 'IEP: Yes'
   END AS IEP_STATUS,
       NULL AS GT_STATUS,
    -- School
       e.school_id AS SCHOOL_NUMBER,
       i2.institution_name AS SCHOOL_NAME, 
  CASE s.grade_level_summary
       WHEN 'ELEM' THEN 'E'
       WHEN 'MID'  THEN 'M'
       WHEN 'HIGH' THEN 'H'
       WHEN 'K12'  THEN 'H'
       ELSE NULL
   END AS EMH_LEVEL,
       e.district_id AS DISTRICT_NUMBER,
       i1.institution_name AS DISTRICT_NAME,
    -- Already limited to school FAY:
      'Enrolled School: Yes' AS SCHOOL_ENROLLMENT_STATUS,
      'Enrolled District: Yes' AS DISTRICT_ENROLLMENT_STATUS,
      'Enrolled State: Yes' AS STATE_ENROLLMENT_STATUS,
      'VALID_CASE' AS VALID_CASE, -- Only valid cases are loaded
    -- Additional Variables
       cts.test_subject, -- Combines CONTENT_AREA and GRADE information
    -- Primary Key (for matching SGP 2014 output to SAGE reportable test data)
       t.school_year,
       t.student_id,
       t.test_subject_id,
       e.grade_level,
       p.grade_high
  FROM sage.sage_reportable_tests AS t
  JOIN ct_test_subjects AS cts
    ON t.test_subject_id = cts.test_subject_id
  JOIN test_program p
    ON t.test_subject_id = p.test_subject_id AND p.school_year = 2014
       -- sage_reportable_tests attributes wrong test_prog_id to tests based only on subject area match
  JOIN student_enrollment AS e
    ON t.school_year = e.school_year
       AND t.district_id = e.district_id
       AND t.school_id = e.school_id
       AND t.school_number = e.school_number
       AND t.student_id = e.student_id
       AND t.school_entry_date = e.entry_date
  JOIN student_master AS m
    ON e.student_id = m.student_id
  JOIN v_be_institution AS i1
    ON e.district_id = i1.institution_id
  JOIN v_be_institution AS i2
    ON e.school_id = i2.institution_id
  JOIN v_be_school AS s
    ON t.district_id = s.district_id
       AND t.school_id = s.school_id
       AND t.school_number = s.school_nbr
 WHERE t.school_year = 2014
   AND t.test_participation_code IN (200,201,300) -- Standard participation, accommodated
   AND t.scale_score IS NOT NULL
   AND e.full_academic_year = 'Y' -- School level
  ) AS sub1
  ) AS sub2
 WHERE row_nbr = 1

Connect to data warehouse and extract input data:

#    Note: This input data should consist of only the new 2014 data to be added to the data extracted in 2013.
require(RODBC)
dbhandle <- odbcDriverConnect('driver={SQL Server}; server=acsdbstage; database=be_upass;')
Utah_Data_LONG_2014 <- sqlQuery(dbhandle, 'SELECT * FROM v_sgp_longfile')
Utah_Data_LONG_2014 <- data.frame(Utah_Data_LONG_2014) # This step may be redundant?
write.csv(Utah_Data_LONG_2014, file='SGP_Longfile_2.csv', row.names=FALSE)

Step 1b: Creating an R version of the cleaned data

After text data has been cleaned using the above SQL script, a comma delimited text file is exported and needs to be read into R in order to create an R based data set to be used for SGP calculations. The R script below reads the SGP_Longfile_2.csv data (located in the relative folder Data/Base_Files) into R and tidies up the data variable by variable to ensure that it matches previously established data naming conventions for subsequent analysis in the SGP package.

##########################################################
####
#### Code for preparation of Utah LONG data
####
##########################################################

library(data.table)

### Set working directory

### Read in USOE .csv file
Utah_Data_LONG_2014 <- fread("Data/Base_Files/SGP_Longfile_2.csv", 
							colClasses=rep("character", 27))

### Final cleaning needed for File Randy uploaded 11/09/14

# TRANSFORM PROFICIENCY LEVEL VARIABLES TO R FACTORS
Utah_Data_LONG_2014$ACHIEVEMENT_LEVEL <- factor(
	Utah_Data_LONG_2014$ACHIEVEMENT_LEVEL, levels=0:4, 
	labels=c(NA, "Level 1", "Level 2", "Level 3", "Level 4"), ordered=TRUE)

Utah_Data_LONG_2014$ACHIEVEMENT_LEVEL_FULL<-Utah_Data_LONG_2014$ACHIEVEMENT_LEVEL

levels(Utah_Data_LONG_2014$ACHIEVEMENT_LEVEL) <- c(NA, "BP", "BP", "P", "A")

table(Utah_Data_LONG_2014$ACHIEVEMENT_LEVEL, exclude=NULL)
table(Utah_Data_LONG_2014$ACHIEVEMENT_LEVEL_FULL, exclude=NULL)

# Make all values of GT_STATUS NA (character values 'NULL' in data provided)
Utah_Data_LONG_2014$GT_STATUS <- NA

###  Save data object.
save(Utah_Data_LONG_2014, file="Data/Utah_Data_LONG_2014.Rdata")


###  Clean up the Demographic variable labels in the 2013 SGP object:

##  ETHNICITY
#  Existing levels: "African American", "American Indian", "Asian", "Hispanic/Latino",
#  		 "Multiple Races", "Pacific Islander", "Unknown", "White"
#  Provided levels: "American Indian", "Asian", "Black", "Hispanic", 
#		"More Than One Race", "Pacific Islander", "White" (No Unknown)

levels(Utah_SGP@Data$ETHNICITY) <- c("Black", "American Indian", "Asian", 
	"Hispanic", "More Than One Race", "Pacific Islander", "Unknown", "White")

levels(Utah_SGP@Data$FRL_STATUS) <- c("Free Reduced Lunch: No", "Free Reduced Lunch: Yes")
levels(Utah_SGP@Data$IEP_STATUS) <- c("IEP: No", "IEP: Yes")
levels(Utah_SGP@Data$GT_STATUS) <- c("GT: No", "GT: Yes")


###  Save data object.
save(Utah_SGP, file="Data/Utah_SGP.Rdata")

Note that the naming conventions associated with Utah data provided by USOE perfectly conform with the naming conventions used by the SGP Package. Although the SGP Package accommodates different state data naming conventions through the use of a meta-data lookup table embedded within the SGPstateData Rdata object in the package, this is not necessary for Utah because the naming conventions have been established in the SQL code in data prep Step 1a above. The variable name lookup table for Utah is shown below with names.provided giving the state specific names, names.sgp the SGP Package specific names, names.type giving the type associated with the variable which is used within summarizeSGP to construct group summaries, names.info providing meta-data associated with the variable, and names.output a Boolean indicator of whether variable should be used with summarizeSGP.

names.provided	names.sgp	names.type	names.info	names.output
ID	ID	individual	Unique student identifier
LAST_NAME	LAST_NAME	label	Student last name
FIRST_NAME	FIRST_NAME	label	Student first name
CONTENT_AREA	CONTENT_AREA	content	Content area associated with record
YEAR	YEAR	time	Year associated with record
GRADE	GRADE	institution_level	Grade/Test level associated with record
SCALE_SCORE	SCALE_SCORE	measure	Scale score
VALID_CASE	VALID_CASE	individual_inclusion	Valid case indicator
TEST_NAME	TEST_NAME	misc	Test name
ACHIEVEMENT_LEVEL	ACHIEVEMENT_LEVEL	measure	Achievement level
SCHOOL_LEVEL	SCHOOL_LEVEL	institution_type	Elementary/Middle/High School level flag
SCHOOL_NAME	SCHOOL_NAME	label	School name
SCHOOL_NUMBER	SCHOOL_NUMBER	institution	School number
DISTRICT_NAME	DISTRICT_NAME	label	School name
DISTRICT_NUMBER	DISTRICT_NUMBER	institution	District number
STATE_ENROLLMENT_STATUS	STATE_ENROLLMENT_STATUS	institution_inclusion	State inclusion/accountability indicator
DISTRICT_ENROLLMENT_STATUS	DISTRICT_ENROLLMENT_STATUS	institution_inclusion	District inclusion/accountability indicator
SCHOOL_ENROLLMENT_STATUS	SCHOOL_ENROLLMENT_STATUS	institution_inclusion	School inclusion/accountability indicator
SUBGROUP	SUBGROUP	demographic	Student super subgroup status	TRUE
GENDER	GENDER	demographic	Student gender	TRUE
ETHNICITY	ETHNICITY	demographic	Student ethnicity	TRUE
FRL_STATUS	FRL_STATUS	demographic	Student FRL status	TRUE
IEP_STATUS	IEP_STATUS	demographic	Student IEP status	TRUE
ELL_STATUS	ELL_STATUS	demographic	Student ELL status	TRUE
GT_STATUS	GT_STATUS	demographic	Student Gifted Talented Status	TRUE
TEST_MEDIUM	TEST_MEDIUM	misc	Test medium
HIGH_NEED_STATUS	HIGH_NEED_STATUS	demographic	High need status flag	TRUE

Next Section: Data Analysis

SGP - Student Growth Percentiles SGP Blog | SGP GitHub Repo | SGP on CRAN

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Preparation 2014

Data Preparation of 2014 SAGE Data

Student level data preparation

Step 1a: Cleaning data with SQL

Step 1b: Creating an R version of the cleaned data

Next Section: Data Analysis

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally