Guide: Program Evaluation

Karen K. Melton

Cultivating Character through Formal Assessment

9 Guide: Program Evaluation

Karen K. Melton

Program evaluation can be a beneficial tool for organizations to make informed decisions about their programs. While the term program evaluation may seem overwhelming, it’s essentially a straightforward process of collecting, analyzing, and interpreting data to demonstrate a program’s effectiveness and efficiency. This process is designed to inform decisions about future programming, determine the worth of a program or its components, and assess its merit.

There are various types of evaluations, each providing unique data or information for organizations. Which type you use should reflect the information you seek, considerations about resources to invest (e.g., time, personnel), and program lifecycle. Among these types of evaluations, the retrospective post-then-pre method ^[1] stands out for its effectiveness in small-scale program evaluation. This approach, which involves collecting data only at the end of the program, is particularly beneficial as it helps to reduce the response-shift effect. This effect refers to the change in the respondent’s frame of reference or evaluation standard during an intervention, and the retrospective pre-test method is designed to mitigate this shift.^[2]

By using the method outlined in this guide, organizations can obtain a numerical score that will provide insight into the effectiveness of their programs in cultivating a specific virtue. This data can be further enriched by combining the retrospective pre-test method with a focus group interview, which provides a deeper understanding of the numerical data. The guide’s systematic approach and specific steps to collect and analyze data empower organizations to make informed decisions about future programming, enhance program effectiveness and efficiency, and demonstrate the value of their program. Essentially, this guide is a practical tool to help organizations understand if they are achieving their intended outcomes.

Step-by-Step Guide

This guide is intended to be used by organizations that have intentionally designed programs to cultivate virtues. The guide provides instructions on using a post-then-pre survey to conduct a program evaluation for collecting data to address the following questions:

Is {PROGRAM} effective at increasing {VIRTUE}?

The guide provides five steps: (1) Establish a program description, (2) Select the virtue measures and method, (3) Facilitate data collection, (4) Analyze and interpret data, and (5) Disseminate results via a report.

Step 1. Establish a Program Description

When embarking on a program evaluation, it is crucial to establish a program description. The program description should outline the program’s purpose, including its goals and objectives, and provide a detailed understanding of how the program is intended to function. Additionally, it should reflect the necessary resources for the program and adequately describe the expected benefits or outcomes. While there are various approaches to crafting a program description, such as writing a program narrative (i.e., written text), the most effective format is a logic model. Logic models provide a condensed one-page overview of the resources, activities, deliverables, and impact; logic models also offer a concise and comprehensive representation of the program’s design. Such a model can provide evaluators with a clear roadmap of the program’s intended operations, which can help them identify better areas of success and areas needing improvement.

Step 2. Select the Virtue Measure(s) & Methods

Virtue Measures. The program description should help you quickly identify the virtues your program intends to impact. Chapter 11 – Tool: Virtue Scales provides psychological scales (i.e., measures) that have been reformatted using the retrospective pre-test approach for the virtues listed below. Select the virtue(s) that align with your program goals (i.e., which virtues do you expect the program to impact).

Virtue Scales formatted for Retrospective Post-then-Pre Test

Additional Measures: In this chapter, we focus on evaluating specific virtues listed in the box above. However, you can also use this method with the Character Index. You may also choose to find another validated scale for other virtues.

Method to Administer Survey. Two options for administering the survey are paper and pencil or online. Both options have pros and cons. Group size and setting are two critical factors to consider when making this decision. Paper-and-pencil requires more administrative work in the end (data entry), whereas online requires more administration in the beginning (creating the survey online). You are more likely to collect data from everyone in the program if you administer a paper-and-pencil survey during the program. We have created worksheets and instructions for paper-and-pencil methods. You can easily modify these instructions if you choose to use online platforms.

Step 3. Facilitate data collection

At the designated time and place, administer the surveys to students. It is recommended that you collect data on the last day of the program. Evaluations can be built into a facilitated reflection about the program (e.g., purpose, content covered, activities, etc.). It is vital to share with students that you are collecting data to determine the program’s effectiveness and that data will be collected anonymously so that they can be honest in their responses. Read the instructions aloud. Then, provide time to complete the worksheet. Time will vary based on the length of the worksheet, but on average, each survey will take about 5-10 minutes to complete. A facilitated reflection and evaluation could take 30-60 minutes.

Additional Resource: At the end of the chapter, we provide an example of a facilitator script for implementing the survey in-person setting.

Demographics. To ensure quality data collection, it is recommended that you do not collect demographic information at the same time as administering a program evaluation survey. For program evaluations, students should not provide identifying information that would give away their identity—we commonly think of this as name, birth date, or student ID. However, in program evaluations for small groups, reporting race or biological sex and other demographic variables may provide the details to identify a specific student. Collecting demographic data will likely result in students not feeling that they can provide honest feedback and thereby reduce the quality of the data.

Demographic information can be collected on a different day or separately to describe the program participants. We provide an example demographic survey here: Demographic Survey.

Step 4. Data Analysis & Interpretation

In this step, the guide provides the simple steps accompanied by videos to assist you in entering your data, completing simple statistical models (means, t-test), and interpreting the data to meaningful information you can share with others.

Scoring. The surveys provided in Chapter 11 – Tool: Virtue Scales have scoring instructions on the second page. Each survey has different scoring instructions; read carefully. For scoring the surveys, consider these options:

- If you use an online data collection system, you can download your data and use functions to calculate the score.
- If you use paper-and-pencil, you (or a staff member/student worker) can choose to score these surveys yourself. If you choose to do it yourself, you may want to only print page 1 for data collection with students.
- If you use paper-and-pencil, you can also choose to have the students score their own survey before turning them in. In this case, be sure to print page 2 of the worksheet and ask students to follow the instructions. It is best to have someone double-check the student’s work.

Reverse Coding. Some scales may have items that require reverse coding. Reverse coding is used in surveys to reduce response bias. It involves changing the wording or scoring of some items so that they are opposite or negative to the original ones. For example, a positive statement like “I am a patient person” can be changed to “I am not a patient person.” If an item is reverse coded, then the scoring is reversed, meaning a 1 becomes a 5, and a 5 becomes a 1. Tables 9.1 and 9.2 below illustrate reverse coding for scales of 1-5 and 1-7. These tables have been embedded into worksheets to assist in scoring.

Table 9.1. Reverse Coding for a 1 to 5 Scale

Original Score	1	2	3	4	5
New Score	5	4	3	2	1

Table 9.2. Reverse Coding for a 1 to 7 Scale

Original Score	1	2	3	4	5	6	7
New Score	7	6	5	4	3	2	1

Data entry. Data should be entered into an Excel sheet or Google Sheets, as these software programs allow you to conduct simple data analyses. Additionally, storing data in this manner will enable you to keep track of data year after year and make comparisons over time.

Calculate the grand mean. ‘Grand mean’ provides a simple summary measure of the average for the group. First, calculate the mean for the pre-test and then repeat for the post-test. The mean is the average of a set of values, which is the sum of all values divided by the number of values. Therefore, you will divide the sum of all the student’s scores by the number of students.

Link to Mean & Standard Deviation calculator

Calculate the change score. ‘Change scores’ provide a simple summary measure of the average change in a variable between two-time points. To calculate, subtract the grand mean for the post-test from the grand mean for the pre-test score.

Calculate effect size. The change score indicates whether a change occurred and the magnitude of the change. However, it can be difficult to interpret just based on the value if there is a significant difference between the pre-test and post-test. In many assessments, t-tests are commonly used to analyze the data and determine if the difference between the pre and post-test is statistically significant. However, we will stay away from this analysis because there are many problems with significance testing, especially for small groups^[3].

Instead, we will examine practical significance, which is referred to as real-world relevance or importance of the findings. We will do this by determining the effect size^[4]. Effect size is the magnitude of the difference between the pre and post-test. Specifically, we will calculate Cohen’s d to determine effect size. There are many online, free calculators that are able to calculate effect size.

Link to Effect Size calculator

Interpretation. Next, you will make meaning of the data for yourself and others. Below is a template paragraph for making sense of the data from the analyses. This template can be used in your evaluation report. As you interpret these data, it is important to remember that the simple methods that we used for this evaluation do not permit us to generalize the information beyond the one program that was evaluated. There is a possibility that the data could be influenced by cohort, instructors, semester, and other factors.

Results

On {DATE}, an evaluation was conducted of {PROGRAM NAME} to determine if the program was effective in cultivating {VIRTUE}. The program had ## participants, and ## participants completed the program evaluation (##%). The evaluation was provided to students on the last day of the program using a post-then-pre survey. {VIRTUE} was measured using {NAME SCALE} (Citation/Footnote/Endnote). The virtue was measured using a scale of 1 (strongly disagree) to 5 (strongly agree) {Double check this sentence to ensure the scaling (1-5 or 1-7) and words (strongly agree, etc) are correct—this will change per survey}.

On average, students rated themselves on {VIRTUE} before the program at #.## {GRAND MEAN PRE}, and after the program at #.## {GRAND MEAN POST} — a difference of #.## {CHANGE SCORE}. The analysis demonstrates that the observed change in pre-test and post-test scores {is/is not} practically significant, with a {small, medium, large} effect size (Cohen’s d = #.##). This finding suggests that {PROGRAM NAME} {has/ does not have} a meaningful and impactful effect in cultivating {VIRTUE}.

Step 5. Disseminate Results via a Report

After completing the data analysis and interpretation, it is important to create a simple report (1-2 pages) of your findings. The report will provide an overview of your evaluation and interpret the results for stakeholders. The following outline is recommended.

Program Evaluation Report Outline

Program Name

Program Description

Purpose of the Evaluation

Evaluation Methods & Results

Implications/Recommendations

Implications and recommendations come from the culmination of your understanding of the current context of the program and the weight of the data in helping make informed decisions. In essence, based on what you know now about the program, what recommendations would you make to senior leadership? You might provide many different implications and recommendations to your stakeholders.

Examples of Recommendations

Data suggests that the program is having the intended impact. Continue the program without any changes. Monitor every 3 years.

Continue monitoring without any changes to the program for two more rounds of program implementation to determine a baseline.

Make the following changes to the program: …

Although the program did not have an impact on X virtue, we recommend evaluating the program for X, Y, or Z virtues in the future.

We have conducted 3 evaluations of the program, and the program does not have the intended impact on student development. Therefore, we recommend sunsetting the program and using the resources (time, finances, staff) for other effective programs.

ADDITIONAL RESOURCES

1. Program Description, Logic Model, & Evaluation … and so much more

If you are new to program design, then some of this might feel a bit overwhelming. We highly recommend checking out Jen Urban’s work. She has created a FREE platform called The Netway that helps practitioners develop program models and evaluation plans. You will need to create a login to use these resources. Once you have logged in, then click on Resources -> Protocol. This manual will provide you with a wealth of information about designing effective programs and how to evaluate them.

2. Facilitator Script

FACILITATOR SCRIPT

The facilitator script is meant to be used for in-person administration of the evaluation. In this script, the facilitator begins with gratitude comments and explains that a program evaluation is being conducted. Then, an optional reflection on the program with students is provided, followed by the implementation of the survey. The facilitator explains the post-then-pre method to students and reminds them that the program is being evaluated (not the student) and that the evaluation is anonymous. An optional request for students to assist in the scoring of the survey is provided. Finally, instructions are provided for concluding the evaluation.

INTRODUCTION

As we wrap up our final meeting, I want to “Thank You” for your involvement and participation…{expand on this}. The last thing I would like to do is have you take a survey to provide {ORG NAME} with information about how this program has impacted your life. As a reminder, the purpose of this program was to {INSERT}. And we have tried to do this as we covered {PROGRAM CONTENT} and engaged in {PROGRAM ACTIVITIES}. Can you believe it? We have been doing this for # weeks.

REFLECTION ACTIVITY ON PROGRAM IMPACT (OPTIONAL)

I wonder what have been your TOP 3 lessons learned or favorite moments from our time together? {HAVE STUDENTS WRITE THESE DOWN.} Then, allow students to Pair & Share and/or have a few students share with the larger group}. Thank you so much for sharing these with our group.

ADMINISTERING THE SURVEY

{Pass out survey}.

I am going to have you take this survey. The purpose of this survey is to help us improve our programs. We want to learn more about how well the program is helping our students grow—we will ask you questions about {VIRTUE}. Your input is very valuable to us when making future decisions about this program.

This survey is slightly different from regular research surveys—it is specifically meant for evaluating programs like this one. Therefore, you are going to be asked to respond twice for each item. Let’s look at the first item of the survey together. The item states: {READ ITEM}—as you respond, there are two things it wants you to consider: First, right now—after completing this program, {PROGRAM}, how would you rate yourself? And then second, it asks you to rate yourself before being part of this program. This will help us understand how much you think you have grown due to being a part of {PROGRAM}.

I ask that you be as honest as possible—and remember we are evaluating the program (not you). Also, we are not asking for any demographic information with this survey, so we cannot identify any one person’s response.

SCORING (OPTIONAL)

When you are done with the survey, please help us with scoring. You will find the instructions for how to score on the second page. {Note any special instructions of which students should be mindful.}

CONNECTION & SUPPORT

Thank you for taking the time to complete this survey. I realize that some emotions and feelings may have surfaced while answering these questions. Please feel free to reach out to {Name} at {Email} if you would like to have someone to talk with about what surfaced and provide you with some spiritual or other guidance.

GOODBYE

When you are completely done, please create a stack here. I will be outside the room to say “Goodbye” as you leave. {Or provide other instructions for when students are done.}

RELATED CHAPTERS

“Retrospective pretests consistently indicate larger program effects than direct comparisons of true pretest and posttest scores (e.g., Hill & Betz, 2005; Hoogstraten, 1985; Pratt et al., 2000; Sibthorp, Paisley, Gookin, & Ward, 2007)” from Geldhof, G. J., Warner, D. A., Finders, J. K., Thogmartin, A. A., Clark, A., & Longway, K. A. (2018). Revisiting the utility of retrospective pre-post designs: the need for mixed-method pilot data. Evaluation and program planning, 70, 83-89. ↵
Lamb, T. (2005). The retrospective pretest: An imperfect but useful tool. The evaluation exchange, 11(2), 18.; Howard, G.S. (1980). Response-shift bias a problem in evaluating interventions with pre/post self-reports. Evaluation Review 4(1), 93-106. ↵
"Here is the major problem. Statistical significance has nothing to do with the main definition of significance – the quality of being important. It is a technical term more accurately defined to be ‘when the probability that the observed data differs from the assumed model is very small [smaller than some arbitrary subjectively-selected cutpoint like 0.05." Bangdiwala, S. I. (2016). Understanding significance and p-values. Nepal journal of epidemiology, 6(1), 522.].’ ↵
Sullivan, G. M., & Feinn, R. (2012). Using effect size—or why the P value is not enough. Journal of graduate medical education, 4(3), 279-282 ↵

License

Icon for the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License

Share This Book

Feedback/Errata

Comments are closed.