Getting Started > Glossary


Alpha level (α)

The criterion that allows researchers to use the p-value (see below) to determine whether an estimated program impact is statistically significant. The p-value is the probability that an estimated impact that large, or larger, in magnitude could have occurred by chance, even if the program had no actual impact. The alpha level should be specified by the researcher before outcome data collection begins. Many researchers set alpha equal to 0.05, by convention, but under certain circumstances a larger value (such as an alpha level of 0.10) or a smaller value (or an alpha level of 0.01) may be preferable.

Baseline Differences

For between-group designs, the difference between the average characteristic of one group versus the average characteristic of another, prior to program (or intervention) delivery. A statistical hypothesis test is typically applied to evaluate whether this difference is due to chance.

Between-Group Designs

Designs that compare the average outcome from each of at least two groups.


In the context of program evaluation, this refers to the extent to which the program impact estimated using the study sample approximates the true impact in the population, across many replications. When an estimate is biased, it will be higher or lower than the true impact.


This approach is used during the assignment phase of a study to improve precision of the estimated program impact, to balance the groups on certain characteristics, or both. This is accomplished by determining a characteristic (such as locale), then ordering study units by levels of that characteristic (e.g., urban, suburban, and rural). Within each level, study units are assigned to groups (using random assignment or matching).

Comparison Group

A group of study participants who do not receive program services, usually formed through methods other than random assignment. This group serves as the counterfactual relative to the program (or intervention) group. Because this group is formed by methods other than random assignment, it is considered a “weaker” comparative group than a control group formed by random assignment.

Confirmatory Research Question

The primary research question that the study is statistically powered to address and the answer to which can be used to inform policy.

Control Group

A group of study participants, formed by random assignment, who do not receive program services, and is assessed in comparison to the group receiving services (or the intervention). A randomly assigned control group of participants, statistically, should be similar in both known and unknown ways to the group of participants receiving services. It is considered the strongest possible group to compare to the intervention group.


A term used in evaluation to denote a hypothetical condition representing what would have happened to the intervention group if it had not received the intervention. The counterfactual cannot be directly observed, so it is usually approximated by observing some group that is “nearly identical,” but did not receive the intervention. In random assignment studies, the “control group” formed by random assignment that is equivalent to the intervention group in every way, on average, except for receiving the intervention serves as the counterfactual.


A statistical term that describes the relationship between characteristics of study participants that are, typically, correlated with the outcome. These characteristics could explain the differences seen between program participants and the control or comparison group. As such, these variables are often used as statistical controls in models used to estimate the impact of the intervention on study participants’ outcomes.

Effect size

A way of statistically describing how much a program affects outcomes of interest. Effect size is the difference between the average outcomes of the intervention and control group expressed in standard deviations. This expression is derived by dividing the difference by a standardized unit.

Evidence Base

The body of research and evaluation studies that support a program or components of a program’s intervention.

Experimental Design

A research design in which the effects of a program, intervention, or treatment are examined by comparing individuals who receive it with a comparable group who do not.  In this type of research, individuals are randomly assigned to the two groups to try to ensure that, prior to taking part in the program, each group is statistically similar in both observable (i.e., race, gender, or years of education) and unobservable ways (i.e., levels of motivation, belief systems, or disposition towards program participation). Experimental designs differ from quasi-experimental designs in how individuals are assigned to program participation or not; in quasi-experimental design, non-random assignment is used, which prevents evaluators from feeling confident that both observable and unobservable characteristics are similar in each group since group assignment is based on observable characteristics usually.

Exploratory Research Question

In contrast to a Confirmatory Research Question, an exploratory research question is posed and then addressed to inform future research rather than to inform policy. This question type includes questions that examine, for example, which specific subgroups respond best to an intervention; questions such as that are less likely to be answered with strong statistical certainty, but may be helpful for program implementation and future evaluation. If a question arises as a result of analyzing the data and was not originally posed as a fundamental impact of the program before data is collected, it is categorized as exploratory.

External Validity

The extent to which evaluation results, statistically, are applicable to groups other than those in the research. More technically, it refers to how well the results obtained from analyzing a sample of study participants from a population can be generalized to that population. The strongest basis for applying results obtained from a sample to a population is when the sample is randomly selected from that population. Otherwise, this generalization must be made on extra-statistical ground – that is, on a non- statistical basis.


Research that focuses on the outcomes for individuals rather than for groups. This is in contrast to research that is nomothetic, which is research that focuses outcomes at the group level. For between-group designs, the strength of the causal attribution depends on how the control or comparison group was formed (random assignment, matching, non-random assignment).

Impact Evaluation

An evaluation designed to determine if the outcomes observed among program participants are due to having received program services or the intervention.

Implementation Fidelity (Fidelity of Intervention Implementation)

The extent to which the program or intervention was implemented as intended. The intention usually is expressed prior to intervention delivery and, when available, in intervention developer documents such as the program theory and logic model.

Informed Consent

A dimension of Human Subjects Protection that requires researchers to make sure that potential study participants (both program participants and control or comparison group members) are fully informed of the potential risks or benefits, if any, and conditions of study participation.

Intent-to-Treat (ITT)

An approach for analyzing data from between-group designs in which study participants are analyzed, (1) in the group they were assigned to at the start of the study, regardless of the group they end up in at the end of the study, and (2) for individuals in the intervention group, whether they participate in the intervention or not. In intent-to-treat analysis, the aim is to estimate the impact of the “offer” of the intervention regardless of whether it is received, as opposed to focusing on how participants’ experience of program participation affects an outcome.

Internal Validity

For a given design, the extent to which the observed difference in the average group outcomes (usually program participants versus control or comparison group members) can be causally attributed to the intervention or program. Randomized controlled trials allow for high causal attribution because of their ability to rule out alternative explanations (usually unobserved characteristics) other than the intervention as the reason for the observed affect.


A term used to describe the services or activities a program does to achieve its stated outcomes, goals, or desired results.

Intervention Level

The level (e.g., at the individual, group, community, or structural level of society) at which a specific program offers treatment or services to address a particular problem or issue.

Level of Evidence

The quality of findings, based on empirical data, from an evaluation or research study. Although there is no consensus within the evaluation field concerning what constitutes a particular level of evidence, the SIF program divides evidence into three categories: preliminary, moderate, and strong. These divisions are based on how well a particular evaluation is able to address concerns about internal and external validity, with evaluations that do a better job generating strong or moderate levels and those that are less able to do so generating preliminary levels of evidence.


A technique used to pair participants in an evaluation based on observed characteristics that are correlated with the outcome of interest. The pairing is then used to create intervention and control groups that are similar based on these characteristics.

Minimum Detectable Effect Size (MDES)

The smallest effect size (usually, the comparative difference measured in an outcome between program participants and control or comparison group members) that can be detected for a given design and under certain assumptions with a specified probability (typically .80). Typically, increasing the sample size leads to a smaller MDES (that is, enables the study to detect a smaller impact).

Multiple Comparisons

When between-group designs are used, there are opportunities to compare multiple groups on the same outcome or two groups (program participants versus control or comparison group members) on multiple outcomes. This comparison can artificially inflate the alpha level and require the researcher to adjust it downward. That is, if many outcomes are addressed in a study, it is possible that some will be erroneously viewed as statistically significant even though they are in reality due to chance.


In contrast to idiographic, nomothetic focuses on group outcomes typically based on the average.

Post Hoc

Means “after the fact.” In the context of evaluation, the term refers to analysis of data that was not specified prior to analyzing the data.

Propensity Score

A score calculated using logistic regression techniques based on known characteristics of an individual or group, which predicts probability of group membership (e.g., intervention or program participation group versus comparison group).

Propensity Score Matching

The use of Propensity Scores to identify participants for inclusion in the comparison group. Propensity Score Matching can decrease pre-treatment differences in the treatment and comparison group, thereby reducing selection bias, which constitutes a key threat to internal study validity.


In the context of an impact evaluation, a statistical term used to describe the probability that the impact observed in the sample could have come from a population in which there is no impact.

Quasi-Experimental Design

A design that includes a comparison group formed using a method other than random assignment, or a design that controls for threats to validity using other counterfactual situations, such as groups which serve as their own control group based on trends created by multiple pre/post measures. Quasi-experimental design, therefore, controls for fewer threats to validity than an Experimental Design.

Random Assignment

A process that uses randomly generated numbers (or other method of randomizing study participants) to assign study units (people, program sites, etc.) to either the program participant or control group. The use, or lack of use, of this process differentiates experimental designs from non-experimental designs.


A statistical model used to examine the influence of one or more factors or characteristics on another factor or characteristic (referred to as variables). This model specifies the impact of a one unit change in the independent variable or variables (sometimes referred to as the predictor variable or variables) on the dependent variable (sometimes referred to as the outcome variable). Regression models can take a variety of forms (ordinary least squares, weighted least squares, logistic, etc.) and require that the data meet certain requirements (or be adjusted, post hoc, to meet this requirements). Because regression models can include several predictor variables, they allow researchers to examine the impact of one variable on an outcome while taking into account other variables’ influence.

Regression Discontinuity Design

A form of research design used in program evaluation to create a stronger comparison group (i.e. reduce threats to internal validity) in a quasi- experimental design evaluation study. The intervention and control group are formed using a well-defined cutoff score. The group below the cutoff score receives the intervention and the group above does not, or vice versa. For example, if students are selected for a program based on test scores, those students just above the score and those students just below the score are expected to be very similar except for participation in the program, and can be compared with each other to determine the program’s impact.

Selection Bias

When study participants are assigned to groups such that the groups differ in either (or both) observed or unobserved characteristics, resulting in group differences prior to delivery of the intervention. If not adjusted for during analysis, these differences can bias the estimate of program impacts on the outcome.

Standard Error

In the context of an impact evaluation, this is the standard deviation of the sampling distribution of the estimate of the program impact. This estimate is divided by the standard error to obtain the test statistic and associated p-value to determine whether the impact is real or due to chance (i.e., sampling error).

Statistical Equivalence

In research, this term refers to situations in which two groups appear to differ, but in truth are not statistically different from one another based on statistical levels of confidence. In a sample, two groups may have what appears to be an average difference on a baseline characteristic. However, when this difference is assessed relative to the population from which this group was drawn (using a statistical hypothesis test), the conclusion is that this difference is ‘what would be expected’, due to sampling from the population, and there is really no difference, statistically, between the groups.

Statistical Power

A gauge of the sensitivity of a statistical test. That is, it describes the ability of a statistical test to detect effects of a specific size, given the particular variances and sample sizes in a study.

Theory of Change

The underlying principles that generate a logic model, a theory of change clearly expresses the relationship between the population/context the program targets, the strategies used, and the outcomes (and/or impact) sought.

Treatment-on-Treated (TOT)

In contrast to Intent-to-Treat (ITT), Treatment on Treated (TOT) is a type of analysis that measures the program impact going beyond just the “offer” of the program to consider the level of program uptake. In contrast with ITT, TOT is typically thought of as a measure of the impact on those who actually got the treatment, rather than those who were offered it.

Unit of Analysis

Study participants that comprise the sample that will be used to produce study results; this may or may not be individuals, as sometimes studies compare program sites, groups of participants and non-participants at the aggregate level, or states, for example.


Friend Us on FacebookFollow Us on TwitterWatch Us on YouTube