Bayesian scheduling of reviews

57 views

Skip to first unread message

Gwern Branwen

unread,

Mar 31, 2014, 5:05:49 PM3/31/14

to Mnemosyne mailing list

"Improving students’ long-term knowledge retention through
personalized review", Lindsey et al 2013
http://laplab.ucsd.edu/articles/LindseyShroyerPashlerMozer2013.pdf

> Human memory is imperfect; thus, periodic review is required for the long-term preservation of knowledge and skills. However, students at every educational level are challenged by an ever-growing amount of material to review and an ongoing imperative to master new material. We developed a method for efficient, systematic, personalized review that combines statistical techniques for inferring individual differences with a psychological theory of memory. The method was integrated into a semester-long middle school language course via retrieval-practice software. In a cumulative exam administered after the semester’s end that compared time-matched review strategies, personalized review yielded a 16.5% boost in course retention over current educational practice (massed study) and a 10.0% improvement over a one-size-fits-all strategy for spaced study.

> ...We incorporated systematic, temporally distributed review into third-semester Spanish foreign language instruction using a web-based flashcard tutoring system, the Colorado Optimized Language Tutor or colt. Throughout the semester, 179 students used colt to drill on ten chapters of material. colt presented vocabulary words and short sentences in English and required students to type the Spanish translation, after which corrective feedback was provided.
>
> ...A generic-spaced scheduler selected one previous chapter to review at a spacing deemed to be optimal for a range of students and a variety of material according to both empirical studies (Cepeda et al., 2006; Cepeda, Vul, Rohrer, Wixted, & Pashler, 2008) and computational models (Khajah, Lindsey, & Mozer, 2013; Mozer, Pashler, Cepeda, Lindsey, & Vul, 2009). On the time frame of a semester—where material must be retained for 1-3 months—a one-week lag between initial study and review obtains near-peak performance for a range of declarative materials. To achieve this lag, the generic-spaced scheduler selected review items from the previous chapter, giving priority to the least recently studied (Figure 1).
> A personalized-spaced scheduler used a latent-state Bayesian model to predict what specific material a particular student would most benefit from reviewing. This model infers the instantaneous memory strength of each item the student has studied. The inference problem is difficult because past observations of a particular student studying a particular item provide only a weak source of evidence concerning memory strength. To illustrate, suppose that the student had practiced an item twice, having failed to translate it once 15 days ago but having succeeded 9 days ago. Based on these sparse observations, it would seem that one cannot reliably predict the student’s current ability to translate the item. However, data from the population of students studying the population of items over time can provide constraints helpful in characterizing the performance of a specific student for a specific item at a given moment. Our model-based approach is related to that used by e-commerce sites that leverage their entire database of past purchases to make individualized recommendations, even when customers have sparse purchase histories. Our model defines memory strength as being jointly dependent on factors relating to (1) an item’s latent difficulty, (2) a student’s latent ability, and (3) the amount, timing, and outcome of past study. We refer to the model with the acronym dash summarizing the three factors (difficulty, ability, and study history). By incorporating psychological theories of memory into a data-driven modeling approach, dash characterizes both individual differences and the temporal dynamics of learning and forgetting. The Appendix describes dash in detail.
> The scheduler was varied within participant by randomly assigning one third of a chapter’s items to each scheduler, counterbalanced across participants. During review, the schedulers alternated in selecting items for retrieval practice. Each selected from among the items assigned to it, ensuring that all items had equal opportunity and that all schedulers administered an equal number of review trials. Figure 1 and Table 1 present student-item statistics for each scheduler over the time course of the experiment.
>
> ...To evaluate the quality of dash’s predictions, we compared dash against alternative models by dividing the 597,990 retrieval practice trials recorded over the semester into 100 temporally contiguous disjoint sets, and the data for each set was predicted given the preceding sets. The accumulative prediction error (Wagenmakers, Gr̈unwald, & Steyvers, 2006) was computed using the mean deviation between the model’s predicted recall probability and the actual binary outcome, normalized such that each student is weighted equally. Figure 4 compares dash against five alternatives: a baseline model that predicts a student’s future performance to be the proportion of correct responses the student has made in the past, a Bayesian form of item-response theory (irt) (De Boeck & Wilson, 2004), a model of spacing effects based on the memory component of act-r (Pavlik & Anderson, 2005), and two variants of dash that incorporate alternative representations of study history motivated by models of spacing effects (act-r, mcm). Details of the alternatives and the evaluation are described in the Supplemental Online Material. The three variants of dash perform better than the alternatives. Each variant has two key components: (1) a dynamical representation of study history that can characterize learning and forgetting, and (2) a Bayesian approach to inferring latent difficulty and ability factors. Models that omit the first component (baseline and irt) or the second (baseline and act-r) do not fare as well. The dash variants all perform similarly.

DASH is defined on pg11. Unfortunately, they don't compare directly to
any of the Supermemo algorithms, so I'm not sure how useful it would
be.

--
gwern
http://www.gwern.net

Reply all

Reply to author

Forward

0 new messages