Recently it has been suggested that parameters estimates of computational models can be used to understand individual differences at the process level. One area of research in which this approach, called computational phenotyping, took hold is computational psychiatry, but it is also used to understand differences in age and personality. One requirement for successful computational phenotyping is that behavior and parameters are stable over time. Surprisingly, the test-retest reliability of behavior and model parameters remains unknown for most experimental tasks and models. The present study seeks to close this gap by investigating the test-retest reliability of canonical reinforcement learning models in the context of two often-used learning paradigms: a two-armed bandit and a reversal learning task. We tested independent cohorts for the two tasks (N=142 and N=154) via an online testing platform with a between-test interval of five weeks. Whereas reliability was high for personality and cognitive measures, it was generally poor for the parameter estimates of the reinforcement learning models. Given that simulations indicated that our procedures could detect high test-retest reliability, this suggests that a significant proportion of the variability must be ascribed to the participants themselves. In support of that hypothesis, we show that mood (stress and happiness) can partly explain within-subject variability. Taken together, these results are critical for current practices in computational phenotyping and suggest that individual variability should be taken into account in the future development of the field.