Reaction time (RT) data are often pre-processed before analysis by rejecting outliers and errors and aggregating the data. In stimulus-response compatibility paradigms such as the Approach-Avoidance Task (AAT), researchers often decide how to pre-process the data without an empirical basis, leading to the use of methods that may hurt rather than help data quality. To provide this empirical basis, we investigated how different pre-processing methods affect the reliability and validity of this task. Our literature review revealed 108 different pre-processing pipelines among 163 examined studies. Using simulated and real datasets, we found that validity and reliability were negatively affected by retaining error trials, by replacing error RTs with the mean RT plus a penalty, by retaining outliers, and by removing the highest and lowest sample-wide RT percentiles as outliers. We recommend removing error trials and rejecting RTs deviating more than 2 or 3 SDs from the participant mean. Bias scores were more reliable but not more valid if computed with means or D-scores rather than with medians. Bias scores were less accurate if based on averaging multiple conditions together, as with compatibility scores, rather being than based on separate averages per condition, as with double-difference scores. We call upon the field to drop the suboptimal practices to improve the psychometric properties of the AAT. We also call for similar investigations in related RT-based cognitive bias measures such as the implicit association task, as their commonly accepted pre-processing practices currently involve many of the aforementioned discouraged methods.