A Powered by Osteons reader, Judy Barr, alerted Kristina Killgrove to a new article in the journal Science and Justice called “Cognitive bias in forensic anthropology: Visual assessment of skeletal remains is susceptible to conformation bias” by Nakhaeizadeha, Dror & Morgan. Kristina then posted it to BioAnthropology News (if you’re not a member, click on the link and join!), and got into a back-and-forth with fellow bioarchaeologist Alison Atkin, who also writes the Deathsplanation blog. We decided to try something new: opening up our ideas by cross-posting our conversation on our respective blogs.
The article is open for anyone to read (click link above), but boils down to the idea of priming: that a person’s response to a stimulus can be affected by another stimulus. An example from marketing: if you go to the grocery store and see that breakfast cereal was originally $2.99 but is now marked down to $2.49, you’re more likely to buy it than if the store simply labeled it as $2.49 to begin with. In this article, the authors “prime” a subject pool by giving them true, false, and no additional context about a skeleton to see if the extraneous context biases their forensic opinion about the demographics (age, sex, and ancestry).
Kristina Killgrove: I’ll kick this off with my initial interest in the article, prior to fully reading it. This is not a journal I regularly read, let alone was even aware of, but my first impression was that the article is overly critical of osteological methods of assessing sex. Given a complete, well-preserved skeleton, we can be about 95-99% confident of our sex assessment, but given incomplete remains, particularly from an archaeological context, that accuracy drops. Due to the article’s placement in this journal, though, it is clear that the authors want to raise awareness of the shortcomings of osteological demographic analysis because of its relevance to forensic cases and possible (in)admissibility in court. There is a lot of misunderstanding about forensic anthropology, due to factors like TV portrayal of the field in Bonesor CSI and due to the widespread ignorance of science in the U.S., so my interest was piqued by the article’s title and the abstract. What was your first reaction to the article, Alison?
Alison Atkin: I think my first reaction of the article was quite similar to yours – although perhaps a bit more unprofessional. If I recall correctly, it was something along the lines of, “How dare ‘they’ question the reliability of this science to which I dedicate my life.” The osteology and biological/forensic anthropology community over here (in the UK) isn’t enormous. So, given the topic of this study, I was surprised that I wasn’t familiar with the authors of the study – however after fully reading the paper, it became clear why non-(osteology)-specialists were interested in this area of research.
My initial skim-read of the article produced a lot of questions on various aspects of the study. While I feel very strongly about osteology being a fantastic scientific discipline, I am aware that it has its fair share of methodological problems. However, this paper felt really unbalanced (and not in our favour). Although my initial reactions were probably in defence of my discipline, after looking more in-depth into the issues I had with this study, I think most of them were justified.
KK: Well, I’ll admit that I also had some knee-jerk reactions upon skimming. But I moved on to the methodology section, because I was genuinely curious about the way the authors set up their study. Unfortunately, I can’t fairly assess the methodology because the authors don’t give enough relevant information. My main issues/questions about the study are:
- They don’t fully discuss which portions of the skeleton were available for study. Was the pubic symphysis — by far the best portion of the skeleton for accurately estimating both age and sex — present or did the participants rely on less accurate techniques?
- The skeleton was archaeological, not forensic, and the participants didn’t know this. This context — bioarchaeological or forensic — is very relevant for completing a full demographic profile because some methods may be affected by secular or cross-cultural differences in populations.
- The participants were largely students. Students outnumbered professionals (using the terms of the authors) 2 to 1. In the control group C, the students outnumbered the professionals 3 to 1. This makes it impossible to control for experience in assessing bias. (I have nothing against students, of course, since I was one just a few short years ago. But I am also certain that I am better at my job now than I was then.)
This is not to say that the topic of the paper is bad; in fact, I am actually worried that context may bias osteological analysis. This isn’t a big deal with archaeological remains generally (if you use grave goods to bias your opinion about sex, it’s not the end of the world), but it gets more complicated when you are talking about forensic evidence admissible in a court of law. I’ve worked a few forensic cases as an assistant, and for most of them, I knew the context of the remains; that is, I knew the demographic details of the person the police thought they’d found. I do worry about cognitive bias, as the authors call it, but they did not design a robust study to test for it because it’s impossible to separate correlation and causation in their methodology.
Did you have similar concerns?
AA: I did. I was really surprised that the exacts methods used weren’t stated in the paper. As someone who is used to controlling for bias in my own research I am all too aware that certain methodologies are either more or less reliable than others and some can be difficult to apply if you are unfamiliar with them (even if you have handouts – such as those provided to participants in this study). If you tried published a similar paper in a more bioarchaeology-related journal, without stating which methods you used, I am pretty sure it wouldn’t pass review. It would be very difficult to replicate this study without knowing this information (although you could of course conduct a similar one).
In addition to the points you’ve mentioned there were some other aspects of the study methodology I am curious about:
- Participants were from different fields. In the UK, osteology and forensic anthropology are different disciplines. We learn the same information, but the aims of our investigations are very different. I wonder if all of the participants would have been considered qualified to stand as an expert-witness in court, given the main aim of this study?
- DNA analysis of sex is a unique biasing factor. I was curious about the decision to include DNA results with sex-specific markers (the authors used the word gender, but we won’t get into that here). It is very rare that you see DNA used to determine sex in archaeological material – and it would always be done after the skeletal assessment. Using both precise methods (like DNA analysis) and less precise methods (like skeletal markers) to form the false contexts would likely have provided different levels of bias for different aspects of this study.
- Could participants answer probable male/female? The archaeological skeleton in this study was recorded in the database as ‘probable female’. The authors stated they are especially concerned about bias in ambiguous cases – however as far as I can tell, participants only had the option to state at the end of their assessment whether the individual was: male, female, undermined – and not probable male/female.
Given the issues I had with the methodology of this study, I really wasn’t surprised to see many of the results. But, like you, I’m not sure the interpretations drawn from them are necessarily that robust. Is it worrying that papers like this which, while interesting appear to be flawed, may have an impact on the admissibility of biological anthropology in court cases?
Speaking of the results, shall we give a quick run-down of what they found?
KK: There’s a whole mess of chi-squaring and some terrible bar graphs going on, but sure. Here goes, keeping in mind the skeleton is archaeological (from England, so most like today’s Caucasian population), probably female, and late 30s to early 40s:
- Sex (Probably Female)
- Group A (male context) – 71% said male, while 29% said female
- Group B (female context) – 100% said female
- Group C (no context) – 30% said male, while 70% said female
- Age-at-Death (36-45)
- Group A (25-30 context) – 0 at 18-25, 11 at 26-35, 3 at 36-45, 0 at over 46
- Group B (50-55 context) – 0 at 18-25, 1 at 26-35, 7 at 36-45, and 5 at over 46
- Group C (no context) – 1 assessed 18-25, 6 at 26-35, 5 at 36-45, and 1 at over 46
- Ancestry (Caucasian)
- Group A (Caucasian context) – 100% got it right
- Group B (Asian context) – 50% got it right, 29% said Asian, and 21% didn’t know
- Group C (no context) – 100% got it right
On my initial read, I wondered why ancestry had such high numbers; the vast majority of the people in the fake-context Group B either got the ancestry right or said it was indeterminable. But when Alison pointed out that the fake sex context included DNA, it started to make more sense. As for the age-at-death, the authors kind of punt on explaining this, choosing instead to refer the reader to a terrible graph (people: if you’re going to do color graphs, pleeeeease change the texture of each color so that people who print them out or who are colorblind can tell them apart! Also, spell-check your axes…). I think the weirdness here may have to do with the non-standard (for the US anyway) age ranges, the odd precision of the ranges, and the context given. But I’m more interested in hearing what you have to say, Alison, since you know way more about age-at-death methodologies and their biases than I do.
AA: Ha, so I’m not the only one with an editing hat who caught that graph-error. Blatant.
With regards to ageing, it’s really difficult to say much without knowing which methods were actually applied (have I said that enough yet?). The paper says that the participants were given access to visual aids for ‘the majority of all non-metric assessments available for each stage [meaning age, sex, ancestry]’, which were taken from Museum of London (MOL) documents – however, does this mean the participants were ageing using the pubic symphysis, cranial sutures, the auricular surface, tooth wear, maybe rib ends…? Many of these methods are accurate (and many are not), but they all have their issues.
When it comes to the age categories, the ones presented are pretty narrow (at 10-year intervals) but this is not unusual. There are no standard age categories – different institutions will use different age categories. This study says theirs were ‘adapted’ from the MOL ones. Personally, I’d be okay with using narrower categories if they were divided a bit differently, like: 15-19, 20-24 (because younger adults can be aged a bit more accurately) and then maybe 25-34, 35-49, 50+. But sometimes you see it as broad as: Young Adult (20-34), Middle Adult (35-49), Old Adult (50+).
That older individuals are consistently under-aged in osteology is a well known fact and it has been something that osteologists have been trying to solve a very long time, so I’m not surprised this paper found results indicating this was the case.
I have to stop and ask at this point – am I the only one wishing they’d published the results of how confident the participants were with their final assessments? I mean they bothered to ask this as a part of their ‘ruse’ (perhaps not the most fair word, but I’m using it), so why not share the results?
KK: No, you’re not alone. I was especially interested because the authors gave a forensic scenario but didn’t specifically note if this was a “court of law” type of scenario. If I were assessing demographics for an archaeological case, I actually think I would consider myself more confident than if I were assessing a forensic case for a murder trial. The methods and their shortcomings are the same, but forensic anthropology is a much more practical, applied use of the techniques–and can literally be a matter of life and death.
And that makes me wonder about the broader implications of this article. If osteologists can be biased by (false) context, will we see a change in the admissibility of forensic evidence in court? In the U.S., we already have the Daubert Standard (and facial reconstruction is one of the forensic techniques that does not meet it), but I think the interpretation of “scientific knowledge” and “scientific method” could be open to interpretation, especially if osteological techniques can be biased by external factors. Thoughts?
AA: I definitely think that it is something that needs to be considered – especially in forensic cases. The authors’ final conclusions, that context can bias results, is almost certainly true (at least in part). But, I do think that there are steps professionals can take to mitigate these effects – first and foremost of course, is being aware of this risk (something students are routinely taught). The authors state that more research is needed before recommendations can be made – but surely they could have at least made some suggestions.* Without them the study appears to have very damning conclusions for the use of osteology in courts.
I am worried that this study and others like it could potentially be used in a legal scenario to throw doubt on to the results of a skeletal assessment – when as we’ve just discussed, there are some pretty big issues with the study. Provided the expert witness in court gives details of the methods they use in their assessment, evidence to support their reliability, and present their results to the degree that their confidence allows then I can’t see a problem with continuing to use them.** Every science has its issues, but being aware of them can make all the difference.
I think if more studies of this kind are conducted and recommendations put in place to account for that dreaded ‘cognitive bias’ then it will only make our discipline stronger. It will also benefit archaeology as well – imagine no more females with swords being mistaken for their male relatives!
KK: My final thoughts on this study are similar: the authors seem to think they’re the first to have considered this. And perhaps they are, from the specific angle of cognitive bias, but osteological specialists have been doing replicability studies for decades. To cite but one recent example, Ashley Smith and Amelia Boaks have a forthcoming article in the Journal of Forensic Sciences about validating postcranial landmark locations. Why something as seemingly mundane as the locations from which we measure bone? Because measurements are used in stature estimations, sex estimations (e.g., femoral head diameter), and ancestry estimations (e.g., FORDISC). Smith and Boaks found high consistency in some measurements, but those percentages still ranged only from 55-62.3%, with differences of tens of millimeters! They discuss these findings in light of Daubert and the need for better standardization.
There are dozens of other references like these, which all lead to osteologists creating better handbooks and better standardization. The revolution in digitization, I think, should obviate some of the inter-observer errors, but this is slow in coming to anthropology. My point is that osteologists themselves are doing this kind of analysis, and it’s odd for outsiders to devise a poorly-created study to do it. Not that there isn’t a reason for outsiders to try, especially when it’s related to cognitive bias, but these just didn’t succeed.
Alison, any final comments? I’ll leave the last word to you.
AA: Well, I was going to sum up with this: We absolutely need address biases in the application of various methodologies in skeletal assessments. In order to do this, we must first understand these biases – and determine the factors that impact their effect on results. While I am not convinced this paper has done a great deal to assist us in doing either – understanding or addressing these biases – I will be very interested in future studies that do. So let’s hop to it osteologists…
But instead I think I’ll go with: Please make sure you include all of the details of your study methodology in your published papers. Sincerely, an academic osteologist.
Buikstra, J. and D. Ubelaker. 1994. Standards for Data Collection from Human Skeletal Remains. Arkansas Archeological Society.
Nakhaeizadeha, S., I.E. Dror, and R.M. Morgan. 2014. Cognitive bias in forensic anthropology: Visual assessment of skeletal remains is susceptible to confirmation bias. Science and Justice 54(3): 208-214. DOI: 10.1016/j.scijus.2013.11.003.
Smith, A.C. and A. Boaks. In press. How “standardized” is standardized? A validation of postcranial landmark locations. Journal of Forensic Sciences.
Thanks to Ashley Smith for sharing a pre-publication version of the JFS article.
*For example: Conducting skeletal assessments ‘blind’ (not being provided with any contextual information apart from the period the remains date from – as this can influence the reliability of certain methods).
**For example: Based on the non-metric analysis of the os coxae, which showed good preservation allowing for a full assessment, these methods were used of scoring these skeletal indicators (sub-pubic angle, sciatic notch, etc). Peer-reviewed studies have shown these methods to be accurate up to 98% in correctly identifying the biological sex of an individual (provide relevant citations). I estimate, based on the results of these methods that this individual is a probable female. This suggests, given the different categories of classification of biological sex within an osteological framework that it is most likely these skeletal remains are from an female individual. BAM!