Child Abuse, AI, and the Forensic Interview
- Show Notes
- Transcript
In this episode of ‘One in Ten,’ host Teresa Huizar speaks with Liisa Järvilehto, a psychologist and Ph.D. candidate at the University of Helsinki, about the positive uses of AI in child abuse investigations and forensic interviews. The conversation addresses the common misuse of AI and explores its potential in assisting professionals by proposing hypotheses, generating question sets, and more. The discussion delves into the application of large language models (LLMs) in generating alternative hypotheses and the nuances of using these tools to avoid confirmation bias in interviews. Huizar and Järvilehto also touch on the practical implications for current practitioners and future research directions.
Time Stamps:
00:00 Introduction to the Episode
00:00 Introduction to the Episode
00:22 Exploring AI in Child Abuse Investigations
01:06 Introducing Liisa Järvilehto and Her Research
01:48 Challenges in Child Abuse Investigations
04:24 The Role of Large Language Models
06:28 Addressing Bias in Investigations
09:13 Hypothesis Testing in Forensic Interviews
12:18 Study Design and Findings
25:54 Implications for Practitioners
33:41 Future Research Directions
36:49 Conclusion and Final Thoughts
Resources:
Teresa Huizar: Hi, I’m Teresa Huizar, your host of One in Ten. In today’s episode, Child Abuse AI, and the Forensic Interview, I speak with researcher Liisa Jarvilehto, psychologist and PhD candidate at the University of Helsinki. Now, when we think of AI in our work, unfortunately our first thoughts often go to its terrible misuse, the creation of child sexual abuse materials, or chatbots run amok.
But what if AI could actually help us in our work? What if it could help us explore alternate theories of the case or create more effective hypothesis testing, or even perhaps create better question sets for forensic interviewers? Does that sound a little sci-fi to you? Well, not anymore. As you’ll hear researchers set out to explore the outer bounds of positive use of AI and child sexual abuse cases. I know you’ll be as intrigued as I was by what they found. Please take a listen,
Hi Liisa, welcome to One in Ten.
Liisa Jarvilehto: Thank you so much. I’m so glad to be part of this interesting conversation.
TH: It is gonna be interesting because I was delighted to see the piece of research that we’re here to talk about today. It just was really intriguing to me because it was the first time I had started thinking much about the sort of intersection between AI in any form and investigative interviewing or what we here in the states would call forensic interviewing. And so I’m just curious about how you came to that work. How did you come to the topic and start looking at that and then we’ll kind of dive into the particular branch of it in terms of large language models that you chose to investigate further.
LJ: Well, I guess the idea came from the fact that in child abuse investigations, there’s often not so much evidence to go by at the beginning, or at least it’s more ambiguous perhaps, than in some other forms of crime. And the child’s own narrative about events is often a very crucial part of the evidence.
So we need to try to basically collect all relevant information and find out what has happened in the child’s life. And we need to try to avoid bias that could affect perhaps the way we end up collecting the evidence or the basically inferences that we make based on that. And to achieve that, you usually need a team.
How most recommendations when it comes to good practice in interviewing? It kind of more and more stresses the fact that you need some multi-professional team around the investigator to make sure that everything is being considered before and after the investigation. However, that’s not always possible.
I guess lack of resources is something that happens in many places and often the investigators do work, if not alone, then without the support of this kind of ideal team. And we were thinking like if perhaps large language models and AI could help to maybe be some kind of a tool that would, in some situations even replace this kind of brainstorming that a multi-professional team could offer.
And that’s basically how the idea came about.
TH: Well, and I, you know, just for our listeners who haven’t read the paper, and we will link your research paper on the One in Ten website to this podcast episode, and I encourage people to go and read it in depth. But I think that one of the things that we should just say upfront is that the paper isn’t positing that somehow we use AI to replace interviewers. To the contrary, what you’re really talking about, and I appreciate the way that you explained it, is that a big part of the work of a multidisciplinary team is around brainstorming, not just solutions to problems, but even what could this be that we’re looking at?
And so if you don’t have access to that team, what might help in its place. And so thinking about AI as kind of a brainstorming engine. You’ve used the term large language model, and again, just to kind of level set for our listeners who may not know all the ins and outs of the AI and the differences, talk a little bit for a moment if you would, about when you are using the term large language model, what are you describing?
LJ: So a large language model, LLM is basically the abbreviation. It’s essentially an AI system, which is artificial intelligence, so AI system that’s been trained on massive amounts of text. So it’s books, websites, articles, even conversations that people are having on forums to learn patterns in language and how language works. And these models then, when you give them a prompt or a question, they generate a response by basically predicting what words should come next based on all the trainings that they have around how language works and how usually words go together. So basically, one could even think of it as some kind of an extremely well-read assistant that can write, summarize, brainstorm, language based content for you.
And many people who say they haven’t heard about large language models, they know, for example, Chat GPT, which is one interface that somebody could use to access GPT4, which is a large language model. That’s basically, well, large language model of Open AI, which is one company that has these language models.
TH: It’s a, it’s a good point that sometimes people hear a term like that and they don’t re really recognize that they’re already using one, for example, because I think that we’re seeing more and more integration with things like Chat GPT, Microsoft Copilot, and all the sort of other versions of these, right?
Okay, so let’s talk a little bit about the bias issue, because you kind of alluded to that, and I really appreciated the lit review and the paper where you talked about this more, about the different types of bias that can arise even when people are not trying to be biased, right? When they’re actively working against that still at the same time, this is something that really in any situation, in any helping profession can arise, but also in investigations specifically.
Can you talk a little bit about, first of all, what are some common forms of bias? Just giving some examples about how this could arise and secondly, and again, you went into this in the paper somewhat, what are strategies that are well known for trying to avoid bias in investigative interviews?
LJ: Yeah. There are many biases that can affect interviewers and investigators and all of us really. But I think the most important one that we try to avoid as best we can is confirmation bias, which is basically a very human tendency to, once we have some kind of an idea of what might have happened, we tend to interpret further information that we get through that assumption. So we tend to kind of focus on information that confirms our beliefs and maybe disregard or reinterpret other type of information. So in the context of child abuse investigation, that could be that if we have some idea that this is something that might have happened to this child, we’ll more focus on evidence that supports our initial idea and perhaps not even pay attention.
Or somehow misinterpret other type of information. And we see this happen in many of these highly publicized cases where there’s some, for example, wrongful convictions. There’s usually some form of confirmation bias going on. So investigation starts to go into a certain way too early without considering other plausible explanations, just because we have some kind of idea in our heads about what might’ve happened.
TH: So there’s strategies that forensic interviewers employ to try to avoid this kind of confirmation bias, because as you say, it’s something that we already know from the literature to look out for. And to pay attention to the way that this can arise specifically when interviewing kids around child abuse, can you talk a little bit about what some of those strategies are?
Because one of them in particular really ties to your use of AI.
LJ: So we know that different forms of structured decision making can be helpful. So just to make sure some kind of a checklist to make sure we’ve considered, for example, everything that we know is relevant so we don’t forget and don’t start to think about only certain aspects, for example.
Then one kind of important idea is that you should have an open mind. So you should go into an interview with an open mind and see what the child has to say. But this is, I think, more easily said than done when it comes to actually then formulating questions and follow-ups to the child. I suppose this is why many recommendations, for example, last year, European Association of Psychology and Law published this white paper, which is basically recommendations, set of recommendations around investigative interviewing. And they advocate for something that’s called hypothesis testing approach, which is an interesting way to try to avoid confirmation bias because it’s both easier then it seems to some, but also I would say more complicated than it sounds to others. So it’s both quite easy, but also quite tricky to do in practice. So the idea is basically that it’s a way of, well, I would say that hypothesis testing is kind of like a disciplined curiosity, and it’s basically making sure that you can see all plausible, sensible explanations there.
Sometimes it’s called different lines of investigations, should be considered in medicine as differential diagnosis.
TH: It’s not exclusive to forensic interviewing and investigative interviewing. The general idea, but sort of embodying it in practice I think is something that is common. It’s interesting to hear you talk about the way in which guidelines or guidance has been set forth in Europe about this, because in the US for forensic interviewers embedded in many, many, many protocols, I would say any that are nationally recognized is this idea of hypothesis testing. And to your point, it’s really about looking at, you know, when a child comes in, being able to consider that one alternative is that they’ve been abused, but another might be that they were confused, that they were coerced, that they were coached, that they, that it never happened at all.
You know, really looking at the range of possible explanations. To your point though, and I appreciate you pointing this out. Even if you hold that in your mind, it is challenging sometimes to formulate questions that really, again, cover this full range of the hypotheses that you’re holding in your head and to do that on the fly, it’s really a challenge. So let’s talk a little bit for a moment then about what large language models and what AI in general bring to bear on that question, because that’s really your study, your research design was all really around that. And so can you just talk a little bit about how you set it up and what you were trying to get it to do?
LJ: Mm-hmm. Yeah, so basically the idea is that before you start an interview you should plan it. You know, and probably many of your listeners know that we use this kind of protocols that are, they have some structure, but you’re supposed to plan each interview on a case-by-case basis. And part of that planning, preferably is that before the interview, you should be asking yourself, well, like you said, what different stories could basically explain what we are seeing, what could explain the fact that somebody is concerned that this child might have been abused, and you need that before you start collecting the evidence, which is also what the interview is about.
And basically you, either alone or hopefully with the help of a team, you need to explore all these options that you then cover in your interview. And we wanted to see if large language models could actually produce good alternative explanations to child abuse vignettes, which are basically these small descriptions of cases, where somebody for some reason is concerned about the child and then the investigator, before they actually start planning their questions, they have to come up with different ways of explaining these concerns about that particular child. And we wanted to compare large language models to actual well-trained investigators and then just some lay people basically. And see if these models could actually produce hypotheses that would be of good quality that could then be helpful in actually planning these interviews. But the tricky thing is that, we have some information on the importance of hypothesis testing, but we don’t have really that much guidance around what does a good hypothesis look like. So when you are kind of trying to brainstorm these alternative explanations, how do you know that the things that you kind of come up with are actually helpful and good, and so we also try to propose one set of criteria that could be then used to judge the hypotheses that are large language models and are experts and lay people are then proposing.
And that’s, I think, one interesting part of the paper to actually try to hopefully invite also other researchers to think about what it is actually that we want to achieve when we do this pre-interview hypothesis testing, or not testing, but formulation and part of the testing is then the interview. But not only that, we also often have some alternative hypotheses that cannot directly test in the child’s interview. But there’s just so much there.
Like there’s also so many misconceptions about hypothesis testing at what it actually looks like in an actual interview. And I think that’s the kind of next step that we all need to take together to kind of think more about when we have these alternative explanations kind of laid down, how do we transform these into actual interview questions that we ask so that we have the best possible evidence while still maintaining rapport and really listening to the child.
TH: So my follow up question to that is, when you were thinking about this and planning the study, what were your hypotheses going into it? What were you expecting to find, and then what did you actually find?
LJ: Well, we were thinking that experts might be best at this perhaps, or that the language models actually might have quite a lot of knowledge about these child abuse phenomena based on everything they have read online.
So they could be good too. But we were expecting just regular psychologists and just lay people, so adults recruited online, to be significantly worse and maybe more prone to mostly consider hypotheses related to the fact that if somebody suspects that there’s been child abuse, there probably was some abuse and kind of focus maybe on that issue.
Even though they were giving, we were giving them some guidelines that you always need to consider also the alternative, and it’s not about you not believing the child, it’s just making sure that you gather good quality evidence basically and make sure that everyone’s rights are being considered during the investigation. So we were expecting experts and large language models to be quite good at this and everyone else to do a bit worse.
And basically what we saw is that GPT4, which is one of the large language models that we used, it outperformed all the human groups, including experts on quantity of hypotheses, specificity, and comprehensiveness. Suggesting to us that AI could indeed be a valuable tool for helping investigators consider all these alternatives that they might otherwise maybe miss if they were just thinking on their own.
And we also looked at the follow up questions that participants proposed to actually explore these hypotheses in the child’s interview, and our experts focused more on eliciting verifiable details about the incident and just really focusing on getting as much information from the child as possible about the incident that the child was describing.
Which is what the research literature also recommends, whereas the LLMs and the naive participants, they tended to focus more on like mental states of the child and the offender and family dynamics and things like that, which are harder to assess and are more speculative than maybe not so much the topics that are maybe most recommended.
So there’s an interesting kind of difference in moving from having these alternative explanations to actually, how do we try to either falsify these different explanations or confirm them in a child’s interview? So there’s a difference in strategies clearly, but when it comes to actually formulating these hypotheses, the large language models seem to do quite well.
TH: So if I’m remembering right from your paper. One of the interesting distinctions, it was that while large language models were excellent at coming up with many hypotheses and comprehensive ones, specific ones, all of that, that actually they also probably just frankly, by the sheer number of them that they generated a lot of untestable ones.
Which I think is probably not surprising. Right. You know? In that just, if you were to have a brainstorming session, even among experts, the longer you went on, eventually you would get to a number of hypotheses that really are untestable, right? So some of it’s just that it’s done at greater speed with a large language model.
But I think it does like point out kind of a cautionary note here, which is that it may be an excellent analytical tool, but as with anything with AI, you have to pay attention to what it’s generating, right? And use common sense about, you know, the usefulness of it so that you’re not kind of dealing with, you know, in the US we call AI slop, just things that are produced by it that are unhelpful. But the other thing that I was wondering about, and I’m curious about, this is not in your paper, but I’m curious about what you think about this. Right now, Chat GPT or any other large language model has very little access actually to forensic interviews You know, the content of them.
We don’t necessarily want it to have a lot of access to it for that matter, or even to case files for CPS. You can imagine a world in which there’s an enclosed system in which a large language model is applied that would not have sort of distributed outflow into the general public and the way that Chat GPT and other commercially available large language models are, but really could investigate a trove of interviews and case files.
And I’m curious about what you think that might generate that might be different in terms of, I just wonder if one of the reasons you found that while many hypotheses were generated by large language models, frankly the quality of those was mixed. Like what would be the ways to improve the quality of those so that you’re not getting flooded with a bunch of untestable hypotheses along with those that are actually valuable?
LJ: Well, that is a good question, and I’m not sure what it is that they should be reading or trained on.
TH: I know it’s tricky, right?
LJ: Yeah, it’s probably, because you’re right, one of the issues that large language models have when it comes to investigative interviewing is that indeed they don’t actually, even though they’ve read a lot and they’ve seen a lot of texts, they haven’t seen actual interviews, and we’ve actually encountered this problem with our other studies where we’ve tried to see how large language models actually do when it comes to question formulation in this specific context. And they did well on this hypothesis testing, but when it comes to formulating case specific good questions in this investigative interview context, they struggle a lot more.
In that skill, they might improve by reading interviews, for example, or interview transcripts. But when it comes to hypothesis testing, I suppose it’s some kind of, maybe combination of reading actual case files combined with some human feedback, because that’s one way that large language models are actually trained, that they come up with a response.
So for example, a set of hypotheses and then if we would have investigators or these teams actually give them feedback on how relevant, how helpful these hypotheses actually were. And then we would see how it works. But I think like one thing that I really wanted to mention here, just for the sake of clarity, is that we don’t actually yet have solid proof that hypothesis testing approach during interviews improves interview quality. So we just don’t have studies that consider this question. I’d say that it makes sense to assume that would be helpful, but we haven’t actually looked at, like, we don’t have studies that would have cases where hypothesis have been formulated with the help of large language models or otherwise, and we would assess the quality of these hypotheses and then we would see if that leads to more open questions, more information generated by the child or some other outcome that we would like to perhaps see happen. So it’s an assumption that it would be helpful because if you don’t have in your mind when you start interviewing, if you don’t have this kind of idea in your head of these different kind of explanations, you don’t even usually, or you might just not even ask these questions.
It makes sense to assume it’s helpful, but we don’t actually know. So I think that’s kind of the next thing that somebody should look at, if this truly is impactful.
TH: It’s interesting because I do think that you’re right that there are so many future research implications and opportunities that flow from this more introductory, really, and groundbreaking, I think, look at this issue. I’m curious about what you see in the nearer term as the implications of this for practitioners, for folks who are out there right this minute. You know, conducting investigations, interviewing kids. Is there anything that you would have a suggestion about at this point, or do you feel like it’s still too nascent to even have that?
LJ: I wouldn’t necessarily yet jump into applying AI in real cases, or at least that wouldn’t be my first. And if one considers using AI, they should always consider their organization’s policies on the use of AI and maybe use locally running models that don’t actually connect to the internet and that kind of stuff.
But what I would first at least consider is maybe using these models in training. So when we are training investigative interviewers, or even maybe like child protective workers who talk to children when abuse is being suspected, then perhaps these models could be used to really try to kind of practice this brainstorming of thinking when you have an abuse suspicion to really use them to first maybe brainstorm some explanations on your own and then see if large language model can suggest something that would make sense, but that you haven’t thought about, and then maybe see how that would affect your interview planning. So how would you ask in an open way about all of these explanations that you and perhaps together with the large language model have actually produced.
And there’s some misconceptions around this hypothesis testing approach, and one is that you somehow don’t believe what the child is telling you, or that you should somehow confront the child directly with these alternative explanations that you have in your mind, which is, I don’t think that this is what you’re meant to do at all.
No, it’s more about really asking open-ended questions that create space for any of these alternative explanations to emerge based on the child’s responses. And you also have to see how the interview goes. So of course, if the child tells you explicitly about an event that’s been abusive, you don’t then ask them about some alternative explanation related to some kind of a misunderstanding or something like that. Then you kind of focus on that more. But as long as there’s ambiguity and you don’t have that much to go to, you’re asking open questions that give space for all of these possibilities. And even having them in your mind I think is already helpful.
So when the child tells you something, you don’t interpret their response and formulate a follow up question with only one possible explanation in your mind, basically. And I think we could maybe practice this with these large language models when we are in more of a like a training mode. So this is, I think, the kind of low hanging fruit.
TH: I think it’s an interesting thing that you’re bringing up because often in training new forensic interviewers. Not only are you working on trying to sort of embody this hypothesis testing approach and help people practice that, all that. But even deriving the scenarios for them to do that with.
The case scenarios themselves. That’s another place where, you know, your research really documents that large language models can be helpful in generating those scenarios. So I think that there are, as you’re saying for training purposes, there are with all the caveats, and I appreciated you providing them about making sure you’re following agency policy and that you’re in closed systems, that when you’re not putting things out on the internet.
But I think that it’s interesting about how that might not only help practice, but really help speed training along so that people able to do their work well, faster than perhaps, you know, with some of the current training methodologies, current ways that that’s here. I would love to see this applied to brainstorming around corroborating evidence because this, I think is where teams get stuck too sometimes, or at least individual investigators. Yes, often you primarily have a failed statement, but it’s been interesting to me at, you know, having been on teams in the past myself, that team members often can bring up ideas and do bring up ideas about even small things that are possible, things to look for corroboration.
So not thinking about like there’s another witness out there, although there could be, but things like the kid discuss a spot on the ceiling they stared at while it was happening to them. Does that spot actually exist in that room? Lots of other things like that. And I think I’m curious about whether large language models in sifting through the kinds of narrative that they have access to, which are often media reports and other things that do sometimes have, at least in the US, a shocking amount of detail in them about cases.
I wonder if they could make suggestions about possible strategies and possible corroborating evidence that might be available to look for.
LJ: I think that is an excellent question and since we didn’t look at that, I of course cannot answer. Apart from saying that, my guess would be based on what I’ve seen, that they are good at that.
I think they could actually help, again, brainstorm possibilities and ideas to really, yes, exactly look at the evidence collected so far, and the child’s narrative and see what could we actually check one way or another. And another thing I am thinking is because what we, humans, we don’t find very natural is trying to falsify hypotheses.
So we try to look for ways of kind of confirming one or many of our kind of alternative explanations, but we are not as good as coming up with ways of how can we disconfirm or exclude one of these ideas that we have. So what type of information could we collect to rule out one or more of these alternative hypotheses so that we are left with maybe just one or less at least, and that is another thing that they could perhaps help us do. So these things that we find not very easy, intuitively to do, and especially when we are in a hurry. Yeah, they could perhaps be helpful.
TH: It’s interesting because as you were describing that, I was thinking this is differential diagnosis, right? What you were actually describing, and that’s really interesting to think about that has lots of application in the type of work in child abuse cases and thinking about the way that it can produce.
I mean, again, you know, we’re not suggesting this replaces humans, but as an analytical aid essentially. Which I think is how you described it in your paper. So where does the research take you next? I’m just curious about what’s next on your research agenda.
LJ: Well, the kind of, I’ve been doing investigative interviews myself since 2009, so quite a while now, but I’ve only started to do my PhD a few years ago, and I’m gonna be done next year. But the overarching kind of goal, what we were trying to do is to see if we could actually at some point build a tool that could help investigators during actual live investigations, so interviews. So we are testing different skills that large language models would need to have in order to be a useful tool in this context.
So we tested hypothesis testing. We’ve tested question formulation. We’ve tested some skills related to rapport, and we’ll see because so far hypothesis testing is actually where they perform best. But now we are trying to see if these other skills can be enhanced. And I really think, I would like at some point to continue to explore this hypothesis theme because I think we really need this because most of the literature on hypothesis now comes from a place where everything like the evidence has been already collected and the narrative exists, and then we try to evaluate that.
But we really need this kind of more information on how to do it before the interview and the N-I-C-H-D protocol, for example. I think it’s written mostly for cases. Like if you follow it as it is, it works best for quite a recent case. So basically something just happened and then you can indeed walk in with just an open mind and ask like, well, tell me what happened last weekend.
But most of the cases, at least in Finland, and I’m afraid worldwide too, are older kind of historic cases in a sense that something happened a few weeks ago, a few months ago, sometimes even a few years ago. So you have to guide the conversation. So I would like to, for example, next, explore these transitional questions.
So how do we approach the substantive phase? Like when we go from kind of introductions, ground rules, everything like that. How do we start talking about the suspicion in a way that we have all of these hypotheses in mind, that we don’t kind of start to focus only on one too early in the investigation. So that’s also something I would like to maybe explore, but there’s just so much to do.
So we’ll see. And I hope others join us.
TH: Oh, I do too. And I hope that you know, you never know, one of our listeners may well be a researcher who would love to partner and further explore this with you. And I also just really appreciate the fact that this is work that you have done yourself for a long time.
Because I think that when practice is informing research, it’s just even better. It’s critically important. So lemme ask you as we kind of close this out, is there anything else I should have asked you and didn’t, or anything else that you wanted to make sure that we talked about today?
LJ: I do think that if there is somebody who is listening, who is interested in, in this research, that I think we should focus, for example, on really looking, because we had these five criteria that we proposed for assessing these pre-interview kind of explanations. And I think that that list of criteria, which was basically the stability, consistency with evidence, believability, specificity, objectivity, and the comprehensiveness of the kind of entire hypothesis test, I think those really need some work.
So I hope that people will be interested in continuing.
TH: Wonderful. Well, as you continue to publish, and congratulations on nearly being done with PhD work, that is always challenging. But as you continue to publish, please do feel free to come back anytime. We’re just so grateful for you sharing this fascinating study.
I just loved it the minute I saw it. I was like, oh, yes, we’ve gotta talk.
LJ: Thank you so much for inviting me. Yeah, it’s been really interesting talking to you. You have such wonderful insights, like I can hear it from your questions that you really can see the benefits.
TH: Come again, anytime, Liisa, thank you again.
Thanks for listening to One in Ten. If you like this episode, please share it with a friend or colleague. And for more information about this episode or any of our others, please visit our podcast website at oneintenpodcast.org.