LLMs and autism diagnosis

This forum made possible through the generous support of SDN members, donors, and sponsors. Thank you.

clausewitz2

Full Member
Volunteer Staff
15+ Year Member
Joined
Oct 13, 2008
Messages
4,468
Reaction score
8,889
It's been many years since I saw a neuroimaging paper that told me anything clinically useful or that really improved my understanding of any typical psychiatric/psychological disorder that I can think of. Same goes for genetics for the most part, sad to say. Better analyses and more interesting designs tapping into behaviors seem vastly more informative and more promising for changing my understanding of the structure of psychopathology.

However. This paper, while quite technical and not directly measuring anything behavioral directly (well, apart from the written verbal behavior of clinicians, I suppose), is genuinely interesting and I think sheds actual light on the conceptual structures underpinning ASD as they are instantiated in clinical practice:


I am not an ASD specialist by any stretch of the imagination, of course, so I eagerly await the reactions of those of y'all who are on this board. Even (or especially) if you think this is rubbish and useless.

Also, for all LLMs are definitely not going to replace clinicians in the near future, when someone asks "what good is AI to us mental health clinicians", this is a paper I feel I can point to and say "that, it's good for doing that."
 
Kundu, S., et al. (2024). "Discovering the gene-brain-behavior link in autism via generative machine learning." Science Advances 10(24): eadl5307.
 
Here are unorganized thoughts after skim-reading the intro and discussion and actually reading the methods and results (it's a slow work day).

I don't know about you, but I really want to see some examples of these expert reports. I feel like much of clinical documentation in many large settings skews downward from fine to absolute garbage. Even with best-estimate data, it can be quite variable in the amount and quality of the data that the diagnostic team has to work with (also how did the diagnostic team handle missing data?).

Not a ML expert by any means, but looks like high AUC for the semantic associations to DSM criteria was achieved by combining the sample overall. Looking at the supplement though, model performance drops (AUC) when classified by age, sex, and comorbidity. What would the model performance be if there were multiple comorbidities, health conditions in a specific child with a specific age and sex? This is the level of granularity I think is necessary for me to care about it as a clinician. Also the fact that random sentences still generated a better-than-chance, albeit much lower AUC, makes me wonder about the how the model performs against random sentences at greater levels of granularity.

They talked about this in the discussion, but diagnostic heterogeneity in autism is an ongoing issue as is the debate around severity vs. trait expressions of autism. I think connecting words/sentences to diagnostic criteria oversimplifies this problem.

I could see something like this eventually improving our diagnostic screening methods if the model is replicated with larger, more diverse samples. I personally can't stand the SRS-2 and would welcome any improvement.
 
Although not as granular in my analysis as above post, I do see this type of study as being potentially helpful in another way. I am not sure exactly how to articulate it, but it has something to with how language shapes our perception of diagnoses and it is not easy to step outside of that. I have always emphasized the importance of social deficits in diagnosing of “autism”, but this finding makes me question that. One thought is that I might actually be picking up on other cues first and then looking for confirmation in an assessment of social deficits or reciprocal thinking. I think of one young adult in particular who had sensory sensitivities and repetitive behaviors that I did not think met criteria for autism because she was so socially adept and easily understood others perspectives. Love to keep challenging my own clinical thinking.
 
Here are unorganized thoughts after skim-reading the intro and discussion and actually reading the methods and results (it's a slow work day).

I don't know about you, but I really want to see some examples of these expert reports. I feel like much of clinical documentation in many large settings skews downward from fine to absolute garbage. Even with best-estimate data, it can be quite variable in the amount and quality of the data that the diagnostic team has to work with (also how did the diagnostic team handle missing data?).

These are reasonable points. However, I think the aim of this paper was to examine "how are these diagnoses made in actual clinical practice? Can we capture something about the clinical intuitions driving these decisions in actual existing evaluations rather than ideal evaluations?" The LLMs were not being trained to diagnose autism. They were being trained to predict whether or not a particular evaluation was going to end with an autism diagnosis or not.

Not a ML expert by any means, but looks like high AUC for the semantic associations to DSM criteria was achieved by combining the sample overall. Looking at the supplement though, model performance drops (AUC) when classified by age, sex, and comorbidity. What would the model performance be if there were multiple comorbidities, health conditions in a specific child with a specific age and sex? This is the level of granularity I think is necessary for me to care about it as a clinician. Also the fact that random sentences still generated a better-than-chance, albeit much lower AUC, makes me wonder about the how the model performs against random sentences at greater levels of granularity.


I thought it was actually interesting that the sentences appearing in evaluations that did seem to have a strong predictive value of autism diagnosis were often not particularly close in semantic space to most of the DSM criteria. What I took away from the supplemental materials is that the random sentences were not that different in performance from a model that was attending to the cosine similarities between those most attended sentences and the DSM criteria. That is, a model paying attention to the most attended sentences in their successful classifier was NOT just recapitulating DSM criteria or finding sentences that were as close to DSM criteria in semantic space as possible.


I'm genuinely not seeing the performance drops you're mentioning in the supplemental materials. I do see a negative control that they did by including age that suggested there was a difference in the semantic space of the most attended sentences of the various reports but one that seemed orthogonal to autism v. non-autism.

Again, not a model designed to make a diagnosis. A model designed to try and figure out what clinicians are paying attention to (or at least, what they are writing down) when they make a diagnosis.

 
I thought it was actually interesting that the sentences appearing in evaluations that did seem to have a strong predictive value of autism diagnosis were often not particularly close in semantic space to most of the DSM criteria. What I took away from the supplemental materials is that the random sentences were not that different in performance from a model that was attending to the cosine similarities between those most attended sentences and the DSM criteria. That is, a model paying attention to the most attended sentences in their successful classifier was NOT just recapitulating DSM criteria or finding sentences that were as close to DSM criteria in semantic space as possible.

I'm not entirely sure what I wrote that gave you the impression that I understood the goal of the LLM was to diagnose autism aside from maybe thinking about it from the clinical perspective of what the potential practice utility of something like this could be. I personally wasn't blown away by the findings because it seems reasonable that clinicians have developed a extra-DSM shorthand for symptoms that could be used as justification of those symptoms. For instance, a clinician writing "helplessness and hopelessness", "avolition", and "anhedonia" in a list that describes charcteristics of a major depressive episode even though those actual words do not appear in the criteria, but are likely to signal to other clinicians that these are indicators of depression because they are in the nomological network of the depression construct. In other words, it's effectively generating a checklist of extra-DSM phenomenology to which these clinicians attended. In clinical work, I see it all the time and I'm going to guess that I'm not alone. So these clinicians did this, but would other clinicians do the same? And what if there were multiple comorbidities, like say, a comorbid tic disorder or ADHD? Would 'flapping' have the same meaning?


I'm genuinely not seeing the performance drops you're mentioning in the supplemental materials. I do see a negative control that they did by including age that suggested there was a difference in the semantic space of the most attended sentences of the various reports but one that seemed orthogonal to autism v. non-autism.

Tables S5-S7. They reference in the text as well. I think they undersell the 'slightly lower' model performance, but I suppose it depends on the goal. Most statisticians I'm aware of consider an AUC >= .90, the standard for clinical practice. Granted in ML world, these numbers are still pretty good.

See Table S3 for a more detailed breakdown of sex and age groups included in our cohort, and see Table S4 for a tabulation of secondary diagnoses present. Tables S5S7 provide additional accuracy results for our model, stratified on the basis of these variables. For some demographic groups we see slightly degraded model prediction performance. This is likely to be attributable to a smaller number of training examples present in these strata, the inherent diagnostic complexity of these groups, or a combination of the two. For instance, this combination of factors is evidently present for the age bracket consisting of subjects over 12 years of age, resulting in slightly lower classification accuracy compared to other more populated age brackets in our cohort."
 
Last edited:
I'm not entirely sure what I wrote that gave you the impression that I understood the goal of the LLM was to diagnose autism aside from maybe thinking about it from the clinical perspective of what the potential practice utility of something like this could be. I personally wasn't blown away by the findings because it seems reasonable that clinicians have developed a extra-DSM shorthand for symptoms that could be used as justification of those symptoms. For instance, a clinician writing "helplessness and hopelessness", "avolition", and "anhedonia" in a list that describes charcteristics of a major depressive episode even though those actual words do not appear in the criteria, but are likely to signal to other clinicians that these are indicators of depression because they are in the nomological network of the depression construct. In other words, it's effectively generating a checklist of extra-DSM phenomenology to which these clinicians attended. In clinical work, I see it all the time and I'm going to guess that I'm not alone. So these clinicians did this, but would other clinicians do the same? And what if there were multiple comorbidities, like say, a comorbid tic disorder or ADHD? Would 'flapping' have the same meaning?


These are all excellent questions. I would be fascinated to see what happens if you were to take their trained model and and just throw a corpus of evaluation reports from a different specialist referral center at it without a new training run. On the one hand it makes sense that it's not generating anything that you're not already aware of from your clinical work. On the other hand, it is interesting that the extra-DSM phenomenology that clinicians are attending to can be derived directly from the reports themselves without any pre-labeling or built-in structure.


Tables S5-S7. They reference in the text as well. I think they undersell the 'slightly lower' model performance, but I suppose it depends on the goal. Most statisticians I'm aware of consider an AUC >= .90, the standard for clinical practice. Granted in ML world, these numbers are still pretty good.


The pdf I was working off of only goes to S3, which explains why I had no idea what you were talking about. I'll have to look this over and digest it a bit. Thank you for clarifying!
 
Last edited:
Finally got a chance to read the article. Some thoughts:

-Of course the reports contain language specific to ASD criteria. Concerns for ASD are typically why this children are in an environment where things will be written about them! It's cool that a LLM can pick up on this, but my guess is that anybody involved with these children would be surprised by that. We suspect ASD, thus we are more likely to document specific symptoms of ASD in our clinical recordings.
-It's not terribly shocking that language related to to restricted/repetitive behaviors occurs more in medical records. Hand flapping and lining up toy vehicles are clearly recognizable and we have specific (and pretty much all children have hands to flap and toy vehicles to line up). limited, common-use language to describe these behaviors. Contrast that with the many different ways to describe "deficits with joint attention" in our colloquial language. The specific terminology (i.e. "deficits in joint attention") is clinical jargon relatively specific to developmental/clinical psych, thus less likely to occur in general medical records than various descriptions of the phenomenon.
-I have always conceptualized some of the restricted interests and repetitive behaviors as an extension of social communication deficits. If the child is not able to focus on the social aspects of the environment (e.g., reciprocal interactions and language; pretend play), then they spend that time focusing on the physical properties of toys. In regards to repetitive behaviors and "stimming," we know that people diagnosed with ASD will often (and often detrimentally so) try to hide/mask these behaviors due to social disapproval. It follows that there are likely some social reasons that individuals without ASD refrain from repetitive behaviors in social contexts.
-I think they are using the term "clinical intuition" overly broadly. This discounts the training medical professionals receive in relationship to ASD, and the formal and informal criterion-based measurement that is used to make the DX.
-What about differential or comorbid diagnoses? When I assess, the question isn't "autism or no autism," but rather "what best accounts for the noted delays and symptoms, and what should we do about it?"
-I'm pretty sure that I do better than 79.4% accuracy, but that's tough to measure!
-It'd be a really cool thing to have an accurate, quick, accessible, and affordable way to diagnose ASD early. Empirical evidence points to it being a difference in brain structures and functioning that is present at birth, and identifying a provide supports and services to clients and families as early as possible should be the goal. There's frequently something new that comes along (saliva samples testing; eyeball anatomy; imaging), but these things turn out to be unreliable for some reason or other.
-I remember doing a project for my Artificial Neural Networks (AAN) course in grad school where I reviewed the "promising prospect" of using AANs to diagnose Alzheimer's. That was ~1998-99. AFAIK, neuropsychs and neurologists have not been replaced in dementia dx.
-My take away from this is that if your concerned enough about potential symptoms of ASD to have it end up on a clinical medical record somewhere, get that child assessed by someone who uses more than "clinical intuition" or "LLM" to make an accurate differential diagnosis.
 
I have always conceptualized some of the restricted interests and repetitive behaviors as an extension of social communication deficits.
I've long said that I think the main difference between hyper-intense special interests in ASD and "normal" passionate interests is actually just knowing when to shut up about them and being able to do it, versus not being able to recognize and respond to the social norms and cues that situation X isn't an appropriate time to talk about Y or that person Z clearly doesn't care about Y, so you should switch the topic.
 
I've long said that I think the main difference between hyper-intense special interests in ASD and "normal" passionate interests is actually just knowing when to shut up about them and being able to do it, versus not being able to recognize and respond to the social norms and cues that situation X isn't an appropriate time to talk about Y or that person Z clearly doesn't care about Y, so you should switch the topic.
How would you differentiate "interests" from "repetitive thoughts"?

For example: Let's say I reminded you that Eiffel 65 produced a song called "Blue (Da Ba Dee)". Let's say you listened up, and got that song stuck in your head. You're not interested in that song, you don't particularly enjoy it, but you can't stop thinking about it. Is that a special interest?
 
For example: Let's say I reminded you that Eiffel 65 produced a song called "Blue (Da Ba Dee)". Let's say you listened up, and got that song stuck in your head. You're not interested in that song, you don't particularly enjoy it, but you can't stop thinking about it. Is that a special interest?

You've done research on Blue where know its entire backstory and have trouble engaging socially with anyone who doesn't know or care to know what you know about Blue.
 
You've done research on Blue where know its entire backstory and have trouble engaging socially with anyone who doesn't know or care to know what you know about Blue.
You’ve resorted to ad hominems instead of engaging in professional discourse. Don’t worry what that says about you .
 
I think they were giving an example of a special interest vs. a repetitive thought.
If that’s the case, I apologize.

However the response still does not explain the difference between enjoyment of the subject matter combined with an impaired theory of mind vs ego dystonic repetition of thoughts combined with an impaired theory of mind.
 
How would you differentiate "interests" from "repetitive thoughts"?

For example: Let's say I reminded you that Eiffel 65 produced a song called "Blue (Da Ba Dee)". Let's say you listened up, and got that song stuck in your head. You're not interested in that song, you don't particularly enjoy it, but you can't stop thinking about it. Is that a special interest?
See the bolded parts. You may have answered your own question (though I do believe you aren't looking for a rational answer, but rather a debate/argument). In your later post, you talk about "enjoyment" of the subject matter of the intense interest and the "ego dystonia" of the inescapable Blue song- again, you have answered your initial question (unless you are referring to person who finds thoughts that are unacceptable or repugnant to their ego to be reinforcing, and thus intentionally sings Blue in their heads to access the reinforcing repugnancy AND excessively shares their thoughts about this song with others who are, in fact, not very interested in the topic- in that case, they are the same thing).

In your later post, you also introduce the concept of "impaired theory of mind," which you did not mention in your original question. In the case of the repetitive Blue song, I'd posit that that's a purely private event that does not really bring up theory of mind deficits (unless your saying that the person believes that all others also have that song on repeat in the heads). In the case of the restricted interest, the lack of understanding that others have thoughts different from one's own (i.e. theory of mind deficit) leads to the conclusion that all others are just as interested in the tank numbers of British steam trains as you are. In turn, this leads to excessive animated discussion of said tank numbers with people who are, in fact, not really interested.
 
Finally got a chance to read the article. Some thoughts:

-Of course the reports contain language specific to ASD criteria. Concerns for ASD are typically why this children are in an environment where things will be written about them! It's cool that a LLM can pick up on this, but my guess is that anybody involved with these children would be surprised by that. We suspect ASD, thus we are more likely to document specific symptoms of ASD in our clinical recordings.

Interestingly the sentences that do seem to predict whether a diagnosis will happen don't simply recapitulate the diagnostic criteria. I agree with you that would be boring.

-It's not terribly shocking that language related to to restricted/repetitive behaviors occurs more in medical records. Hand flapping and lining up toy vehicles are clearly recognizable and we have specific (and pretty much all children have hands to flap and toy vehicles to line up). limited, common-use language to describe these behaviors. Contrast that with the many different ways to describe "deficits with joint attention" in our colloquial language. The specific terminology (i.e. "deficits in joint attention") is clinical jargon relatively specific to developmental/clinical psych, thus less likely to occur in general medical records than various descriptions of the phenomenon.

I will say that the medical records in question are specifically the evaluation and or/referral reports that contain a substantial amount of observational description specifically evaluating the question of ASD, not general medical records. I am not sure "well of course jargon specific to clinical psych won't occur" does that much lifting here.

-I think they are using the term "clinical intuition" overly broadly. This discounts the training medical professionals receive in relationship to ASD, and the formal and informal criterion-based measurement that is used to make the DX.
-What about differential or comorbid diagnoses? When I assess, the question isn't "autism or no autism," but rather "what best accounts for the noted delays and symptoms, and what should we do about it?"
-I'm pretty sure that I do better than 79.4% accuracy, but that's tough to measure!

Again, model isn't really attempting to diagnose autism. It's trying to see whether it can predict whether or not the evaluators are going to diagnose autism based on their reports. This is a different question. Would replacing "clinical intuition" with "clinical judgement" or "clinical expertise" (as is done in the paper in several instances" feel more felicitous to you?

For what it's worth, as a philosophical question, I think "clinical intuition" fairly accurately captures what is actually happening when these diagnoses are made, guided of course by other forms of measurement and sources of information, but ultimately there is not a calculus for weighing the various factors an experienced clinician takes into account and coming to a decision. It is judgement/expertise/intuition at the end of the day, even if a very informed and well-trained one.

-It'd be a really cool thing to have an accurate, quick, accessible, and affordable way to diagnose ASD early. Empirical evidence points to it being a difference in brain structures and functioning that is present at birth, and identifying a provide supports and services to clients and families as early as possible should be the goal. There's frequently something new that comes along (saliva samples testing; eyeball anatomy; imaging), but these things turn out to be unreliable for some reason or other.
The paper actually makes the point that these attempts to come up with clinically useful information from these other modalities has not borne much fruit, so why not look directly at what clinicians are doing and what information they are actually weighing?
 
Top