The integration of large language models in Norwegian healthcare

The Norwegian healthcare sector views Artificial Intelligence (AI) as a key element in ensuring sustainable development, aiming to maintain service quality, reduce waiting times, and meet the demands of an aging population. The national strategy, guided by the Joint AI Plan, prioritizes the medically appropriate, ethical, and safe use of these technologies to build trust among both healthcare professionals and patients. Large language models (LLMs) are poised to significantly impact clinical practices, particularly through the vast amounts of textual data found in patient records, clinical guidelines, and communications.

LLMs in Norwegian healthcare: Pasient journal data and speech-to-text

Applications in patient journal data

The high volume of unstructured medical text within electronic patient journals (EPJs) presents a fertile ground for LLM applications. LLMs offer the promise of enhancing diagnostic accuracy, improving patient–provider communication, and streamlining complex clinical workflows.

Specific applications include:

Clinical documentation: Automated generation of clinical notes from patient conversations, standardization of existing medical notes for improved Natural Language Processing (NLP), and summarizing complex medical texts.
Data utility: Organization of clinical data to improve accessibility and utility for research and practice.
Administrative efficiency: Assistance with coding, billing processes, and reducing the cognitive load related to detailed classification systems like ICD-11.
Reducing burnout: By streamlining documentation processes, LLMs can help reduce clinician burnout, allowing providers to dedicate more time to direct patient interaction.

Speech-to-text integration in Norway

Norway has been recognized as a leader in the adoption of speech recognition (SR) technology within its healthcare infrastructure. This technology is essential for generating the initial textual data from physician-patient conversations, which then feeds into LLM processing pipelines.

The successful integration of speech-based data capturing has yielded measurable results, including:

Decreased document turnaround times: One hospital reported delivering 90% of all medical reports to referring physicians within seven days after introducing SR, significantly exceeding the national target.
Cost savings: Speech recognition has been cited as an IT solution that generated measurable cost savings within the hospital sector.

Furthermore, specialized AI transcription tools demonstrate industry-leading accuracy for Norwegian medical dictation. Some models report a word error rate (WER) as low as 3.1% on recognized language benchmarks, with other providers achieving a WER of 8.4%. This high level of accuracy is critical for ensuring the fidelity of the patient journal data used by LLMs.

The non-English challenge: Norwegian language performance

A key concern in deploying LLMs in the Norwegian health sector is the performance and reliability of models on a non-English, domain-specific language.

However, recent studies have shown promising results for advanced models like GPT-4, which demonstrated a «robust handling of the Norwegian medical language» when evaluated on full-scale medical multiple-choice exams and comprehensive student patient cases. In these educational assessments, the LLM’s performance aligned closely with human expert assessments. This indicates a strong potential for LLMs to handle complex Norwegian clinical vocabulary, although systematic research on state-of-the-art NLP methods for Scandinavian clinical text is still ongoing.

Comprehensive challenges and risks

The clinical integration of LLMs presents several critical risks that must be addressed to ensure patient safety and quality of care.

Technical and design limitations

Opacity and ‘black box’ issues: LLMs are highly complex, generating responses based on statistical patterns rather than actual understanding. This opacity makes it difficult to interpret and trust model outputs, a major barrier to adoption among clinicians.
Hallucinations and incorrectness: The generated outputs can suffer from non-reproducibility, non-comprehensiveness, and outright incorrectness or unsafety (hallucinations), potentially leading to severe consequences for patient safety.
Data quality and bias: LLMs require enormous amounts of data. Poor data quality—due to incomplete, deficient, or non-systematically collected data in patient records—can affect performance. Misinformation in training data, even in small amounts, can lead to misleading medical responses. Data distortions can perpetuate bias, preventing patients from being treated equitably.
Enterprise data challenges: Clinical enterprise data (EPJs) present unique challenges compared to general public datasets, often requiring enterprise-specific knowledge for integration and navigation.

Regulatory, ethical, and privacy concerns

Legal framework: The deployment of AI systems must align with the Norwegian Medical Devices Act, the Health Personnel Act, and forthcoming regulations like the EU AI Act. The legal basis for using AI as a decision-support tool rests on the EU’s General Data Protection Directive (GDPR) and supplementary Norwegian healthcare legislation.
Data protection and secrecy: The processing of health data, which falls under special category personal data, is highly regulated. Using this data to train LLMs, especially third-party or ‘off-the-shelf’ models, carries a high risk of exposure if not formally agreed upon and authorized. Patients must receive general information that their data is being used to develop AI tools.
Lack of ethics and regulation: A significant gap exists in addressing the ethical, regulatory, and patient safety implications of clinically integrating LLMs.

Implementation and governance barriers

Experience has shown that challenges often arise during the implementation phase of AI in healthcare rather than solely from the technology’s limitations. Successful implementation requires establishing new management systems, quality assurance mechanisms, and a robust organizational structure. Key barriers include:

Trust and explainability: Clinicians must be able to interpret and trust model outputs, especially in high-stakes medical decision-making. Transparent and interpretable models are essential for fostering trust and ensuring accountability.
Competence and resources: A lack of AI competence and concerns regarding security and cost were major worries before initial public sector AI experiments. The successful implementation requires investments in workforce and data infrastructure.

Quality validation and managing model dynamics

Validating LLM quality in clinical settings

Validating the quality of LLMs is complex due to the subjective nature of evaluating performance and the lack of standardization in model reporting. A robust validation framework must be established:

Standardized Benchmarks: Developing a standardized benchmark for testing LLMs in medicine is essential for comparing different models and versions over time. This requires standardization in evaluation methodologies and reporting requirements, demanding detailed documentation of training data and performance limitations.
Metrics: Evaluation should integrate automated metrics with human judgment.
- String-based metrics (e.g., BLEU, ROUGE) assess n-gram overlap between the output and reference text.
- Semantic similarity metrics (e.g., BERTScore) evaluate deeper semantic alignment, capturing paraphrased expressions.
Human Oversight: Ultimately, continuous refinement and human oversight remain crucial to ensure the effective and responsible integration of LLMs. Clinician involvement in comparing and validating LLM responses against human expert judgments is key to establishing reliability.

Addressing model drift and dynamic updates

LLMs are highly dynamic, leading to the risk of «model degradation over time». This is known as data drift (or concept drift), where the distribution of input data changes, causing the model’s performance to degrade on new, real-world data. This dynamic nature, driven by social changes or domain-specific updates, poses significant safety concerns in healthcare.

To manage model dynamics and drift, a strategy of continuous monitoring is required:

Continuous Monitoring: Organizations must continuously monitor LLM drift, hallucination risk, and output latency in real-time. This iterative process, integral to Machine Learning Operations (MLOps), involves comparing production data and model predictions against the original training data to quickly detect deviations.
Regulatory Adaptation: Future regulatory frameworks are expected to incorporate mandatory requirements for continuous monitoring, performance thresholds for specific clinical applications, and mandatory reporting of adverse events.
Mitigation: If drift is detected, strategies include immediate retraining, dynamic adaptation of the model, and maintaining a «human in the loop» for critical decisions.