• Suomi
  • English

Patient data creating better artificial intelligence models

Medical research will not progress without data and its re-use. Collected data can be used to create artificial intelligence models to make speedier diagnoses and to support care decisions. New data analysis technologies emerge all the time, but how to make data available for all researchers?


One of the strengths of the Genome Center, which is being established in Finland, consists of biobank databases. The Center would be in charge of developing a genome data register, that is, a centralised system for storing and managing genetic data. The aim is to create a high-quality database that describes genetic variation in Finland. Auria Biobank’s Director Lila Kallio believes that good cooperation between the biobanks and the Genome Center can lead to significant results in the screening of gene variants.

“Once the Genome Center has been established and starts operating, genome data created during research can also be stored in the Genome Center. The Genome Center could re-analyse any genome data against all accumulating reference genome data. This would enable, for example, the screening of new identified, clinically significant variants by using previously produced and stored data,” says Lila Kallio.

The Biobank Act was enacted in Finland in 2013, enabling the establishment of biobanks. There are currently 11 biobanks in Finland. In 2020, the biobank network was joined by Arctic Biobank, which stores extensive population data collected by the University of Oulu in the north of Finland. Researchers in Finland can utilise material from all biobanks through the Fingenious online service. Fingenious is a digital tool through which a researcher can send a request for material to be made available to them. This service is provided by the Finnish Biobank Cooperative, FINBB.

“Biobanks store data about samples securely. Data about biobank samples is available to all researchers. A researcher must present a research plan for approval by the biobank steering groups or ethical committee. Biobanks have a process in place for the research use of samples and their related data.

Finland has exceptionally comprehensive and high-quality health care data resources. The Act on the Secondary Use of Health and Social Data (552/2019) came into effect in 2019 in Finland. Secondary use of data means that customer and register data within social care and healthcare are used for some other than the original purpose. This act on secondary use has also created pressure to amend the Biobank Act from 2013. The significance of data in biomedical research is increasing and the legislation should create the conditions for both research and appropriate data security.

Secondary use obviously requires that data collected of people is managed securely. Identifier data of human samples stored in the biobanks is carefully protected.

“The biobanks remove all personal identifiers and replace them with pseudonym codes. When samples are handed over for research purposes, the pseudonyms are replaced with another code, specific to that particular research. The code key is stored in the biobank. If you need to access the original sample owing to, for example, by some clinically significant detail, this can only be done with the code key,” says Kallio.

The use of a code key enables the data to be re-used in subsequent more research purposes.

“If the sample was anonymised, that is, making impossible to identify, it would be impossible to access it after any findings in biobank research, and no sample-specific data could be added to it later.”

According to Lila Kallio, the real value of samples is in the data created from it.

“Data is created during diagnostics and treatment. Research also results in analysed data, which must be returned to the biobank in possession of the sample, to be appended to it. Biobanks manage not only identifier data but also clinical data and data that has been produced during research.”



Plans for various protection levels for data use


The act on the secondary use of health and social data concentrated the permit process management to Findata, a new legal authority. Lengthy processing times of permit applications has become a problem. All applicants are treated equally regardless of the size of material they are making an application for.

Auria’s chief data officer (CDO) and adjunct professor of medical mathematics Arho Virkki points out that material can be used in a number of ways, and that’s why there should be different protection levels based on the purpose of use. According to Virkki, the initially planned data security improvement leap for the secondary use of health data is too great to be taken in one go.

“Extreme protection weakens data availability, which means that data security won’t reach an optimal level. To me, optimal data security means that material is truly available for scientific research, planning of new treatments and controlling treatment processes. In the optimal situation, data is available and at the same time it is adequately protected. The protection level should be set on the basis of the risks involved.”

Because data management is part of the work of doctors and nurses, Virkki says a balance should be found between material availability and its protection. At the moment, the system is lopsided.

“For example, exploring clinical data is part of medical students’ curriculum. One part of their training is the use operational systems to find data in order to learn.”

According to Virkki, the isolated data architecture has been the culprit for a long time. Owing to a defensive approach and regulation in medicine and healthcare, information architecture is more traditional compared to, for example, logistics and the financial sector. This is why various information systems are poorly integrated.

Virkki does admit, however, that hospitals are more complex places than logistics centres, for example. In logistics systems, postal packages follow a pre-defined route and their travel is easily recorded by the system, whereas when a patient arrives at a hospital, the following steps are typically more complex and involve a great number of different options.

The act on secondary use of data, however, makes the assumption that one type of secure data processing environment fits all uses. Virkki says that the legal entity issuing the research permit could provide a range of user environments based on the researchers’ needs.

“There could be a basic environment sufficient for simple data analysis that consists of spreadsheet software and the usual range of statistical programming languages.”

Then again, if researchers need a custom environment, they should be given exact details about data security with assurances that they will comply with them.

“This way, the authorities would set the requirements for data security, but the researchers should be accountable of it, as has been the case up till now. At the end of the day, it is the researchers’ responsibility to ensure that their results are correct, honest, scientific and anonymous.”

According to Virkki, people in the medical field in Finland take great pride in their work and have always strived to process medical material in a proper fashion. Virkki says that data security can be ensured through licensing and training. Data security issues should be part of medical training. Virkki is a regular speaker at the University of Turku on an introductory course on the basics of clinical research about data platforms and data security.



Secondary use of data lays foundation for AI use in medicine


According to Virkki, amendments are in progress for the act on secondary use of data. If the provisions can be made more flexible and the permit processes faster, there will be many opportunities for artificial intelligence research.

“Now that the reform on social care and healthcare was passed in Finland, there is a good possibility to combine the patient data of basic and specialised health care, that is, view patient data as a single entity. This in turn will enable the development of new AI applications for the clinical side.”

The algorithms of AI models can perform text-based analyses, make medical records or learn to identify details in images that can be used in diagnoses.

“In fact, artificial intelligence is just modern statistics, a refined branch of mathematical statistics. AI models make use of complex statistical methods. When you talk of machine learning, you actually mean statistical learning. Today you can calculate predictions in such a precision that is almost feels like magic.”

Virkki has been intrigued by AI models for a long time. In his doctoral thesis, he created an AI model for human respiratory system during sleep. Recently he has been developing a prediction model for pulmonary embolism. The model is used as a tools for decision-making. Pulmonary embolism occurs when a blood clot gets wedged into an artery in the lungs. The most common symptom is sudden shortness of breath. In serious cases of pulmonary embolism, the clot is diluted by injecting an anticoagulant into a vein.

“If there is reason to suspect that a patient in an emergency room has pulmonary embolism, you have to act fast. A machine can quickly go through a set of scanned images and inform the radiologist where they should focus on in any image. After that, the decision is made whether to start diluting or not. If not, another treatment is chosen. You should be able to do all the following in less than 10 minutes: lung imaging, diagnosis and starting the treatment.”

According to Virkki, the pulmonary embolism model was the first scientific test trying to solve a difficult problem with very little data. However, a more comprehensive and more accurate AI model is under development. Scientific publications and dissertations will be published on the subject.

“If realised, the model will speed up decision-making in case situations, but also assist in quality control. For example, we can screen afterwards whether we detected any smaller cases of pulmonary embolism.”

The development of artificial intelligence models requires a lot of data with which algorithms are taught, and plenty of computing power.

The hospital district of Finland Proper uses the ePouta cloud service of Finland’s ELIXIR centre’s CSC, with a dedicated 10 GiB connection. Virkki hopes researchers could have better access to the ELIXIR network.

“It would be great if researchers were given capacity from the ELIXIR infrastructure for their work. The data resource would be made available directly in the ELIXIR environment, and ELIXIR would ensure there was enough computing capacity.”

The ELIXIR Node in Finland (ELIXIR-FI) is hosted at the CSC – IT Center for Science Ltd. CSC operates resources and services that are part of ELIXIR, like pan-European ELIXIR identity and access infrastructure. In ELIXIR the data needs to be managed as a federation, where data providers work as a single infrastructure providing mechanisms where researchers can bring their analysis to where the data is located. The ELIXIR Compute Platform infrastructure will allow life scientists to easily access, share and analyse data from different sources across Europe. The objective is to combine all components of the ELIXIR Compute services into a seamless workflow. A researcher may use the ELIXIR Authorisation and Authentication services to securely create a scientific software analysis environment, and use the environment to access large biological data resources stored in a cloud.


Text-based AI model



Text written or dictated by a doctor can be utilised in artificial intelligent models used in aid of current care guidelines and diagnoses. Statements and sentences can be constructed into data and teach the algorithm to make deductions. In the project participated in by Auria Biobank, Turku University Hospital and the University of Turku, artificial intelligence was taught to extract data about smoking from about 30,000 patient records. In the project, headed by researcher Antti Karlsson, a language model called ULMFiT was used. The model was trained using the analysis computers of Finland Proper Hospital District, making use of texts from Wikipedia. After this, the model was trained to become a classifier by means of some 5,000 manually annotated sentences. These days there are also more sophisticated, pre-trained models in Finnish, most notably perhaps FinBERT, based on the Google BERT model. This was created by a University of Turku research group led by Filip Ginter, using computing power from Finland’s ELIXIR centre CSC.

Based on the data collected by the artificial intelligence model, the study showed that quitting smoking, even once a cancer diagnosis has been made, may extend the patient’s life expectancy considerably.

“I’m sure that future patient record systems will not be formal, with items picked from drop-down menus, but written more in prose, with systems that automatically structure them,” says Karlsson.

“This also improves work efficiency. I don’t even want to imagine how difficult it must be for a busy doctor to enter complex matters into the systems.”

When you mine a large data mass, you save an awful lot of time and money. The artificial intelligence model trained by Antti Karlsson analysed patient records, searching for smoking-related issues. In the study, the model analysed text date obtained from the records of 30,000 patients. According to Karlsson, models like this can produce analyses that are more than 90% accurate, in hours or even minutes. It’s quite different from manually reading through the texts of 30,000 patients, entering variables into a table.

“In the best-case scenario, such models could be readily available in a data lake, structuring this tobacco data automatically for research purposes,” says Karlsson.

The model does not produce treatment instructions for an individual patient, but creates a good overall picture.

“A believe that at least initially, the automated systems of the future will collect data relevant to reporting and research, while the really important things, such as dosages and allergies must still be checked and filled in manually by experts.”

Ari Turunen

Read article in PDF



For more information:

Karlsson et al. (2021): Impact of deep learning-determined smoking status on mortality of cancer patients: never too late to quit. Esmo Open Cancer Horizons. Vol 3. Issue 3.




CSC – IT Center for Science

is a non-profit, state-owned company administered by the Ministry of Education and Culture. CSC maintains and develops the state-owned, centra- lised IT infrastructure.





builds infrastructure in support of the biological sector. It brings together the leading organisations of 21 Euro- pean countries and the EMBL European Molecular Bio- logy Laboratory to form a common infrastructure for biological information. CSC – IT Center for Science is the Finnish centre within this infrastructure.