Ver o conteúdo principal

LLM-Train: Large Language Model Training and Fine-Tuning Environment

Headerssite2025 Level3

LLM-Train: Large Language Model Training and Fine-Tuning Environment

Short Summary

 

The LLM Training and Fine-Tuning Environment provides dedicated computational resources for developing, training, and deploying local large language models without relying on external cloud APIs. Built on NVIDIA H100 GPUs with extensive memory capacity, this testbed enables researchers and industry partners to customize open-source language models (Llama, Mistral, BERT variants) for domain-specific applications including technical documentation, customer service automation, and specialized knowledge bases. The environment supports the full LLM lifecycle from pre-training on custom corpora to fine-tuning with techniques like LoRA and QLoRA, and inference optimization for production deployment.

 

Keywords: Large Language Models; Fine-Tuning; Generative AI; GPU Computing; Domain Adaptation; Data Sovereignty

Deeptech Area

  • Artificial Intelligence
  • Human Machine Interfaces

Hosting Institution and PI Info

 

Name of Host Organization

NOVA Information Management School (NOVA IMS), Universidade Nova de Lisboa

Department or Lab

MagIC (Information Management Research Center) - the NOVA IMS research and development center

Name of Building

Manuel Vilares Building

Physical Address

Campus de Campolide, 1070-312 Lisboa

Website Links

https://www.novaims.unl.pt/

Institutional contact name

Cristina Oliveira

Institutional contact email

magic@novaims.unl.pt

Principal Investigator Name

Professor Ian James Scott

Position / institutional role

Assistant Professor

ORCID

0000-0001-9699-4473

Email

iscott@novaims.unl.pt

TestBed Responsible Name
(if different from PI)

 

Funding source(s)
for TestBed’s acquisition

This testbed benefits from the resources of the NOVA Data & Analytics Hub (NOVA DAH), hosted at NOVA Information Management School (NOVA IMS) of Universidade NOVA de Lisboa. The work is supported by national funds through FCT (Fundação para a Ciência e a Tecnologia) under project UID/04152/2025 (https://doi.org/10.54499/UID/04152/2025) (Centro de Investigação em Gestão de Informação (MagIC)/NOVA IMS); by the Plano de Recuperação e Resiliência (PRR) under projects UID/PRR/04152/2025 (https://doi.org/10.54499/UID/PRR/04152/2025) and EQUIPAR+2: UID/PRR2/04152/2025 (https://doi.org/10.54499/UID/PRR2/04152/2025); and by LISBOA2030 under project LISBOA2030-FEDER-01317500.

Application Domain

  • Manufacturing
  • Healthcare
  • Logistics
  • Agriculture
  • Maintenance & Inspection

Application Cases

 

Application case:

Short description:

Fine tuning LLM

Fine-tune open-source multilingual models (e.g., BLOOM, Llama) on a curated corpus of Portuguese legal documents.

RAG Legal Document Retrieval

Implement RAG (Retrieval-Augmented Generation) to ensure accurate citation of legal precedents

Test / chat interfaces

Deploy custom chat interface for lawyers to query case law and generate legal briefs

Potencial Stakeholders

 

Non-academic stakeholders

Industrial partners, SMEs, Startups, Government bodies, Professional associations, Public agencies and municipalities

Academic stakeholders

MSc students, PhD students, Researchers, Visiting researchers, Seconded researchers

Other types of stakeholders

R&I support professionals, R&I infrastructure operators, Innovation intermediaries, Technology transfer actors

Possible TRL and Exploitation Scenarios

 

TRL application range

4

Internal academic research

Yes

Collaborative research with external academic partners

Yes

Contract research / Proof-of-Concept for industry

Yes

Pilot / DeepTech Deployment in operational environment

No

Training services (courses, workshops, certification)

Yes

Service provision (testing, benchmarking, validation)

Yes

Open access for walk-in users (e.g. open days / hackathons)

No

Other (Secondments / sponsored access for visiting researchers under project-based or institutionally approved arrangements)

Yes

Formal access conditions and prerequisites

 

Type of contractual relationship

Academic partner

Industrial partner

No contract (direct access)

No

No

Direct contract between parties
(e.g., research agreement)

Yes (See Note 1)

Yes (See Note 1)

Indirect contract between parties
(e.g., project framework)

Yes (See Note 1)

Yes (See Note 1)

 

Note 1: All access is subject to terms and conditions.

 

 

Type of prerequisites

Description of prerequisites

 

Agreements

                                                                    

Confidentiality agreement for proprietary algorithms

In some cases (See Note 2)  

Data sharing agreement for datasets generated

In some cases (See Note 2)

IP agreements

In some cases (See Note 2)

Other 

In some cases (See Note 3)

Insurance

Users must have appropriate liability coverage through their home institution

Yes

 

Note 2: Intellectual property, confidentiality, and exploitation conditions are governed by the applicable NOVA regulations, the CITADELS consortium framework, and any project- or service-specific agreements. Background IP remains with the original rightsholders. Foreground generated through collaborative or service activities will be managed according to the applicable contractual framework, including provisions on ownership, access rights, confidentiality, dissemination, and exploitation. Additional NDAs, data-processing agreements, or specific IP clauses may apply depending on the nature of the data, software, models, or other assets involved.

Note 3: Access is granted on a project-based or institutionally approved basis, subject to feasibility assessment, resource availability, compliance with data protection and security requirements, and acceptance of the applicable terms and conditions. Special arrangements may apply for CITADELS secondments and other approved visiting researcher schemes. Where sensitive, proprietary, or regulated assets are involved, additional safeguards may be required before access is enabled.

 

Training and Safety

 

Mandatory technical training

N/A

Recommended technical training

Recommended training on cluster operation, job submission, queuing in SLURM.

Mandatory safety requirements

No physical safety requirements apply. Users must follow institutional cybersecurity, data protection, GDPR and responsible AI rules where relevant.

 

Technical Components for the Testbed

 

Components:

 

Description:

 

Hardware

(physical equipment available in this TestBed)

1) NOVA DAH01 System Specifications:

a) CPU: 32-Core CPU - This processor provides a significant amount of processing power, enabling users to run multiple demanding tasks simultaneously, such as simulations, data processing, and other compute-intensive workloads.

b) GPU: 2 x Nvidia RTX 6000 ADA - These high-performance GPUs are designed to accelerate AI, HPC, and other GPU-accelerated workloads. With two RTX 6000 ADA GPUs, users can leverage massive parallel processing capabilities, handling large amounts of data and providing a substantial boost to performance.

c) Storage: 7TB NVMe Storage - This high-capacity storage solution provides rapid data access and transfer speeds, ideal for applications that require high-performance storage, such as data analytics, scientific simulations.

2) NOVA DAH02 System Specifications:

a) CPU: 112 CPU cores, providing a substantial amount of processing power for compute-intensive tasks. This will enable users to run multiple simulations, data processing, and other tasks concurrently.

b) GPU: 2 x Nvidia H100 NVL (Next-Generation High-Performance Computing) GPUs, which offer significant performance boosts for AI, HPC, and other GPU-accelerated workloads. The H100 NVL GPUs are designed to handle massive amounts of data and provide high-performance computing capabilities.

3) NOVA DAH WS includes:

a) 16 Lenovo Thinkstations P5 units, each equipped with an Intel(R) Xeon(R) W3-2423 processor, 32 GB DDRS-4800 MHz ECC memory, NVIDIA RTX(R) 2000 GPU with 16 GB GDDR6 (Ada Generation), and 1 TB PCIe Neg4 SSD.

b) Operating system Windows 11 Education,

c) Broad range of licensed and open-source software for data science, analytics, modelling, and visualisation, including but not limited to a broad range of licensed and open-source software for data science, analytics, modelling, and visualization, including but not limited to Python, R, Power BI, Tableau, SPSS, SAS, QGIS, ArcGIS, Docker, Anaconda, Visual Studio Code, and Zotero.

c) Storage: 500 TB of storage, providing ample space for storing large datasets, applications, and other data. This storage capacity will enable users to work with big data and store the results of their computations.

4) Others

 

Software

(needed to run
the TestBed)

1) SSH client

2) File transfer tools recommended

3) Apptainer runtime to test locally

 

Standards and regulations
(relevant for the safe and compliant operation of this TestBed)

N/A

Ethical and Societal Aspects

 

Ethical and societal
aspect:

Short description:

Data Sovereignty and Privacy Protection

The testbed enables organizations to train and deploy LLMs on-premise, ensuring sensitive corporate data, personal information, and proprietary knowledge never leave the institution's infrastructure. This is critical for healthcare providers, government agencies, and financial institutions that cannot use cloud-based AI services due to data residency requirements and privacy regulations.

Democratization of AI for European Languages

By providing accessible infrastructure for training European language models, the testbed addresses the underrepresentation in mainstream AI systems. This enables development of culturally-aware AI assistants that understand idioms, legal terminology, and regional variations, ensuring equal access to AI benefits for European-speaking populations.

Digital Literacy and AI Education

The testbed serves as a hands-on learning platform for students and professionals to understand LLM technology, promoting AI literacy beyond technical audiences.

Support for Regional SMEs and Startups

By offering subsidized access, the testbed enables SMEs to compete with tech giants in developing specialized AI applications.

Funding Source

 

This testbed benefits from the resources of the NOVA Data & Analytics Hub (NOVA DAH), hosted at NOVA Information Management School (NOVA IMS) of Universidade NOVA de Lisboa. The work is supported by national funds through FCT (Fundação para a Ciência e a Tecnologia) under project UID/04152/2025 (https://doi.org/10.54499/UID/04152/2025) (Centro de Investigação em Gestão de Informação (MagIC)/NOVA IMS); by the Plano de Recuperação e Resiliência (PRR) under projects UID/PRR/04152/2025 (https://doi.org/10.54499/UID/PRR/04152/2025) and EQUIPAR+2: UID/PRR2/04152/2025 (https://doi.org/10.54499/UID/PRR2/04152/2025); and by LISBOA2030 under project LISBOA2030-FEDER-01317500

More info

(TBD)