Hybrid LLM-enhanced topic modelling for large-scale thematic analysis of dairy cattle health literature
Options
Project description
This dataset contains the outputs of the BERTopic algorithm, an unsupervised clustering tool used to group large bodies of unstructured text. In this research project, the texts are scientific abstracts related to a specific research domain: dairy cattle health. BERTopic outputs were enhanced using a large language model (LLaMA 3.1 Instruct 8B) deployed locally on the UBELIX HPC cluster, which was used to generate refined topic labels and structured summaries. The dataset includes the algorithmically derived topic titles (Name), topic identifiers (Topic), and cluster sizes (Count), sorted in descending order by cluster size. The Representation column lists the main keywords extracted by BERTopic via c-TF-IDF, which serve as the basis for the algorithmic topic title. All remaining columns were generated by LLaMA to provide structured descriptions of each topic and its thematic content.
Data Availability
Open
Contributors
Zahri, Reda |
Languages
en
Keyword(s)
Topic Modeling
•
LLM
•
BERTopic
•
Unsupervised Learning
•
Machine Learning
•
Natural Language Processing
•
NLP
•
ML
•
Veterinary Public Health
•
Public Health
•
LLaMA
•
Embedding
•
Clustering
•
Topic
•
Exploratory Analysis
•
Dairy
•
Cattle
•
Health
•
Dairy Cattle Health
•
Bibliometrics
•
Veterinary Epidemiology
Rights URI
Boris Publication