[Seminar] Clustering high-dimensional count data through a mixture of multinomial PCA

  • Training
  • Research
Published on October 13, 2020 Updated on October 13, 2020
Dates

on the November 12, 2020

Location
Seminar held remotely via Microsoft Teams

Seminar held by Nicolas Jouvin (Université Paris 1) on November 12, 2020 at 10:00

Speaker: Nicolas Jouvin (PhD student at Université Paris 1, SAAM Laboratory).

Abstract: Count data is used in many scientific fields in the form of frequency counts for instance as the occurrences of distinct words in a bag-of-words model for text analysis, or as read counts in genomics. This presentation addresses the problem of count data clustering, with the help of a mixture model. Based on the latent Dirichlet allocation, also known as the multinomial PCA, it allows the integration of clustering and dimension reduction to deal with high-dimensional datasets. We present a new variational EM algorithm for this model, combined with a greedy heuristic. We illustrate the qualitative interest of the proposed methodology in a real-world application, for the clustering of anatomopathological medical reports, in partnership with expert practitioners from the Institut Curie hospital.

Due to the current pandemic, this seminar will be held remotely via Microsoft Teams.

To register, please send an email to