Machine learning-based identification of separating features in molecular fragments
Identifikace separujících vlastností molekulárních fragmentů pomocí strojového učení
bachelor thesis (DEFENDED)
View/ Open
Permanent link
http://hdl.handle.net/20.500.11956/2093Identifiers
Study Information System: 172518
Collections
- Kvalifikační práce [11242]
Author
Advisor
Referee
Škoda, Petr
Faculty / Institute
Faculty of Mathematics and Physics
Discipline
General Computer Science
Department
Department of Software Engineering
Date of defense
31. 1. 2017
Publisher
Univerzita Karlova, Matematicko-fyzikální fakultaLanguage
English
Grade
Excellent
Keywords (Czech)
cheminformatika, strojové učení, molekulární reprezentaceKeywords (English)
cheminformatics, machine learning, molecular representationChosen molecular representation is one of the key parameters of virtual screening campaigns where one is searching in-silico for active molecules with respect to given macromolecular target. Most campaigns employ a molecular representation in which a molecule is represented by the presence or absence of a predefined set of topological fragments. Often, this information is enriched by physiochemical features of these fragments: i.e. the representation distinguishes fragments with identical topology, but different features. Given molecular representation, however, most approaches always use the same static set of features irrespective of the specific target. The goal of this thesis is, given a set of known active and inactive molecules with respect to a target, to study the possibilities of parameterization of a fragment-based molecular representation with feature weights dependent on the given target. In this setting, we are given a very general molecular representation, with targets represented by sets of known active and inactive molecules. We subsequently propose a machine-learning approach that would identify which of the features are relevant for the given target. This will be done using a multi-stage pipeline that includes data preprocessing using statistical imputation and dimensionality...