- Paper: Methods of integrating data to uncover genotype–phenotype interactions
- このレビューペイパーは分野の夏休みセミナー(2015年8月)に使用したものです。セミナーのためのサイトはhereです。ユーザ名、パスワードともに"guestsan"でログインして覗いてみてください。
- This review was used in our summer seminar in Aug 2015, whose seminar site is here. You can login with username "guestsan" and password "guestsan". Feel free to access there.
- 専門用語の理解も大事ですが、いわゆる「英単語」を正確に理解することも有用です。
- Besides the specialized terms in the field, precise understanding of various English words are beneficial.
- 以下の単語リストは、上記のレビューに出てくるもので、理解しておくと役に立つでしょう。
- The followings are the words that appear in the review and their understanding would be helpful.
- 簡単な言葉で分野外の人を相手に説明できればOKです!
- If you can describe them to somebody unfamiliar with biology/informatics/statistics, your understanding is fine!
- 中学生レベルの単語から院生レベルの単語まで色々です。複数回、登場する単語もあります。
- Some of the words appear in high-school English textbooks and some appear frequently in graduate school textbooks. Some words appear more than once.
------
- TITLELS, ABSTRACT and INTRODUCTION
- integrate (integrating data)
- genotype
- phenotype
- identification (identification of effective model)
- predict (predict phenotypic traits)
- trait (phenotypic traits)
- elucidate (elucidating important biomarkers)
- biomarker
- insight (generate important insights)
- heritability
- complex trait
- harness (harness the utility of ...)
- emerge (emerging approaches for data integration)
- meta (meta-dimensional)
- dimension (meta-dimensional analyses)
- multi (multi-staged)
- stage (multi-staged analyses)
- system (biological systems)
- translational (translational bioinformatics)
- complementary (complementary analysis)
- family-based, population based (family-based data and population-basd data)
- architecture (genetic architecture)
- pathway (biological pathways)
- aetiology (genetic aetiology)
- interrogation (interrogation of genotype-phenotype associations)
- compensate (compensate for missing or unreliable information)
- principle (principles of meta-dimensional analysis and multi-staged analysis)
- quantitative, categorical (quantitative or categorical outcome)
- outcome (quantitative or categorical outcome)
- challenge (analytical challenges)
- perspective (provide our perspective on how such systems genomic analyses might develop in the future)
- WHY INTEGRATE DATA?
- predictor (predictor variables)
- variable (predictor variables)
- comprehensive (comprehendive modelling)
- elaborate (the result of an elaborate interplay)
- interplay (the result of an elaborate interplay)
- informative (more informative model)
- bridge (bridge the gap)
- reflect (reflecting the complexity)
- complexity (reflecting the complexity)
- primary (primary motivation)
- explain, predict (explain or predict disease risk)
- risk (disease risk)
- modest (The success ... has been modest)
- limited (limited exploration)
- exploration (limited exploration)
- power (improved power)
- mechanism (understanding of the mechanism)
- causal (causal relationship)
- stepwise (stepwise or hierarchical analysis)
- hierarchical (hierarchical analysis)
- refer (refers to the concept)
- concept (the concept of integrating multiple different data types)
- build (build a multivariate model)
- multivariate (multivariate model)
- given (a given outcome)
- scientific (new scientific questions)
- assemble (assembling all of these data types together)
- diversity (diversity in the size of data sets)
- size (diversity in the size)
- pattern (patterns of missing data)
- noise (noise across the different data types)
- across (noise across the different data types)
- correspondence (correspondence between measurements from different technologies)
- measurements (correspondence between measurements from different technologies)
- substantial (create substantial challenges)
- single (no single analysis approach)
- optimal (be optimal for all studies)
- comprehensive (a comprehensive analysis toolbox)
- expanded (a expanded analysis toolbox)
- CHALLENGES WITH INDIVIDUAL DATA SETS
- individual (individual data sets)
- unique (unique challenges)
- implement (before implementing multi-staged analyses)
- quality (data quality)
- scale (data scale)
- dimensionality
- potential (potential confounding of the data)
- confounding (potential confounding of the data)
- issue (these issues are not dealt)
- each (each individual data types)
- downstream (avoid downstream problems)
- storage (computational power and storage capabilities)
- capability (storage capabilities)
- system (computing systems)
- open-source (open-source to commercial packages)
- commercial (commercial packages)
- packages (commercial packages)
- store (store these data)
- track (track these data)
- assurance (quality assurance)
- control (quality control)
- assay (low-throughput assays)
- cluster (genotype clusters)
- sample (any samples that did not cluster well)
- rest (with the rest of the data set)
- nature (large-scale nature of high-throughput data)
- feasible (examining data individually is not feasible)
- summary statistics (rely on summary statistics)
- overview (broad overview of the data)
- pipeline (quality control pipelines)
- electronic medical record
- profile (methylation profiling)
- specific (specific and critical quality control steps)
- critical (critical steps)
- integrity (sample integrity)
- distributional (distributional evaluation)
- respect (with respect to variables)
- ensure (will ensure that ...)
- rigorously (how rigorously to perform)
- reduction (data reduction)
- limit (limit the number of variables)
- single (in a single data set)
- initial (as an initial step)
- consider (when considering data with a vast number of independent variables)
- independent (independent variables)
- cross (cross-validation)
- validation (cross-validation)
- permutation (permutation testing)
- concern (address this concern)
- filter (filtering strategy)
- facilitate (facilitates data integration analyses)
- refine (more refined subset)
- subset (more refined subset)
- efficient (efficient computation)
- computation (efficient computation)
- burden (multiple-hypothesis testing burden)
- full (full dimensionality)
- consideration (computational time, memory and sample size considerations)
- exhaustive (in an exhaustive manner)
- combinatorial (combinatorial increase in models)
- respective (and their respective computation times)
- possible (all possible pairwise models)
- pairwise (all possible pairwise models)
- choose (by choosing 2 of the 5 million variables)
- GPU (GPU clusters)
- considerably (considerably faster)
- traditional (traditional computing processors)
- practicality (reaching the limits of practicality)
- mine (data mining)
- extrinsic, intrinsic (either extrinsic ... or intrinsic)
- external (using information external to the data set itself)
- prior (prior knowledge)
- domain (in the public domain)
- system (immune system)
- time (the knowledge of the field at the time)
- feature (remove biologically important features)
- threshold (on a chosen P value threshold)
- relevant (biologically relevant variants)
- annotation (based on a Biofilter annotation)
- drive (will drive the hypothesis that can be tested)
- dominant (dominant paradigm)
- paradigm (dominant paradigm)
- stratify (by stratifying the data by type)
- alternative (Hypothesis B is the alternative possibility)
- multiple (multiple levels of molecular variation)
- contribute (contribute to disease risk)
- interactive (in a nonlinear, interactive and complex way)
- subsequently (and subsequently performing analyses would inhibit ...)
- appropriate (would be more appropriate)
- particular (association with a particular outcome)
- spurious (spurious association)
- finding (interpretations of findings)
- demographic (genetic, environmental, demographic or other technical factors)
- technical (genetic, environmental, demographic or other technical factors)
- address (address population stratification)
- surrogate (surrogate variable)
- interest (other variables of interest)
- issue (overcome the potential issues with heterogeneity)
- heterogeneity (overcome the potential issues with heterogeneity)
- comprehensive (comprehensive data integration analyses)
- AN OVERVIEW OF DATA INTEGRATION
- scale (using only two different scales at a time)
- refer (we refer to the numerical and categorical features)
- continuous (continuous values)
- reflect (this approach reflects Hypothesis A)
- fusion (fusion of scales)
- simultaneously (are combined simultaneously)
- DATA INTEGRATION: MULTI-STAGED ANALYSIS
- suggest (as its name suggests)
- signal (signals are enriched)
- enrich (signals are enriched with each step of the analyses)
- deem (SNPs deemed significant)
- option (one option is to look for ...)
- binary (on a continuous or a binary dependent variable)
- respectively (linear or logistic regression (depending on a continuous or a binary dependent variable, respectively)
- rational (the rational of this approach)
- arbitrary (relatively arbitrary threshold)
- combat (combat multiple testing problems)
- functional (functional SNPs)
- inference (causal inference)
- key (key drivers)
- driver (key drivers)
- exploit ((something) that exploit the naturally occurring DNA variation)
- natural (naturally occurring)
- reactive (as an independent, causative or reactive function)
- likelihood (maximum likelihood)
- fairly (are fairly powerful)
- specific (allele-specific expression)
- organism (diploid organisms)
- preferential (preferentially expressed)
- modification (epigenetic modifications)
- product (gene product)
- extra (extra resources)
- resource (extra resources used for experimentally tagging the two alleles)
- tag (experimentally tagging the two alleles)
- extend (other extended methods)
- context (used in other contexts)
- state (chromatin state)
- domain (domain knowledge-guided approaches)
- guide (domain knowledge-guided approaches)
- consolidate (is consolidated by initiatives)
- initiative (initiatives such as ENCODE)
- input (the genomic regions of interest are inputs)
- unit (functional units)
- annotate (annotate them with domain knowledge from muliple public database resources)
- current (biased by current knowledge)
- perturbation (environmental perturbations)
- applicable (a multi-staged analysis would be applicable)
- DATA INTEGRATION: META-DIMENSIONAL ANALYSIS
- concatenation (concatenation-based integration)
- transformation (transformation-based integration)
- joint (joint relationship)
- recurrence (time to recurrence)
- alteration (copy number alteration)
- via (via LASSO)
- meaningful (in a meaningful way)
- corresponding (values corresponding to the copies of a specific allele per individual)
- per (values corresponding to the copies of a specific allele per individual)
- inflate (can inflate high-dimensionality)
- intermediate (transforming each data type into an intermediate form)
- symmetrical (symmetrical ... matrix)
- positive (positive ... matrix)
- semi (semi-definite)
- definite (semi-definite)
- represent (a matrix represents the relative positions)
- position (the relative positions of all samples)
- merge (multiple graphs or kernels can then be merged)
- elaborate (before elaborating any models)
- preserve (the advantage of preserving data-type-specific properties)
- property (data-type-specific properties)
- representation (transformed into an appropriate intermediate representation)
- unifying (as long as the data contain a unifying feature, such as patient identifiers)
- identifier (patient identifiers)
- robust (robust to different data measurement scales)
- supervised (semi-supervised)
- learning (semi-supervised learning)
- space (original feature space)
- encompass (model-based integration encompasses methods)
- training (training set)
- final (a final model)
- phase (during the training phase)
- available (DNA sequence data may be available)
- suite (a suite of analysis tools)
- majority (majority voting)
- vote (majority voting)
- resistance (drug resistance)
- mutants (HIV proteave mutants)
- complex (HIV protease-drug inhibitor complex)
- recognition (protein fold recognition)
- resulting (the resulting model)
- weighted (in a weighted voting scheme)
- scheme (in a weighted voting scheme)
- probabilistic (construct probabilistic causal networks)
- require (model-based integration requires a specific hypothesis)
- resultant (resultant DNA sequence model)
- incorporate (the only variables that are incorporated into the integrative analysis)
- ensemble (ensemble-based approaches)
- supervised (supervised learning)
- label (with known labels (outcome or phenotype))
- latent (latent variable)
- exploratory (exploratory learning)
- CAVEATS AND LIMITATIONS
- caveat (caveats and limitations)
- theoretical (theoretical distributions from which power calculations can be performed)
- empirical (empirical power)
- apply (these power estimates will apply only to the data set or simulation at hand)
- at hand (these power estimates will apply only to the data set or simulation at hand)
- universal (the universal power of the approach)
- pitfall (potential pitfalls)
- prohibitive (as the computation time can be prohibitive)
- orthogonal (that extract orthogonal, or independent, relationships)
- essential (which primary variables are essential)
- gold standard (the 'gold standard' in human genetics is to look for replication of results)
- replication (the 'gold standard' in human genetics is to look for replication of results)
- stringent (more stringent protection)
- protection (more stringent protection from type 1 errors)
- underlie (underlying functional genomic units)
- unit (underlying functional genomic units)
- represent (represented by each variable)
- external (external replication)
- readily (independent data sets are not often readily available)
- internal (internal replication)
- extrinsic (extrinsic data)
- corroborate (to estimate the strength of the available corroborating evidence supporting a given association)
- validation (functional validation)
- viable (viable alternative to replication)
- bench (bench science)
- literature (text mining to find literature that supports or refutes the original findings)
- refute (text mining to find literature that supports or refutes the original findings)
- in silico (in sillico modelling)
- series (a series of experiments)
- kinetic (kinetic experiments)
- differential (differential equations)
- within, between (highly correlated variables both within and between data types)
- sparse (sparse data matrices)
- metric (two metrics of the models are compared)
- fitness (fitness metric)
- parsimony (parsimony metric)
- FUTURE DIRECTIONS
- crude (crude tissue extract)
- promise (showing promise)
- reductionist (reductionist paradigm)
- prevalent (less preavlent)
- affordable (readily available and affordable)
- prevail (will prevail as the dominat type of study design)
- isolation (the days of studying molecular data variability in isolation)
- CONCLUSION
- emergence (emergence of new statistical and computational techniques)
- facilitate (the emergence ... will facilitate the search)
- compensatory (compensatory mechanisms)
コメントをかく