Skip to content
Snippets Groups Projects
Commit aca25340 authored by mishraa94's avatar mishraa94
Browse files

Upload New File

parent fdeefc95
No related branches found
No related tags found
No related merge requests found
README.md 0 → 100644
### Master Thesis
Programme: MSc Bioinformatics (FU Berlin)
First Supervisor: Prof. Dr. Dr. hc. Edda Klipp
Second Supervisor: Prof. Dr. Katharina Baum
# Pipeline
## Overview
This repository provides a meticulously crafted Python-based pipeline for the in-depth analysis and optimization of gene-phosphorylation site dynamics. Leveraging advanced methods for data preprocessing, parameter estimation, and visualization, the pipeline delivers valuable insights into kinase interactions and phosphorylation mechanisms. Designed for modularity and robustness, it accommodates a variety of experimental datasets, enabling researchers to adapt the workflow seamlessly to their specific needs. By addressing the inherent complexities of phosphorylation datasets, this repository ensures precision and scalability in analytical processes.
---
## Features
### 1. Data Organization
- 📁 Automatically categorizes output files (`.xlsx`,`.csv`,`.png`, `.html`) into protein-specific subdirectories based on protein symbols.
- 🗂️ Establishes a structured file hierarchy that simplifies downstream analyses and minimizes manual effort in organizing outputs.
### 2. Data Preprocessing
- 🧪 Processes phosphorylation time-series datasets alongside kinase interaction data.
- 🖇️ Constructs observed data matrices (P_initial) and kinase-specific arrays (K_array) to provide a robust foundation for analysis.
- 🔄 Handles incomplete kinase-phosphorylation site datasets by generating synthetic data, ensuring all relationships are considered during optimization.
### 3. Parameter Estimation
- 🛠️ Employs sophisticated optimization algorithms (SLSQP & DE) to estimate accurate alpha and beta parameters.
- 📊 Supports multiple loss functions, including mean squared error (MSE), mean absolute percentage error (MAPE), Huber loss, and weighted error metrics.
- ✅ Enforces biologically meaningful constraints, ensuring alpha values sum to unity for gene-phosphorylation site combinations and beta values reflect consistent kinase contributions.
### 4. Scalable Optimization
- ⚡ Utilizes multi-core parallel processing to expedite computational workflows, making it suitable for large-scale datasets.
- 🔧 Provides configurable hyperparameter settings, including parameter bounds, regularization techniques (L1/L2), and data scaling methods tailored to experimental contexts.
### 5. Advanced Visualization Tools
- 🌈 Generates a wide range of visualizations, including heatmaps, scatter plots, and convergence plots, to effectively communicate optimization results.
- 🔍 Produces residual analysis outputs, such as cumulative residual distributions and goodness-of-fit plots, for rigorous error evaluation.
- 📉 Incorporates dimensionality reduction techniques, such as PCA and t-SNE, to reveal hidden patterns in complex datasets.
---
## Structure
### Scripts
1. **`arrange_folders.py`**: 📂 Organizes output files into protein-specific subdirectories for streamlined result management.
2. **`metrics.py`**: 📈 Computes performance metrics (e.g., MSE, RMSE, R-squared) and generates comparative summaries across datasets.
3. **`optimization_posthoc.py`**: 🔄 Conducts post-optimization analyses, extracting parameters and visualizing error landscapes and parameter distributions.
4. **`optimization_1.py` and `optimization_2.py`**: 🧮 Serve as the core optimization scripts, supporting diverse algorithms and constraints for precise parameter estimation.
5. **`optimization_analysis.py`**: 🧠 Provides advanced statistical analyses, dimensionality reduction, and network visualizations to complement optimization results.
6. **`problem_diagram.py`**: 🎨 Generates network diagrams with labeled nodes, colored edges, and subscripted labels for multiple Graphviz layouts.
### Files
- **Input**:
- 📄 `input1.csv`: Contains phosphorylation time-series data for multiple genes and phosphorylation sites.
- 📄 `input2.csv`: Mapped kinase-phosphorylation site interactions, forming the foundation for optimization.
- **Output**:
- 📊 Excel reports (`optimization_results.xlsx`) summarizing optimized parameters, residuals, and performance metrics.
- 🖼️ High-resolution visualizations (png/PNG formats) illustrating parameter trends, error distributions, and network structures.
---
## Usage
### Prerequisites
- 🖥️ Python 3.x
- 📦 Required libraries: `numpy`, `pandas`, `matplotlib`, `seaborn`, `scipy`, `statsmodels`, `openpyxl`, `pymoo`, `tqdm`, `sklearn`
To install dependencies, run:
```bash
pip install numpy pandas matplotlib seaborn pymoo scipy statsmodels openpyxl tqdm sklearn
```
### Execution
1. **Run Optimization Pipelines**:
```bash
python optimization_1.py
```
or
```bash
python optimization_2.py
```
- 🛠️ Executes parameter estimation using customizable settings, algorithms, and constraints.
2. **Conduct Advanced Analyses**:
```bash
python optimization_analysis.py
```
- 📊 Performs statistical evaluation, visualizes parameter trends, and generates detailed network diagrams.
3. **PostHoc Analysis**
```bash
python optimization_posthoc.py
```
- 📊 Performs bootstrapping for generating error landscape, contour and gradient fields along with waterfall plot for multiple optimizer runs.
5. **Organize Files**:
```bash
python arrange_folders.py
```
- 🗂️ Groups output files by protein symbols to create a clean and navigable layout.
5. **Generate Comparative Metrics**:
```bash
python metrics.py
```
- 🔍 Computes and compares performance metrics to aid in model evaluation and selection on the basis of lower and upper bounds.
### Outputs
- **Reports**:
- 📄 Excel files containing optimized parameters, residuals, and comprehensive error metrics.
- **Visualizations**:
- 🖼️ High-resolution plots, including convergence trajectories, scatter plots, and network diagrams, for detailed analysis and interpretation.
---
## License
This repository is licensed under the MIT License. See LICENSE for details.
---
## Author
**Abhinav Mishra**
📧 Email: mishraabhinav36@gmail.com, abhinav.mishra@fu-berlin.de
---
## Affiliations
1. **Theoretical Biophysics Lab, Humboldt University of Berlin**
Research conducted under the supervision of [Prof. Edda Klipp](https://rumo.biologie.hu-berlin.de/tbp/index.php/en/).
Institution: Humboldt University of Berlin, Berlin, Germany.
2. **Freie Universität Berlin**
Author: MSc Bioinformatics student at [Freie Universität Berlin](https://www.fu-berlin.de/en/index.html).
Institution: Freie Universität Berlin, Berlin, Germany.
---
## Acknowledgements
🎓 Special thanks to collaborators and domain experts for their invaluable insights and guidance, which have been instrumental in shaping this pipeline. Their contributions have significantly enhanced the repository's analytical capabilities and usability. A moment of gratitute to my family and friends who wer patient and kind enough to be there from start to end.
---
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment