🤖 AI vs Human Code Detection System

Advanced machine learning system for detecting whether code was written by artificial intelligence or humans. Features intelligent ensemble models, GitHub repository analysis, and comprehensive explainability with smart contradiction detection.

✨ Key Features

🎯 Multi-Mode Analysis

Single Code Analysis: Analyze individual code snippets with detailed breakdown
GitHub Repository Scanning: Complete repository analysis with file-by-file insights
Batch File Processing: Upload and analyze multiple files simultaneously

🧠 Intelligent Detection Engine

Ensemble ML Models: 4 classical models (LogisticRegression, RandomForest, GradientBoosting, XGBoost)
Smart Voting System: Advanced consensus mechanism with confidence weighting
Contradiction Detection: Automatically corrects predictions when line-level analysis conflicts with file-level results
Multi-Language Support: Python, Java, and JavaScript code detection

📊 Advanced Analysis & Explanations

Line-by-Line Breakdown: Detailed analysis of individual code lines with pattern detection
Confidence Scoring: Precision confidence metrics for all predictions
Model Agreement Tracking: Shows which models agree/disagree and why
Pattern Recognition: Detects coding patterns like functions, loops, imports, etc.
Consistency Validation: Cross-validates file-level vs line-level predictions

🔍 GitHub Integration

Repository Scanning: Analyzes entire GitHub repositories automatically
Progress Tracking: Real-time analysis progress with status updates
Comprehensive Reports: Downloadable analysis reports with detailed insights
API Integration: Direct GitHub API integration for seamless repository access

🏗️ Project Architecture

Code_Detector/
├── 📱 Web Application
│   └── app.py                    # Main Streamlit application with 3 analysis modes
├── 🤖 Machine Learning Pipeline  
│   ├── ml_train.py              # Classical ML model training (4 algorithms)
│   └── dl_train.py              # Deep learning model training (Transformers)
├── 📊 Data & Models
│   ├── Dataset/                 # Training data organized by language
│   │   ├── Python/             # Python samples (AI vs HUMAN)
│   │   ├── Java/               # Java samples (AI vs HUMAN) 
│   │   └── JS/                 # JavaScript samples (AI vs HUMAN)
│   ├── model/                  # Trained classical ML models
│   │   ├── logisticregression.pkl
│   │   ├── randomforest.pkl
│   │   ├── gradientboosting.pkl
│   │   ├── xgboost.pkl
│   │   ├── vectorizer.pkl
│   │   └── labelencoder.pkl
│   └── output/                 # Trained transformer models
│       ├── CodeBERT/           # Microsoft CodeBERT model
│       ├── CodeT5/             # Salesforce CodeT5 model  
│       └── GraphCodeBERT/      # Microsoft GraphCodeBERT model
├── 📋 Documentation
│   ├── README.md               # This file
│   └── requirements.txt        # Python dependencies
└── 🗂️ Cache & Temp Files
    └── __pycache__/            # Python bytecode cache

🚀 Quick Start Guide

Prerequisites

Python 3.8 or higher
4GB+ RAM (8GB+ recommended for transformer models)
Internet connection (for GitHub repository analysis)

1. Installation

# Clone the repository
git clone https://github.com/muhammadnavas/Code_Detector.git
cd Code_Detector

# Install required dependencies
pip install -r requirements.txt

2. Launch the Application

# Start the Streamlit web interface
streamlit run app.py

🌐 Access the app at: http://localhost:8501

3. Model Training (Optional)

If you want to retrain models with custom data:

# Train classical ML models (faster, CPU-friendly)
python ml_train.py

# Train transformer models (requires GPU for optimal performance)
python dl_train.py

🎮 Usage Modes

1. 📝 Single Code Analysis

Perfect for analyzing individual code snippets:

Input Methods:
- Paste code directly into the text area
- Upload single Python/Java/JavaScript files
Analysis Output:
- 🎯 Overall Prediction: AI vs Human with confidence score
- 🔧 Model Breakdown: Individual model predictions and confidence
- 📋 Line-by-Line Analysis: Detailed analysis of each code line
- 🏷️ Pattern Detection: Identified coding patterns and structures

2. 🐙 GitHub Repository Analysis

Comprehensive analysis of entire GitHub repositories:

Repository Input:
```
https://github.com/username/repository
```
Analysis Process:
- 🔍 Auto-Discovery: Finds all Python files in the repository
- ⚡ Progress Tracking: Real-time analysis with progress indicators
- 📊 Summary Statistics: Repository-wide AI vs Human breakdown
- 📁 File-by-File Results: Detailed analysis for each file
Advanced Features:
- 🎯 Smart Corrections: Automatically corrects contradictory predictions
- ⚠️ Warning System: Flags suspicious patterns or inconsistencies
- 📄 Report Generation: Download comprehensive analysis reports

3. 📂 Batch File Analysis

Upload and analyze multiple files simultaneously:

Multi-File Upload: Support for .py, .java, .js files
Batch Processing: Analyze all files with progress tracking
Consolidated Results: Summary statistics across all uploaded files

🧠 Machine Learning Architecture

🎯 Ensemble Prediction System

Our intelligent ensemble combines multiple approaches for maximum accuracy:

Classical ML Models (4 Models)

🔗 Logistic Regression
- Linear classification with TF-IDF features
- Fast prediction, good baseline performance
- Confidence: Probability scores from sigmoid function
🌲 Random Forest
- Ensemble of decision trees with voting
- Handles feature interactions well
- Confidence: Vote proportion from trees
📈 Gradient Boosting
- Sequential ensemble with error correction
- Strong performance on structured data
- Confidence: Probability from gradient boost
⚡ XGBoost
- Optimized gradient boosting framework
- State-of-the-art classical ML performance
- Confidence: Native probability estimation

🤖 Smart Ensemble Logic

Majority Voting: 3+ models must agree for high confidence
Confidence Weighting: Uses model-specific confidence scores
Contradiction Detection: Compares file-level vs line-level predictions
Smart Corrections: Automatically adjusts predictions when inconsistencies detected

🔍 Advanced Analysis Features

📋 Line-by-Line Analysis

Smart Filtering: Skips comments, imports, and trivial lines
Pattern Detection: Identifies functions, loops, conditionals, etc.
Confidence Thresholding: Only includes high-confidence line predictions (>60%)
Context Preservation: Maintains code structure understanding

⚠️ Intelligent Contradiction Detection

Our system automatically detects and corrects contradictory predictions:

# Example: File predicted as AI, but 73% of lines are Human
Original Prediction: AI (confidence: 0.86)
Line Analysis: 73% Human lines
Smart Correction: → HUMAN (adjusted confidence: 0.72)
Status: [PREDICTION CORRECTED: AI → HUMAN]

🎨 Pattern Recognition Engine

Detects various coding patterns:

Structural: Functions, classes, imports
Control Flow: Loops, conditionals, exception handling
Modern Python: F-strings, list comprehensions, lambda functions
Style Indicators: Docstrings, comments, naming conventions

📊 Understanding the Results

🎯 Prediction Confidence Levels

🔵 High Confidence (>0.8): Very reliable prediction
🟡 Medium Confidence (0.6-0.8): Generally reliable with some uncertainty
🔴 Low Confidence (<0.6): Results may be unreliable, manual review recommended

🤖 Model Agreement Indicators

✅ Unanimous: All models agree (highest confidence)
📊 Majority: 3/4 models agree (good confidence)
⚠️ Split Decision: 2/2 split (requires careful interpretation)

🔍 Consistency Analysis

✅ Consistent: File and line predictions align
📊 Mixed Signals: Some disagreement between levels
🔄 Auto-Corrected: System detected and fixed contradiction
❌ Major Contradiction: Significant disagreement requiring manual review

🛠️ Technical Implementation

📦 Dependencies & Requirements

# Core Framework
streamlit>=1.28.0        # Web application framework

# Machine Learning  
scikit-learn>=1.3.0      # Classical ML algorithms
xgboost>=1.7.0          # Gradient boosting framework
numpy>=1.24.0           # Numerical computing
pandas>=2.0.0           # Data manipulation

# Deep Learning (Optional)
torch>=2.0.0            # PyTorch framework
transformers>=4.30.0    # Hugging Face transformers

# Web & API
requests>=2.31.0        # HTTP requests for GitHub API
joblib>=1.3.0          # Model serialization

# Utilities  
pathlib                 # Path handling (built-in)
re                      # Regular expressions (built-in)
typing                  # Type hints (built-in)

🎯 Feature Engineering

Text Preprocessing Pipeline

Code Cleaning: Remove excess whitespace, normalize line endings
TF-IDF Vectorization: Character-level n-grams (3-5) for classical models
Feature Extraction: Syntactic patterns, complexity metrics
Tokenization: Language-specific tokenization for transformers

Advanced Features

Syntactic Patterns: Language constructs (functions, classes, loops)
Stylistic Features: Naming conventions, spacing patterns
Complexity Metrics: Code depth, nesting levels, line lengths
AI Indicators: Patterns typical in AI-generated code

⚙️ System Architecture

Model Loading & Caching

# Smart model loading with caching
@st.cache_resource
def load_models():
    models = {
        'logistic': joblib.load('model/logisticregression.pkl'),
        'rf': joblib.load('model/randomforest.pkl'),
        'gb': joblib.load('model/gradientboosting.pkl'),
        'xgb': joblib.load('model/xgboost.pkl')
    }
    vectorizer = joblib.load('model/vectorizer.pkl')
    return models, vectorizer

GitHub API Integration

Rate Limiting: Respects GitHub API limits
Error Handling: Robust error handling for network issues
Recursive Scanning: Deep repository traversal for Python files
Content Processing: Handles various file encodings

Performance Optimizations

Streamlit Caching: Models loaded once and cached
Batch Processing: Efficient handling of multiple files
Memory Management: Optimized for large repositories
Progress Tracking: Real-time user feedback

🎮 Advanced Usage Examples

🔧 Programmatic Usage

# Example: Analyzing code with the system
from app import CodeAnalyzer

# Initialize analyzer
analyzer = CodeAnalyzer()

# Analyze code snippet
code = """
def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)
"""

results, prediction, confidence = analyzer.analyze_code(code)

print(f"Prediction: {prediction}")
print(f"Confidence: {confidence:.3f}")

# Get individual model results
for result in results:
    print(f"{result.name}: {result.prediction} ({result.confidence:.3f})")

📊 Batch Analysis

# Example: Analyzing multiple files
files = ['file1.py', 'file2.py', 'file3.py']
results = []

for file_path in files:
    with open(file_path, 'r') as f:
        code = f.read()
    
    file_result = analyzer.analyze_file(file_path, code)
    results.append(file_result)

# Generate summary
summary = SummarizationEngine.summarize_file_analysis(results)
print(f"AI Files: {summary['ai_files']}/{summary['total_files']}")

🔧 Configuration & Customization

⚙️ Model Configuration

You can customize which models to use:

# In app.py, modify the models dictionary
model_config = {
    'logistic': True,      # Enable/disable Logistic Regression
    'random_forest': True, # Enable/disable Random Forest  
    'gradient_boost': True,# Enable/disable Gradient Boosting
    'xgboost': True       # Enable/disable XGBoost
}

🎨 UI Customization

Modify the Streamlit interface:

# Custom page configuration
st.set_page_config(
    page_title="Custom AI Detector",
    page_icon="🤖",
    layout="wide",
    initial_sidebar_state="expanded"
)

# Custom styling
st.markdown("""
<style>
    .main-header { color: #1e88e5; }
    .prediction-ai { background-color: #ffebee; }
    .prediction-human { background-color: #e8f5e8; }
</style>
""", unsafe_allow_html=True)

📈 Performance Tuning

For Large Repositories

# Adjust these parameters in app.py
MAX_FILES_ANALYZE = 100      # Limit files to analyze
LINE_CONFIDENCE_THRESHOLD = 0.7  # Higher threshold for line analysis
ENABLE_LINE_ANALYSIS = False     # Disable for faster processing

Memory Optimization

# Process files in batches
BATCH_SIZE = 10
for i in range(0, len(files), BATCH_SIZE):
    batch = files[i:i+BATCH_SIZE]
    process_batch(batch)

🚀 Model Training Guide

📚 Dataset Preparation

Organize your training data in this structure:

Dataset/
├── Python/
│   ├── AI/           # AI-generated Python code samples
│   │   ├── A1.py, A2.py, ...
│   └── HUMAN/        # Human-written Python code samples  
│       ├── H1.py, H2.py, ...
├── Java/
│   ├── AI/           # AI-generated Java code samples
│   └── HUMAN/        # Human-written Java code samples
└── JS/
    ├── AI/           # AI-generated JavaScript code samples  
    └── HUMAN/        # Human-written JavaScript code samples

🎯 Training Classical ML Models

# Train all classical models with cross-validation
python ml_train.py

Training Process:

Data Loading: Loads code samples from Dataset/ directories
Preprocessing: TF-IDF vectorization with character n-grams
Class Balancing: Handles imbalanced datasets with class weights
Model Training: Trains 4 different algorithms with hyperparameter tuning
Validation: Stratified cross-validation for robust evaluation
Model Saving: Saves trained models to model/ directory

Expected Output:

Loading dataset...
Found 1000 Python samples (500 AI, 500 Human)
Training Logistic Regression... Accuracy: 0.85
Training Random Forest...      Accuracy: 0.88  
Training Gradient Boosting...  Accuracy: 0.87
Training XGBoost...           Accuracy: 0.89
Models saved to model/ directory

🤖 Training Deep Learning Models

# Train transformer models (requires GPU for optimal speed)
python dl_train.py

Supported Models:

CodeBERT: Microsoft's code understanding model
CodeT5: Salesforce's code generation model
GraphCodeBERT: Enhanced with data flow understanding

Training Features:

Custom Trainer: Weighted loss for class imbalance
Early Stopping: Prevents overfitting
Learning Rate Scheduling: Optimizes training convergence
Evaluation Metrics: F1-macro score for balanced evaluation

🔧 Performance Tips

Faster Analysis

Disable Line Analysis: For quick file-level predictions only
Use Fewer Models: Enable only fast models (Logistic, Random Forest)
Batch Processing: Analyze multiple files together
GPU Acceleration: Use CUDA for transformer models

Better Accuracy

Enable All Models: Use full ensemble for best results
Line Analysis: Enable for detailed insights
Large Training Data: More diverse training samples improve accuracy
Regular Retraining: Update models with new AI-generated code patterns

📊 Understanding Model Behavior

Why Models Disagree

Different Feature Focus: Each model looks at different code aspects
Training Data Variance: Models trained on slightly different samples
Algorithm Differences: Linear vs tree-based vs ensemble approaches
Overfitting: Some models may overfit to specific patterns

When to Trust Results

High Agreement: All 4 models agree → High confidence
High Confidence: Individual confidence scores > 0.8
Line Consistency: File prediction matches line analysis
Pattern Recognition: Clear AI/Human coding patterns detected

🤝 Contributing

We welcome contributions to improve the AI detection system!

🎯 Areas for Contribution

New Programming Languages
- Add support for C++, Go, Rust, etc.
- Language-specific pattern detection
- Training data collection
Model Improvements
- Advanced ensemble techniques
- New feature engineering approaches
- Deep learning architecture improvements
User Interface Enhancements
- Better visualization components
- Real-time analysis features
- API endpoint development
Dataset Expansion
- More diverse AI-generated code samples
- Different AI model outputs (GPT, Claude, etc.)
- Domain-specific code samples

📋 Development Setup

# 1. Fork and clone the repository
git clone https://github.com/your-username/Code_Detector.git
cd Code_Detector

# 2. Create development environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# 3. Install development dependencies  
pip install -r requirements.txt
pip install pytest black flake8  # Additional dev tools

# 4. Run tests
pytest tests/

# 5. Format code
black .
flake8 .

🔄 Contribution Workflow

Create Issue: Describe the feature/bug
Fork Repository: Create your own copy
Create Branch: git checkout -b feature/your-feature
Make Changes: Implement your improvements
Add Tests: Ensure functionality works
Submit PR: Create pull request with description

🆘 Support & Community

📞 Getting Help

Issues: GitHub Issues
Discussions: GitHub Discussions

📈 Roadmap

Multi-language expansion (C++, Go, Rust)
Real-time API endpoints for integration
Advanced visualizations for pattern analysis
Cloud deployment options
Mobile app for on-the-go analysis
Plugin development for popular IDEs

🌟 Star History

If you find this project useful, please ⭐ star it on GitHub to help others discover it!

Built with ❤️ for the developer community

Empowering developers with intelligent AI detection capabilities

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
model		model
.gitignore		.gitignore
README.md		README.md
app.py		app.py
dl_train.py		dl_train.py
ml_train.py		ml_train.py
requirements.txt		requirements.txt

muhammadnavas/Code_Detector

Folders and files

Latest commit

History

Repository files navigation