Advanced machine learning system for detecting whether code was written by artificial intelligence or humans. Features intelligent ensemble models, GitHub repository analysis, and comprehensive explainability with smart contradiction detection.
- Single Code Analysis: Analyze individual code snippets with detailed breakdown
- GitHub Repository Scanning: Complete repository analysis with file-by-file insights
- Batch File Processing: Upload and analyze multiple files simultaneously
- Ensemble ML Models: 4 classical models (LogisticRegression, RandomForest, GradientBoosting, XGBoost)
- Smart Voting System: Advanced consensus mechanism with confidence weighting
- Contradiction Detection: Automatically corrects predictions when line-level analysis conflicts with file-level results
- Multi-Language Support: Python, Java, and JavaScript code detection
- Line-by-Line Breakdown: Detailed analysis of individual code lines with pattern detection
- Confidence Scoring: Precision confidence metrics for all predictions
- Model Agreement Tracking: Shows which models agree/disagree and why
- Pattern Recognition: Detects coding patterns like functions, loops, imports, etc.
- Consistency Validation: Cross-validates file-level vs line-level predictions
- Repository Scanning: Analyzes entire GitHub repositories automatically
- Progress Tracking: Real-time analysis progress with status updates
- Comprehensive Reports: Downloadable analysis reports with detailed insights
- API Integration: Direct GitHub API integration for seamless repository access
Code_Detector/
โโโ ๐ฑ Web Application
โ โโโ app.py # Main Streamlit application with 3 analysis modes
โโโ ๐ค Machine Learning Pipeline
โ โโโ ml_train.py # Classical ML model training (4 algorithms)
โ โโโ dl_train.py # Deep learning model training (Transformers)
โโโ ๐ Data & Models
โ โโโ Dataset/ # Training data organized by language
โ โ โโโ Python/ # Python samples (AI vs HUMAN)
โ โ โโโ Java/ # Java samples (AI vs HUMAN)
โ โ โโโ JS/ # JavaScript samples (AI vs HUMAN)
โ โโโ model/ # Trained classical ML models
โ โ โโโ logisticregression.pkl
โ โ โโโ randomforest.pkl
โ โ โโโ gradientboosting.pkl
โ โ โโโ xgboost.pkl
โ โ โโโ vectorizer.pkl
โ โ โโโ labelencoder.pkl
โ โโโ output/ # Trained transformer models
โ โโโ CodeBERT/ # Microsoft CodeBERT model
โ โโโ CodeT5/ # Salesforce CodeT5 model
โ โโโ GraphCodeBERT/ # Microsoft GraphCodeBERT model
โโโ ๐ Documentation
โ โโโ README.md # This file
โ โโโ requirements.txt # Python dependencies
โโโ ๐๏ธ Cache & Temp Files
โโโ __pycache__/ # Python bytecode cache
- Python 3.8 or higher
- 4GB+ RAM (8GB+ recommended for transformer models)
- Internet connection (for GitHub repository analysis)
# Clone the repository
git clone https://github.com/muhammadnavas/Code_Detector.git
cd Code_Detector
# Install required dependencies
pip install -r requirements.txt# Start the Streamlit web interface
streamlit run app.py๐ Access the app at: http://localhost:8501
If you want to retrain models with custom data:
# Train classical ML models (faster, CPU-friendly)
python ml_train.py
# Train transformer models (requires GPU for optimal performance)
python dl_train.pyPerfect for analyzing individual code snippets:
-
Input Methods:
- Paste code directly into the text area
- Upload single Python/Java/JavaScript files
-
Analysis Output:
- ๐ฏ Overall Prediction: AI vs Human with confidence score
- ๐ง Model Breakdown: Individual model predictions and confidence
- ๐ Line-by-Line Analysis: Detailed analysis of each code line
- ๐ท๏ธ Pattern Detection: Identified coding patterns and structures
Comprehensive analysis of entire GitHub repositories:
-
Repository Input:
https://github.com/username/repository -
Analysis Process:
- ๐ Auto-Discovery: Finds all Python files in the repository
- โก Progress Tracking: Real-time analysis with progress indicators
- ๐ Summary Statistics: Repository-wide AI vs Human breakdown
- ๐ File-by-File Results: Detailed analysis for each file
-
Advanced Features:
- ๐ฏ Smart Corrections: Automatically corrects contradictory predictions
โ ๏ธ Warning System: Flags suspicious patterns or inconsistencies- ๐ Report Generation: Download comprehensive analysis reports
Upload and analyze multiple files simultaneously:
- Multi-File Upload: Support for
.py,.java,.jsfiles - Batch Processing: Analyze all files with progress tracking
- Consolidated Results: Summary statistics across all uploaded files
Our intelligent ensemble combines multiple approaches for maximum accuracy:
-
๐ Logistic Regression
- Linear classification with TF-IDF features
- Fast prediction, good baseline performance
- Confidence: Probability scores from sigmoid function
-
๐ฒ Random Forest
- Ensemble of decision trees with voting
- Handles feature interactions well
- Confidence: Vote proportion from trees
-
๐ Gradient Boosting
- Sequential ensemble with error correction
- Strong performance on structured data
- Confidence: Probability from gradient boost
-
โก XGBoost
- Optimized gradient boosting framework
- State-of-the-art classical ML performance
- Confidence: Native probability estimation
- Majority Voting: 3+ models must agree for high confidence
- Confidence Weighting: Uses model-specific confidence scores
- Contradiction Detection: Compares file-level vs line-level predictions
- Smart Corrections: Automatically adjusts predictions when inconsistencies detected
- Smart Filtering: Skips comments, imports, and trivial lines
- Pattern Detection: Identifies functions, loops, conditionals, etc.
- Confidence Thresholding: Only includes high-confidence line predictions (>60%)
- Context Preservation: Maintains code structure understanding
Our system automatically detects and corrects contradictory predictions:
# Example: File predicted as AI, but 73% of lines are Human
Original Prediction: AI (confidence: 0.86)
Line Analysis: 73% Human lines
Smart Correction: โ HUMAN (adjusted confidence: 0.72)
Status: [PREDICTION CORRECTED: AI โ HUMAN]Detects various coding patterns:
- Structural: Functions, classes, imports
- Control Flow: Loops, conditionals, exception handling
- Modern Python: F-strings, list comprehensions, lambda functions
- Style Indicators: Docstrings, comments, naming conventions
- ๐ต High Confidence (>0.8): Very reliable prediction
- ๐ก Medium Confidence (0.6-0.8): Generally reliable with some uncertainty
- ๐ด Low Confidence (<0.6): Results may be unreliable, manual review recommended
- โ Unanimous: All models agree (highest confidence)
- ๐ Majority: 3/4 models agree (good confidence)
โ ๏ธ Split Decision: 2/2 split (requires careful interpretation)
- โ Consistent: File and line predictions align
- ๐ Mixed Signals: Some disagreement between levels
- ๐ Auto-Corrected: System detected and fixed contradiction
- โ Major Contradiction: Significant disagreement requiring manual review
# Core Framework
streamlit>=1.28.0 # Web application framework
# Machine Learning
scikit-learn>=1.3.0 # Classical ML algorithms
xgboost>=1.7.0 # Gradient boosting framework
numpy>=1.24.0 # Numerical computing
pandas>=2.0.0 # Data manipulation
# Deep Learning (Optional)
torch>=2.0.0 # PyTorch framework
transformers>=4.30.0 # Hugging Face transformers
# Web & API
requests>=2.31.0 # HTTP requests for GitHub API
joblib>=1.3.0 # Model serialization
# Utilities
pathlib # Path handling (built-in)
re # Regular expressions (built-in)
typing # Type hints (built-in)- Code Cleaning: Remove excess whitespace, normalize line endings
- TF-IDF Vectorization: Character-level n-grams (3-5) for classical models
- Feature Extraction: Syntactic patterns, complexity metrics
- Tokenization: Language-specific tokenization for transformers
- Syntactic Patterns: Language constructs (functions, classes, loops)
- Stylistic Features: Naming conventions, spacing patterns
- Complexity Metrics: Code depth, nesting levels, line lengths
- AI Indicators: Patterns typical in AI-generated code
# Smart model loading with caching
@st.cache_resource
def load_models():
models = {
'logistic': joblib.load('model/logisticregression.pkl'),
'rf': joblib.load('model/randomforest.pkl'),
'gb': joblib.load('model/gradientboosting.pkl'),
'xgb': joblib.load('model/xgboost.pkl')
}
vectorizer = joblib.load('model/vectorizer.pkl')
return models, vectorizer- Rate Limiting: Respects GitHub API limits
- Error Handling: Robust error handling for network issues
- Recursive Scanning: Deep repository traversal for Python files
- Content Processing: Handles various file encodings
- Streamlit Caching: Models loaded once and cached
- Batch Processing: Efficient handling of multiple files
- Memory Management: Optimized for large repositories
- Progress Tracking: Real-time user feedback
# Example: Analyzing code with the system
from app import CodeAnalyzer
# Initialize analyzer
analyzer = CodeAnalyzer()
# Analyze code snippet
code = """
def fibonacci(n):
if n <= 1:
return n
return fibonacci(n-1) + fibonacci(n-2)
"""
results, prediction, confidence = analyzer.analyze_code(code)
print(f"Prediction: {prediction}")
print(f"Confidence: {confidence:.3f}")
# Get individual model results
for result in results:
print(f"{result.name}: {result.prediction} ({result.confidence:.3f})")# Example: Analyzing multiple files
files = ['file1.py', 'file2.py', 'file3.py']
results = []
for file_path in files:
with open(file_path, 'r') as f:
code = f.read()
file_result = analyzer.analyze_file(file_path, code)
results.append(file_result)
# Generate summary
summary = SummarizationEngine.summarize_file_analysis(results)
print(f"AI Files: {summary['ai_files']}/{summary['total_files']}")You can customize which models to use:
# In app.py, modify the models dictionary
model_config = {
'logistic': True, # Enable/disable Logistic Regression
'random_forest': True, # Enable/disable Random Forest
'gradient_boost': True,# Enable/disable Gradient Boosting
'xgboost': True # Enable/disable XGBoost
}Modify the Streamlit interface:
# Custom page configuration
st.set_page_config(
page_title="Custom AI Detector",
page_icon="๐ค",
layout="wide",
initial_sidebar_state="expanded"
)
# Custom styling
st.markdown("""
<style>
.main-header { color: #1e88e5; }
.prediction-ai { background-color: #ffebee; }
.prediction-human { background-color: #e8f5e8; }
</style>
""", unsafe_allow_html=True)# Adjust these parameters in app.py
MAX_FILES_ANALYZE = 100 # Limit files to analyze
LINE_CONFIDENCE_THRESHOLD = 0.7 # Higher threshold for line analysis
ENABLE_LINE_ANALYSIS = False # Disable for faster processing# Process files in batches
BATCH_SIZE = 10
for i in range(0, len(files), BATCH_SIZE):
batch = files[i:i+BATCH_SIZE]
process_batch(batch)Organize your training data in this structure:
Dataset/
โโโ Python/
โ โโโ AI/ # AI-generated Python code samples
โ โ โโโ A1.py, A2.py, ...
โ โโโ HUMAN/ # Human-written Python code samples
โ โโโ H1.py, H2.py, ...
โโโ Java/
โ โโโ AI/ # AI-generated Java code samples
โ โโโ HUMAN/ # Human-written Java code samples
โโโ JS/
โโโ AI/ # AI-generated JavaScript code samples
โโโ HUMAN/ # Human-written JavaScript code samples
# Train all classical models with cross-validation
python ml_train.pyTraining Process:
- Data Loading: Loads code samples from Dataset/ directories
- Preprocessing: TF-IDF vectorization with character n-grams
- Class Balancing: Handles imbalanced datasets with class weights
- Model Training: Trains 4 different algorithms with hyperparameter tuning
- Validation: Stratified cross-validation for robust evaluation
- Model Saving: Saves trained models to
model/directory
Expected Output:
Loading dataset...
Found 1000 Python samples (500 AI, 500 Human)
Training Logistic Regression... Accuracy: 0.85
Training Random Forest... Accuracy: 0.88
Training Gradient Boosting... Accuracy: 0.87
Training XGBoost... Accuracy: 0.89
Models saved to model/ directory
# Train transformer models (requires GPU for optimal speed)
python dl_train.pySupported Models:
- CodeBERT: Microsoft's code understanding model
- CodeT5: Salesforce's code generation model
- GraphCodeBERT: Enhanced with data flow understanding
Training Features:
- Custom Trainer: Weighted loss for class imbalance
- Early Stopping: Prevents overfitting
- Learning Rate Scheduling: Optimizes training convergence
- Evaluation Metrics: F1-macro score for balanced evaluation
- Disable Line Analysis: For quick file-level predictions only
- Use Fewer Models: Enable only fast models (Logistic, Random Forest)
- Batch Processing: Analyze multiple files together
- GPU Acceleration: Use CUDA for transformer models
- Enable All Models: Use full ensemble for best results
- Line Analysis: Enable for detailed insights
- Large Training Data: More diverse training samples improve accuracy
- Regular Retraining: Update models with new AI-generated code patterns
- Different Feature Focus: Each model looks at different code aspects
- Training Data Variance: Models trained on slightly different samples
- Algorithm Differences: Linear vs tree-based vs ensemble approaches
- Overfitting: Some models may overfit to specific patterns
- High Agreement: All 4 models agree โ High confidence
- High Confidence: Individual confidence scores > 0.8
- Line Consistency: File prediction matches line analysis
- Pattern Recognition: Clear AI/Human coding patterns detected
We welcome contributions to improve the AI detection system!
-
New Programming Languages
- Add support for C++, Go, Rust, etc.
- Language-specific pattern detection
- Training data collection
-
Model Improvements
- Advanced ensemble techniques
- New feature engineering approaches
- Deep learning architecture improvements
-
User Interface Enhancements
- Better visualization components
- Real-time analysis features
- API endpoint development
-
Dataset Expansion
- More diverse AI-generated code samples
- Different AI model outputs (GPT, Claude, etc.)
- Domain-specific code samples
# 1. Fork and clone the repository
git clone https://github.com/your-username/Code_Detector.git
cd Code_Detector
# 2. Create development environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# 3. Install development dependencies
pip install -r requirements.txt
pip install pytest black flake8 # Additional dev tools
# 4. Run tests
pytest tests/
# 5. Format code
black .
flake8 .- Create Issue: Describe the feature/bug
- Fork Repository: Create your own copy
- Create Branch:
git checkout -b feature/your-feature - Make Changes: Implement your improvements
- Add Tests: Ensure functionality works
- Submit PR: Create pull request with description
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Multi-language expansion (C++, Go, Rust)
- Real-time API endpoints for integration
- Advanced visualizations for pattern analysis
- Cloud deployment options
- Mobile app for on-the-go analysis
- Plugin development for popular IDEs
If you find this project useful, please โญ star it on GitHub to help others discover it!
Built with โค๏ธ for the developer community
Empowering developers with intelligent AI detection capabilities