-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Enhance structured_resume.py to include keyword extraction from resumes. #500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…e to raise ResumeKeywordExtractionError when extracted keywords are not present.
WalkthroughThe prompt for structured resume parsing was updated to include keyword extraction. In the scoring service, a defensive check now raises ResumeKeywordExtractionError when deserialized extracted_keywords is None to prevent downstream attribute access on None. No public APIs or signatures changed. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant Backend
participant LLM as LLM (Structured Resume Prompt)
participant Service as score_improvement_service
Client->>Backend: Submit resume
Backend->>LLM: Prompt (now includes "extract keywords")
LLM-->>Backend: JSON with extracted_keywords
Backend->>Service: process(extracted_keywords_json)
Service->>Service: parsed = json.loads(...)
alt parsed is None
Service-->>Backend: Raise ResumeKeywordExtractionError
note right of Service: New defensive check on null
else parsed is object
Service-->>Backend: Continue with parsed.get(...)
end
Backend-->>Client: Response or error
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested reviewers
Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
|||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
apps/backend/app/services/score_improvement_service.py (3)
73-79: Symmetric None-handling missing for job keywords (possible AttributeError path)If processed_job.extracted_keywords is "null", json.loads returns None and .get will raise AttributeError, which escapes your JSONDecodeError except. Mirror the resume fix here.
Apply this diff:
try: - keywords_data = json.loads(processed_job.extracted_keywords) - keywords = keywords_data.get("extracted_keywords", []) + keywords_data = json.loads(processed_job.extracted_keywords) + if keywords_data is None: + raise JobKeywordExtractionError(job_id=job_id) + keywords = keywords_data.get("extracted_keywords", []) if not keywords or len(keywords) == 0: raise JobKeywordExtractionError(job_id=job_id) - except json.JSONDecodeError: + except (json.JSONDecodeError, TypeError, AttributeError): raise JobKeywordExtractionError(job_id=job_id)
311-314: Streaming bug: Iterating a string yields characters, not suggestionsupdated_resume is a string (see run() which renders it as markdown). Streaming per-character is incorrect.
Use a type check and stream once (or split into chunks) instead:
- for i, suggestion in enumerate(updated_resume): - yield f"data: {json.dumps({'status': 'suggestion', 'index': i, 'text': suggestion})}\n\n" - await asyncio.sleep(0.2) + if isinstance(updated_resume, str): + yield f"data: {json.dumps({'status': 'suggestion', 'index': 0, 'text': updated_resume})}\n\n" + else: + for i, suggestion in enumerate(updated_resume): + yield f"data: {json.dumps({'status': 'suggestion', 'index': i, 'text': suggestion})}\n\n" + await asyncio.sleep(0.2)
190-191: PII leakage: Full resume content logged in promptStructured Resume Prompt includes the raw resume text. This is sensitive and should not be written to logs.
Redact the log to a hash and lower the level:
- logger.info(f"Structured Resume Prompt: {prompt}") + logger.debug("Structured Resume Prompt redacted. prompt_sha256=%s", + hashlib.sha256(prompt.encode("utf-8")).hexdigest())Add this import at the top of the file:
+import hashlib
🧹 Nitpick comments (7)
apps/backend/app/services/score_improvement_service.py (6)
135-141: Guard against zero vectors in cosine similarityIf either embedding has zero norm, you’ll return inf/nan. Return 0.0 instead for robustness.
- return float(np.dot(ejk, re) / (np.linalg.norm(ejk) * np.linalg.norm(re))) + denom = float(np.linalg.norm(ejk) * np.linalg.norm(re)) + if denom == 0.0 or not np.isfinite(denom): + return 0.0 + return float(np.dot(ejk, re) / denom)
168-170: Parameter order nit for cosine similarityComputation is symmetric, but passing arguments in the declared order improves readability.
- score = self.calculate_cosine_similarity( - emb, extracted_job_keywords_embedding - ) + score = self.calculate_cosine_similarity( + extracted_job_keywords_embedding, emb + )
155-179: Consider continuing all attempts and keeping the best resultRight now you return on the first improvement; given LLM stochasticity, later attempts might beat the early win.
- if score > best_score: - return improved, score + if score > best_score: + best_resume, best_score = improved, score + continue @@ - return best_resume, best_score + return best_resume, best_score
210-218: Avoid repeated JSON loads and joins; centralize keyword extractionSame pattern appears here and again in run_and_stream(). Consider a small helper to DRY and validate once.
Example helper (place it as a private method in this class):
def _keywords_csv(self, payload: str) -> str: try: data = json.loads(payload) if not data: return "" return ", ".join(data.get("extracted_keywords", [])) except (json.JSONDecodeError, AttributeError, TypeError): return ""Then call self._keywords_csv(processed_job.extracted_keywords) etc.
70-79: Add tests to cover "null" keyword payloads and streaming pathGiven the new validation, add unit tests where extracted_keywords is "null" for both resume and job, and verify JobKeywordExtractionError/ResumeKeywordExtractionError and streaming behavior.
I can draft pytest cases that insert ProcessedResume/ProcessedJob with extracted_keywords="null" and assert the error path and the fixed streaming logic. Want me to push those?
Also applies to: 210-218, 275-283
65-80: Optional: log context without leaking contentExceptions raised here are user-actionable; consider including resume_id/job_id (already included) and avoid logging raw payloads anywhere else in this method.
apps/backend/app/prompt/structured_resume.py (1)
3-9: Make the keyword instruction concrete to improve LLM adherenceBe explicit about field name, type, and constraints to reduce empty/irrelevant outputs.
- Do not format the response in Markdown or any other format. Just output raw JSON. - - Don't forget to extract keywords from the resume. + - Do not format the response in Markdown or any other format. Just output raw JSON. + - Extract keywords under the field "extracted_keywords" as a non-empty array of unique, lowercase keywords/phrases (1–3 words each) relevant to the resume; exclude stopwords and boilerplate.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
apps/backend/app/prompt/structured_resume.py(1 hunks)apps/backend/app/services/score_improvement_service.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
apps/backend/app/{models,services}/**/*.py
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use SQLAlchemy ORM with async sessions for database access
Files:
apps/backend/app/services/score_improvement_service.py
apps/backend/app/services/**/*.py
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
apps/backend/app/services/**/*.py: Use temporary files for document processing and always clean up after processing
Implement async processing for file operations in backend services
Cache processed results when appropriate in backend services
Files:
apps/backend/app/services/score_improvement_service.py
apps/backend/app/{agent,services}/**/*.py
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use structured prompts and validate AI model responses against Pydantic models
Files:
apps/backend/app/services/score_improvement_service.py
🧬 Code graph analysis (2)
apps/backend/app/services/score_improvement_service.py (2)
apps/backend/app/services/exceptions.py (2)
ResumeKeywordExtractionError(85-96)__init__(90-96)apps/backend/app/services/resume_service.py (2)
_extract_structured_json(201-234)_extract_and_store_structured_resume(132-199)
apps/backend/app/prompt/structured_resume.py (1)
apps/backend/app/services/resume_service.py (1)
_extract_structured_json(201-234)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: cubic · AI code reviewer
🔇 Additional comments (3)
apps/backend/app/services/score_improvement_service.py (2)
55-61: Solid defensive None-check after json.loads for resume keywordsGood catch handling the "null" case produced by json.dumps(None) upstream. This prevents AttributeError on .get and turns it into a domain-specific ResumeKeywordExtractionError.
253-254: Confirm or Add HTML Sanitization for Markdown OutputsI ran a repository-wide search for common sanitization libraries and patterns (e.g. bleach, dompurify, sanitize, dangerouslySetInnerHTML) and found no matches. This suggests that the HTML generated by
markdown.markdown(...)at lines 253–254 (and similarly at 320–321) is not currently being sanitized before it’s sent to the frontend.• Please verify whether the client code applies a definitive sanitization or strict CSP that would mitigate XSS risks.
• If no such sanitization exists, I strongly recommend adding server-side HTML sanitization (for example, using Bleach as shown below) or ensuring the frontend applies a vetted sanitizer before rendering.from bleach.sanitizer import Cleaner allowed_tags = ["p","ul","ol","li","strong","em","h1","h2","h3","code","pre","blockquote","br","a"] allowed_attrs = {"a": ["href","title","rel","target"]} cleaner = Cleaner(tags=allowed_tags, attributes=allowed_attrs, strip=True) safe_html = cleaner.clean(markdown.markdown(updated_resume))This will help prevent any untrusted LLM output from introducing XSS vulnerabilities.
apps/backend/app/prompt/structured_resume.py (1)
8-8: Good addition: explicit instruction to extract keywordsThis aligns the prompt with downstream expectations and reduces the chance of missing extracted_keywords in the schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 issues found across 2 files
React with 👍 or 👎 to teach cubic. You can also tag @cubic-dev-ai to give feedback, ask questions, or re-run the review.
|
|
||
| try: | ||
| keywords_data = json.loads(processed_resume.extracted_keywords) | ||
| if keywords_data is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validate keywords_data is a dict before calling .get; json.loads can return non-dict types (e.g., list or str), which would raise AttributeError and bypass the intended error handling
Prompt for AI agents
Address the following comment on apps/backend/app/services/score_improvement_service.py at line 57:
<comment>Validate keywords_data is a dict before calling .get; json.loads can return non-dict types (e.g., list or str), which would raise AttributeError and bypass the intended error handling</comment>
<file context>
@@ -54,6 +54,8 @@ def _validate_resume_keywords(
try:
keywords_data = json.loads(processed_resume.extracted_keywords)
+ if keywords_data is None:
+ raise ResumeKeywordExtractionError(resume_id=resume_id)
keywords = keywords_data.get("extracted_keywords", [])
</file context>
| if keywords_data is None: | |
| if keywords_data is None or not isinstance(keywords_data, dict): |
| - Use "Present" if an end date is ongoing. | ||
| - Make sure dates are in YYYY-MM-DD. | ||
| - Do not format the response in Markdown or any other format. Just output raw JSON. | ||
| - Don't forget to extract keywords from the resume. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Align this instruction with the provided schema to avoid emitting fields not defined there, which can break strict JSON validation downstream
Prompt for AI agents
Address the following comment on apps/backend/app/prompt/structured_resume.py at line 8:
<comment>Align this instruction with the provided schema to avoid emitting fields not defined there, which can break strict JSON validation downstream</comment>
<file context>
@@ -5,6 +5,7 @@
- Use "Present" if an end date is ongoing.
- Make sure dates are in YYYY-MM-DD.
- Do not format the response in Markdown or any other format. Just output raw JSON.
+- Don't forget to extract keywords from the resume.
Schema:
</file context>
| - Don't forget to extract keywords from the resume. | |
| - If the schema includes an "extracted_keywords" field, extract keywords from the resume and populate it; do not add any fields not present in the schema. |
Related Issue
Fixes #421
Type
Checklist
Summary by cubic
Adds keyword extraction to the structured resume output and validates missing data to prevent silent scoring failures. Resumes now always include extracted keywords or fail fast.
New Features
Bug Fixes
Summary by CodeRabbit