Enhance structured_resume.py to include keyword extraction from resumes. #500

SmartManoj · 2025-08-25T14:45:38Z

Related Issue

Fixes #421

Type

Checklist

The code compiles successfully without any errors or warnings
The changes have been tested and verified
The documentation has been updated (if applicable)
The changes follow the project's coding guidelines and best practices
The commit messages are descriptive and follow the project's guidelines
All tests (if applicable) pass successfully
This pull request has been linked to the related issue (if applicable)

Summary by cubic

Adds keyword extraction to the structured resume output and validates missing data to prevent silent scoring failures. Resumes now always include extracted keywords or fail fast.

New Features
- Update structured_resume prompt to explicitly require keyword extraction.
Bug Fixes
- In ScoreImprovementService, raise ResumeKeywordExtractionError when extracted_keywords is null or empty.

Summary by CodeRabbit

Bug Fixes
- Improved resume parsing to reliably extract keywords, leading to more complete and accurate scoring and insights.
- Added safeguards to prevent occasional failures when keyword data is missing or unreadable, returning a clear error and maintaining app stability.

…e to raise ResumeKeywordExtractionError when extracted keywords are not present.

coderabbitai · 2025-08-25T14:45:47Z

Walkthrough

The prompt for structured resume parsing was updated to include keyword extraction. In the scoring service, a defensive check now raises ResumeKeywordExtractionError when deserialized extracted_keywords is None to prevent downstream attribute access on None. No public APIs or signatures changed.

Changes

Cohort / File(s)	Summary
Prompt update: structured resume `apps/backend/app/prompt/structured_resume.py`	Replaced instruction about not using Markdown with an instruction to extract keywords from the resume; rest of the prompt (schema, formatting expectations) unchanged.
Defensive null-check in scoring service `apps/backend/app/services/score_improvement_service.py`	After json.loads on processed_resume.extracted_keywords, raises ResumeKeywordExtractionError if result is None; preserves existing JSONDecodeError handling. No signature changes.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant Backend
  participant LLM as LLM (Structured Resume Prompt)
  participant Service as score_improvement_service

  Client->>Backend: Submit resume
  Backend->>LLM: Prompt (now includes "extract keywords")
  LLM-->>Backend: JSON with extracted_keywords
  Backend->>Service: process(extracted_keywords_json)

  Service->>Service: parsed = json.loads(...)
  alt parsed is None
    Service-->>Backend: Raise ResumeKeywordExtractionError
    note right of Service: New defensive check on null
  else parsed is object
    Service-->>Backend: Continue with parsed.get(...)
  end

  Backend-->>Client: Response or error

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

console errors fix #420 — Introduces ResumeKeywordExtractionError and related validation, aligning with this PR’s new null check and error raising.
Resume dashboard Previewer #371 — Also touches score_improvement_service.py, integrating resume preview logic alongside related processing paths.

Suggested reviewers

srbhr

Poem

I nibbled the prompt with careful cheer,
“Extract the keywords!” now loud and clear.
If null hops in, I thump—beware!
An error raised with tidy care.
Carrots of JSON, crisp and bright,
No None shall sneak by moonlit night. 🥕🐇

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

qodo-code-review · 2025-08-25T14:46:00Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Prompt Clarity The added instruction to extract keywords is vague; consider specifying the exact JSON field name and format expected to ensure consistent model output aligned with the validator that reads "extracted_keywords". - Don't forget to extract keywords from the resume. Schema: ```json {0} Validation Robustness After json.loads, ensure the type of keywords_data is a dict before calling get; otherwise a list/str would trigger an AttributeError and bypass the intended error handling. keywords_data = json.loads(processed_resume.extracted_keywords) if keywords_data is None: raise ResumeKeywordExtractionError(resume_id=resume_id) keywords = keywords_data.get("extracted_keywords", []) if not keywords or len(keywords) == 0: raise ResumeKeywordExtractionError(resume_id=resume_id)

qodo-code-review · 2025-08-25T14:47:25Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Validate parsed JSON type Guard against non-dict JSON to avoid an AttributeError when calling `.get`. Extend the new null check to also validate the parsed JSON is a dict before accessing keys. apps/backend/app/services/score_improvement_service.py [56-61] keywords_data = json.loads(processed_resume.extracted_keywords) -if keywords_data is None: +if keywords_data is None or not isinstance(keywords_data, dict): raise ResumeKeywordExtractionError(resume_id=resume_id) keywords = keywords_data.get("extracted_keywords", []) if not keywords or len(keywords) == 0: raise ResumeKeywordExtractionError(resume_id=resume_id) Apply / Chat Suggestion importance[1-10]: 7 __ Why: The suggestion correctly points out that `json.loads` can return non-dictionary types, which would cause an unhandled `AttributeError` on the subsequent `.get` call, and the proposed fix prevents this potential crash.	Medium
Possible issue	Clarify schema-aligned keyword extraction Align the instruction with the schema to prevent generating fields that might not exist, which would break downstream validation. Specify extracting keywords only if the schema includes the corresponding field. apps/backend/app/prompt/structured_resume.py [8] -- Don't forget to extract keywords from the resume. +- If the schema includes an "extracted_keywords" field, extract keywords from the resume and populate it; do not add any fields not present in the schema. Apply / Chat Suggestion importance[1-10]: 6 __ Why: The suggestion improves the prompt's clarity by resolving a potential contradiction between instructions, making the language model's behavior more predictable and aligned with the provided schema.	Low
More

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

apps/backend/app/services/score_improvement_service.py (3)

73-79: Symmetric None-handling missing for job keywords (possible AttributeError path)

If processed_job.extracted_keywords is "null", json.loads returns None and .get will raise AttributeError, which escapes your JSONDecodeError except. Mirror the resume fix here.

Apply this diff:

         try:
-            keywords_data = json.loads(processed_job.extracted_keywords)
-            keywords = keywords_data.get("extracted_keywords", [])
+            keywords_data = json.loads(processed_job.extracted_keywords)
+            if keywords_data is None:
+                raise JobKeywordExtractionError(job_id=job_id)
+            keywords = keywords_data.get("extracted_keywords", [])
             if not keywords or len(keywords) == 0:
                 raise JobKeywordExtractionError(job_id=job_id)
-        except json.JSONDecodeError:
+        except (json.JSONDecodeError, TypeError, AttributeError):
             raise JobKeywordExtractionError(job_id=job_id)

311-314: Streaming bug: Iterating a string yields characters, not suggestions

updated_resume is a string (see run() which renders it as markdown). Streaming per-character is incorrect.

Use a type check and stream once (or split into chunks) instead:

-        for i, suggestion in enumerate(updated_resume):
-            yield f"data: {json.dumps({'status': 'suggestion', 'index': i, 'text': suggestion})}\n\n"
-            await asyncio.sleep(0.2)
+        if isinstance(updated_resume, str):
+            yield f"data: {json.dumps({'status': 'suggestion', 'index': 0, 'text': updated_resume})}\n\n"
+        else:
+            for i, suggestion in enumerate(updated_resume):
+                yield f"data: {json.dumps({'status': 'suggestion', 'index': i, 'text': suggestion})}\n\n"
+                await asyncio.sleep(0.2)

190-191: PII leakage: Full resume content logged in prompt

Structured Resume Prompt includes the raw resume text. This is sensitive and should not be written to logs.

Redact the log to a hash and lower the level:

-        logger.info(f"Structured Resume Prompt: {prompt}")
+        logger.debug("Structured Resume Prompt redacted. prompt_sha256=%s",
+                     hashlib.sha256(prompt.encode("utf-8")).hexdigest())

Add this import at the top of the file:

+import hashlib

🧹 Nitpick comments (7)

apps/backend/app/services/score_improvement_service.py (6)
135-141: Guard against zero vectors in cosine similarity

If either embedding has zero norm, you’ll return inf/nan. Return 0.0 instead for robustness.
-        return float(np.dot(ejk, re) / (np.linalg.norm(ejk) * np.linalg.norm(re)))
+        denom = float(np.linalg.norm(ejk) * np.linalg.norm(re))
+        if denom == 0.0 or not np.isfinite(denom):
+            return 0.0
+        return float(np.dot(ejk, re) / denom)
168-170: Parameter order nit for cosine similarity

Computation is symmetric, but passing arguments in the declared order improves readability.
-            score = self.calculate_cosine_similarity(
-                emb, extracted_job_keywords_embedding
-            )
+            score = self.calculate_cosine_similarity(
+                extracted_job_keywords_embedding, emb
+            )
155-179: Consider continuing all attempts and keeping the best result

Right now you return on the first improvement; given LLM stochasticity, later attempts might beat the early win.
-            if score > best_score:
-                return improved, score
+            if score > best_score:
+                best_resume, best_score = improved, score
+                continue
@@
-        return best_resume, best_score
+        return best_resume, best_score
210-218: Avoid repeated JSON loads and joins; centralize keyword extraction

Same pattern appears here and again in run_and_stream(). Consider a small helper to DRY and validate once.

Example helper (place it as a private method in this class):
def _keywords_csv(self, payload: str) -> str:
    try:
        data = json.loads(payload)
        if not data:
            return ""
        return ", ".join(data.get("extracted_keywords", []))
    except (json.JSONDecodeError, AttributeError, TypeError):
        return ""
Then call self._keywords_csv(processed_job.extracted_keywords) etc.

70-79: Add tests to cover "null" keyword payloads and streaming path

Given the new validation, add unit tests where extracted_keywords is "null" for both resume and job, and verify JobKeywordExtractionError/ResumeKeywordExtractionError and streaming behavior.

I can draft pytest cases that insert ProcessedResume/ProcessedJob with extracted_keywords="null" and assert the error path and the fixed streaming logic. Want me to push those?

Also applies to: 210-218, 275-283

65-80: Optional: log context without leaking content

Exceptions raised here are user-actionable; consider including resume_id/job_id (already included) and avoid logging raw payloads anywhere else in this method.
apps/backend/app/prompt/structured_resume.py (1)
3-9: Make the keyword instruction concrete to improve LLM adherence

Be explicit about field name, type, and constraints to reduce empty/irrelevant outputs.
 - Do not format the response in Markdown or any other format. Just output raw JSON.
- - Don't forget to extract keywords from the resume.
+ - Do not format the response in Markdown or any other format. Just output raw JSON.
+ - Extract keywords under the field "extracted_keywords" as a non-empty array of unique, lowercase keywords/phrases (1–3 words each) relevant to the resume; exclude stopwords and boilerplate.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8c75adf and 83f328e.

📒 Files selected for processing (2)

apps/backend/app/prompt/structured_resume.py (1 hunks)
apps/backend/app/services/score_improvement_service.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

apps/backend/app/{models,services}/**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use SQLAlchemy ORM with async sessions for database access

Files:

apps/backend/app/services/score_improvement_service.py

apps/backend/app/services/**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

apps/backend/app/services/**/*.py: Use temporary files for document processing and always clean up after processing
Implement async processing for file operations in backend services
Cache processed results when appropriate in backend services

Files:

apps/backend/app/services/score_improvement_service.py

apps/backend/app/{agent,services}/**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use structured prompts and validate AI model responses against Pydantic models

Files:

apps/backend/app/services/score_improvement_service.py

🧬 Code graph analysis (2)

apps/backend/app/services/score_improvement_service.py (2)

apps/backend/app/services/exceptions.py (2)

ResumeKeywordExtractionError (85-96)

__init__ (90-96)

apps/backend/app/services/resume_service.py (2)

_extract_structured_json (201-234)

_extract_and_store_structured_resume (132-199)

apps/backend/app/prompt/structured_resume.py (1)

apps/backend/app/services/resume_service.py (1)

_extract_structured_json (201-234)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: cubic · AI code reviewer

🔇 Additional comments (3)

apps/backend/app/services/score_improvement_service.py (2)
55-61: Solid defensive None-check after json.loads for resume keywords

Good catch handling the "null" case produced by json.dumps(None) upstream. This prevents AttributeError on .get and turns it into a domain-specific ResumeKeywordExtractionError.

253-254: Confirm or Add HTML Sanitization for Markdown Outputs

I ran a repository-wide search for common sanitization libraries and patterns (e.g. bleach, dompurify, sanitize, dangerouslySetInnerHTML) and found no matches. This suggests that the HTML generated by markdown.markdown(...) at lines 253–254 (and similarly at 320–321) is not currently being sanitized before it’s sent to the frontend.

• Please verify whether the client code applies a definitive sanitization or strict CSP that would mitigate XSS risks.
• If no such sanitization exists, I strongly recommend adding server-side HTML sanitization (for example, using Bleach as shown below) or ensuring the frontend applies a vetted sanitizer before rendering.
from bleach.sanitizer import Cleaner

allowed_tags = ["p","ul","ol","li","strong","em","h1","h2","h3","code","pre","blockquote","br","a"]
allowed_attrs = {"a": ["href","title","rel","target"]}

cleaner = Cleaner(tags=allowed_tags, attributes=allowed_attrs, strip=True)
safe_html = cleaner.clean(markdown.markdown(updated_resume))
This will help prevent any untrusted LLM output from introducing XSS vulnerabilities.
apps/backend/app/prompt/structured_resume.py (1)

8-8: Good addition: explicit instruction to extract keywords

This aligns the prompt with downstream expectations and reduces the chance of missing extracted_keywords in the schema.

cubic-dev-ai

2 issues found across 2 files

_{React with 👍 or 👎 to teach cubic. You can also tag @cubic-dev-ai to give feedback, ask questions, or re-run the review.}

cubic-dev-ai · 2025-08-25T14:54:30Z

apps/backend/app/services/score_improvement_service.py


        try:
            keywords_data = json.loads(processed_resume.extracted_keywords)
+            if keywords_data is None:


Validate keywords_data is a dict before calling .get; json.loads can return non-dict types (e.g., list or str), which would raise AttributeError and bypass the intended error handling

Prompt for AI agents

Address the following comment on apps/backend/app/services/score_improvement_service.py at line 57: <comment>Validate keywords_data is a dict before calling .get; json.loads can return non-dict types (e.g., list or str), which would raise AttributeError and bypass the intended error handling</comment> <file context> @@ -54,6 +54,8 @@ def _validate_resume_keywords( try: keywords_data = json.loads(processed_resume.extracted_keywords) + if keywords_data is None: + raise ResumeKeywordExtractionError(resume_id=resume_id) keywords = keywords_data.get("extracted_keywords", []) </file context>

Suggested change

if keywords_data is None:

if keywords_data is None or not isinstance(keywords_data, dict):

cubic-dev-ai · 2025-08-25T14:54:31Z

apps/backend/app/prompt/structured_resume.py

 - Use "Present" if an end date is ongoing.
 - Make sure dates are in YYYY-MM-DD.
 - Do not format the response in Markdown or any other format. Just output raw JSON.
+- Don't forget to extract keywords from the resume.


Align this instruction with the provided schema to avoid emitting fields not defined there, which can break strict JSON validation downstream

Prompt for AI agents

Address the following comment on apps/backend/app/prompt/structured_resume.py at line 8: <comment>Align this instruction with the provided schema to avoid emitting fields not defined there, which can break strict JSON validation downstream</comment> <file context> @@ -5,6 +5,7 @@ - Use "Present" if an end date is ongoing. - Make sure dates are in YYYY-MM-DD. - Do not format the response in Markdown or any other format. Just output raw JSON. +- Don't forget to extract keywords from the resume. Schema: </file context>

Suggested change

- Don't forget to extract keywords from the resume.

- If the schema includes an "extracted_keywords" field, extract keywords from the resume and populate it; do not add any fields not present in the schema.

Your Name added 2 commits August 25, 2025 19:36

Added error handling for null keywords data in ScoreImprovementServic…

05bcb2f

…e to raise ResumeKeywordExtractionError when extracted keywords are not present.

Enhance structured_resume.py to include keyword extraction from resumes.

83f328e

qodo-code-review bot added the Review effort 2/5 label Aug 25, 2025

coderabbitai bot reviewed Aug 25, 2025

View reviewed changes

cubic-dev-ai bot reviewed Aug 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Enhance structured_resume.py to include keyword extraction from resumes. #500

Enhance structured_resume.py to include keyword extraction from resumes. #500

Uh oh!

SmartManoj commented Aug 25, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Aug 25, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Status, Documentation and Community

Uh oh!

qodo-code-review bot commented Aug 25, 2025

Uh oh!

qodo-code-review bot commented Aug 25, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

cubic-dev-ai bot left a comment

Uh oh!

cubic-dev-ai bot Aug 25, 2025

Uh oh!

cubic-dev-ai bot Aug 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	if keywords_data is None:
	if keywords_data is None or not isinstance(keywords_data, dict):

	- Don't forget to extract keywords from the resume.
	- If the schema includes an "extracted_keywords" field, extract keywords from the resume and populate it; do not add any fields not present in the schema.

Uh oh!

Enhance structured_resume.py to include keyword extraction from resumes. #500

Are you sure you want to change the base?

Enhance structured_resume.py to include keyword extraction from resumes. #500

Uh oh!

Conversation

SmartManoj commented Aug 25, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issue

Type

Checklist

Summary by cubic

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

qodo-code-review bot commented Aug 25, 2025

PR Reviewer Guide 🔍

Uh oh!

qodo-code-review bot commented Aug 25, 2025

PR Code Suggestions ✨

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai bot Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SmartManoj commented Aug 25, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 25, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)