Skip to content

Conversation

@SmartManoj
Copy link

@SmartManoj SmartManoj commented Aug 25, 2025

Related Issue

Fixes #421

Type

  • Bug Fix
  • Feature Enhancement
  • Documentation Update
  • Code Refactoring
  • Other (please specify):

Checklist

  • The code compiles successfully without any errors or warnings
  • The changes have been tested and verified
  • The documentation has been updated (if applicable)
  • The changes follow the project's coding guidelines and best practices
  • The commit messages are descriptive and follow the project's guidelines
  • All tests (if applicable) pass successfully
  • This pull request has been linked to the related issue (if applicable)

Summary by cubic

Adds keyword extraction to the structured resume output and validates missing data to prevent silent scoring failures. Resumes now always include extracted keywords or fail fast.

  • New Features

    • Update structured_resume prompt to explicitly require keyword extraction.
  • Bug Fixes

    • In ScoreImprovementService, raise ResumeKeywordExtractionError when extracted_keywords is null or empty.

Summary by CodeRabbit

  • Bug Fixes
    • Improved resume parsing to reliably extract keywords, leading to more complete and accurate scoring and insights.
    • Added safeguards to prevent occasional failures when keyword data is missing or unreadable, returning a clear error and maintaining app stability.

Your Name added 2 commits August 25, 2025 19:36
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 25, 2025

Walkthrough

The prompt for structured resume parsing was updated to include keyword extraction. In the scoring service, a defensive check now raises ResumeKeywordExtractionError when deserialized extracted_keywords is None to prevent downstream attribute access on None. No public APIs or signatures changed.

Changes

Cohort / File(s) Summary
Prompt update: structured resume
apps/backend/app/prompt/structured_resume.py
Replaced instruction about not using Markdown with an instruction to extract keywords from the resume; rest of the prompt (schema, formatting expectations) unchanged.
Defensive null-check in scoring service
apps/backend/app/services/score_improvement_service.py
After json.loads on processed_resume.extracted_keywords, raises ResumeKeywordExtractionError if result is None; preserves existing JSONDecodeError handling. No signature changes.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant Backend
  participant LLM as LLM (Structured Resume Prompt)
  participant Service as score_improvement_service

  Client->>Backend: Submit resume
  Backend->>LLM: Prompt (now includes "extract keywords")
  LLM-->>Backend: JSON with extracted_keywords
  Backend->>Service: process(extracted_keywords_json)

  Service->>Service: parsed = json.loads(...)
  alt parsed is None
    Service-->>Backend: Raise ResumeKeywordExtractionError
    note right of Service: New defensive check on null
  else parsed is object
    Service-->>Backend: Continue with parsed.get(...)
  end

  Backend-->>Client: Response or error
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • console errors fix #420 — Introduces ResumeKeywordExtractionError and related validation, aligning with this PR’s new null check and error raising.
  • Resume dashboard Previewer #371 — Also touches score_improvement_service.py, integrating resume preview logic alongside related processing paths.

Suggested reviewers

  • srbhr

Poem

I nibbled the prompt with careful cheer,
“Extract the keywords!” now loud and clear.
If null hops in, I thump—beware!
An error raised with tidy care.
Carrots of JSON, crisp and bright,
No None shall sneak by moonlit night. 🥕🐇

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@qodo-code-review
Copy link

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Prompt Clarity

The added instruction to extract keywords is vague; consider specifying the exact JSON field name and format expected to ensure consistent model output aligned with the validator that reads "extracted_keywords".

- Don't forget to extract keywords from the resume.

Schema:
```json
{0}
Validation Robustness

After json.loads, ensure the type of keywords_data is a dict before calling get; otherwise a list/str would trigger an AttributeError and bypass the intended error handling.

keywords_data = json.loads(processed_resume.extracted_keywords)
if keywords_data is None:
    raise ResumeKeywordExtractionError(resume_id=resume_id)
keywords = keywords_data.get("extracted_keywords", [])
if not keywords or len(keywords) == 0:
    raise ResumeKeywordExtractionError(resume_id=resume_id)

@qodo-code-review
Copy link

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Validate parsed JSON type

Guard against non-dict JSON to avoid an AttributeError when calling .get. Extend
the new null check to also validate the parsed JSON is a dict before accessing
keys.

apps/backend/app/services/score_improvement_service.py [56-61]

 keywords_data = json.loads(processed_resume.extracted_keywords)
-if keywords_data is None:
+if keywords_data is None or not isinstance(keywords_data, dict):
     raise ResumeKeywordExtractionError(resume_id=resume_id)
 keywords = keywords_data.get("extracted_keywords", [])
 if not keywords or len(keywords) == 0:
     raise ResumeKeywordExtractionError(resume_id=resume_id)
  • Apply / Chat
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly points out that json.loads can return non-dictionary types, which would cause an unhandled AttributeError on the subsequent .get call, and the proposed fix prevents this potential crash.

Medium
Clarify schema-aligned keyword extraction

Align the instruction with the schema to prevent generating fields that might
not exist, which would break downstream validation. Specify extracting keywords
only if the schema includes the corresponding field.

apps/backend/app/prompt/structured_resume.py [8]

-- Don't forget to extract keywords from the resume.
+- If the schema includes an "extracted_keywords" field, extract keywords from the resume and populate it; do not add any fields not present in the schema.
  • Apply / Chat
Suggestion importance[1-10]: 6

__

Why: The suggestion improves the prompt's clarity by resolving a potential contradiction between instructions, making the language model's behavior more predictable and aligned with the provided schema.

Low
  • More

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
apps/backend/app/services/score_improvement_service.py (3)

73-79: Symmetric None-handling missing for job keywords (possible AttributeError path)

If processed_job.extracted_keywords is "null", json.loads returns None and .get will raise AttributeError, which escapes your JSONDecodeError except. Mirror the resume fix here.

Apply this diff:

         try:
-            keywords_data = json.loads(processed_job.extracted_keywords)
-            keywords = keywords_data.get("extracted_keywords", [])
+            keywords_data = json.loads(processed_job.extracted_keywords)
+            if keywords_data is None:
+                raise JobKeywordExtractionError(job_id=job_id)
+            keywords = keywords_data.get("extracted_keywords", [])
             if not keywords or len(keywords) == 0:
                 raise JobKeywordExtractionError(job_id=job_id)
-        except json.JSONDecodeError:
+        except (json.JSONDecodeError, TypeError, AttributeError):
             raise JobKeywordExtractionError(job_id=job_id)

311-314: Streaming bug: Iterating a string yields characters, not suggestions

updated_resume is a string (see run() which renders it as markdown). Streaming per-character is incorrect.

Use a type check and stream once (or split into chunks) instead:

-        for i, suggestion in enumerate(updated_resume):
-            yield f"data: {json.dumps({'status': 'suggestion', 'index': i, 'text': suggestion})}\n\n"
-            await asyncio.sleep(0.2)
+        if isinstance(updated_resume, str):
+            yield f"data: {json.dumps({'status': 'suggestion', 'index': 0, 'text': updated_resume})}\n\n"
+        else:
+            for i, suggestion in enumerate(updated_resume):
+                yield f"data: {json.dumps({'status': 'suggestion', 'index': i, 'text': suggestion})}\n\n"
+                await asyncio.sleep(0.2)

190-191: PII leakage: Full resume content logged in prompt

Structured Resume Prompt includes the raw resume text. This is sensitive and should not be written to logs.

Redact the log to a hash and lower the level:

-        logger.info(f"Structured Resume Prompt: {prompt}")
+        logger.debug("Structured Resume Prompt redacted. prompt_sha256=%s",
+                     hashlib.sha256(prompt.encode("utf-8")).hexdigest())

Add this import at the top of the file:

+import hashlib
🧹 Nitpick comments (7)
apps/backend/app/services/score_improvement_service.py (6)

135-141: Guard against zero vectors in cosine similarity

If either embedding has zero norm, you’ll return inf/nan. Return 0.0 instead for robustness.

-        return float(np.dot(ejk, re) / (np.linalg.norm(ejk) * np.linalg.norm(re)))
+        denom = float(np.linalg.norm(ejk) * np.linalg.norm(re))
+        if denom == 0.0 or not np.isfinite(denom):
+            return 0.0
+        return float(np.dot(ejk, re) / denom)

168-170: Parameter order nit for cosine similarity

Computation is symmetric, but passing arguments in the declared order improves readability.

-            score = self.calculate_cosine_similarity(
-                emb, extracted_job_keywords_embedding
-            )
+            score = self.calculate_cosine_similarity(
+                extracted_job_keywords_embedding, emb
+            )

155-179: Consider continuing all attempts and keeping the best result

Right now you return on the first improvement; given LLM stochasticity, later attempts might beat the early win.

-            if score > best_score:
-                return improved, score
+            if score > best_score:
+                best_resume, best_score = improved, score
+                continue
@@
-        return best_resume, best_score
+        return best_resume, best_score

210-218: Avoid repeated JSON loads and joins; centralize keyword extraction

Same pattern appears here and again in run_and_stream(). Consider a small helper to DRY and validate once.

Example helper (place it as a private method in this class):

def _keywords_csv(self, payload: str) -> str:
    try:
        data = json.loads(payload)
        if not data:
            return ""
        return ", ".join(data.get("extracted_keywords", []))
    except (json.JSONDecodeError, AttributeError, TypeError):
        return ""

Then call self._keywords_csv(processed_job.extracted_keywords) etc.


70-79: Add tests to cover "null" keyword payloads and streaming path

Given the new validation, add unit tests where extracted_keywords is "null" for both resume and job, and verify JobKeywordExtractionError/ResumeKeywordExtractionError and streaming behavior.

I can draft pytest cases that insert ProcessedResume/ProcessedJob with extracted_keywords="null" and assert the error path and the fixed streaming logic. Want me to push those?

Also applies to: 210-218, 275-283


65-80: Optional: log context without leaking content

Exceptions raised here are user-actionable; consider including resume_id/job_id (already included) and avoid logging raw payloads anywhere else in this method.

apps/backend/app/prompt/structured_resume.py (1)

3-9: Make the keyword instruction concrete to improve LLM adherence

Be explicit about field name, type, and constraints to reduce empty/irrelevant outputs.

 - Do not format the response in Markdown or any other format. Just output raw JSON.
- - Don't forget to extract keywords from the resume.
+ - Do not format the response in Markdown or any other format. Just output raw JSON.
+ - Extract keywords under the field "extracted_keywords" as a non-empty array of unique, lowercase keywords/phrases (1–3 words each) relevant to the resume; exclude stopwords and boilerplate.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8c75adf and 83f328e.

📒 Files selected for processing (2)
  • apps/backend/app/prompt/structured_resume.py (1 hunks)
  • apps/backend/app/services/score_improvement_service.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
apps/backend/app/{models,services}/**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use SQLAlchemy ORM with async sessions for database access

Files:

  • apps/backend/app/services/score_improvement_service.py
apps/backend/app/services/**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

apps/backend/app/services/**/*.py: Use temporary files for document processing and always clean up after processing
Implement async processing for file operations in backend services
Cache processed results when appropriate in backend services

Files:

  • apps/backend/app/services/score_improvement_service.py
apps/backend/app/{agent,services}/**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use structured prompts and validate AI model responses against Pydantic models

Files:

  • apps/backend/app/services/score_improvement_service.py
🧬 Code graph analysis (2)
apps/backend/app/services/score_improvement_service.py (2)
apps/backend/app/services/exceptions.py (2)
  • ResumeKeywordExtractionError (85-96)
  • __init__ (90-96)
apps/backend/app/services/resume_service.py (2)
  • _extract_structured_json (201-234)
  • _extract_and_store_structured_resume (132-199)
apps/backend/app/prompt/structured_resume.py (1)
apps/backend/app/services/resume_service.py (1)
  • _extract_structured_json (201-234)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: cubic · AI code reviewer
🔇 Additional comments (3)
apps/backend/app/services/score_improvement_service.py (2)

55-61: Solid defensive None-check after json.loads for resume keywords

Good catch handling the "null" case produced by json.dumps(None) upstream. This prevents AttributeError on .get and turns it into a domain-specific ResumeKeywordExtractionError.


253-254: Confirm or Add HTML Sanitization for Markdown Outputs

I ran a repository-wide search for common sanitization libraries and patterns (e.g. bleach, dompurify, sanitize, dangerouslySetInnerHTML) and found no matches. This suggests that the HTML generated by markdown.markdown(...) at lines 253–254 (and similarly at 320–321) is not currently being sanitized before it’s sent to the frontend.

• Please verify whether the client code applies a definitive sanitization or strict CSP that would mitigate XSS risks.
• If no such sanitization exists, I strongly recommend adding server-side HTML sanitization (for example, using Bleach as shown below) or ensuring the frontend applies a vetted sanitizer before rendering.

from bleach.sanitizer import Cleaner

allowed_tags = ["p","ul","ol","li","strong","em","h1","h2","h3","code","pre","blockquote","br","a"]
allowed_attrs = {"a": ["href","title","rel","target"]}

cleaner = Cleaner(tags=allowed_tags, attributes=allowed_attrs, strip=True)
safe_html = cleaner.clean(markdown.markdown(updated_resume))

This will help prevent any untrusted LLM output from introducing XSS vulnerabilities.

apps/backend/app/prompt/structured_resume.py (1)

8-8: Good addition: explicit instruction to extract keywords

This aligns the prompt with downstream expectations and reduces the chance of missing extracted_keywords in the schema.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 2 files

React with 👍 or 👎 to teach cubic. You can also tag @cubic-dev-ai to give feedback, ask questions, or re-run the review.


try:
keywords_data = json.loads(processed_resume.extracted_keywords)
if keywords_data is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validate keywords_data is a dict before calling .get; json.loads can return non-dict types (e.g., list or str), which would raise AttributeError and bypass the intended error handling

Prompt for AI agents
Address the following comment on apps/backend/app/services/score_improvement_service.py at line 57:

<comment>Validate keywords_data is a dict before calling .get; json.loads can return non-dict types (e.g., list or str), which would raise AttributeError and bypass the intended error handling</comment>

<file context>
@@ -54,6 +54,8 @@ def _validate_resume_keywords(
 
         try:
             keywords_data = json.loads(processed_resume.extracted_keywords)
+            if keywords_data is None:
+                raise ResumeKeywordExtractionError(resume_id=resume_id)
             keywords = keywords_data.get(&quot;extracted_keywords&quot;, [])
</file context>
Suggested change
if keywords_data is None:
if keywords_data is None or not isinstance(keywords_data, dict):

- Use "Present" if an end date is ongoing.
- Make sure dates are in YYYY-MM-DD.
- Do not format the response in Markdown or any other format. Just output raw JSON.
- Don't forget to extract keywords from the resume.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Align this instruction with the provided schema to avoid emitting fields not defined there, which can break strict JSON validation downstream

Prompt for AI agents
Address the following comment on apps/backend/app/prompt/structured_resume.py at line 8:

<comment>Align this instruction with the provided schema to avoid emitting fields not defined there, which can break strict JSON validation downstream</comment>

<file context>
@@ -5,6 +5,7 @@
 - Use &quot;Present&quot; if an end date is ongoing.
 - Make sure dates are in YYYY-MM-DD.
 - Do not format the response in Markdown or any other format. Just output raw JSON.
+- Don&#39;t forget to extract keywords from the resume.
 
 Schema:
</file context>
Suggested change
- Don't forget to extract keywords from the resume.
- If the schema includes an "extracted_keywords" field, extract keywords from the resume and populate it; do not add any fields not present in the schema.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Submitting Improve always failed with status 500, details in post

1 participant