| # Recent Updates and Fixes |
|
|
| ## Overview |
|
|
| Two important fixes have been implemented based on testing feedback: |
|
|
| 1. **Leetspeak Translation** (before NER) |
| 2. **Improved Country Mapping** (check ALL tags) |
|
|
| --- |
|
|
| ## Fix 1: Leetspeak Translation |
|
|
| ### Problem |
| Names with leetspeak (numbers replacing letters) weren't being properly cleaned: |
| - `4kira` should be `Akira` |
| - `1rene` should be `Irene` |
| - `3mma` should be `Emma` |
|
|
| ### Solution |
| Added leetspeak translation **before** other NER processing in Cell 5. |
|
|
| ### Mapping Table |
| | Leetspeak | Letter | |
| |-----------|--------| |
| | 4 | A | |
| | 3 | E | |
| | 1 | I | |
| | 0 | O | |
| | 7 | T | |
| | 5 | S | |
| | 8 | B | |
| | 9 | G | |
| | @ | A | |
| | $ | S | |
| | ! | I | |
|
|
| ### Examples |
| ``` |
| 4kira -> akira |
| 3mma -> emma |
| 1rene -> irene |
| L3vi -> Levi |
| S4sha -> Sasha |
| K4te -> Kate |
| J3ssica -> Jessica |
| ``` |
|
|
| ### Implementation |
| The `translate_leetspeak()` function runs FIRST in `clean_name()`, before emoji removal and other cleaning steps. This ensures leetspeak is converted to proper letters before any other processing. |
|
|
| --- |
|
|
| ## Fix 2: Improved Country Mapping |
|
|
| ### Problem |
| The country mapping was stopping at the first match, which meant: |
| - **Irene** with tags `['girl', 'photorealistic', 'asian', 'woman', 'beautiful', 'celebrity', 'korean']` |
| - The `'korean'` tag wasn't being properly mapped to `'South Korea'` |
| - This resulted in incomplete hints being sent to the LLM |
| - **Expected**: Deepseek should identify **Bae Joo-hyun (Irene)** from Red Velvet |
|
|
| ### Solution |
| Updated Cell 7 to: |
| 1. **Check ALL tags** (not just stop at first match) |
| 2. **Use a priority system** to select the best match: |
| - Priority 3: Exact country name match (highest) |
| - Priority 2: Nationality match (medium) |
| - Priority 1: Word parts (lowest) |
|
|
| ### How It Works |
|
|
| #### Before (Broken) |
| ```python |
| def infer_country_and_nationality(tags): |
| for tag in tags: |
| if tag in mapping: |
| return mapping[tag] # β Stops at first match! |
| return ("", "") |
| ``` |
|
|
| #### After (Fixed) |
| ```python |
| def infer_country_and_nationality(tags): |
| best_match = None |
| best_priority = 0 |
| |
| for tag in tags: # β
Check ALL tags |
| if tag in mapping: |
| country, nationality, priority = mapping[tag] |
| if priority > best_priority: |
| best_match = (country, nationality) |
| best_priority = priority |
| |
| return best_match or ("", "") |
| ``` |
|
|
| ### Example: Irene Case |
|
|
| **Input Tags**: `['girl', 'photorealistic', 'asian', 'woman', 'beautiful', 'celebrity', 'korean']` |
|
|
| **Processing**: |
| 1. Check `'girl'` β no match |
| 2. Check `'photorealistic'` β no match |
| 3. Check `'asian'` β no match (too generic) |
| 4. Check `'woman'` β no match |
| 5. Check `'beautiful'` β no match |
| 6. Check `'celebrity'` β no match |
| 7. Check `'korean'` β β
**MATCH!** |
| - Maps to nationality: `'South Korean'` |
| - Which maps to country: `'South Korea'` |
| - Priority: 2 (nationality match) |
|
|
| **Output**: |
| - `likely_country`: `'South Korea'` |
| - `likely_nationality`: `'South Korean'` |
|
|
| **Sent to Deepseek**: |
| ``` |
| Given 'Irene' (celebrity, South Korea), provide: |
| 1. Full legal name |
| 2. Aliases |
| 3. Gender |
| 4. Top 3 professions |
| 5. Country |
| ``` |
|
|
| **Expected Result**: Deepseek can now identify this as **Bae Joo-hyun (Irene)**, a South Korean singer/actress from the K-pop group Red Velvet. |
|
|
| --- |
|
|
| ## Impact on Results |
|
|
| ### Better Name Recognition |
| - Leetspeak names are now properly translated |
| - LLMs receive cleaner, more recognizable names |
|
|
| ### Better Country Context |
| - All tags are now considered for country mapping |
| - More accurate country/nationality hints sent to LLMs |
| - Better identification of international celebrities |
|
|
| ### Example Improvements |
|
|
| | Name | Tags | Before | After | |
| |------|------|--------|-------| |
| | `4kira LoRA` | `['japanese', 'actress']` | `'4kira'` + no country | `'Akira'` + `'Japan'` | |
| | `Irene` | `['korean', 'celebrity']` | `'Irene'` + no country | `'Irene'` + `'South Korea'` | |
| | `1U` | `['korean', 'singer']` | `'1U'` + no country | `'IU'` + `'South Korea'` | |
| | `3lsa` | `['model']` | `'3lsa'` + no country | `'Elsa'` + country if tagged | |
|
|
| --- |
|
|
| ## Testing Recommendations |
|
|
| ### Before Running Full Pipeline |
|
|
| 1. **Test Leetspeak Translation** (Cell 5): |
| ```python |
| # Look for names with numbers in the output |
| # Verify they're properly translated |
| ``` |
|
|
| 2. **Test Country Mapping** (Cell 7): |
| ```python |
| # Check the debug output at the end: |
| # "π Checking 'Irene' entries:" |
| # Verify country is properly mapped |
| ``` |
|
|
| 3. **Test Deepseek Results** (Cell 10): |
| ```python |
| # Look for Irene in the results |
| # Should now identify as Bae Joo-hyun |
| ``` |
|
|
| ### Validation Checklist |
|
|
| - [ ] Leetspeak names are translated (check console output in Cell 5) |
| - [ ] Country mapping shows high success rate (check stats in Cell 7) |
| - [ ] Irene is correctly identified as Bae Joo-hyun (check results in Cell 10) |
| - [ ] Other K-pop/Korean celebrities are properly identified |
| - [ ] Japanese/Chinese celebrities also benefit from better country mapping |
|
|
| --- |
|
|
| ## Notes |
|
|
| ### Why Check ALL Tags? |
|
|
| Some entries have many tags, and the most informative tag might not be first: |
| ``` |
| tags = ['girl', 'sexy', 'beautiful', 'asian', 'korean', 'celebrity', 'kpop'] |
| ^^^^ Most informative! |
| ``` |
|
|
| The old code might stop at `'girl'` or `'asian'` (no country info), missing the `'korean'` tag. |
|
|
| ### Why Use Priority? |
|
|
| Some tags might match multiple countries. Priority ensures we get the best match: |
| - `'american'` β exact nationality match (priority 2) β USA |
| - `'america'` β could be North/South/Central America (priority 1) |
|
|
| The system picks the higher priority match. |
|
|
| ### Word Length Filter |
|
|
| Word parts only match if >4 characters to avoid false positives: |
| - β
`'china'` β matches China (5 chars) |
| - β `'us'` β too short, might be part of other words |
|
|
| --- |
|
|
| ## Future Improvements |
|
|
| Potential enhancements: |
| 1. **More leetspeak patterns**: `|\/|` for M, `(_)` for U, etc. |
| 2. **Fuzzy country matching**: Handle typos like `'corean'` β `'korean'` |
| 3. **Multi-country support**: Some celebrities work in multiple countries |
| 4. **Language detection**: Use name structure to infer origin |
|
|
| --- |
|
|
| ## Summary |
|
|
| β
**Leetspeak translation** ensures names are readable before NER |
| β
**ALL tags checked** ensures no country hints are missed |
| β
**Priority system** ensures best match is selected |
| β
**Better LLM results** from improved name quality and country context |
|
|
| These fixes should significantly improve the accuracy of person identification, especially for: |
| - International celebrities (K-pop, J-pop, C-pop) |
| - Names with leetspeak |
| - Entries where country info appears later in tag list |
|
|