Site icon R-bloggers

Cleaning up SPAM with NLP

[This article was first published on MeanMean, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
< svg style="display: none;">< defs id="MathJax_SVG_glyphs">< path stroke-width="1" id="MJMAIN-3A3" d="M666 247Q664 244 652 126T638 4V0H351Q131 0 95 0T57 5V6Q54 12 57 17L73 36Q89 54 121 90T182 159L305 299L56 644L55 658Q55 677 60 681Q63 683 351 683H638V679Q640 674 652 564T666 447V443H626V447Q618 505 604 543T559 605Q529 626 478 631T333 637H294H189L293 494Q314 465 345 422Q400 346 400 340Q400 338 399 337L154 57Q407 57 428 58Q476 60 508 68T551 83T575 103Q595 125 608 162T624 225L626 251H666V247Z">< path stroke-width="1" id="MJMAIN-6C" d="M42 46H56Q95 46 103 60V68Q103 77 103 91T103 124T104 167T104 217T104 272T104 329Q104 366 104 407T104 482T104 542T103 586T103 603Q100 622 89 628T44 637H26V660Q26 683 28 683L38 684Q48 685 67 686T104 688Q121 689 141 690T171 693T182 694H185V379Q185 62 186 60Q190 52 198 49Q219 46 247 46H263V0H255L232 1Q209 2 183 2T145 3T107 3T57 1L34 0H26V46H42Z">< path stroke-width="1" id="MJMATHI-74" d="M26 385Q19 392 19 395Q19 399 22 411T27 425Q29 430 36 430T87 431H140L159 511Q162 522 166 540T173 566T179 586T187 603T197 615T211 624T229 626Q247 625 254 615T261 596Q261 589 252 549T232 470L222 433Q222 431 272 431H323Q330 424 330 420Q330 398 317 385H210L174 240Q135 80 135 68Q135 26 162 26Q197 26 230 60T283 144Q285 150 288 151T303 153H307Q322 153 322 145Q322 142 319 133Q314 117 301 95T267 48T216 6T155 -11Q125 -11 98 4T59 56Q57 64 57 83V101L92 241Q127 382 128 383Q128 385 77 385H26Z">< path stroke-width="1" id="MJMATHI-68" d="M137 683Q138 683 209 688T282 694Q294 694 294 685Q294 674 258 534Q220 386 220 383Q220 381 227 388Q288 442 357 442Q411 442 444 415T478 336Q478 285 440 178T402 50Q403 36 407 31T422 26Q450 26 474 56T513 138Q516 149 519 151T535 153Q555 153 555 145Q555 144 551 130Q535 71 500 33Q466 -10 419 -10H414Q367 -10 346 17T325 74Q325 90 361 192T398 345Q398 404 354 404H349Q266 404 205 306L198 293L164 158Q132 28 127 16Q114 -11 83 -11Q69 -11 59 -2T48 16Q48 30 121 320L195 616Q195 629 188 632T149 637H128Q122 643 122 645T124 664Q129 683 137 683Z">< path stroke-width="1" id="MJMATHI-41" d="M208 74Q208 50 254 46Q272 46 272 35Q272 34 270 22Q267 8 264 4T251 0Q249 0 239 0T205 1T141 2Q70 2 50 0H42Q35 7 35 11Q37 38 48 46H62Q132 49 164 96Q170 102 345 401T523 704Q530 716 547 716H555H572Q578 707 578 706L606 383Q634 60 636 57Q641 46 701 46Q726 46 726 36Q726 34 723 22Q720 7 718 4T704 0Q701 0 690 0T651 1T578 2Q484 2 455 0H443Q437 6 437 9T439 27Q443 40 445 43L449 46H469Q523 49 533 63L521 213H283L249 155Q208 86 208 74ZM516 260Q516 271 504 416T490 562L463 519Q447 492 400 412L310 260L413 259Q516 259 516 260Z">< path stroke-width="1" id="MJMAIN-3D" d="M56 347Q56 360 70 367H707Q722 359 722 347Q722 336 708 328L390 327H72Q56 332 56 347ZM56 153Q56 168 72 173H708Q722 163 722 153Q722 140 707 133H70Q56 140 56 153Z">< path stroke-width="1" id="MJMAIN-7B" d="M434 -231Q434 -244 428 -250H410Q281 -250 230 -184Q225 -177 222 -172T217 -161T213 -148T211 -133T210 -111T209 -84T209 -47T209 0Q209 21 209 53Q208 142 204 153Q203 154 203 155Q189 191 153 211T82 231Q71 231 68 234T65 250T68 266T82 269Q116 269 152 289T203 345Q208 356 208 377T209 529V579Q209 634 215 656T244 698Q270 724 324 740Q361 748 377 749Q379 749 390 749T408 750H428Q434 744 434 732Q434 719 431 716Q429 713 415 713Q362 710 332 689T296 647Q291 634 291 499V417Q291 370 288 353T271 314Q240 271 184 255L170 250L184 245Q202 239 220 230T262 196T290 137Q291 131 291 1Q291 -134 296 -147Q306 -174 339 -192T415 -213Q429 -213 431 -216Q434 -219 434 -231Z">< path stroke-width="1" id="MJMATHI-6C" d="M117 59Q117 26 142 26Q179 26 205 131Q211 151 215 152Q217 153 225 153H229Q238 153 241 153T246 151T248 144Q247 138 245 128T234 90T214 43T183 6T137 -11Q101 -11 70 11T38 85Q38 97 39 102L104 360Q167 615 167 623Q167 626 166 628T162 632T157 634T149 635T141 636T132 637T122 637Q112 637 109 637T101 638T95 641T94 647Q94 649 96 661Q101 680 107 682T179 688Q194 689 213 690T243 693T254 694Q266 694 266 686Q266 675 193 386T118 83Q118 81 118 75T117 65V59Z">< path stroke-width="1" id="MJMAIN-7D" d="M65 731Q65 745 68 747T88 750Q171 750 216 725T279 670Q288 649 289 635T291 501Q292 362 293 357Q306 312 345 291T417 269Q428 269 431 266T434 250T431 234T417 231Q380 231 345 210T298 157Q293 143 292 121T291 -28V-79Q291 -134 285 -156T256 -198Q202 -250 89 -250Q71 -250 68 -247T65 -230Q65 -224 65 -223T66 -218T69 -214T77 -213Q91 -213 108 -210T146 -200T183 -177T207 -139Q208 -134 209 3L210 139Q223 196 280 230Q315 247 330 250Q305 257 280 270Q225 304 212 352L210 362L209 498Q208 635 207 640Q195 680 154 696T77 713Q68 713 67 716T65 731Z">< path stroke-width="1" id="MJMAIN-53" d="M55 507Q55 590 112 647T243 704H257Q342 704 405 641L426 672Q431 679 436 687T446 700L449 704Q450 704 453 704T459 705H463Q466 705 472 699V462L466 456H448Q437 456 435 459T430 479Q413 605 329 646Q292 662 254 662Q201 662 168 626T135 542Q135 508 152 480T200 435Q210 431 286 412T370 389Q427 367 463 314T500 191Q500 110 448 45T301 -21Q245 -21 201 -4T140 27L122 41Q118 36 107 21T87 -7T78 -21Q76 -22 68 -22H64Q61 -22 55 -16V101Q55 220 56 222Q58 227 76 227H89Q95 221 95 214Q95 182 105 151T139 90T205 42T305 24Q352 24 386 62T420 155Q420 198 398 233T340 281Q284 295 266 300Q261 301 239 306T206 314T174 325T141 343T112 367T85 402Q55 451 55 507Z">< path stroke-width="1" id="MJMAIN-57" d="M792 683Q810 680 914 680Q991 680 1003 683H1009V637H996Q931 633 915 598Q912 591 863 438T766 135T716 -17Q711 -22 694 -22Q676 -22 673 -15Q671 -13 593 231L514 477L435 234Q416 174 391 92T358 -6T341 -22H331Q314 -21 310 -15Q309 -14 208 302T104 622Q98 632 87 633Q73 637 35 637H18V683H27Q69 681 154 681Q164 681 181 681T216 681T249 682T276 683H287H298V637H285Q213 637 213 620Q213 616 289 381L364 144L427 339Q490 535 492 546Q487 560 482 578T475 602T468 618T461 628T449 633T433 636T408 637H380V683H388Q397 680 508 680Q629 680 650 683H660V637H647Q576 637 576 619L727 146Q869 580 869 600Q869 605 863 612T839 627T794 637H783V683H792Z">< path stroke-width="1" id="MJMAIN-50" d="M130 622Q123 629 119 631T103 634T60 637H27V683H214Q237 683 276 683T331 684Q419 684 471 671T567 616Q624 563 624 489Q624 421 573 372T451 307Q429 302 328 301H234V181Q234 62 237 58Q245 47 304 46H337V0H326Q305 3 182 3Q47 3 38 0H27V46H60Q102 47 111 49T130 61V622ZM507 488Q507 514 506 528T500 564T483 597T450 620T397 635Q385 637 307 637H286Q237 637 234 628Q231 624 231 483V342H302H339Q390 342 423 349T481 382Q507 411 507 488Z">< path stroke-width="1" id="MJMAIN-5B" d="M118 -250V750H255V710H158V-210H255V-250H118Z">< path stroke-width="1" id="MJMAIN-5D" d="M22 710V750H159V-250H22V-210H119V710H22Z">< path stroke-width="1" id="MJMAIN-28" d="M94 250Q94 319 104 381T127 488T164 576T202 643T244 695T277 729T302 750H315H319Q333 750 333 741Q333 738 316 720T275 667T226 581T184 443T167 250T184 58T225 -81T274 -167T316 -220T333 -241Q333 -250 318 -250H315H302L274 -226Q180 -141 137 -14T94 250Z">< path stroke-width="1" id="MJMAIN-29" d="M60 749L64 750Q69 750 74 750H86L114 726Q208 641 251 514T294 250Q294 182 284 119T261 12T224 -76T186 -143T145 -194T113 -227T90 -246Q87 -249 86 -250H74Q66 -250 63 -250T58 -247T55 -238Q56 -237 66 -225Q221 -64 221 250T66 725Q56 737 55 738Q55 746 60 749Z">< path stroke-width="1" id="MJMATHI-3C3" d="M184 -11Q116 -11 74 34T31 147Q31 247 104 333T274 430Q275 431 414 431H552Q553 430 555 429T559 427T562 425T565 422T567 420T569 416T570 412T571 407T572 401Q572 357 507 357Q500 357 490 357T476 358H416L421 348Q439 310 439 263Q439 153 359 71T184 -11ZM361 278Q361 358 276 358Q152 358 115 184Q114 180 114 178Q106 141 106 117Q106 67 131 47T188 26Q242 26 287 73Q316 103 334 153T356 233T361 278Z">< path stroke-width="1" id="MJMATHI-6A" d="M297 596Q297 627 318 644T361 661Q378 661 389 651T403 623Q403 595 384 576T340 557Q322 557 310 567T297 596ZM288 376Q288 405 262 405Q240 405 220 393T185 362T161 325T144 293L137 279Q135 278 121 278H107Q101 284 101 286T105 299Q126 348 164 391T252 441Q253 441 260 441T272 442Q296 441 316 432Q341 418 354 401T367 348V332L318 133Q267 -67 264 -75Q246 -125 194 -164T75 -204Q25 -204 7 -183T-12 -137Q-12 -110 7 -91T53 -71Q70 -71 82 -81T95 -112Q95 -148 63 -167Q69 -168 77 -168Q111 -168 139 -140T182 -74L193 -32Q204 11 219 72T251 197T278 308T289 365Q289 372 288 376Z">< path stroke-width="1" id="MJMATHI-6B" d="M121 647Q121 657 125 670T137 683Q138 683 209 688T282 694Q294 694 294 686Q294 679 244 477Q194 279 194 272Q213 282 223 291Q247 309 292 354T362 415Q402 442 438 442Q468 442 485 423T503 369Q503 344 496 327T477 302T456 291T438 288Q418 288 406 299T394 328Q394 353 410 369T442 390L458 393Q446 405 434 405H430Q398 402 367 380T294 316T228 255Q230 254 243 252T267 246T293 238T320 224T342 206T359 180T365 147Q365 130 360 106T354 66Q354 26 381 26Q429 26 459 145Q461 153 479 153H483Q499 153 499 144Q499 139 496 130Q455 -11 378 -11Q333 -11 305 15T277 90Q277 108 280 121T283 145Q283 167 269 183T234 206T200 217T182 220H180Q168 178 159 139T145 81T136 44T129 20T122 7T111 -2Q98 -11 83 -11Q66 -11 57 -1T48 16Q48 26 85 176T158 471L195 616Q196 629 188 632T149 637H144Q134 637 131 637T124 640T121 647Z">< path stroke-width="1" id="MJMAIN-2212" d="M84 237T84 250T98 270H679Q694 262 694 250T679 230H98Q84 237 84 250Z">< path stroke-width="1" id="MJMAIN-31" d="M213 578L200 573Q186 568 160 563T102 556H83V602H102Q149 604 189 617T245 641T273 663Q275 666 285 666Q294 666 302 660V361L303 61Q310 54 315 52T339 48T401 46H427V0H416Q395 3 257 3Q121 3 100 0H88V46H114Q136 46 152 46T177 47T193 50T201 52T207 57T213 61V578Z">< path stroke-width="1" id="MJMAIN-26" d="M156 540Q156 620 201 668T302 716Q354 716 377 671T401 578Q401 505 287 386L274 373Q309 285 416 148L429 132L437 142Q474 191 543 309L562 341V349Q562 368 541 376T498 385H493V431H502L626 428Q709 428 721 431H727V385H712Q688 384 669 379T639 369T618 354T603 337T591 316T578 295Q537 223 506 176T464 117T454 104Q454 102 471 85T497 62Q543 24 585 24Q618 24 648 48T682 113V121H722V112Q721 94 714 75T692 32T646 -7T574 -22Q491 -19 414 42L402 51L391 42Q312 -22 224 -22Q144 -22 93 25T42 135Q42 153 46 169T55 197T74 225T96 249T125 278T156 308L195 347L190 360Q185 372 182 382T174 411T165 448T159 491T156 540ZM361 576Q361 613 348 646T305 679Q272 679 252 649T232 572Q232 497 255 426L259 411L267 420Q361 519 361 576ZM140 164Q140 103 167 64T240 24Q271 24 304 36T356 61T374 77Q295 156 235 262L220 292L210 310L193 293Q177 277 169 268T151 229T140 164Z">< path stroke-width="1" id="MJMAIN-2260" d="M166 -215T159 -215T147 -212T141 -204T139 -197Q139 -190 144 -183L306 133H70Q56 140 56 153Q56 168 72 173H327L406 327H72Q56 332 56 347Q56 360 70 367H426Q597 702 602 707Q605 716 618 716Q625 716 630 712T636 703T638 696Q638 692 471 367H707Q722 359 722 347Q722 336 708 328L451 327L371 173H708Q722 163 722 153Q722 140 707 133H351Q175 -210 170 -212Q166 -215 159 -215Z">< path stroke-width="1" id="MJSZ4-239B" d="M837 1154Q843 1148 843 1145Q843 1141 818 1106T753 1002T667 841T574 604T494 299Q417 -84 417 -609Q417 -641 416 -647T411 -654Q409 -655 366 -655Q299 -655 297 -654Q292 -652 292 -643T291 -583Q293 -400 304 -242T347 110T432 470T574 813T785 1136Q787 1139 790 1142T794 1147T796 1150T799 1152T802 1153T807 1154T813 1154H819H837Z">< path stroke-width="1" id="MJSZ4-239D" d="M843 -635Q843 -638 837 -644H820Q801 -644 800 -643Q792 -635 785 -626Q684 -503 605 -363T473 -75T385 216T330 518T302 809T291 1093Q291 1144 291 1153T296 1164Q298 1165 366 1165Q409 1165 411 1164Q415 1163 416 1157T417 1119Q417 529 517 109T833 -617Q843 -631 843 -635Z">< path stroke-width="1" id="MJSZ4-239C" d="M413 -9Q412 -9 407 -9T388 -10T354 -10Q300 -10 297 -9Q294 -8 293 -5Q291 5 291 127V300Q291 602 292 605L296 609Q298 610 366 610Q382 610 392 610T407 610T412 609Q416 609 416 592T417 473V127Q417 -9 413 -9Z">< path stroke-width="1" id="MJMAIN-33" d="M127 463Q100 463 85 480T69 524Q69 579 117 622T233 665Q268 665 277 664Q351 652 390 611T430 522Q430 470 396 421T302 350L299 348Q299 347 308 345T337 336T375 315Q457 262 457 175Q457 96 395 37T238 -22Q158 -22 100 21T42 130Q42 158 60 175T105 193Q133 193 151 175T169 130Q169 119 166 110T159 94T148 82T136 74T126 70T118 67L114 66Q165 21 238 21Q293 21 321 74Q338 107 338 175V195Q338 290 274 322Q259 328 213 329L171 330L168 332Q166 335 166 348Q166 366 174 366Q202 366 232 371Q266 376 294 413T322 525V533Q322 590 287 612Q265 626 240 626Q208 626 181 615T143 592T132 580H135Q138 579 143 578T153 573T165 566T175 555T183 540T186 520Q186 498 172 481T127 463Z">< path stroke-width="1" id="MJMAIN-D7" d="M630 29Q630 9 609 9Q604 9 587 25T493 118L389 222L284 117Q178 13 175 11Q171 9 168 9Q160 9 154 15T147 29Q147 36 161 51T255 146L359 250L255 354Q174 435 161 449T147 471Q147 480 153 485T168 490Q173 490 175 489Q178 487 284 383L389 278L493 382Q570 459 587 475T609 491Q630 491 630 471Q630 464 620 453T522 355L418 250L522 145Q606 61 618 48T630 29Z">< path stroke-width="1" id="MJMATHI-58" d="M42 0H40Q26 0 26 11Q26 15 29 27Q33 41 36 43T55 46Q141 49 190 98Q200 108 306 224T411 342Q302 620 297 625Q288 636 234 637H206Q200 643 200 645T202 664Q206 677 212 683H226Q260 681 347 681Q380 681 408 681T453 682T473 682Q490 682 490 671Q490 670 488 658Q484 643 481 640T465 637Q434 634 411 620L488 426L541 485Q646 598 646 610Q646 628 622 635Q617 635 609 637Q594 637 594 648Q594 650 596 664Q600 677 606 683H618Q619 683 643 683T697 681T738 680Q828 680 837 683H845Q852 676 852 672Q850 647 840 637H824Q790 636 763 628T722 611T698 593L687 584Q687 585 592 480L505 384Q505 383 536 304T601 142T638 56Q648 47 699 46Q734 46 734 37Q734 35 732 23Q728 7 725 4T711 1Q708 1 678 1T589 2Q528 2 496 2T461 1Q444 1 444 10Q444 11 446 25Q448 35 450 39T455 44T464 46T480 47T506 54Q523 62 523 64Q522 64 476 181L429 299Q241 95 236 84Q232 76 232 72Q232 53 261 47Q262 47 267 47T273 46Q276 46 277 46T280 45T283 42T284 35Q284 26 282 19Q279 6 276 4T261 1Q258 1 243 1T201 2T142 2Q64 2 42 0Z">< path stroke-width="1" id="MJMAIN-2C" d="M78 35T78 60T94 103T137 121Q165 121 187 96T210 8Q210 -27 201 -60T180 -117T154 -158T130 -185T117 -194Q113 -194 104 -185T95 -172Q95 -168 106 -156T131 -126T157 -76T173 -3V9L172 8Q170 7 167 6T161 3T152 1T140 0Q113 0 96 17Z">< path stroke-width="1" id="MJMAIN-32" d="M109 429Q82 429 66 447T50 491Q50 562 103 614T235 666Q326 666 387 610T449 465Q449 422 429 383T381 315T301 241Q265 210 201 149L142 93L218 92Q375 92 385 97Q392 99 409 186V189H449V186Q448 183 436 95T421 3V0H50V19V31Q50 38 56 46T86 81Q115 113 136 137Q145 147 170 174T204 211T233 244T261 278T284 308T305 340T320 369T333 401T340 431T343 464Q343 527 309 573T212 619Q179 619 154 602T119 569T109 550Q109 549 114 549Q132 549 151 535T170 489Q170 464 154 447T109 429Z">< path stroke-width="1" id="MJSZ4-239E" d="M31 1143Q31 1154 49 1154H59Q72 1154 75 1152T89 1136Q190 1013 269 873T401 585T489 294T544 -8T572 -299T583 -583Q583 -634 583 -643T577 -654Q575 -655 508 -655Q465 -655 463 -654Q459 -653 458 -647T457 -609Q457 -58 371 340T100 1037Q87 1059 61 1098T31 1143Z">< path stroke-width="1" id="MJSZ4-23A0" d="M56 -644H50Q31 -644 31 -635Q31 -632 37 -622Q69 -579 100 -527Q286 -228 371 170T457 1119Q457 1161 462 1164Q464 1165 520 1165Q575 1165 577 1164Q582 1162 582 1153T583 1093Q581 910 570 752T527 400T442 40T300 -303T89 -626Q78 -640 75 -642T61 -644H56Z">< path stroke-width="1" id="MJSZ4-239F" d="M579 -9Q578 -9 573 -9T554 -10T520 -10Q466 -10 463 -9Q460 -8 459 -5Q457 5 457 127V300Q457 602 458 605L462 609Q464 610 532 610Q548 610 558 610T573 610T578 609Q582 609 582 592T583 473V127Q583 -9 579 -9Z">< path stroke-width="1" id="MJMAIN-2E" d="M78 60Q78 84 95 102T138 120Q162 120 180 104T199 61Q199 36 182 18T139 0T96 17T78 60Z">< path stroke-width="1" id="MJMATHI-53" d="M308 24Q367 24 416 76T466 197Q466 260 414 284Q308 311 278 321T236 341Q176 383 176 462Q176 523 208 573T273 648Q302 673 343 688T407 704H418H425Q521 704 564 640Q565 640 577 653T603 682T623 704Q624 704 627 704T632 705Q645 705 645 698T617 577T585 459T569 456Q549 456 549 465Q549 471 550 475Q550 478 551 494T553 520Q553 554 544 579T526 616T501 641Q465 662 419 662Q362 662 313 616T263 510Q263 480 278 458T319 427Q323 425 389 408T456 390Q490 379 522 342T554 242Q554 216 546 186Q541 164 528 137T492 78T426 18T332 -20Q320 -22 298 -22Q199 -22 144 33L134 44L106 13Q83 -14 78 -18T65 -22Q52 -22 52 -14Q52 -11 110 221Q112 227 130 227H143Q149 221 149 216Q149 214 148 207T144 186T142 153Q144 114 160 87T203 47T255 29T308 24Z">< path stroke-width="1" id="MJMAIN-63" d="M370 305T349 305T313 320T297 358Q297 381 312 396Q317 401 317 402T307 404Q281 408 258 408Q209 408 178 376Q131 329 131 219Q131 137 162 90Q203 29 272 29Q313 29 338 55T374 117Q376 125 379 127T395 129H409Q415 123 415 120Q415 116 411 104T395 71T366 33T318 2T249 -11Q163 -11 99 53T34 214Q34 318 99 383T250 448T370 421T404 357Q404 334 387 320Z">< path stroke-width="1" id="MJMAIN-6F" d="M28 214Q28 309 93 378T250 448Q340 448 405 380T471 215Q471 120 407 55T250 -10Q153 -10 91 57T28 214ZM250 30Q372 30 372 193V225V250Q372 272 371 288T364 326T348 362T317 390T268 410Q263 411 252 411Q222 411 195 399Q152 377 139 338T126 246V226Q126 130 145 91Q177 30 250 30Z">< path stroke-width="1" id="MJMAIN-76" d="M338 431Q344 429 422 429Q479 429 503 431H508V385H497Q439 381 423 345Q421 341 356 172T288 -2Q283 -11 263 -11Q244 -11 239 -2Q99 359 98 364Q93 378 82 381T43 385H19V431H25L33 430Q41 430 53 430T79 430T104 429T122 428Q217 428 232 431H240V385H226Q187 384 184 370Q184 366 235 234L286 102L377 341V349Q377 363 367 372T349 383T335 385H331V431H338Z">< path stroke-width="1" id="MJMAIN-5E" d="M112 560L249 694L257 686Q387 562 387 560L361 531Q359 532 303 581L250 627L195 580Q182 569 169 557T148 538L140 532Q138 530 125 546L112 560Z">< path stroke-width="1" id="MJSZ3-2C6" d="M1439 564Q1434 564 1080 631T722 698Q719 698 362 631Q7 564 4 564L0 583Q-4 602 -4 603L720 772L1083 688Q1446 603 1447 603Q1447 602 1443 583L1439 564Z">< path stroke-width="1" id="MJMATHI-78" d="M52 289Q59 331 106 386T222 442Q257 442 286 424T329 379Q371 442 430 442Q467 442 494 420T522 361Q522 332 508 314T481 292T458 288Q439 288 427 299T415 328Q415 374 465 391Q454 404 425 404Q412 404 406 402Q368 386 350 336Q290 115 290 78Q290 50 306 38T341 26Q378 26 414 59T463 140Q466 150 469 151T485 153H489Q504 153 504 145Q504 144 502 134Q486 77 440 33T333 -11Q263 -11 227 52Q186 -10 133 -10H127Q78 -10 57 16T35 71Q35 103 54 123T99 143Q142 143 142 101Q142 81 130 66T107 46T94 41L91 40Q91 39 97 36T113 29T132 26Q168 26 194 71Q203 87 217 139T245 247T261 313Q266 340 266 352Q266 380 251 392T217 404Q177 404 142 372T93 290Q91 281 88 280T72 278H58Q52 284 52 289Z">< path stroke-width="1" id="MJMAIN-AF" d="M69 544V590H430V544H69Z">< path stroke-width="1" id="MJSZ1-2211" d="M61 748Q64 750 489 750H913L954 640Q965 609 976 579T993 533T999 516H979L959 517Q936 579 886 621T777 682Q724 700 655 705T436 710H319Q183 710 183 709Q186 706 348 484T511 259Q517 250 513 244L490 216Q466 188 420 134T330 27L149 -187Q149 -188 362 -188Q388 -188 436 -188T506 -189Q679 -189 778 -162T936 -43Q946 -27 959 6H999L913 -249L489 -250Q65 -250 62 -248Q56 -246 56 -239Q56 -234 118 -161Q186 -81 245 -11L428 206Q428 207 242 462L57 717L56 728Q56 744 61 748Z">< path stroke-width="1" id="MJMATHI-6E" d="M21 287Q22 293 24 303T36 341T56 388T89 425T135 442Q171 442 195 424T225 390T231 369Q231 367 232 367L243 378Q304 442 382 442Q436 442 469 415T503 336T465 179T427 52Q427 26 444 26Q450 26 453 27Q482 32 505 65T540 145Q542 153 560 153Q580 153 580 145Q580 144 576 130Q568 101 554 73T508 17T439 -10Q392 -10 371 17T350 73Q350 92 386 193T423 345Q423 404 379 404H374Q288 404 229 303L222 291L189 157Q156 26 151 16Q138 -11 108 -11Q95 -11 87 -5T76 7T74 17Q74 30 112 180T152 343Q153 348 153 366Q153 405 129 405Q91 405 66 305Q60 285 60 284Q58 278 41 278H27Q21 284 21 287Z">< path stroke-width="1" id="MJMATHI-69" d="M184 600Q184 624 203 642T247 661Q265 661 277 649T290 619Q290 596 270 577T226 557Q211 557 198 567T184 600ZM21 287Q21 295 30 318T54 369T98 420T158 442Q197 442 223 419T250 357Q250 340 236 301T196 196T154 83Q149 61 149 51Q149 26 166 26Q175 26 185 29T208 43T235 78T260 137Q263 149 265 151T282 153Q302 153 302 143Q302 135 293 112T268 61T223 11T161 -11Q129 -11 102 10T74 74Q74 91 79 106T122 220Q160 321 166 341T173 380Q173 404 156 404H154Q124 404 99 371T61 287Q60 286 59 284T58 281T56 279T53 278T49 278T41 278H27Q21 284 21 287Z">< path stroke-width="1" id="MJMATHI-6D" d="M21 287Q22 293 24 303T36 341T56 388T88 425T132 442T175 435T205 417T221 395T229 376L231 369Q231 367 232 367L243 378Q303 442 384 442Q401 442 415 440T441 433T460 423T475 411T485 398T493 385T497 373T500 364T502 357L510 367Q573 442 659 442Q713 442 746 415T780 336Q780 285 742 178T704 50Q705 36 709 31T724 26Q752 26 776 56T815 138Q818 149 821 151T837 153Q857 153 857 145Q857 144 853 130Q845 101 831 73T785 17T716 -10Q669 -10 648 17T627 73Q627 92 663 193T700 345Q700 404 656 404H651Q565 404 506 303L499 291L466 157Q433 26 428 16Q415 -11 385 -11Q372 -11 364 -4T353 8T350 18Q350 29 384 161L420 307Q423 322 423 345Q423 404 379 404H374Q288 404 229 303L222 291L189 157Q156 26 151 16Q138 -11 108 -11Q95 -11 87 -5T76 7T74 17Q74 30 112 181Q151 335 151 342Q154 357 154 369Q154 405 129 405Q107 405 92 377T69 316T57 280Q55 278 41 278H27Q21 284 21 287Z">< path stroke-width="1" id="MJMAIN-73" d="M295 316Q295 356 268 385T190 414Q154 414 128 401Q98 382 98 349Q97 344 98 336T114 312T157 287Q175 282 201 278T245 269T277 256Q294 248 310 236T342 195T359 133Q359 71 321 31T198 -10H190Q138 -10 94 26L86 19L77 10Q71 4 65 -1L54 -11H46H42Q39 -11 33 -5V74V132Q33 153 35 157T45 162H54Q66 162 70 158T75 146T82 119T101 77Q136 26 198 26Q295 26 295 104Q295 133 277 151Q257 175 194 187T111 210Q75 227 54 256T33 318Q33 357 50 384T93 424T143 442T187 447H198Q238 447 268 432L283 424L292 431Q302 440 314 448H322H326Q329 448 335 442V310L329 304H301Q295 310 295 316Z">< path stroke-width="1" id="MJMAIN-74" d="M27 422Q80 426 109 478T141 600V615H181V431H316V385H181V241Q182 116 182 100T189 68Q203 29 238 29Q282 29 292 100Q293 108 293 146V181H333V146V134Q333 57 291 17Q264 -10 221 -10Q187 -10 162 2T124 33T105 68T98 100Q97 107 97 248V385H18V422H27Z">< path stroke-width="1" id="MJMATHI-3B2" d="M29 -194Q23 -188 23 -186Q23 -183 102 134T186 465Q208 533 243 584T309 658Q365 705 429 705H431Q493 705 533 667T573 570Q573 465 469 396L482 383Q533 332 533 252Q533 139 448 65T257 -10Q227 -10 203 -2T165 17T143 40T131 59T126 65L62 -188Q60 -194 42 -194H29ZM353 431Q392 431 427 419L432 422Q436 426 439 429T449 439T461 453T472 471T484 495T493 524T501 560Q503 569 503 593Q503 611 502 616Q487 667 426 667Q384 667 347 643T286 582T247 514T224 455Q219 439 186 308T152 168Q151 163 151 147Q151 99 173 68Q204 26 260 26Q302 26 349 51T425 137Q441 171 449 214T457 279Q457 337 422 372Q380 358 347 358H337Q258 358 258 389Q258 396 261 403Q275 431 353 431Z">< path stroke-width="1" id="MJMAIN-2B" d="M56 237T56 250T70 270H369V420L370 570Q380 583 389 583Q402 583 409 568V270H707Q722 262 722 250T707 230H409V-68Q401 -82 391 -82H389H387Q375 -82 369 -68V230H70Q56 237 56 250Z">< path stroke-width="1" id="MJMATHI-3F5" d="M227 -11Q149 -11 95 41T40 174Q40 262 87 322Q121 367 173 396T287 430Q289 431 329 431H367Q382 426 382 411Q382 385 341 385H325H312Q191 385 154 277L150 265H327Q340 256 340 246Q340 228 320 219H138V217Q128 187 128 143Q128 77 160 52T231 26Q258 26 284 36T326 57T343 68Q350 68 354 58T358 39Q358 36 357 35Q354 31 337 21T289 0T227 -11Z">< path stroke-width="1" id="MJMAIN-2208" d="M84 250Q84 372 166 450T360 539Q361 539 377 539T419 540T469 540H568Q583 532 583 520Q583 511 570 501L466 500Q355 499 329 494Q280 482 242 458T183 409T147 354T129 306T124 272V270H568Q583 262 583 250T568 230H124V228Q124 207 134 177T167 112T231 48T328 7Q355 1 466 0H570Q583 -10 583 -20Q583 -32 568 -40H471Q464 -40 446 -40T417 -41Q262 -41 172 45Q84 127 84 250Z">< path stroke-width="1" id="MJMAIN-39B" d="M320 708Q326 716 340 716H348H355Q367 716 372 708Q374 706 423 547T523 226T575 62Q581 52 591 50T634 46H661V0H653Q644 3 532 3Q411 3 390 0H379V46H392Q464 46 464 65Q463 70 390 305T316 539L246 316Q177 95 177 84Q177 72 198 59T248 46H253V0H245Q230 3 130 3Q47 3 38 0H32V46H45Q112 51 127 91Q128 92 224 399T320 708Z">< path stroke-width="1" id="MJMAIN-7C" d="M139 -249H137Q125 -249 119 -235V251L120 737Q130 750 139 750Q152 750 159 735V-235Q151 -249 141 -249H139Z">< path stroke-width="1" id="MJMATHI-54" d="M40 437Q21 437 21 445Q21 450 37 501T71 602L88 651Q93 669 101 677H569H659Q691 677 697 676T704 667Q704 661 687 553T668 444Q668 437 649 437Q640 437 637 437T631 442L629 445Q629 451 635 490T641 551Q641 586 628 604T573 629Q568 630 515 631Q469 631 457 630T439 622Q438 621 368 343T298 60Q298 48 386 46Q418 46 427 45T436 36Q436 31 433 22Q429 4 424 1L422 0Q419 0 415 0Q410 0 363 1T228 2Q99 2 64 0H49Q43 6 43 9T45 27Q49 40 55 46H83H94Q174 46 189 55Q190 56 191 56Q196 59 201 76T241 233Q258 301 269 344Q339 619 339 625Q339 630 310 630H279Q212 630 191 624Q146 614 121 583T67 467Q60 445 57 441T43 437H40Z">< path stroke-width="1" id="MJSZ4-28" d="M758 -1237T758 -1240T752 -1249H736Q718 -1249 717 -1248Q711 -1245 672 -1199Q237 -706 237 251T672 1700Q697 1730 716 1749Q718 1750 735 1750H752Q758 1744 758 1741Q758 1737 740 1713T689 1644T619 1537T540 1380T463 1176Q348 802 348 251Q348 -242 441 -599T744 -1218Q758 -1237 758 -1240Z">< path stroke-width="1" id="MJSZ4-29" d="M33 1741Q33 1750 51 1750H60H65Q73 1750 81 1743T119 1700Q554 1207 554 251Q554 -707 119 -1199Q76 -1250 66 -1250Q65 -1250 62 -1250T56 -1249Q55 -1249 53 -1249T49 -1250Q33 -1250 33 -1239Q33 -1236 50 -1214T98 -1150T163 -1052T238 -910T311 -727Q443 -335 443 251Q443 402 436 532T405 831T339 1142T224 1438T50 1716Q33 1737 33 1741Z">< path stroke-width="1" id="MJMATHI-52" d="M230 637Q203 637 198 638T193 649Q193 676 204 682Q206 683 378 683Q550 682 564 680Q620 672 658 652T712 606T733 563T739 529Q739 484 710 445T643 385T576 351T538 338L545 333Q612 295 612 223Q612 212 607 162T602 80V71Q602 53 603 43T614 25T640 16Q668 16 686 38T712 85Q717 99 720 102T735 105Q755 105 755 93Q755 75 731 36Q693 -21 641 -21H632Q571 -21 531 4T487 82Q487 109 502 166T517 239Q517 290 474 313Q459 320 449 321T378 323H309L277 193Q244 61 244 59Q244 55 245 54T252 50T269 48T302 46H333Q339 38 339 37T336 19Q332 6 326 0H311Q275 2 180 2Q146 2 117 2T71 2T50 1Q33 1 33 10Q33 12 36 24Q41 43 46 45Q50 46 61 46H67Q94 46 127 49Q141 52 146 61Q149 65 218 339T287 628Q287 635 230 637ZM630 554Q630 586 609 608T523 636Q521 636 500 636T462 637H440Q393 637 386 627Q385 624 352 494T319 361Q319 360 388 360Q466 361 492 367Q556 377 592 426Q608 449 619 486T630 554Z">< path stroke-width="1" id="MJMATHI-57" d="M436 683Q450 683 486 682T553 680Q604 680 638 681T677 682Q695 682 695 674Q695 670 692 659Q687 641 683 639T661 637Q636 636 621 632T600 624T597 615Q597 603 613 377T629 138L631 141Q633 144 637 151T649 170T666 200T690 241T720 295T759 362Q863 546 877 572T892 604Q892 619 873 628T831 637Q817 637 817 647Q817 650 819 660Q823 676 825 679T839 682Q842 682 856 682T895 682T949 681Q1015 681 1034 683Q1048 683 1048 672Q1048 666 1045 655T1038 640T1028 637Q1006 637 988 631T958 617T939 600T927 584L923 578L754 282Q586 -14 585 -15Q579 -22 561 -22Q546 -22 542 -17Q539 -14 523 229T506 480L494 462Q472 425 366 239Q222 -13 220 -15T215 -19Q210 -22 197 -22Q178 -22 176 -15Q176 -12 154 304T131 622Q129 631 121 633T82 637H58Q51 644 51 648Q52 671 64 683H76Q118 680 176 680Q301 680 313 683H323Q329 677 329 674T327 656Q322 641 318 637H297Q236 634 232 620Q262 160 266 136L501 550L499 587Q496 629 489 632Q483 636 447 637Q428 637 422 639T416 648Q416 650 418 660Q419 664 420 669T421 676T424 680T428 682T436 683Z">< path stroke-width="1" id="MJMATHI-50" d="M287 628Q287 635 230 637Q206 637 199 638T192 648Q192 649 194 659Q200 679 203 681T397 683Q587 682 600 680Q664 669 707 631T751 530Q751 453 685 389Q616 321 507 303Q500 302 402 301H307L277 182Q247 66 247 59Q247 55 248 54T255 50T272 48T305 46H336Q342 37 342 35Q342 19 335 5Q330 0 319 0Q316 0 282 1T182 2Q120 2 87 2T51 1Q33 1 33 11Q33 13 36 25Q40 41 44 43T67 46Q94 46 127 49Q141 52 146 61Q149 65 218 339T287 628ZM645 554Q645 567 643 575T634 597T609 619T560 635Q553 636 480 637Q463 637 445 637T416 636T404 636Q391 635 386 627Q384 621 367 550T332 412T314 344Q314 342 395 342H407H430Q542 342 590 392Q617 419 631 471T645 554Z">

Hey, it’s the holidays. Time to take some PTO/Leave, hang out with family, relax, and work on some side projects. One of the many projects I finally have time for is removal of ‘SPAM’ messages from one of the games I play from time-to-time. This allows me to play with some NLP and solve some fun problems, like how to score a model quickly while only using a few mb of RAM in the scripting language Lua.

As someone new to Lua, and only vaguely familiar with XGBoost model dumps, there was a lot of learning to be done. However, I was able to achieve an overall accuracy rate of 95%. Not too bad for a first shot.

< !-- define problem -->

Introduction

Final Fantasy XI (FFXI) is a Massive Multiplayer Online Game, that I’ve been playing on and off for about two decades now. I primarily play the game to get in contact with old friends, enjoy the storyline, and just blow off some steam at the end of the day. However, over the last few years the primary game-wide communication system has been poluted with SPAM messages. This is problematic because, this communication system is also where people find groups to do content.

Other solutions do exist to solve this problem, but they are largely just a set of regular expressions. So I thought I would take a stab in creating a quick natural language processing (NLP) solution to see if I can do better.

One way that I can do better is to not exclusively look at two different types of messages, SPAM and not SPAM. Instead, I can look at useful classes of messages that people can decided if they want to see or not.

  1. Content: Content Messages.
  2. RMT: Real Money Trade, messages for buying/selling items using real money.
  3. Merc: People offering to sell content using in-game currency.
  4. JP Merc: People offering to ‘level’ your character for you.
  5. Chat: Consersations between players.
  6. Buying: Buying items using in-game currency.
  7. Selling: Selling items using in-game currency.
  8. Other: Unknown messages catch-all.

Like all problems there are constraints. Messages need to be scored in real time, on the computer running the game. To avoid maintaining a message log, messages must be considered individually, so we can’t infer class by repeated message frequency as part of the scoring process.

< !-- data collection -->

Data Collection

I’m not going into the depths of data collection, but to do any data analysis I do need data. In particular, I need data that I can label. I utilized a project that allows for 3rd party addons for the game, windower. This framework allows a user to interact with various components of the game including in-game messaging through the programming language Lua. This allows me to write a quick plugin that records messages received from my character to a MariaDB database on my home server.

Because the addon only runs when I log into the game, it certainly is not a census of all messages, but should provide a reasonable sample of what types of messages are being sent within the game.

After running this for a few months, I was able to aquire, 98,141 messages. Unfortunately, the data does not come pre-labeled, so I manualy labeled 3,570 messages. This accounts for a total of 1,736 unique messages. For anyone interested, I’ve made this training data available. An sample of the first five and last five rows is provide below to provide context.

msg label n
1 job points 4m/500p dho gates Ž©“®‚Å‘g‚É“ü‚ê‚é 4 124
2 job points š4m/500pšš all time party dho gates autoinvite 4 114
3 job points 500p/4m 0-2100 15m dho gates moh gates igeo 4song fast tachi: jinpu 99999 4 93
4 job points ššš “500p/4mil dho gates very fast. do you need it? all time party. buy? autoinvite. /tell 4 87
5 experience points pl do you need it? 1-99 3m 1-50 2m 50-99 2m escha – zi’tah 4 70
1732 i bet you are 5 1
1733 job points 500p do you need it? buy? dho gates fast kill 4 1
1734 kyou enmerkars earring mercenary can i have it? you can have this. 15m 3 1
1735 ea houppe. 1 ea slops 1 ninja nodowa 2 do you need it? buy? cheap bazaar@ guide stone. 6 1
1736 alexandrite 5.6k x6000 do you need it? buy? (f-8) bazaar 6 1

Show R Code.

library(magrittr)
library(dplyr)
library(tidyr)
library(tidytext)
library(xgboost)
library(tictoc)
library(ggplot2)
library(xtable)


##############################################
## read in a copy of the data
##############################################
# msg - message
# label - label
#   1 - Content 
#   2 - RMT
#   3 - Merc
#   4 - JP Merc
#   5 - Chat
#   6 - Selling (non-Merc) such as AH Items 
#   7 - Buying (non-Merc) such as AH Items 
#   8 - Other
# n - count
##############################################
msgs <- read.csv("http://meanmean.me/data/train_data.csv",
                 stringsAsFactors = FALSE) %>% select(-X)

# order by frequency
msgs <- msgs %>% arrange(desc(n))



# get an example of messages
msg_example <- rbind( head(msgs,5), tail(msgs,5))

## define labels for plots
description <- c("Content", "RMT", "Merc", "JP Merc", "Chat",
                 "Selling (non-Merc)", "Buying (non-Merc)","Unknown")

# convert labels to descriptions
label_desc <- data.frame(
  label = 1:8,
  desc = description,
  stringsAsFactors = FALSE
  )

# add labels to msgs
test_train <- left_join( msgs, label_desc, by="label")


# summarize message counts
message_summary <- test_train %>% group_by(desc) %>%
  summarize( Unique = n(), Duplicate = sum(n) ) %>%
  tidyr::gather( 'Messages', 'count', Unique , Duplicate)

png(filename="message_count.png", width=640, height=480)
ggplot(message_summary, aes(x= desc, y = count, fill=Messages)) +
  geom_bar( stat='identity',position='dodge' ) + xlab("Message Class") +
  ylab("Count") +
  ggtitle("Message Count by Category")
dev.off()

From a quick plot of overall counts of each message type, the JP Merc group clarly stands out. Although the messages in this group are rarely unique, they do account for most volume. This matches with my observations within the game. Likewise, the next largest group, Merc, also has similar trends in that there are lots of repeat messages. These two groups are commonly considered spam by most players. Content and Chat messages, the one most useful for most players are quite different from the spammy messages in both uniqueness and quantity.

Unfortunately, these observations don’t help much since capturing message frequency would violate one of our requirements for message scoring. Instead, we would need to build features around model content.

Feature Building

Messages in FFXI are brief and organized. The interquartile range of unique messages within my trained data set is 46 to 78 characters. This is a bit more similar to tweets rather than a blog post or a novel. The messages within each message class are also quite similar. Content messages usually mention the activity the player is trying to do, what various jobs are needed to do this activity, and if needed the distribution of items from the activity. Likewise ‘selling’ messages usually describe the item, price, and a method to find or contact the individual selling the item.

Because of this brevity and organization, classifying distinct messages is fairly straight forward using n-grams and game specific terminology. The n-grams capture the distinct structure of the message, and the game specific terminology serves as a thesaurus to link together synonyms such as ‘paladin’, ‘pld’, and ‘tank’. Implementation of the methods will be through a bag-of-words type approach, where each n-gram or word occurence in a message will generate a feature.

Show R Code.

##############################################
## Do test-train split early, this avoids any 
# leakage against test 
##############################################
set.seed(100)

#Create Holdout 
# holdout 20% of the data
#0-4 k-folds (k=5)
folds <- 5

test_train <- test_train %>%
  group_by(desc) %>% dplyr::mutate( U = runif(n())) %>%
  dplyr::mutate( Rank=rank(U,ties.method='min')) %>%
  arrange(Rank)  %>%
  # account for repeat messages
  #dplyr::mutate( Rank=cumsum(n) ) %>%
  dplyr::mutate( partition=
    cut( Rank/max(Rank),
         breaks=c(0,0.2,0.36,0.52,0.68,0.84,1), labels=c('holdout',1:5) ) ) %>%
  ungroup()


## all variables created past this point are features
# we will exclude all current variables later on
remove_vars <- colnames(test_train)


##############################################
## here we use tidy text to generate n-grams
##############################################

# function to build out ngrams, and keep the top 10
top_n_gram <- function( x, ngram=2, top_n=10) {
  # get ngrams
  x_token <- x %>%
    unnest_tokens(ngram, msg, token = "ngrams", n = ngram)
  # subset ngrams by most frequesnt
  x_count <- x_token %>% filter(!is.na(ngram)) %>% group_by(desc) %>%
    count(ngram,sort=TRUE) %>% slice_head(n=top_n) %>% ungroup()
  x_count
}

# build out groups
# identify top most frequent 10 ngrams
test_train_word <-
  top_n_gram(test_train %>% filter(partition != 'holdout'), ngram=1)
test_train_bigram <-
  top_n_gram(test_train %>% filter(partition != 'holdout'), ngram=2)
test_train_trigram <-
  top_n_gram(test_train %>% filter(partition != 'holdout'), ngram=3)




# grab our distinct words 
key_words <- test_train_word %>% select( ngram) %>% distinct() %>%
  filter( !is.na(ngram)) %>% unlist()
key_bigrams <- test_train_bigram %>% select( ngram) %>% distinct() %>%
  filter( !is.na(ngram)) %>% unlist()
key_trigrams <- test_train_trigram %>% select( ngram) %>% distinct() %>%
  filter( !is.na(ngram)) %>% unlist()


# turn key words into features
ngram_matrix <- matrix(0,nrow=NROW(test_train),ncol=length(c(key_trigrams,key_bigrams,key_words)))
colnames(ngram_matrix) <- c(key_trigrams,key_bigrams,key_words)

for( key_word in key_words ) {
  ngram_matrix[,key_word] <- grepl(key_word, test_train$msg, ignore.case = TRUE)
}
for( key_bigram in key_bigrams ) {
  ngram_matrix[,key_bigram] <- grepl(key_bigram, test_train$msg, ignore.case = TRUE)
}
for( key_trigram in key_trigrams ) {
  ngram_matrix[,key_trigram] <- grepl(key_trigram, test_train$msg, ignore.case = TRUE)
}

test_train <- cbind( test_train, ngram_matrix %>% as.data.frame())


#plot by frequency
png(filename="test_train_word.png", width=640, height=640)

test_train_word %>%
  arrange( desc(n)) %>%
  mutate( order_helper = factor(sprintf("%d::::%s",row_number(), ngram)) ) %>%
  ggplot(aes(x=reorder(order_helper,n), y=n, fill = desc)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~desc, ncol=2, scales = "free_y") +
  scale_x_discrete(name=NULL, labels=function(x) gsub('^.*:::', '', x))+
  labs(y = "Most Frequent Words",
  x = NULL) +
  ggtitle("10 Most Frequent Words") +
  coord_flip()
dev.off()

png(filename="test_train_bigram.png", width=640, height=640)
test_train_bigram %>%
  arrange( desc(n)) %>%
  mutate( order_helper = factor(sprintf("%d::::%s",row_number(), ngram)) ) %>%
  ggplot(aes(x=reorder(order_helper,n), y=n, fill = desc)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~desc, ncol=2, scales = "free_y") +
  scale_x_discrete(name=NULL, labels=function(x) gsub('^.*:::', '', x))+
  labs(y = "Most Frequent Bigrams",
  x = NULL) +
  ggtitle("10 Most Frequent Bigrams") +
  coord_flip()
dev.off()

png(filename="test_train_trigram.png", width=640, height=640)
test_train_trigram %>%
  arrange( desc(n)) %>%
  mutate( order_helper = factor(sprintf("%d::::%s",row_number(), ngram)) ) %>%
  ggplot(aes(x=reorder(order_helper,n), y=n, fill = desc)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~desc, ncol=2, scales = "free_y") +
  scale_x_discrete(name=NULL, labels=function(x) gsub('^.*:::', '', x))+
  labs(y = "Most Frequent Trigrams",
  x = NULL) +
  ggtitle("10 Most Frequent Trigrams") +
  coord_flip()
dev.off()

Using a little bit of tidy text it is pretty easy to build out the n-gram features to see what are the most common words and phrases. Note that common words like articles are not removed, they actually turn out to be fairly informative since they rarely show up in the spammy messages, but do show up in the ‘chat’ class.

Rare words from a TF-IDF approach don’t really help out here, since they either exclude important and frequent words used to separate classes, or they just identify words that would appear in the game specific terms.

Show R Code.

# build out word frequencies
test_train_word_idf<- test_train %>% unnest_tokens(word, msg) %>%
  count(desc,word,sort=TRUE)
test_train_total_words_idf <- test_train_word_idf %>% group_by(desc) %>%
  summarize(total=sum(n))
test_train_word_idf <-
  left_join( test_train_word_idf, test_train_total_words_idf)
test_train_word_idf <- test_train_word_idf %>% bind_tf_idf(word, desc,n) %>%
  group_by(desc) %>% arrange(desc(tf_idf)) %>% slice_head(n=10) %>% ungroup()

png(filename="tf_idf.png", width=640, height=800)
test_train_word_idf %>%
  arrange( desc(tf_idf)) %>%
  mutate( order_helper = factor(sprintf("%d::::%s",row_number(), word)) ) %>%
  ggplot(aes(x=reorder(order_helper,tf_idf), y=tf_idf, fill = desc)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~desc, ncol=2, scales = "free_y") +
  scale_x_discrete(name=NULL, labels=function(x) gsub('^.*:::', '', x))+
  labs(y = "TF-IDF Words",
  x = NULL) +
  ggtitle("TF-IDF Words") +
  coord_flip()
dev.off()

I used my own personal experience from reviewing messages and being aware of messaging mechanics to build game specific terms. These link similar game terms together which may be difficult to learn with a small data set like the one I have. To create features using these words I simply use common pattern matching strings in the form of regular expressions. These strings are portable between R and Lua, so it makes if fairly straight forward to implement in both langauges.

Note that matching isn’t perfect, subsequences of words are captured along with the word. E.g. ‘is’ would be captured if ‘this’ was a part of a message. It is possible to fix this, but would require evaluation of matching implementations in R and lua. Furthermore, there are some nuances with the FFXI auto-translate function that is used frequently by users. Auto-translated words have the unfortunate consequence of having no spaces placed between words.

Show R Code.

test_train <- test_train %>% dplyr::mutate(
`__gil` = as.numeric(grepl("[0-9]+(k|M|gil)",test_train$msg, ignore.case = TRUE) ),
`__currency` = as.numeric(grepl("(alexandrite|plouton|beitsu|riftborn|montiont|jadeshell|byne|bayld|heavy[ ]metal|hmp|hpb|riftcinder)",test_train$msg, ignore.case=TRUE)),
`__jobs` = as.numeric(grepl("(war|mnk|whm|rdm|blm|thf|pld|drk|bst|brd|rng|smn|sam|nin|drg|blu|cor|pup|dnc|sch|geo|run|warrior|monk|mage|theif|paladin|knight|beastmaster|bard|ranger|summoner|samurai|ninja|dragoon|corsair|puppet|dancer|scholar|geomancer|rune)",test_train$msg, ignore.case = TRUE) ),
`__roles` = as.numeric(grepl("(support|healer|tank|dd|melee|job)",test_train$msg, ignore.case = TRUE) ) ,
`__merc` = as.numeric(grepl("merc",test_train$msg, ignore.case = TRUE) ) ,
`__omen_big_drops` = as.numeric(grepl("(regal|dagon|udug|shamash|ashera|nisroch)",test_train$msg, ignore.case = TRUE) ) ,
`__omen_drops` = as.numeric(grepl("(utu|ammurapi|niqmaddu|shulmanu|dingir|yamarang|lugalbanda|ilabrat|enmerkar|iskur|sherida)",test_train$msg, ignore.case = TRUE) ) ,
`__mythic_merc` = as.numeric(grepl("(tinnin|tyger|sarameya)",test_train$msg, ignore.case = TRUE) ) ,
`__abyssea_merc` = as.numeric(grepl("(colorless|chloris|ulhuadshi|dragua|glavoid|itzpapalot|orthus|briareus|sobek|apademak|carabosse|cirein-croin|isgebind|fistule|bukhis|alfard|azdaja)",test_train$msg, ignore.case = TRUE) ) ,
`__htbm_merc` = as.numeric(grepl("(daybreak|sacro|malignance|lilith|odin|gere|freke|palug|hjarrandi|zantetsuken|geirrothr)",test_train$msg, ignore.case = TRUE) ) ,
`__bazaar_loc` = as.numeric(grepl("[(][a-z]-[1-9][)]",test_train$msg, ignore.case = TRUE) ) ,
`__bazaar_item` = as.numeric(grepl("(blured|blurred|raetic|voodoo|jinxed|vexed)",test_train$msg, ignore.case = TRUE) ) ,
`__dyna_item` = as.numeric(grepl("(voidhead|voidleg|voidhand|voidfeet|voidtorso|voidbody|beastman|kindred)",test_train$msg, ignore.case = TRUE) ) ,
`__job_points` = as.numeric(grepl("(job points|jobpoints|merit points|meritpoint|experiencepoints|experience points|^exp[ ]|[ ]exp[ ])",test_train$msg, ignore.case = TRUE) ),
`__power_level` = as.numeric(grepl("(^| )pl[ ]" ,test_train$msg, ignore.case = TRUE) ),
`__vagary_boss` = as.numeric(grepl("(vagary|perfidien|plouton|putraxia)" ,test_train$msg, ignore.case = TRUE) ),
`__aman_orbs` = as.numeric(grepl("(mars|venus)[ ]orb" ,test_train$msg, ignore.case = TRUE) ) ,
`__dynamis` = as.numeric(grepl("(dynamis|[d]|(d))" ,test_train$msg, ignore.case = TRUE) ) ,
`__content` = as.numeric(grepl("(omen|kei|kyou|kin|gin|fu|[ ]ou|^ou|ambuscade|[ ]sr|^sr)" ,test_train$msg, ignore.case = TRUE) ),
`__buy` = as.numeric(grepl("(buy[ ]|buy$|sell[?]|wtb|reward|price)" ,test_train$msg, ignore.case = TRUE) ), # I am trying to buy 
`__sell` = as.numeric(grepl("(sell[ ]|sell$|buy[?]|wts)" ,test_train$msg, ignore.case = TRUE) ), # I am trying to sell
`__social` = as.numeric(grepl("(linkshell|schedule|event|social|^ls[ ]|[ ]ls[ ]|concierge)" ,test_train$msg, ignore.case = TRUE) )
)

Model Building

Although a Neural Network approach may work better in theory, I don’t have a huge amount of data. I also have a set of features that are likely to work pretty well for more traditional models, so I went with XGBoost for an initial iteration simply because it is fairly easy to interpret the results and extremely easy to score for new languages with multi-class models. This later property is due to the raw model dumps that XGBoost provides for multi-class problems.

If you are not familiar with XGBoost, it’s a popular implementation of gradient boosted tree models using regularization. It allows you to fit a succession of simple models (trees), to create an ensemble of simple trees (boosting). To avoid over fitting, regularization is used to limit the impact of trees that provide little input to the model. A good overview can be found on the XGBoost site.

Implementation of an XGBoost model is fairly straight forward. We already split our data early on, so we just need to build a parameter grid; train XGBoost on our training set to find our best parameters, and then see what features we want to keep.

To train I just did a grid search with cross validation to aid in robust parameter selection. The parameters we are evaluating our model over are a reduced set from a prior and more exhaustive run. With this subset, we will still get pretty good performance and a reasonable run time, usuallly 1-2min on modern computers.

Show R Code.

####################################
# find model parameters 
####################################

#grid to search
nrounds <- c(200)
colsample_bytrees <- c(.6,.8)
min_child_weights <- c(1)
max_depths = c(6,7,8)
etas = c(0.1)
subsamples = c(.8)
cv.nfold = 5
gams=c(0.01)


# create a place to store our results
summary_cv <- expand.grid(
  nrounds,
  max_depths,
  etas,
  subsamples,
  gams,
  colsample_bytrees,
  min_child_weights,
  Inf,0)

colnames(summary_cv) <- c("nround","max_depth","eta",'subsample',"gam",
                          "colsample_bytree","min_child_weight","min_log_loss",
                          "min_log_loss_index")
summary_cv <- as.data.frame(summary_cv)



## create training data set
# data set
train <- test_train[test_train$partition != 'holdout',
                    !colnames(test_train) %in% remove_vars ] %>% as.matrix()

# label (xgboost requires labels that start with 0)
train_label <- test_train %>% filter(partition != 'holdout') %>%
  select(label) %>% unlist() -1
# folds
train_partition <- test_train %>% filter(partition != 'holdout') %>%
  select(partition) %>% unlist() %>% as.character() %>% as.numeric()
train_folds <- list()
for( k in 1:max(train_partition)) {
  train_folds[[k]] <- which(train_partition == k)
}

# find best parameters
for( i in 1:NROW(summary_cv)) {

   cur_summary <- summary_cv[i,]

   param <- list(objective = "multi:softprob",
      eval_metric = "mlogloss",
      num_class = 8,
      max_depth = cur_summary$max_depth,
      eta = cur_summary$eta,
      gamma = cur_summary$gam,
      subsample = cur_summary$subsample,
      colsample_bytree = cur_summary$colsample_bytree,
      min_child_weight = cur_summary$min_child_weight,
      max_delta_step = 0
   )

   mdcv <- xgb.cv(
     data = train,
     label = train_label,
     params = param,
     nthread=8,
     folds=train_folds,
     nrounds=cur_summary$nround,
     verbose = FALSE)

   summary_cv$min_log_loss[i] <-
     min(mdcv$evaluation_log[,'test_mlogloss_mean'] %>% unlist())
   summary_cv$min_logloss_index[i] <-
     which.min(mdcv$evaluation_log[,'test_mlogloss_mean'] %>% unlist())

}

best_model_cv <- summary_cv[which.min(summary_cv[,'min_log_loss']),]

With our best parameters, we are now going to build a full model using all our training data. Based on this output, we are going to pick which features we are going to keep through a Gain threshold. Gain just gives us an idea of how important a variable is by the increase in accuracy after it is used for a split in the tree.

As an implementation note, if we were going to evaluate different gain value thresholds we would do this as part of the parameter selection step.

Show R Code.

####################################
# Train Model for Evaluation
####################################

# set best parameters
param <- list(objective = "multi:softprob",
              eval_metric = "mlogloss",
              num_class = 8,
              max_depth = best_model_cv['max_depth'],
              eta = best_model_cv['eta'],
              gamma = best_model_cv['gam'],
              subsample = best_model_cv['subsample'],
              colsample_bytree = best_model_cv['colsample_bytree'],
              min_child_weight = best_model_cv['min_child_weight'],
              max_delta_step = 0
)

best_model <- xgboost(
  data = train,
  label = train_label,
  params = param, nthread=8,
  nrounds=best_model_cv$nround,
  verbose = FALSE)


####################################
# Feature Selection
####################################

importance <- xgb.importance(feature_names = colnames(train), model = best_model)
top_features <- importance %>% filter( Gain >= 0.001) %>% select(Feature) %>% unlist()

best_model_reduced <- xgboost(
  data = train[,top_features],
  label = train_label,
  params = param, nthread=8,
  nrounds=best_model_cv$nround,
  verbose = FALSE)
Model Log Loss
Full 0.0663
Reduced 0.0670

Using our gain threshold of 0.001, we end up with 77 out of 174 variables and a minor reduction in model performance. This is a pretty good trade off and will also lower the run time for model scoring.

Model Evaluation

Now that we have our model, we want to check two important model characteristics: (1) Does the model and it’s output make sense? (2) How well does the model perform?

To address the first model characteristic, models are helpful in expanding our knowledge and challenging our hypotheses. However, if they are keying in on variables in a nonsensical way, then it is usually an indication that something has gone wrong.

Show R Code.

####################################
# Evaluate Model against Holdout 
####################################

## create test data set
# data set
test <- test_train[test_train$partition == 'holdout',
                   !colnames(test_train) %in% remove_vars ] %>% as.matrix()
# label (xgboost requires labels that start with 0)
test_label <- test_train %>% filter(partition == 'holdout') %>%
  select(label) %>% unlist() -1

# predict holdout
holdout_predict <- predict(best_model_reduced, test[,top_features])


## create data set for evaluation
# convert holdout prediction to something usable
pred <- matrix(holdout_predict, ncol=8, byrow=TRUE)
pred_label <- apply(pred,1,which.max)
colnames(pred) <- description
pred <- as.data.frame(pred)
pred$predicted_label = pred_label

# bring in test label (remember we need to add 1 back)
pred$actual_label =  test_label + 1

# add on message for evaluating misclassifications
pred$msg <-
 test_train %>% filter( partition == 'holdout') %>% select(msg) %>% unlist()

# add on repeat messages
pred$n <-
 test_train %>% filter( partition == 'holdout') %>% select(n) %>% unlist()


## take a look at feature importance
n_features <- 20
importance <- xgb.importance(feature_names = colnames(test), model = best_model)

png( filename="importance.png", width = 640, height= 640)
ggplot(importance %>% slice_head(n=n_features), aes(x=reorder(Feature,Gain),Gain)) +
  geom_col(fill="steelblue") + coord_flip() + ylab("Gain") +
  ggtitle("Gain from Top 20 Features")
dev.off()

The initial way I check if a model makes sense is through feature importance. The features here should align with our understanding fo the phenomena we are modeling. If a feature is suprising, then we should do a further investigation into how that feature relates with the response.

Here the most important variables are primarily game term specific features. This ranking makes sense since variations of similar terms are captured by these features, and are present in almost all of these messages. They are also specifically created to relate to different types of content.

Specifically ‘__job_points’ would relate primarily to people selling job points (a sort of RPG leveling system). Likewise, ‘__jobs’ captures a large number of variations of specific jobs (think D&D classes), that would be requested in content messages.

Show R Code.

## get confusion matrix
confusionMatrix <- as.data.frame( table(
    label_desc$desc[pred$actual_label],
    label_desc$desc[pred$predicted_label]) )

png( filename="confusion.png", width = 640, height= 640)
ggplot(confusionMatrix, aes(x=Var1, y=Var2, fill=Freq)) + geom_tile() +
  geom_text(aes(label=Freq)) + xlab("Actual") + ylab("Predicted") +
  scale_fill_gradient(low="white", high="orange")
dev.off()

## get tpr and and fpr
# calc tpr and fpr
tpr <- pred %>% group_by( actual_label ) %>%
  dplyr::summarize(
    tpr=mean(actual_label == predicted_label)
  )
tnr <-c()
for( i in 1:NROW(tpr) ) {
  tnr[i] <- sum( (pred$predicted_label != i) & (pred$actual_label != i))/ sum( pred$actual_label != i )
}
tpr$tnr <- tnr
tpr$actual_label <- description
Model Log Loss
Full 89%
Reduced 95%

The second model characteristic, model performance, is evaluated through a confusion matrix, model accuracy, and per class True Positive Rates (TPR).

The confusion matrix shows generally good performance, where the diagonal clearly dominates the off-diagonal miss-classifications for all classes. Likewise the model accuracy is 89% for unique messages, and 95% when accounting for duplicate or repeated messages.

Message Category TPR TNR
1 Content 0.91 0.97
2 RMT 1.00 1.00
3 Merc 0.89 0.96
4 JP Merc 0.99 0.99
5 Chat 0.81 0.98
6 Selling (non-Merc) 0.84 0.98
7 Buying (non-Merc) 0.67 0.99
8 Unknown 0.67 1.00

The per class TPR and FPR rates also look fairly good except for the categories ‘Unknown’ and ‘Buying (non-Merc)’. I’m not too concerned about the ‘Unknown’ category since it’s both fairly rare and a general catch-all of strange messages.

‘Buying (non-Merc)’ is a bit more of an issue. The primary separaters between ‘Merc’ and the ‘(non-Merc)’ categories are the types of content or items that are being bought or sold. As an example, a Malignance Tabard is only obtained by doing specific content, it cannot directly be bought or sold buy a player. Therefore it is a merc item. But a Raetic Rod +1 can directly be bought or sold buy a player, so it would be a ‘Non-Merc’ item. The fix to this would be to create more game specific features to capture this, or to just to label more messages so XGBoost can identify these items on its own.

Model Deployment in Lua

To deploy the model, I need to get it out of R and into Lua. This is a five step process:

  1. Dump the XGBoost model to a text format
  2. Read in the trees from the XGBoost dump
  3. Wait for a new Message
  4. Extract features from message
  5. Score the message

I have already written all the code to do this in a github repo function. This also includes a previous model dump from XGBoost.

The first step is extremely easy, we just grab our model and dump it to a text file. Note that we retrain the model using all our data at this point to get the best possible fit out of all of our data.

Show R Code.

####################################
# Retrain on all data
####################################

best_model_full <- xgboost(
  data = rbind(test[,top_features], train[,top_features]),
  label = c(test_label,train_label),
  params = param, nthread=8,
  nrounds=best_model_cv$nround,
  verbose = FALSE)

####################################
# Dump Model 
####################################

# write for windower package
xgb.dump(best_model_full, fname='ffxi_spam_model.txt', dump_format="text")

# csv to identify key variables
write.csv( test_train[1:2,top_features] ,'example.csv',row.names = FALSE)

The dump looks something like this (note this is from an earlier version of the same model):

booster[0]
0:[f11<0.5] yes=1,no=2,missing=1
  1:[f194<0.5] yes=3,no=4,missing=3
    3:[f211<0.5] yes=5,no=6,missing=5
      5:[f215<0.5] yes=7,no=8,missing=7
        7:[f64<0.5] yes=11,no=12,missing=11
          11:[f197<0.5] yes=15,no=16,missing=15
            15:[f76<0.5] yes=23,no=24,missing=23
              23:[f210<0.5] yes=27,no=28,missing=27
                27:leaf=-0.0266666692
                28:leaf=-0.00392156886
              24:[f210<0.5] yes=29,no=30,missing=29
                29:leaf=-0.0129032256
                30:leaf=-0.0380228125
            16:leaf=0.11343284
          12:[f129<0.5] yes=17,no=18,missing=17
            17:[f195<0.5] yes=25,no=26,missing=25
              25:[f201<0.5] yes=31,no=32,missing=31
                31:leaf=0.0990595669
                32:leaf=-0.0298507456
              26:leaf=-0.037894737
            18:leaf=0.259561151
        8:leaf=-0.0559531562
      6:[f215<0.5] yes=9,no=10,missing=9
        9:[f210<0.5] yes=13,no=14,missing=13
          13:[f129<0.5] yes=19,no=20,missing=19
            19:leaf=0.164210528
            20:leaf=-0.00784313772
          14:[f129<0.5] yes=21,no=22,missing=21
            21:leaf=0.189473689
            22:leaf=0.323467851
        10:leaf=-0.0519774035
    4:leaf=-0.0562807731
  2:leaf=0.366163135
...

This is in fact a single tree, given that we ran 200 iterations and there are eight categories to classify into, that gives is 1,600 trees that we need to evaluate in lua. This may seem a little daunting, but since these are trees it is fairly simple to write a recursive algorithm that will parse these structures. To do this, we should note that each line of the XGBoost dump has one of three possible line types that describe the tree structure:

  1. booster[m] denotes the ‘m’-th tree.
  2. k:[f(xxx) < t] yes=h,no=i,missing=j is a branch of the tree.
    • i is the index of the node within this tree.
    • [f(xxx) < t] is an inequality determining if the tree will go to ‘yes’ f(xxx) is the feature in order of its input data set starting at 0, and t is just the value the feature is being compared against.
    • j, k, and l are references to the node to move to given the state of the inequality for a particular row of the data set being evaluated.
  3. ‘i:leaf=s’ is a leaf node with index i and log-odd s
  4. .

In the example lua the parsing of the tree is accomplished in the read_booster function in xgboost_score.lua. It simply creates an implicit tree structure using lua arrays. Each element of the array aligns with the node index in the tree. I do add one to each index of both the features and nodes since Lua starts indexes at 0 unlike XGBoost (written in C++). Each element also contains the following components:

This is duplicated for all 1,600 trees and internally consumes no more than 8MB of data.

When a new message comes in we need to extract the features from the message to map to the features created in R. To do this we just use the regular expressions library that windower provides for lua. Here our new message is passed in as clean_text after removing non-ascii text and translating all auto-translate functions to english. As an example, this is the __gil feature we created earlier. It uses the exact same regular expression to achieve a match.

if windower.regex.match(clean_text, "[0-9]+(k|m|gil)") ~= nil then 
  eval_features[i] = 1 
end

Our next step is scoring the incomming message based on the features we just generated. As it turned out, this step was one of the harder bits of this project, not because of the technical difficulties, but in figuring out how XGBoost actually created the score.

When a new message comes in, it has it’s features extracted and then we lookup in each tree for a particular message category to figure out what score it gets for each of these message categories.

The lookup step is done through a recursive function I created called eval_tree. It simply iterates over a given tree using the structure we created while reading the original tree in. As an initial condition it checks if a given node in the tree is a branch or a leaf. If we are at a leaf we are done and return our log-odds value, else we evaluate if the feature extracted from the message has a value less than our stored lt_value from our XGBoost dump. If it does, then we go to the yes_node. If the value is missing we follow the default value, equivalent to the yes_node. Otherwise, we go to the no_node. A recursive call is then made to evaluate the selected next node.

eval_tree = function( cur_node, booster, eval)

   -- Is this a branch, branches have a conditional
   if booster[cur_node].lt_value ~= nil then
   
     -- this is a branch, let's check what direction to go
     if( eval[ booster[cur_node].feature ] < booster[cur_node].lt_value ) then
       cur_node = booster[cur_node].yes_node
     elseif( eval[ booster[cur_node].feature ] == 0 ) then
       cur_node = booster[cur_node].missing_node
     else
       cur_node = booster[cur_node].no_node
     end
     
   else
     return( booster[cur_node].leaf)
   end

   return( eval_tree(cur_node, booster, eval) )

end

Remember that each of our 1,600 trees belong to one of eight message categories. So for every message we get 200 values from the leaf nodes associated with a given message category. Each of these values, which are log-odds are then summed up and added to a default value of 0.5. This process is implemented through the parse tree function in my Lua code.

--parse tree
eval_phrase = function( value, booster,classes )  
  xgboost_class = 0 
  score={}
  for i = 1,classes do
    score[i] = 0.5
  end

  for i = 1,table.getn(booster) do
    xgboost_class = xgboost_class + 1
    
    -- iterate over all classes
    if xgboost_class > classes then 
      xgboost_class = 1
    end
    score[xgboost_class] = score[xgboost_class] + eval_tree( 1, booster[i], value) 

  end

  -- combine score
  sum_all = 0
  for i = 1,classes do
    sum_all = sum_all + math.exp(score[i])
  end
  for i = 1,classes do
    score[i] = math.exp(score[i])/sum_all
  end

  return(score)
end

The 0.5 in the log-odds sum was a bit of mystery to figure out, as I had to search through all the XGBoost documentation and ended up diving into the source code to figure out why the sum of of log-odds was not giving me the right value. However, it makes sense since it provides a prior to dampen extreme differences in log-odds when there are few informative trees.

The final step is to use the softmax function to determine the probability that a given message belongs to a particular class. Here, I’ve just let the sum of log-odds of message category < svg xmlns:xlink="http://www.w3.org/1999/xlink" width="1.116ex" height="1.796ex" viewBox="0 -499.7 480.5 773.1" role="img" focusable="false" style="vertical-align: -0.635ex;" aria-hidden="true">< g stroke="currentColor" fill="currentColor" stroke-width="0" transform="matrix(1 0 0 -1 0 0)">< use href="#MJMATHI-67" x="0" y="0">< math xmlns="http://www.w3.org/1998/Math/MathML">< mi>g be < svg xmlns:xlink="http://www.w3.org/1999/xlink" width="2.446ex" height="2.636ex" viewBox="0 -771.1 1053.3 1135.1" role="img" focusable="false" style="vertical-align: -0.845ex;" aria-hidden="true">< g stroke="currentColor" fill="currentColor" stroke-width="0" transform="matrix(1 0 0 -1 0 0)">< use href="#MJMATHI-53" x="0" y="0">< use transform="scale(0.707)" href="#MJMATHI-67" x="867" y="-213">< math xmlns="http://www.w3.org/1998/Math/MathML">< msub>< mi>S< mi>g. Finally, to end up with our probability estimate with softmax we just need to feed our sum of squares plus our prior into the softmax function:

< svg xmlns:xlink="http://www.w3.org/1999/xlink" width="24.433ex" height="7.343ex" style="vertical-align: -3.505ex;" viewBox="0 -1652.5 10519.8 3161.4" role="img" focusable="false" xmlns="http://www.w3.org/2000/svg" aria-labelledby="MathJax-SVG-1-Title"> upper P Subscript g Baseline equals StartFraction exp left-parenthesis 0.5 plus upper S Subscript g Baseline right-parenthesis Over sigma-summation Underscript g Overscript upper G Endscripts exp left-parenthesis 0.5 plus upper S Subscript g Baseline right-parenthesis EndFraction < defs aria-hidden="true"> < path stroke-width="1" id="E1-MJMATHI-50" d="M287 628Q287 635 230 637Q206 637 199 638T192 648Q192 649 194 659Q200 679 203 681T397 683Q587 682 600 680Q664 669 707 631T751 530Q751 453 685 389Q616 321 507 303Q500 302 402 301H307L277 182Q247 66 247 59Q247 55 248 54T255 50T272 48T305 46H336Q342 37 342 35Q342 19 335 5Q330 0 319 0Q316 0 282 1T182 2Q120 2 87 2T51 1Q33 1 33 11Q33 13 36 25Q40 41 44 43T67 46Q94 46 127 49Q141 52 146 61Q149 65 218 339T287 628ZM645 554Q645 567 643 575T634 597T609 619T560 635Q553 636 480 637Q463 637 445 637T416 636T404 636Q391 635 386 627Q384 621 367 550T332 412T314 344Q314 342 395 342H407H430Q542 342 590 392Q617 419 631 471T645 554Z"> < path stroke-width="1" id="E1-MJMATHI-67" d="M311 43Q296 30 267 15T206 0Q143 0 105 45T66 160Q66 265 143 353T314 442Q361 442 401 394L404 398Q406 401 409 404T418 412T431 419T447 422Q461 422 470 413T480 394Q480 379 423 152T363 -80Q345 -134 286 -169T151 -205Q10 -205 10 -137Q10 -111 28 -91T74 -71Q89 -71 102 -80T116 -111Q116 -121 114 -130T107 -144T99 -154T92 -162L90 -164H91Q101 -167 151 -167Q189 -167 211 -155Q234 -144 254 -122T282 -75Q288 -56 298 -13Q311 35 311 43ZM384 328L380 339Q377 350 375 354T369 368T359 382T346 393T328 402T306 405Q262 405 221 352Q191 313 171 233T151 117Q151 38 213 38Q269 38 323 108L331 118L384 328Z"> < path stroke-width="1" id="E1-MJMAIN-3D" d="M56 347Q56 360 70 367H707Q722 359 722 347Q722 336 708 328L390 327H72Q56 332 56 347ZM56 153Q56 168 72 173H708Q722 163 722 153Q722 140 707 133H70Q56 140 56 153Z"> < path stroke-width="1" id="E1-MJMAIN-65" d="M28 218Q28 273 48 318T98 391T163 433T229 448Q282 448 320 430T378 380T406 316T415 245Q415 238 408 231H126V216Q126 68 226 36Q246 30 270 30Q312 30 342 62Q359 79 369 104L379 128Q382 131 395 131H398Q415 131 415 121Q415 117 412 108Q393 53 349 21T250 -11Q155 -11 92 58T28 218ZM333 275Q322 403 238 411H236Q228 411 220 410T195 402T166 381T143 340T127 274V267H333V275Z"> < path stroke-width="1" id="E1-MJMAIN-78" d="M201 0Q189 3 102 3Q26 3 17 0H11V46H25Q48 47 67 52T96 61T121 78T139 96T160 122T180 150L226 210L168 288Q159 301 149 315T133 336T122 351T113 363T107 370T100 376T94 379T88 381T80 383Q74 383 44 385H16V431H23Q59 429 126 429Q219 429 229 431H237V385Q201 381 201 369Q201 367 211 353T239 315T268 274L272 270L297 304Q329 345 329 358Q329 364 327 369T322 376T317 380T310 384L307 385H302V431H309Q324 428 408 428Q487 428 493 431H499V385H492Q443 385 411 368Q394 360 377 341T312 257L296 236L358 151Q424 61 429 57T446 50Q464 46 499 46H516V0H510H502Q494 1 482 1T457 2T432 2T414 3Q403 3 377 3T327 1L304 0H295V46H298Q309 46 320 51T331 63Q331 65 291 120L250 175Q249 174 219 133T185 88Q181 83 181 74Q181 63 188 55T206 46Q208 46 208 23V0H201Z"> < path stroke-width="1" id="E1-MJMAIN-70" d="M36 -148H50Q89 -148 97 -134V-126Q97 -119 97 -107T97 -77T98 -38T98 6T98 55T98 106Q98 140 98 177T98 243T98 296T97 335T97 351Q94 370 83 376T38 385H20V408Q20 431 22 431L32 432Q42 433 61 434T98 436Q115 437 135 438T165 441T176 442H179V416L180 390L188 397Q247 441 326 441Q407 441 464 377T522 216Q522 115 457 52T310 -11Q242 -11 190 33L182 40V-45V-101Q182 -128 184 -134T195 -145Q216 -148 244 -148H260V-194H252L228 -193Q205 -192 178 -192T140 -191Q37 -191 28 -194H20V-148H36ZM424 218Q424 292 390 347T305 402Q234 402 182 337V98Q222 26 294 26Q345 26 384 80T424 218Z"> < path stroke-width="1" id="E1-MJMAIN-28" d="M94 250Q94 319 104 381T127 488T164 576T202 643T244 695T277 729T302 750H315H319Q333 750 333 741Q333 738 316 720T275 667T226 581T184 443T167 250T184 58T225 -81T274 -167T316 -220T333 -241Q333 -250 318 -250H315H302L274 -226Q180 -141 137 -14T94 250Z"> < path stroke-width="1" id="E1-MJMAIN-30" d="M96 585Q152 666 249 666Q297 666 345 640T423 548Q460 465 460 320Q460 165 417 83Q397 41 362 16T301 -15T250 -22Q224 -22 198 -16T137 16T82 83Q39 165 39 320Q39 494 96 585ZM321 597Q291 629 250 629Q208 629 178 597Q153 571 145 525T137 333Q137 175 145 125T181 46Q209 16 250 16Q290 16 318 46Q347 76 354 130T362 333Q362 478 354 524T321 597Z"> < path stroke-width="1" id="E1-MJMAIN-2E" d="M78 60Q78 84 95 102T138 120Q162 120 180 104T199 61Q199 36 182 18T139 0T96 17T78 60Z"> < path stroke-width="1" id="E1-MJMAIN-35" d="M164 157Q164 133 148 117T109 101H102Q148 22 224 22Q294 22 326 82Q345 115 345 210Q345 313 318 349Q292 382 260 382H254Q176 382 136 314Q132 307 129 306T114 304Q97 304 95 310Q93 314 93 485V614Q93 664 98 664Q100 666 102 666Q103 666 123 658T178 642T253 634Q324 634 389 662Q397 666 402 666Q410 666 410 648V635Q328 538 205 538Q174 538 149 544L139 546V374Q158 388 169 396T205 412T256 420Q337 420 393 355T449 201Q449 109 385 44T229 -22Q148 -22 99 32T50 154Q50 178 61 192T84 210T107 214Q132 214 148 197T164 157Z"> < path stroke-width="1" id="E1-MJMAIN-2B" d="M56 237T56 250T70 270H369V420L370 570Q380 583 389 583Q402 583 409 568V270H707Q722 262 722 250T707 230H409V-68Q401 -82 391 -82H389H387Q375 -82 369 -68V230H70Q56 237 56 250Z"> < path stroke-width="1" id="E1-MJMATHI-53" d="M308 24Q367 24 416 76T466 197Q466 260 414 284Q308 311 278 321T236 341Q176 383 176 462Q176 523 208 573T273 648Q302 673 343 688T407 704H418H425Q521 704 564 640Q565 640 577 653T603 682T623 704Q624 704 627 704T632 705Q645 705 645 698T617 577T585 459T569 456Q549 456 549 465Q549 471 550 475Q550 478 551 494T553 520Q553 554 544 579T526 616T501 641Q465 662 419 662Q362 662 313 616T263 510Q263 480 278 458T319 427Q323 425 389 408T456 390Q490 379 522 342T554 242Q554 216 546 186Q541 164 528 137T492 78T426 18T332 -20Q320 -22 298 -22Q199 -22 144 33L134 44L106 13Q83 -14 78 -18T65 -22Q52 -22 52 -14Q52 -11 110 221Q112 227 130 227H143Q149 221 149 216Q149 214 148 207T144 186T142 153Q144 114 160 87T203 47T255 29T308 24Z"> < path stroke-width="1" id="E1-MJMAIN-29" d="M60 749L64 750Q69 750 74 750H86L114 726Q208 641 251 514T294 250Q294 182 284 119T261 12T224 -76T186 -143T145 -194T113 -227T90 -246Q87 -249 86 -250H74Q66 -250 63 -250T58 -247T55 -238Q56 -237 66 -225Q221 -64 221 250T66 725Q56 737 55 738Q55 746 60 749Z"> < path stroke-width="1" id="E1-MJSZ1-2211" d="M61 748Q64 750 489 750H913L954 640Q965 609 976 579T993 533T999 516H979L959 517Q936 579 886 621T777 682Q724 700 655 705T436 710H319Q183 710 183 709Q186 706 348 484T511 259Q517 250 513 244L490 216Q466 188 420 134T330 27L149 -187Q149 -188 362 -188Q388 -188 436 -188T506 -189Q679 -189 778 -162T936 -43Q946 -27 959 6H999L913 -249L489 -250Q65 -250 62 -248Q56 -246 56 -239Q56 -234 118 -161Q186 -81 245 -11L428 206Q428 207 242 462L57 717L56 728Q56 744 61 748Z"> < path stroke-width="1" id="E1-MJMATHI-47" d="M50 252Q50 367 117 473T286 641T490 704Q580 704 633 653Q642 643 648 636T656 626L657 623Q660 623 684 649Q691 655 699 663T715 679T725 690L740 705H746Q760 705 760 698Q760 694 728 561Q692 422 692 421Q690 416 687 415T669 413H653Q647 419 647 422Q647 423 648 429T650 449T651 481Q651 552 619 605T510 659Q492 659 471 656T418 643T357 615T294 567T236 496T189 394T158 260Q156 242 156 221Q156 173 170 136T206 79T256 45T308 28T353 24Q407 24 452 47T514 106Q517 114 529 161T541 214Q541 222 528 224T468 227H431Q425 233 425 235T427 254Q431 267 437 273H454Q494 271 594 271Q634 271 659 271T695 272T707 272Q721 272 721 263Q721 261 719 249Q714 230 709 228Q706 227 694 227Q674 227 653 224Q646 221 643 215T629 164Q620 131 614 108Q589 6 586 3Q584 1 581 1Q571 1 553 21T530 52Q530 53 528 52T522 47Q448 -22 322 -22Q201 -22 126 55T50 252Z"> < g stroke="currentColor" fill="currentColor" stroke-width="0" transform="matrix(1 0 0 -1 0 0)" aria-hidden="true"> < use xlink:href="#E1-MJMATHI-50" x="0" y="0"> < use transform="scale(0.707)" xlink:href="#E1-MJMATHI-67" x="908" y="-213"> < use xlink:href="#E1-MJMAIN-3D" x="1360" y="0"> < g transform="translate(2138,0)"> < g transform="translate(397,0)"> < rect stroke="none" width="450" x="0" y="220"> < g transform="translate(999,815)"> < use xlink:href="#E1-MJMAIN-65"> < use xlink:href="#E1-MJMAIN-78" x="444" y="0"> < use xlink:href="#E1-MJMAIN-70" x="973" y="0"> < g transform="translate(1529,0)"> < use xlink:href="#E1-MJMAIN-28"> < g transform="translate(389,0)"> < use xlink:href="#E1-MJMAIN-30"> < use xlink:href="#E1-MJMAIN-2E" x="500" y="0"> < use xlink:href="#E1-MJMAIN-35" x="779" y="0"> < use xlink:href="#E1-MJMAIN-2B" x="1501" y="0"> < g transform="translate(2502,0)"> < use xlink:href="#E1-MJMATHI-53" x="0" y="0"> < use transform="scale(0.707)" xlink:href="#E1-MJMATHI-67" x="867" y="-213"> < use xlink:href="#E1-MJMAIN-29" x="3945" y="0"> < g transform="translate(60,-997)"> < use xlink:href="#E1-MJSZ1-2211" x="0" y="0"> < use transform="scale(0.707)" xlink:href="#E1-MJMATHI-47" x="1494" y="675"> < use transform="scale(0.707)" xlink:href="#E1-MJMATHI-67" x="1494" y="-405"> < g transform="translate(1879,0)"> < use xlink:href="#E1-MJMAIN-65"> < use xlink:href="#E1-MJMAIN-78" x="444" y="0"> < use xlink:href="#E1-MJMAIN-70" x="973" y="0"> < g transform="translate(3408,0)"> < use xlink:href="#E1-MJMAIN-28"> < g transform="translate(389,0)"> < use xlink:href="#E1-MJMAIN-30"> < use xlink:href="#E1-MJMAIN-2E" x="500" y="0"> < use xlink:href="#E1-MJMAIN-35" x="779" y="0"> < use xlink:href="#E1-MJMAIN-2B" x="1501" y="0"> < g transform="translate(2502,0)"> < use xlink:href="#E1-MJMATHI-53" x="0" y="0"> < use transform="scale(0.707)" xlink:href="#E1-MJMATHI-67" x="867" y="-213"> < use xlink:href="#E1-MJMAIN-29" x="3945" y="0">

Now we have probabilities for each class of our message, and can block the message types we don’t want to see. The result can be seen below where the message screen on the left is fairly free of spammy messages (Necessary), and the screen on the right has primarily spammy messages (Sufficient).

Conclusion

Through this blog post we have gone through quite a journey. We started off with a problem of message spam, used our ML insights to come up with a quick model, and produced a meaningful solution. The model attains a 95% accuracy rate for all messages, and fairly good TPR and FPR rates on a per category basis. A few more features could probably be generated to improve performance, but this serves as an excellent initial solution.

From my github stats, it looks like there have been over 200 users of this plugin, as of December 1st and hopefully more as I push out new features like automatic model updating.

Again, happy holidays and let’s see your holiday ML projects!

All FFXI content and images © 2002-2020 SQUARE ENIX CO., LTD.
FINAL FANTASY is a registered trademark of SQUARE ENIX CO., LTD.

To leave a comment for the author, please follow the link and comment on their blog: MeanMean.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.