Topics
ðèšèªã¢ãã«
åèªåã«å¯Ÿããð確çååž.
- ðšã¯ããŒãã»ã·ã£ãã³ã®æ å ±çè«ãæã¡ç«ãŠãè«æ(1948)ã§ãã§ã«ç»å ŽããŠãã.
- ð€å€§èŠæš¡èšèªã¢ãã«(LLMs)ã®ç»å Žã§å€§çºå±.
èšèªã¢ããªã³ã°
ðèšèªã¢ãã«ãæ§ç¯ããæ¹æ³.
ð圢æ çŽ è§£æ
èªç¶èšèªåŠçã®ååŠç, äž»ã«ä»¥äžã®ãã®ããã.
- ã¯ãªãŒãã³ã°
- HTMLã¿ã°ãèšå·çãããã¹ãäžã®ãã€ãºãé€å».
- æåºåã(sentence segmentation)
- æãšæã®åºåããæ€åºãåå².
- æå¢çè§£æ
- åèªåå²(tokenization)
- æãåèªã®åã«åå².
- æ£èŠå(normalization)
- å šè§ã»åè§ã倧æåã»å°æåçã®çµ±äž.
- ã¹ãããã¯ãŒãã®é€å»(stopword removal/noise removal)
- è§£ãããã¿ã¹ã¯ã«äžèŠãªåèªãé€å».
- ãã¯ãã«è¡šçŸ
ð§Embedding
æç« ããã¯ãã«åããæè¡.
è±èªã𿥿¬èªã®å¯Ÿå¿é
- ååŠç: preprocessing
æåºåã(Sentence Segmentation)
- GiNZA: æ¥æ¬èªèªç¶èšèªåŠçãªãŒãã³ãœãŒã¹ã©ã€ãã©ãª | ãªã¯ã«ãŒã AIç ç©¶æ©é¢
- æ¥æ¬èªã®æç« ãããæãã«æåºåãããã©ã€ãã©ãªãäœã£ã - Qiita
- Mecabãçšããæ¥æ¬èªã®èªç¶èšèªåŠç