全文停用词
使用服务器字符集和校验规则(character_set_server
和collation_server
系统变量的值),加载停用词列表并搜索全文查询。如果用于全文索引或搜索的停用词文件或列的字符集或校验规则不同于character_set_server
或,则对于停用词查找可能会出现错误的命中或遗漏collation_server
。
停用词查找的区分大小写取决于服务器校验规则。例如,查找是不区分大小写如果核对是utf8mb4_0900_ai_ci
,反之,如果核对是查找是大小写敏感utf8mb4_0900_as_cs
或utf8mb4_bin
。
- InnoDB搜索索引的停用词
- MyISAM搜索索引的停用词
InnoDB搜索索引的停用词
InnoDB
缺省停用词的列表相对较短,因为技术,文学和其他来源的文档经常使用短词作为关键字或重要短语。例如,您可能搜索“是或不是”,并期望获得明智的结果,而不是忽略所有这些词。
要参见默认InnoDB
停用词列表,请查询INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD
表。
mysql>SELECT *FROM INFORMATION_SCHEMA.INNODB_FT_DEFAULT_STOPWORD; +------- + | value | +------- + | a | | about | | an | | are | | as | | at | | be | | by | | com | | de | | en | | for | | from | | how | | i | | in | | is | | it | | la | | of | | on | | or | | that | | the | | this | | to | | was | | what | | when | | where | | who | | will | | with | | und | | the | | www | +------- + 36 rows in set (0.00 sec)
要为所有InnoDB
表定义自己的停用词列表,请定义与该表具有相同结构的INNODB_FT_DEFAULT_STOPWORD
表,并使用停用词填充该表,并将innodb_ft_server_stopword_table
选项的值设置为表单中的值,然后再创建全文索引。停用词表必须只有一个名为的列。以下示例演示了如何为创建和配置新的全局停用词表。db_name/table_name
VARCHAR
value
InnoDB
-- Create a new stopword table mysql>CREATE TABLE my_stopwords(value VARCHAR(30))ENGINE = INNODB; Query OK, 0 rows affected (0.01 sec) -- Insert stopwords (for simplicity, a single stopword is used in this example) mysql>INSERT INTO my_stopwords(value )VALUES ('Ishmael'); Query OK, 1 row affected (0.00 sec) -- Create the table mysql>CREATE TABLE opening_lines ( id INTUNSIGNED AUTO_INCREMENT NOT NULLPRIMARY KEY , opening_line TEXT(500), author VARCHAR(200), title VARCHAR(200) )ENGINE =InnoDB; Query OK, 0 rows affected (0.01 sec) -- Insert data into the table mysql>INSERT INTO opening_lines(opening_line,author,title)VALUES ('Call me Ishmael.','Herman Melville','Moby-Dick'), ('A screaming comes across the sky.','Thomas Pynchon','Gravity\'s Rainbow'), ('I am an invisible man.','Ralph Ellison','Invisible Man'), ('Where now? Who now? When now?','Samuel Beckett','The Unnamable'), ('It was love at first sight.','Joseph Heller','Catch-22'), ('All this happened, more or less.','Kurt Vonnegut','Slaughterhouse-Five'), ('Mrs. Dalloway said she would buy the flowers herself.','Virginia Woolf','Mrs. Dalloway'), ('It was a pleasure to burn.','Ray Bradbury','Fahrenheit 451'); Query OK, 8 rows affected (0.00 sec) Records: 8 Duplicates: 0 Warnings: 0 -- Set the innodb_ft_server_stopword_table option to the new stopword table mysql>SET GLOBAL innodb_ft_server_stopword_table = 'test/my_stopwords'; Query OK, 0 rows affected (0.00 sec) -- Create the full-text index (which rebuilds the table if no FTS_DOC_ID column is defined) mysql>CREATE FULLTEXT INDEX idxON opening_lines(opening_line); Query OK, 0 rows affected, 1 warning (1.17 sec) Records: 0 Duplicates: 0 Warnings: 1
通过查询中的单词,确认没有出现指定的停用词('Ishmael')INFORMATION_SCHEMA.INNODB_FT_INDEX_TABLE
。
注意默认情况下,长度少于3个字符或长度大于84个字符的单词不会出现在
InnoDB
全文搜索索引中。可以使用innodb_ft_max_token_size
和innodb_ft_min_token_size
变量配置最大和最小字长值。此默认行为不适用于ngram解析器插件。ngram令牌大小由该ngram_token_size
选项定义。
mysql>SET GLOBAL innodb_ft_aux_table='test/opening_lines'; Query OK, 0 rows affected (0.00 sec) mysql>SELECT wordFROM INFORMATION_SCHEMA.INNODB_FT_INDEX_TABLELIMIT 15; +----------- + | word | +----------- + | across | | all | | burn | | buy | | call | | comes | | dalloway | | first | | flowers | | happened | | herself | | invisible | | less | | love | | man | +----------- + 15 rows in set (0.00 sec)
要逐个表创建停用词列表,请创建其他停用词表,并使用该innodb_ft_user_stopword_table
选项指定要使用的停用词表,然后再创建全文索引。
MyISAM搜索索引的停用词
停止字文件被加载并使用搜索latin1
,如果character_set_server
是ucs2
,utf16
,utf16le
,或utf32
。
要覆盖MyISAM表的默认停用词列表,请设置ft_stopword_file
系统变量。(请参见“服务器系统变量”。)变量值应为包含停用词列表的文件的路径名,或为禁用停用词过滤的空字符串。除非指定了绝对路径名以指定其他目录,否则服务器将在数据目录中查找文件。更改此变量的值或停用词文件的内容后,重新启动服务器并重建FULLTEXT
索引。
停用词列表是自由格式的,使用任何非字母数字字符(例如换行符,空格或逗号)分隔停用词。下划线字符(_
)和单撇号('
)被视为单词的一部分,但例外。停用词列表的字符集是服务器的默认字符集。请参见“服务器字符集和校验规则”。
以下列表显示了MyISAM
搜索索引的默认停用词。在MySQL源代码发行版中,您可以在storage/myisam/ft_static.c
文件中找到此列表。
a's able about above according accordingly across actually after afterwards again against ain't all allow allows almost alone along already also although always am among amongst an and another any anybody anyhow anyone anything anyway anyways anywhere apart appear appreciate appropriate are aren't around as aside ask asking associated at available away awfully be became because become becomes becoming been before beforehand behind being believe below beside besides best better between beyond both brief but by c'mon c's came can can't cannot cant cause causes certain certainly changes clearly co com come comes concerning consequently consider considering contain containing contains corresponding could couldn't course currently definitely described despite did didn't different do does doesn't doing don't done down downwards during each edu eg eight either else elsewhere enough entirely especially et etc even ever every everybody everyone everything everywhere ex exactly example except far few fifth first five followed following follows for former formerly forth four from further furthermore get gets getting given gives go goes going gone got gotten greetings had hadn't happens hardly has hasn't have haven't having he he's hello help hence her here here's hereafter hereby herein hereupon hers herself hi him himself his hither hopefully how howbeit however i'd i'll i'm i've ie if ignored immediate in inasmuch inc indeed indicate indicated indicates inner insofar instead into inward is isn't it it'd it'll it's its itself just keep keeps kept know known knows last lately later latter latterly least less lest let let's like liked likely little look looking looks ltd mainly many may maybe me mean meanwhile merely might more moreover most mostly much must my myself name namely nd near nearly necessary need needs neither never nevertheless new next nine no nobody non none noone nor normally not nothing novel now nowhere obviously of off often oh ok okay old on once one ones only onto or other others otherwise ought our ours ourselves out outside over overall own particular particularly per perhaps placed please plus possible presumably probably provides que quite qv rather rd re really reasonably regarding regardless regards relatively respectively right said same saw say saying says second secondly see seeing seem seemed seeming seems seen self selves sensible sent serious seriously seven several shall she should shouldn't since six so some somebody somehow someone something sometime sometimes somewhat somewhere soon sorry specified specify specifying still sub such sup sure t's take taken tell tends th than thank thanks thanx that that's thats the their theirs them themselves then thence there there's thereafter thereby therefore therein theres thereupon these they they'd they'll they're they've think third this thorough thoroughly those though three through throughout thru thus to together too took toward towards tried tries truly try trying twice two un under unfortunately unless unlikely until unto up upon us use used useful uses using usually value various very via viz vs want wants was wasn't way we we'd we'll we're we've welcome well went were weren't what what's whatever when whence whenever where where's whereafter whereas whereby wherein whereupon wherever whether which while whither who who's whoever whole whom whose why will willing wish with within without won't wonder would wouldn't yes yet you you'd you'll you're you've your yours yourself yourselves zero