htmlspecialchars()

(PHP 4, PHP 5, PHP 7)

将特殊字符转换为 HTML 实体

说明

htmlspecialchars(string $string[,int $flags= ENT_COMPAT | ENT_HTML401[,string $encoding= ini_get("default_charset")[,bool $double_encode= TRUE]]]) : string

某类字符在 HTML 中有特殊用处，如需保持原意，需要用 HTML 实体来表达。本函数会返回字符转义后的表达。如需转换子字符串中所有关联的名称实体，使用htmlentities()代替本函数。

如果传入字符的字符编码和最终的文档是一致的，则用函数处理的输入适合绝大多数 HTML 文档环境。然而，如果输入的字符编码和最终包含字符的文档是不一样的，想要保留字符（以数字或名称实体的形式），本函数以及htmlentities()（仅编码名称实体对应的子字符串）可能不够用。这种情况可以使用mb_encode_numericentity()代替。

**执行转换**
字符	替换后
&(&符号)	&
"(双引号)	"，除非设置了`ENT_NOQUOTES`
'(单引号)	设置了`ENT_QUOTES`后，'(如果是`ENT_HTML401`)，或者'(如果是`ENT_XML1`、`ENT_XHTML`或`ENT_HTML5`)。
<(小于)	<
>(大于)	>

参数

$string

待转换的string。

$flags

位掩码，由以下某个或多个标记组成，设置转义处理细节、无效单元序列、文档类型。默认是ENT_COMPAT | ENT_HTML401。

**有效的$flags常量**
常量名称	描述
`ENT_COMPAT`	会转换双引号，不转换单引号。
`ENT_QUOTES`	既转换双引号也转换单引号。
`ENT_NOQUOTES`	单/双引号都不转换
`ENT_IGNORE`	静默丢弃无效的代码单元序列，而不是返回空字符串。不建议使用此标记，因为它»可能有安全影响。
`ENT_SUBSTITUTE`	替换无效的代码单元序列为 Unicode 代替符（Replacement Character）， U+FFFD (UTF-8)或者�(其他)，而不是返回空字符串。
`ENT_DISALLOWED`	为文档的无效代码点替换为 Unicode 代替符（Replacement Character）： U+FFFD (UTF-8)，或�（其他），而不是把它们留在原处。比如以下情况下就很有用：要保证 XML 文档嵌入额外内容时格式合法。
`ENT_HTML401`	以 HTML 4.01 处理代码。
`ENT_XML1`	以 XML 1 处理代码。
`ENT_XHTML`	以 XHTML 处理代码。
`ENT_HTML5`	以 HTML 5 处理代码。

$encoding

An optional argument defining the encoding used when converting characters.

If omitted, the default value of the$encodingvaries depending on the PHP version in use. In PHP 5.6 and later,thedefault_charsetconfiguration option is used as the default value. PHP 5.4 and 5.5 will useUTF-8as the default. Earlier versions of PHP useISO-8859-1.

Although this argument is technically optional, you are highly encouraged to specify the correct value for your code if you are using PHP 5.5 or earlier, or if yourdefault_charsetconfiguration option may be set incorrectly for the given input.

本函数使用效果上，如果$string对以下字符编码是有效的，ISO-8859-1、ISO-8859-15、UTF-8、cp866、cp1251、cp1252、KOI8-R将具有相同的效果。也就是说，在这些编码里，受htmlspecialchars()影响的字符会占据相同的位置。

支持以下字符集：

**支持的字符集列表**
字符集	别名	描述
ISO-8859-1	ISO8859-1	西欧，Latin-1
ISO-8859-5	ISO8859-5	Little used cyrillic charset (Latin/Cyrillic).
ISO-8859-15	ISO8859-15	西欧，Latin-9。增加欧元符号，法语和芬兰语字母在 Latin-1(ISO-8859-1)中缺失。
UTF-8		ASCII 兼容的多字节 8 位 Unicode。
cp866	ibm866, 866	DOS 特有的西里尔编码。本字符集在 4.3.2 版本中得到支持。
cp1251	Windows-1251, win-1251, 1251	Windows 特有的西里尔编码。本字符集在 4.3.2 版本中得到支持。
cp1252	Windows-1252, 1252	Windows 特有的西欧编码。
KOI8-R	koi8-ru, koi8r	俄语。本字符集在 4.3.2 版本中得到支持。
BIG5	950	繁体中文，主要用于中国台湾省。
GB2312	936	简体中文，中国国家标准字符集。
BIG5-HKSCS		繁体中文，附带香港扩展的 Big5 字符集。
Shift_JIS	SJIS, 932	日语
EUC-JP	EUCJP	日语
MacRoman		Mac OS 使用的字符串。
''		An empty string activates detection from script encoding (Zend multibyte),default_charsetand current locale (seenl_langinfo()andsetlocale()), in this order. Not recommended.

Note:其他字符集没有认可。将会使用默认编码并抛出异常。

$double_encode

关闭$double_encode时，PHP 不会转换现有的 HTML 实体，默认是全部转换。

返回值

转换后的string。

如果指定的编码$encoding里，$string包含了无效的代码单元序列，没有设置ENT_IGNORE或者ENT_SUBSTITUTE标记的情况下，会返回空字符串。

更新日志

版本	说明
5.6.0	The default value for the$encodingparameter was changed to be the value of thedefault_charsetconfiguration option.
5.4.0	$encoding参数的默认值改成 UTF-8。
5.4.0	增加常量`ENT_SUBSTITUTE`、`ENT_DISALLOWED`、`ENT_HTML401`、`ENT_XML1`、`ENT_XHTML`、`ENT_HTML5`。
5.3.0	增加常量`ENT_IGNORE`。
5.2.3	增加参数$double_encode。

范例

Example #1htmlspecialchars()例子

<?php
$new = htmlspecialchars("<a href='test'>Test</a>", ENT_QUOTES);
echo $new; // &lt;a href=&#039;test&#039;&gt;Test&lt;/a&gt;
?>

注释

Note:
注意，本函数不会转换以上列表以外的实体。完整转换请参见htmlentities()。

Note:
如果$flags的设置模糊易混淆，将遵循以下规则：
当ENT_COMPAT、ENT_QUOTES、ENT_NOQUOTES都没设置，默认就是ENT_COMPAT。
如果设置不止一个ENT_COMPAT、ENT_QUOTES、ENT_NOQUOTES，优先级最高的是ENT_QUOTES，其次是ENT_COMPAT。
当ENT_HTML401、ENT_HTML5、ENT_XHTML、ENT_XML1都没设置，默认是ENT_HTML401。
如果设置不止一个ENT_HTML401、ENT_HTML5、ENT_XHTML、ENT_XML1，优先级最高的是ENT_HTML5其次是ENT_XHTML和ENT_HTML401。
如果设置不止一个ENT_DISALLOWED、ENT_IGNORE、ENT_SUBSTITUTE，优先级最高的是ENT_IGNORE，其次是ENT_SUBSTITUTE。

参见

get_html_translation_table()返回使用 htmlspecialchars 和 htmlentities 后的转换表
htmlspecialchars_decode()将特殊的 HTML 实体转换回普通字符
strip_tags()从字符串中去除 HTML 和 PHP 标记
htmlentities()将字符转换为 HTML 转义字符
nl2br()在字符串所有新行之前插入 HTML 换行标记

As of PHP 5.4 they changed default encoding from "ISO-8859-1" to "UTF-8". So if you get null from htmlspecialchars or htmlentities
where you have only set 
<?php
echo htmlspecialchars($string);
echo htmlentities($string);
?>
you can fix it by
<?php
echo htmlspecialchars($string, ENT_COMPAT,'ISO-8859-1', true);
echo htmlentities($string, ENT_COMPAT,'ISO-8859-1', true);
?> 
On linux you can find the scripts you need to fix by
grep -Rl "htmlspecialchars\\ | htmlentities" /path/to/php/scripts/

Unfortunately, as far as I can tell, the PHP devs did not provide ANY way to set the default encoding used by htmlspecialchars() or htmlentities(), even though they changed the default encoding in PHP 5.4 (*golf clap for PHP devs*). To save someone the time of trying it, this does not work:
<?php
ini_set('default_charset', $charset); // doesn't work.
?>
Unfortunately, the only way to not have to explicitly provide the second and third parameter every single time this function is called (which gets extremely tedious) is to write your own function as a wrapper:
<?php
define('CHARSET', 'ISO-8859-1');
define('REPLACE_FLAGS', ENT_COMPAT  |  ENT_XHTML);
function html($string) {
  return htmlspecialchars($string, REPLACE_FLAGS, CHARSET);
}
echo html("ñ"); // works
?>
You can do the same for htmlentities()

i searched for a while for a script, that could see the difference between an html tag and just < and > placed in the text, 
the reason is that i recieve text from a database,
wich is inserted by an html form, and contains text and html tags, 
the text can contain < and >, so does the tags,
with htmlspecialchars you can validate your text to XHTML,
but you'll also change the tags, like <b> to &lt;b&gt;,
so i needed a script that could see the difference between those two...
but i couldn't find one so i made my own one, 
i havent fully tested it, but the parts i tested worked perfect!
just for people that were searching for something like this,
it may looks big, could be done easier, but it works for me, so im happy.
<?php
function fixtags($text){
$text = htmlspecialchars($text);
$text = preg_replace("/=/", "=\"\"", $text);
$text = preg_replace("/&quot;/", "&quot;\"", $text);
$tags = "/&lt;(\/ | )(\w*)(\  | )(\w*)([\\\=]*)(? | (\")\"&quot;\" | )(? | (.*)?&quot;(\") | )([\ ]?)(\/ | )&gt;/i";
$replacement = "<$1$2$3$4$5$6$7$8$9$10>";
$text = preg_replace($tags, $replacement, $text);
$text = preg_replace("/=\"\"/", "=", $text);
return $text;
}
?>
an example:
<?php
$string = "
this is smaller < than this<br /> 
this is greater > than this<br />
this is the same = as this<br />
<a href=\"http://www.example.com/example.php?test=test\">This is a link</a><br />
<b>Bold</b> <i>italic</i> etc...";
echo fixtags($string);
?>
will echo:
this is smaller &lt; than this<br /> 
this is greater &gt; than this<br /> 
this is the same = as this<br /> 
<a href="http://www.example.com/example.php?test=test">This is a link</a><br /> 
<b>Bold</b> <i>italic</i> etc...
I hope its helpfull!!

if your goal is just to protect your page from Cross Site Scripting (XSS) attack, or just to show HTML tags on a web page (showing <body> on the page, for example), then using htmlspecialchars() is good enough and better than using htmlentities(). A minor point is htmlspecialchars() is faster than htmlentities(). A more important point is, when we use htmlspecialchars($s) in our code, it is automatically compatible with UTF-8 string. Otherwise, if we use htmlentities($s), and there happens to be foreign characters in the string $s in UTF-8 encoding, then htmlentities() is going to mess it up, as it modifies the byte 0x80 to 0xFF in the string to entities like &eacute;. (unless you specifically provide a second argument and a third argument to htmlentities(), with the third argument being "UTF-8").
The reason htmlspecialchars($s) already works with UTF-8 string is that, it changes bytes that are in the range 0x00 to 0x7F to &lt; etc, while leaving bytes in the range 0x80 to 0xFF unchanged. We may wonder whether htmlspecialchars() may accidentally change any byte in a 2 to 4 byte UTF-8 character to &lt; etc. The answer is, it won't. When a UTF-8 character is 2 to 4 bytes long, all the bytes in this character is in the 0x80 to 0xFF range. None can be in the 0x00 to 0x7F range. When a UTF-8 character is 1 byte long, it is just the same as ASCII, which is 7 bit, from 0x00 to 0x7F. As a result, when a UTF-8 character is 1 byte long, htmlspecialchars($s) will do its job, and when the UTF-8 character is 2 to 4 bytes long, htmlspecialchars($s) will just pass those bytes unchanged. So htmlspecialchars($s) will do the same job no matter whether $s is in ASCII, ISO-8859-1 (Latin-1), or UTF-8.

Actually, if you're using >= 4.0.5, this should theoretically be quicker (less overhead anyway):
$text = str_replace(array("&gt;", "&lt;", "&quot;", "&amp;"), array(">", "<", "\"", "&"), $text);

Be careful, the "charset" argument IS case sensitive. This is counter-intuitive and serves no practical purpose because the HTML spec actually has the opposite.

I was recently exploring some code when I saw this being used to make data safe for "SQL".
This function should not be used to make data SQL safe (although to prevent phishing it is perfectly good).
Here is an example of how NOT to use this function:
<?php
$username = htmlspecialchars(trim("$_POST[username]"));
$uniqueuser = $realm_db->query("SELECT `login` FROM `accounts` WHERE `login` = '$username'");
?>
(Only other check on $_POST['username'] is to make sure it isn't empty which it is after trim on a white space only name)
The problem here is that it is left to default which allows single quote marks which are used in the sql query. Turning on magic quotes might fix it but you should not rely on magic quotes, in fact you should never use it and fix the code instead. There are also problems with \ not being escaped. Even if magic quotes were used there would be the problem of allowing usernames longer than the limit and having some really weird usernames given they are to be used outside of html, this just provide a front end for registering to another system using mysql. Of course using it on the output wouldn;t cause that problem.
Another way to make something of a fix would be to use ENT_QUOTE or do:
<?php
$uniqueuser = $realm_db->query('SELECT `login` FROM `accounts` WHERE `login` = "'.$username.'";');
?>
Eitherway none of these solutions are good practice and are not entirely unflawed. This function should simply never be used in such a fashion.
I hope this will prevent newbies using this function incorrectly (as they apparently do).

Problem
In many PHP legacy products the function htmlspecialchars($string) is used to convert characters like < and > and quotes a.s.o to HTML-entities. That avoids the interpretation of HTML Tags and asymmetric quote situations.
Since PHP 5.4 for $string in htmlspecialchars($string) utf8 characters are expected if no charset is defined explicitly as third parameter in the function. Legacy products are mostly in Latin1 (alias iso-8859-1) what makes the functions htmlspecialchars(), htmlentites() and html_entity_decode() to return empty strings if a special character, e. g. a German Umlaut, is present in $string:
PHP<5.4
echo htmlspecialchars('<b>Woermann</b>') //Output: &lt;b&gt;Woermann&lt;b&gt;
echo htmlspecialchars('Wörmann') //Output: &lt;b&gt;Wörmann&lt;b&gt;
PHP=5.4
echo htmlspecialchars('<b>Woermann</b>') //Output: &lt;b&gt;Woermann&lt;b&gt;
echo htmlspecialchars('<b>Wörmann</b>') //Output: empty
Three alternative solutions
a) Not runnig legacy products on PHP 5.4
b) Change all find spots in your code from 
htmlspecialchars($string) and *** to 
htmlspecialchars($string, ENT_COMPAT  |  ENT_HTML401, 'ISO-8859-1')
c) Replace all htmlspecialchars() and *** with a new self-made function
*** The same is true for htmlentities() and html_entity_decode();
Solution c
1 Make Search and Replace in the concerned legacy project:
Search for:    htmlspecialchars
Replace with:  htmlXspecialchars
Search for:    htmlentities
Replace with:  htmlXentities
Search for:    html_entity_decode
Replace with:  htmlX_entity_decode
2a Copy and paste the following three functions into an existing already everywhere included PHP-file in your legacy project. (of course that PHP-file must be included only once per request, otherwise you will get a Redeclare Function Fatal Error).
function htmlXspecialchars($string, $ent=ENT_COMPAT, $charset='ISO-8859-1') {
return htmlspecialchars($string, $ent, $charset);
}
function htmlXentities($string, $ent=ENT_COMPAT, $charset='ISO-8859-1') {
return htmlentities($string, $ent, $charset);
}
function htmlX_entity_decode($string, $ent=ENT_COMPAT, $charset='ISO-8859-1') {
return html_entity_decode($string, $ent, $charset);
}
or 2b crate a new PHP-file containing the three functions mentioned above, let's say, z. B. htmlXfunctions.inc.php and include it on the first line of every PHP-file in your legacy product like this: require_once('htmlXfunctions.inc.php').

Just a few notes on how one can use htmlspecialchars() and htmlentities() to filter user input on forms for later display and/or database storage... 
1. Use htmlspecialchars() to filter text input values for html input tags. i.e.,
echo '<input name=userdata type=text value="'.htmlspecialchars($data).'" />';
 
2. Use htmlentities() to filter the same data values for most other kinds of html tags, i.e.,
echo '<p>'.htmlentities($data).'</p>';
3. Use your database escape string function to filter the data for database updates & insertions, for instance, using postgresql, 
pg_query($connection,"UPDATE datatable SET datavalue='".pg_escape_string($data)."'");
 
This strategy seems to work well and consistently, without restricting anything the user might like to type and display, while still providing a good deal of protection against a wide variety of html and database escape sequence injections, which might otherwise be introduced through deliberate and/or accidental input of such character sequences by users submitting their input data via html forms.

also see function "urlencode()", useful for passing text with ampersand and other special chars through url
(i.e. the text is encoded as if sent from form using GET method)
e.g.
<?php
echo "<a href='foo.php?text=".urlencode("foo?&bar!")."'>link</a>";
?>
produces
<a href='foo.php?text=foo%3F%26bar%21'>link</a>
and if the link is followed, the $_GET["text"] in foo.php will contain "foo?&bar!"

Another thing important to mention is that
htmlspecialchars(NULL)
returnes an empty string and not NULL!

This may seem obvious, but it caused me some frustration. If you try and use htmlspecialchars with the $charset argument set and the string you run it on is not actually the same charset you specify, you get any empty string returned without any notice/warning/error.
<?php
$ok_utf8 = "A valid UTF-8 string";
$bad_utf8 = "An invalid UTF-8 string";
var_dump(htmlspecialchars($bad_utf8, ENT_NOQUOTES, 'UTF-8')); // string(0) ""
var_dump(htmlspecialchars($ok_utf8, ENT_NOQUOTES, 'UTF-8')); // string(20) "A valid UTF-8 string"
?>
So make sure your charsets are consistent
<?php
$bad_utf8 = "An invalid UTF-8 string";
// make sure it's really UTF-8
$bad_utf8 = mb_convert_encoding($bad_utf8, 'UTF-8', mb_detect_encoding($bad_utf8));
var_dump(htmlspecialchars($bad_utf8, ENT_NOQUOTES, 'UTF-8')); // string(23) "An invalid UTF-8 string" 
?>
I had this problem because a Mac user was submitting posts copy/pasted from a program and it contained weird chars in it.

Be aware of the encoding of your source files!!! 
Some of the suggestions here make reference to workarounds where you hard-code an encoding.
<?php
 echo htmlspecialchars('<b>Wörmann</b>'); // Why isn't this working?
?>
As it turns out, it may actually be your text editor that is to blame.
As of PHP 5.4, htmlspecialchars now defaults to the UTF-8 encoding. That said, many text editors default to non-UTF encodings like ISO-8859-1 (i.e. Latin-1) or WIN-1252. If you change the encoding of the file to UTF-8, the code above will now work (i.e. the ö is encoded differently in UTF-8 and ISO-8859-1, and you need the UTF-8 version).
Make sure you are editing in UTF-8 Unicode mode! Check your UI or manual for how to convert files to Unicode. It's also a good idea to figure out where to look in your UI to see what the current file encoding is.

function htmlspecialchars_array($arr = array()) {
  $rs = array();
  while(list($key,$val) = each($arr)) {
    if(is_array($val)) {
      $rs[$key] = htmlspecialchars_array($val);
    }
    else {
      $rs[$key] = htmlspecialchars($val, ENT_QUOTES);
    }  
  }
  return $rs;
}

If you use htmlspecialchars() to escape any HTML attribute, make sure use double quote instead of single quote for the attribute.
For Example, 
> Wrap with Single Quote
<?php
echo "<p title='" . htmlspecialchars("Hello\"s\'world") . "'"> 
// title will end up Hello"s\ and rest of the text after single quote will be cut off. 
?>
> Wrap with Double quote :
<?php
echo '<p title="' . htmlspecialchars("Hello\"s\'world") . '"'> 
// title will show up correctly as Hello"s'world
?>

People, don't use ereg_replace for the most simple string replacing operations (replacing constant string with another).
Use str_replace.

For those having problems after the change of default value of $encoding argument to UTF-8 since PHP 5.4.
If your old non-UTF8 projects ruined - pls consider:
1. http://php.net/manual/en/function.override-function.php
2. http://php.net/manual/ru/function.runkit-function-redefine.php
The idea - you override the built-in htmlspecialchars() function with your customized variant which is able to respect non UTF-8 default encoding. This small piece of code can be then easily inserted somewhere at the start of yout project. No need to rewrite all htmlspecialchars() entries globally.
I've spent several hours with both approaches. Variant 1 looks good especaially in combination with http://www.php.net/manual/en/function.rename-function.php as it allows to call original htmlspecialchars() with just altered default args. The code could be as follows:
<?php
rename_function('htmlspecialchars', 'renamed_htmlspecialchars');
function overriden_htmlspecialchars($string, $flags=NULL, $encoding='cp1251', $double_encode=true) {
  $flags = $flags ? $flags : (ENT_COMPAT | ENT_HTML401);
  return renamed_htmlspecialchars($string, $flags, $encoding, $double_encode);
}
override_function('htmlspecialchars', '$string, $flags, $encoding, $double_encode', 'return overriden_htmlspecialchars($string, $flags, $encoding, $double_encode);');
?>
Unfortunatelly this didn't work for me properly - my site managed to call overriden function but not every time I reloaded the pages. Moreover other PHP sites crashed under my Apache server as they suddenly started blaming htmlspecialchars() was not defined. I suppose I had to spend more time to make it work thread/request/site/whatever-safe.
So I switched to runkit (variant 2). It worked for me, although even after trying runkit_function_rename()+runkit_function_add() I didn't managed to recall original htmlspecialchars() function. So as a quick solution I decided to call htmlentities() instead:
<?php
function overriden_htmlspecialchars($string, $flags=NULL, $encoding='UTF-8', $double_encode=true) {
  $flags = $flags ? $flags : (ENT_COMPAT | ENT_HTML401);
  $encoding = $encoding ? $encoding : 'cp1251';
  return htmlentities($string, $flags, $encoding, $double_encode);
}
runkit_function_redefine('htmlspecialchars', '$string, $flags, $encoding, $double_encode', 'return overriden_htmlspecialchars($string, $flags, $encoding, $double_encode);'); 
?>
You may be able to implement your more powerfull overriden function.
Good luck!

I had problems with spanish special characters. So i think in using htmlspecialchars but my strings also contain HTML.
So I used this :) Hope it help
<?php
function htmlspanishchars($str) 
{
  return str_replace(array("&lt;", "&gt;"), array("<", ">"), htmlspecialchars($str, ENT_NOQUOTES, "UTF-8"));
}
?>