Java中怎么過(guò)濾中文亂碼-創(chuàng)新互聯(lián)

Java中怎么過(guò)濾中文亂碼，針對(duì)這個(gè)問(wèn)題，這篇文章詳細(xì)介紹了相對(duì)應(yīng)的分析和解答，希望可以幫助更多想解決這個(gè)問(wèn)題的小伙伴找到更簡(jiǎn)單易行的方法。

成都創(chuàng)新互聯(lián)主營(yíng)中站網(wǎng)站建設(shè)的網(wǎng)絡(luò)公司,主營(yíng)網(wǎng)站建設(shè)方案,App定制開(kāi)發(fā),中站h5成都小程序開(kāi)發(fā)搭建,中站網(wǎng)站營(yíng)銷推廣歡迎中站等地區(qū)企業(yè)咨詢

1. Unicode編碼

Unicode編碼是一種涵蓋了世界上所有語(yǔ)言、標(biāo)點(diǎn)等字符的編碼方式，簡(jiǎn)單一點(diǎn)說(shuō)，就是一種通用的世界碼；其編碼范圍：U+0000 .. U+10FFFF。按Unicode硬編碼的區(qū)間進(jìn)行劃分，Unicode編碼被分成若干個(gè)block ( Unicode block)；每一個(gè)Unicode編碼專屬于唯一的Unicode block，Unicode block之間互不重疊。從碼字的本身的屬性出發(fā)，Unicode編碼被分成了若干script ( Unicode script)；比如，與中文相關(guān)的字符、標(biāo)點(diǎn)的scriptHan包括block如下：

CJK Radicals Supplement
Kangxi Radicals
CJK Symbols and Punctuation中的15個(gè)字符
CJK Unified Ideographs Extension A
CJK Unified Ideographs
CJK Compatibility Ideographs
CJK Unified Ideographs Extension B
CJK Unified Ideographs Extension C
CJK Unified Ideographs Extension D
CJK Unified Ideographs Extension E
CJK Compatibility Ideographs Supplement

其中，常見(jiàn)的中文字符在CJK Unified Ideographs block；此外，考慮繁體字及不常見(jiàn)字等，CJK還有A、B、C、D、E五個(gè)extension。Basic Latin block完整地包含了ASCII碼的控制字符、標(biāo)點(diǎn)字符與英文字母字符。

2. Java的字符編碼

JDK完整實(shí)現(xiàn)Unicode的block與script：

Char c = '?'
Character.UnicodeBlock ub = Character.UnicodeBlock.of(c)
Character.UnicodeScript uc = Character.UnicodeScript.of(c);

Java中的字符char內(nèi)置的編碼方式是UTF-16，當(dāng)char強(qiáng)轉(zhuǎn)成int類型時(shí)，其返回值是unicode編碼值，只有當(dāng)getbyte時(shí)才返回的是utf-8編碼的byte：

String s = "\u00a0";
String.format("\\u%04x", (int) s.charAt(0)) // --> \u00a0
import org.apache.commons.codec.binary.Hex;
Hex.encodeHex(s.getBytes()) // --> c2a0

UTF-8是Unicode字符的變長(zhǎng)前綴編碼的一種實(shí)現(xiàn)，二者之間的對(duì)應(yīng)關(guān)系在這里.現(xiàn)在我們回到開(kāi)篇過(guò)濾中文亂碼的問(wèn)題，有一個(gè)基本解決思路：

去掉各種標(biāo)點(diǎn)字符、控制字符，
計(jì)算剩下字符中非中文字符所占的比例，如果超過(guò)閾值，則認(rèn)為該字符串為亂碼串

完整代碼如下：

public class ChineseUtill {
     
    private static boolean isChinese(char c) {
        Character.UnicodeScript sc = Character.UnicodeScript.of(c);
        if (sc == Character.UnicodeScript.HAN) {
            return true;
        }
        return false;
    }
    
    public static boolean isPunctuation(char c) {
        Character.UnicodeBlock ub = Character.UnicodeBlock.of(c);
        if (    // punctuation, spacing, and formatting characters
                ub == Character.UnicodeBlock.GENERAL_PUNCTUATION
                // symbols and punctuation in the unified Chinese, Japanese and Korean script
                || ub == Character.UnicodeBlock.CJK_SYMBOLS_AND_PUNCTUATION
                // fullwidth character or a halfwidth character
                || ub == Character.UnicodeBlock.HALFWIDTH_AND_FULLWIDTH_FORMS
                // vertical glyph variants for east Asian compatibility
                || ub == Character.UnicodeBlock.CJK_COMPATIBILITY_FORMS
                // vertical punctuation for compatibility characters with the Chinese Standard GB 18030
                || ub == Character.UnicodeBlock.VERTICAL_FORMS
                // ascii
                || ub == Character.UnicodeBlock.BASIC_LATIN
                ) {
            return true;
        } else {
            return false;
        }
    }
    
    private static Boolean isUserDefined(char c) {
        Character.UnicodeBlock ub = Character.UnicodeBlock.of(c);
        if (ub == Character.UnicodeBlock.NUMBER_FORMS
                || ub == Character.UnicodeBlock.ENCLOSED_ALPHANUMERICS
                || ub == Character.UnicodeBlock.LETTERLIKE_SYMBOLS
                || c == '\ufeff'
                || c == '\u00a0'
                )
            return true;
        return false;
    }
    
    public static Boolean isMessy(String str)  {
        float chlength = 0;
        float count = 0;
        for(int i = 0; i < str.length(); i++) {
            char c = str.charAt(i);
            if(isPunctuation(c) || isUserDefined(c))
                continue;
            else {
                if(!isChinese(c)) {
                    count = count + 1;
                }
                chlength ++;
            }
        }
        float result = count / chlength;
        if(result > 0.3)
            return true;
        return false;
    }
    
}

關(guān)于Java中怎么過(guò)濾中文亂碼問(wèn)題的解答就分享到這里了，希望以上內(nèi)容可以對(duì)大家有一定的幫助，如果你還有很多疑惑沒(méi)有解開(kāi)，可以關(guān)注創(chuàng)新互聯(lián)-成都網(wǎng)站建設(shè)公司行業(yè)資訊頻道了解更多相關(guān)知識(shí)。

文章名稱：Java中怎么過(guò)濾中文亂碼-創(chuàng)新互聯(lián)
當(dāng)前URL：http://chinadenli.net/article2/cohpic.html

成都網(wǎng)站建設(shè)公司_創(chuàng)新互聯(lián)，為您提供網(wǎng)站設(shè)計(jì)公司、網(wǎng)站建設(shè)、軟件開(kāi)發(fā)、動(dòng)態(tài)網(wǎng)站、靜態(tài)網(wǎng)站、定制開(kāi)發(fā)

聲明：本網(wǎng)站發(fā)布的內(nèi)容（圖片、視頻和文字）以用戶投稿、用戶轉(zhuǎn)載內(nèi)容為主，如果涉及侵權(quán)請(qǐng)盡快告知，我們將會(huì)在第一時(shí)間刪除。文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng)，如需處理請(qǐng)聯(lián)系客服。電話：028-86922220；郵箱：631063699@qq.com。內(nèi)容未經(jīng)允許不得轉(zhuǎn)載，或轉(zhuǎn)載時(shí)需注明來(lái)源：創(chuàng)新互聯(lián)

猜你還喜歡下面的內(nèi)容

欧美一区二区三区老妇人-欧美做爰猛烈大尺度电-99久久夜色精品国产亚洲a-亚洲福利视频一区二区

Java中怎么過(guò)濾中文亂碼-創(chuàng)新互聯(lián)