MLlib 提供的API可以通過Pipelines將多個復雜的機器學習算法結合成單個pipeline或者單個工作流。這個概念和scikit-learn里的概念類似,根據(jù)官方的說法是,此抽象概念的設計靈感來自于scikit-learn。

· DataFrame:通過Spark SQL 組件里的DataFrame作為機器學習的數(shù)據(jù)集。支持多種數(shù)據(jù)類型.比如 DataFrame 可以將文本,數(shù)據(jù)庫等外部數(shù)據(jù)源劃分為不同的列,包含特征向量, 特征值等。
· Transformer: 一個 Transformer可以將一個DataFrame 轉換成另一個DataFrame. 比如, 一個機器學習模型可以將帶有特征值的DataFrame轉換為一個帶有模型預測結果數(shù)據(jù)的DataFrame.
· Estimator:通過 DataFrame數(shù)據(jù)集進行訓練 產(chǎn)生一個機器學習模型的算法。
· Pipeline:聯(lián)合多個 Transformer和 Estimator構成一個機器學習工作流。
· Parameter: 所有Transformer和 Estimator指定參數(shù)的共享API。
DataFrame里廣泛運用的數(shù)據(jù)結構,可以包含向量,文本,圖片,以及結構化數(shù)據(jù)。DataFrame通過Spark SQL支持多種數(shù)據(jù)源。
工作流程如圖所示:

機器學習中Pipleline流程圖
正如圖中所示,Pipeline有三個階段,每個階段要么是Transformer ,要么就是Estimator,這些階段按照一定的順序執(zhí)行,執(zhí)行的過程中,通過圓柱體代表的DataFrame類型的Raw text產(chǎn)生一個新的Words(DataFrame類型),最后建立了一個LogisticRegressionModel。圖中的Tokenizer,HashingTF都是Transformer,而LogisticRegressionModel是Estimator 。
在Transformer 階段,主要調(diào)用transform()方法進行計算。
在Estimator階段,主要調(diào)用fit()方法進行計算。
DAG Pipelines:多個階段形成一個pipeline,同理,DAG Pipelines就是多個pipeline組成的一個有向無環(huán)圖。
運行時檢查:數(shù)據(jù)結構DataFrame中可以有各種各樣的數(shù)據(jù),但是在編譯的時候不會檢查數(shù)據(jù)的數(shù)據(jù)類型,而是在運行的時候才根據(jù)DataFrame的Schema來檢查數(shù)據(jù)類型。
唯一ID標識:Pipeline的每一個階段(stage)都通過id來進行唯一的標識,同一個相同的實列,比如HashingTF不會插入到同一個Pipeline倆次,因為每一個stage都有自身的唯一的ID來進行標識。
代碼案例:
importorg.apache.spark.ml.classification.LogisticRegression
importorg.apache.spark.ml.linalg.{Vector,Vectors}
importorg.apache.spark.ml.param.ParamMap
importorg.apache.spark.sql.Row
// Prepare training data from a list of (label, features)tuples.
valtraining=spark.createDataFrame(Seq(
(1.0,Vectors.dense(0.0,1.1,0.1)),
(0.0,Vectors.dense(2.0,1.0,-1.0)),
(0.0,Vectors.dense(2.0,1.3,1.0)),
(1.0,Vectors.dense(0.0,1.2,-0.5))
)).toDF("label","features")
// Create a LogisticRegression instance. This instance is anEstimator.
vallr=newLogisticRegression()
// Print out the parameters, documentation, and any defaultvalues.
println("LogisticRegressionparameters:\n"+lr.explainParams()+"\n")
// We may set parameters using setter methods.
lr.setMaxIter(10)
.setRegParam(0.01)
// Learn a LogisticRegression model. This uses the parametersstored in lr.
valmodel1=lr.fit(training)
// Since model1 is a Model (i.e., a Transformer produced byan Estimator),
// we can view the parameters it used during fit().
// This prints the parameter (name: value) pairs, where namesare unique IDs for this
// LogisticRegression instance.
println("Model 1 was fit usingparameters: "+model1.parent.extractParamMap)
// We may alternatively specify parameters using a ParamMap,
// which supports several methods for specifying parameters.
valparamMap=ParamMap(lr.maxIter->20)
.put(lr.maxIter,30) // Specify 1 Param. This overwrites the original maxIter.
.put(lr.regParam->0.1,lr.threshold->0.55) // Specify multiple Params.
// One can also combine ParamMaps.
valparamMap2=ParamMap(lr.probabilityCol->"myProbability") // Change output column name.
valparamMapCombined=paramMap++paramMap2
// Now learn a new model using the paramMapCombinedparameters.
// paramMapCombined overrides all parameters set earlier vialr.set* methods.
valmodel2=lr.fit(training,paramMapCombined)
println("Model 2 was fit usingparameters: "+model2.parent.extractParamMap)
// Prepare test data.
valtest=spark.createDataFrame(Seq(
(1.0,Vectors.dense(-1.0,1.5,1.3)),
(0.0,Vectors.dense(3.0,2.0,-0.1)),
(1.0,Vectors.dense(0.0,2.2,-1.5))
)).toDF("label","features")
// Make predictions on test data using theTransformer.transform() method.
// LogisticRegression.transform will only use the 'features'column.
// Note that model2.transform() outputs a 'myProbability'column instead of the usual
// 'probability' column since we renamed thelr.probabilityCol parameter previously.
model2.transform(test)
.select("features","label","myProbability","prediction")
.collect()
.foreach{caseRow(features:Vector,label:Double,prob:Vector,prediction:Double)=>
println(s"($features, $label) -> prob=$prob, prediction=$prediction")
}
Pipeline單獨的案例代碼
importorg.apache.spark.ml.{Pipeline,PipelineModel}importorg.apache.spark.ml.classification.LogisticRegression
importorg.apache.spark.ml.feature.{HashingTF,Tokenizer}importorg.apache.spark.ml.linalg.Vector
importorg.apache.spark.sql.Row
// Prepare training documents from a list of (id, text, label) tuples.
val training = spark.createDataFrame(Seq(
(0L,"a b c d e spark",1.0),
(1L,"b d",0.0),
(2L,"spark f g h",1.0),
(3L,"hadoop mapreduce",0.0)
)).toDF("id","text","label")// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
val tokenizer =newTokenizer()
.setInputCol("text") .setOutputCol("words")val hashingTF =newHashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")val lr =newLogisticRegression()
.setMaxIter(10)
.setRegParam(0.001)
val pipeline =newPipeline()
.setStages(Array(tokenizer, hashingTF, lr))
// Fit the pipeline to training documents.
val model = pipeline.fit(training)
// Now we can optionally save the fitted pipeline to disk
model.write.overwrite().save("/tmp/spark-logistic-regression-model")// We can also save this unfit pipeline to disk
pipeline.write.overwrite().save("/tmp/unfit-lr-model")// And load it back in during production
val sameModel =PipelineModel.load("/tmp/spark-logistic-regression-model")// Prepare test documents, which are unlabeled (id, text) tuples.
val test = spark.createDataFrame(Seq(
(4L,"spark i j k"),
(5L,"l m n"),
(6L,"spark hadoop spark"),
(7L,"apache hadoop")
)).toDF("id","text")// Make predictions on test documents.
model.transform(test)
.select("id","text","probability","prediction").collect()
.foreach{caseRow(id:Long, text:String, prob:Vector, prediction:Double)=>println(s"($id, $text) --> prob=$prob, prediction=$prediction")
}
另外有需要云服務器可以了解下創(chuàng)新互聯(lián)scvps.cn,海內(nèi)外云服務器15元起步,三天無理由+7*72小時售后在線,公司持有idc許可證,提供“云服務器、裸金屬服務器、高防服務器、香港服務器、美國服務器、虛擬主機、免備案服務器”等云主機租用服務以及企業(yè)上云的綜合解決方案,具有“安全穩(wěn)定、簡單易用、服務可用性高、性價比高”等特點與優(yōu)勢,專為企業(yè)上云打造定制,能夠滿足用戶豐富、多元化的應用場景需求。
分享文章:Spark機器學習-創(chuàng)新互聯(lián)
網(wǎng)站地址:http://chinadenli.net/article4/hpeoe.html
成都網(wǎng)站建設公司_創(chuàng)新互聯(lián),為您提供微信公眾號、面包屑導航、ChatGPT、外貿(mào)建站、網(wǎng)站維護、域名注冊
聲明:本網(wǎng)站發(fā)布的內(nèi)容(圖片、視頻和文字)以用戶投稿、用戶轉載內(nèi)容為主,如果涉及侵權請盡快告知,我們將會在第一時間刪除。文章觀點不代表本網(wǎng)站立場,如需處理請聯(lián)系客服。電話:028-86922220;郵箱:631063699@qq.com。內(nèi)容未經(jīng)允許不得轉載,或轉載時需注明來源: 創(chuàng)新互聯(lián)
猜你還喜歡下面的內(nèi)容