IT技術互動交流平台(tai)

湖北快三官网

作者(zhe)︰Overview——Sqoop 概述 - 林子系  來源︰IT165收集(ji)  發布日期︰2020-02-19 08:08:52

Apache Sqoop - Overview

湖北快三官网

使用(yong)Hadoop來分析和處理數據需要將數據加(jia)載到(dao)集(ji)群中(zhong)並且(qie)將它和企業生(sheng)產數據庫中(zhong)的其他數據進行(xing)結合處理。從生(sheng)產系統加(jia)載大塊數據到(dao)Hadoop中(zhong)或者(zhe)從大型集(ji)群的map reduce應(ying)用(yong)中(zhong)獲得數據是個(ge)挑戰。用(yong)戶(hu)必(bi)須意(yi)識到(dao)確保數據一致性,消(xiao)耗(hao)生(sheng)產系統資源,供應(ying)下游管道的數據預(yu)處理這些細節。用(yong)腳本(ben)來轉化數據是低效和耗(hao)時的方式。使用(yong)map reduce應(ying)用(yong)直接去獲取外部(bu)系統的數據使得應(ying)用(yong)變得復雜和增(zeng)加(jia)了生(sheng)產系統來自(zi)集(ji)群節點過度負載的風險。

這就是Apache Sqoop能(neng)夠做到(dao)的。Aapche Sqoop 目前是Apache軟(ruan)件會(hui)的孵化項目。更多關于這個(ge)項目的信(xin)息可(ke)以在http://incubator.apache.org/sqoop查(cha)看

Sqoop能(neng)夠使得像關系型數據庫、企業數據倉庫和NoSQL系統那樣簡單地從結構化數據倉庫中(zhong)導入導出數據。你可(ke)以使用(yong)Sqoop將數據從外部(bu)系統加(jia)載到(dao)HDFS,存儲(chu)在Hive和HBase表格中(zhong)。Sqoop配合Ooozie能(neng)夠幫(bang)助你調度和自(zi)動運行(xing)導入導出任務。Sqoop使用(yong)基于支持插(cha)件來提供新的外部(bu)鏈接yong)牧 悠qi)。

當你運行(xing)Sqoop的時候(hou)看起來是非常(chang)簡單的,但是表象底層下面(mian)發生(sheng)了什麼呢?數據集(ji)將被切片分到(dao)不同的partitions和運行(xing)一個(ge)只(zhi)有map的作業來負責數據集(ji)的某個(ge)切片。因為(wei)Sqoop使用(yong)數據庫的nao) 堇賜貧鮮堇嘈退悅刻跏荻du)以一種類型安(an)全的方式來處理。

在這篇文(wen)章(zhang)其余部(bu)分中(zhong)我們將通(tong)過一個(ge)例子來展示Sqoop的各種使用(yong)方式。這篇文(wen)章(zhang)的目標是提供Sqoop操(cao)作的一個(ge)概述而不是qiao)shen)入高級功能(neng)的細節。

湖北快三官网

下面(mian)的命令用(yong)于將一個(ge)MySQL數據庫中(zhong)名為(wei)ORDERS的表中(zhong)所有數據導入到(dao)集(ji)群中(zhong)
---
$ sqoop import --connect jdbc:mysql://localhost/acmedb
  --table ORDERS --username test --password ****
---

在這條命令中(zhong)的各種選項解釋如下︰

  • import: 指示Sqoop開始導入 --connect <connect string>, --username <user name>, --password <password>: 這些都(du)是連接you)菘饈斃枰 牟問em>。這跟(gen)你通(tong)過JDBC連接you)菘饈彼褂yong)的參數沒有區別 --table <table name>: 指定要導入哪個(ge)表

    導入操(cao)作通(tong)過下面(mian)Figure1所描繪的那兩步(bu)來完成。第一步(bu),Sqoop從you)菘庵zhong)獲取要導入的數據的nao) 蕁5詼bu),Sqoop提交map-only作業到(dao)Hadoop集(ji)群中(zhong)。第二步(bu)通(tong)過在前一步(bu)中(zhong)獲取的nao) 葑鍪導實氖荽 涔?鰲/p>

    Figure 1: Sqoop Import Overview

    導入的數據存儲(chu)在HDFS目錄下。正如Sqoop大多數操(cao)作一樣,用(yong)戶(hu)可(ke)以指定任何(he)替換路(lu)徑來存儲(chu)導入的數據。

    默(mo)認情況下這些文(wen)檔包(bao)含用(yong)逗(dou)號分隔的字段(duan),用(yong)新行(xing)來分隔不同的記錄。你可(ke)以明(ming)確fan)?付ㄗ佷duan)分隔符和記錄結束符容易(yi)地實現文(wen)件復制過程中(zhong)的格式覆蓋。

    Sqoop也支持不同數據格式的數據導入。例如,你可(ke)以通(tong)過指定 --as-avrodatafile 選項的命令行(xing)來簡單地實現導入Avro 格式的數據。

    There are many other options that Sqoop provides which can be used to further tune the import operation to suit your specific requirements.

    Sqoop提供許(xu)多選項可(ke)以用(yong)來滿足指定需求(qiu)的導入操(cao)作。

    湖北快三官网

    在許(xu)多情況下,導入數據到(dao)Hive就跟(gen)運行(xing)一個(ge)導入任務然後使用(yong)Hive創建和加(jia)載一個(ge)確定的表和partition。手動執(zhi)行(xing)這個(ge)操(cao)作需要你要知道正確fan)氖堇嘈陀成she)和其他細節像序(xu)列化格式和分隔符。Sqoop負責將合適的表格元數據填充到(dao)Hive 元數據倉庫和調用(yong)必(bi)要的指令來加(jia)載table和partition。這些操(cao)作都(du)可(ke)以通(tong)過簡單地在命令行(xing)中(zhong)指定--hive-import 來實現。
    ----
    $ sqoop import --connect jdbc:mysql://localhost/acmedb
      --table ORDERS --username test --password **** --hive-import
    ----

    當你運行(xing)一個(ge)Hive import時,Sqoop將會(hui)將數據的類型從外部(bu)數據倉庫的nao) sheng)數據類型轉換成Hive中(zhong)對(dui)應(ying)的類型,Sqoop自(zi)動地選擇Hive使用(yong)的本(ben)地分隔符。如果被導入的數據中(zhong)有xing)灤xing)或者(zhe)有其他Hive分隔符,Sqoop允許(xu)你移除這些字符並且(qie)獲取導入到(dao)Hive的正確數據。

    一旦導入操(cao)作完成,你就像Hive其他表格一樣去查(cha)看和操(cao)作。

    湖北快三官网

    你可(ke)以使用(yong)Sqoop將數據插(cha)入到(dao)HBase表格中(zhong)特定列族。跟(gen)Hive導入操(cao)作很像,可(ke)以通(tong)過指定一個(ge)額(e)外的選項來指定要插(cha)入的HBase表格和列族。所有導入到(dao)HBase的數據將轉換成字符串並以UTF-8字節數組(zu)的格式插(cha)入到(dao)HBase中(zhong)

    ----
    $ sqoop import --connect jdbc:mysql://localhost/acmedb
     --table ORDERS --username test --password ****
    --hbase-create-table --hbase-table ORDERS --column-family mysql
    ----
     

    下面(mian)是命令行(xing)中(zhong)各種選項的解釋︰

  • --hbase-create-table: 這個(ge)選項指示Sqoop創建HBase表. --hbase-table: 這個(ge)選項指定HBase表格的名字. --column-family: T這個(ge)選項指定列族的名字.

    剩下的選項跟(gen)普通(tong)的導入操(cao)作一樣。

    湖北快三官网

    在一些情況中(zhong),通(tong)過Hadoop pipelines來處理數據可(ke)能(neng)需要在生(sheng)產系統中(zhong)運行(xing)額(e)外的關鍵業務函(han)數來提供幫(bang)助。Sqoop可(ke)以在必(bi)要的時候(hou)用(yong)來導出這些的數據到(dao)外部(bu)數據倉庫。還是使用(yong)上面(mian)的例子,如果Hadoop pieplines產生(sheng)的數據對(dui)應(ying)數據庫OREDERS表格中(zhong)的某些地方,你可(ke)以使用(yong)下面(mian)的命令行(xing)︰


    ----
    $ sqoop export --connect jdbc:mysql://localhost/acmedb
     --table ORDERS --username test --password ****
    --export-dir /user/arvind/ORDERS
    ----
     

    下面(mian)是各種選項的解釋︰

  • export: 指示Sqoop開始導出 --connect <connect string>, --username <user name>, --password <password>:這些都(du)是連接you)菘饈斃枰 牟問U飧gen)你通(tong)過JDBC連接you)菘饈彼褂yong)的參數沒有區別 --table <table name>: 指定要被填充的表格 --export-dir <directory path>: 導出路(lu)徑.

    導入操(cao)作通(tong)過下面(mian)Figure2所描繪的那兩步(bu)來完成。第一步(bu),從you)菘庵zhong)獲取要導入的數據的nao) 藎 詼bu)則是數據的傳輸。Sqoop將輸入數據集(ji)分割成片然後用(yong)map任務將片插(cha)入到(dao)數據庫中(zhong)。為(wei)了確保最佳(jia)的吞吐量和最小的資源使用(yong)率,每個(ge)map任務通(tong)過多個(ge)事務來執(zhi)行(xing)這個(ge)數據傳輸。

     

    Figure 2: Sqoop Export Overview

    一些連接器(qi)支持臨時表格來幫(bang)助隔離(li)那些任何(he)原因導致的作業失敗而產生(sheng)的生(sheng)產表格。一旦所有的數據都(du)傳輸完成,臨時表格中(zhong)的數據首先(xian)被填充到(dao)map任務和合並到(dao)目標表格。

    湖北快三官网

    使用(yong)專門(men)連接器(qi),Sqoop可(ke)以連接那些擁(yong)有優化導入導出基礎設施(shi)的外部(bu)系統chang) 蛘zhe)不支持本(ben)地JDBC。連接器(qi)是插(cha)件化組(zu)件基于Sqoop的可(ke)擴(kuo)展框架(jia)和可(ke)以添(tian)加(jia)yong)dao)任何(he)當前存在的Sqoop。一旦連接器(qi)安(an)裝好,Sqoop可(ke)以使用(yong)它在Hadoop和連接器(qi)支持的外部(bu)倉庫之(zhi)間(jian)進行(xing)高效的傳輸數據。

    默(mo)認情況下,Sqoop包(bao)含支持各種常(chang)用(yong)數據庫例如MySQL,PostgreSQL,Oracle,SQLServer和DB2的連接器(qi)。它也包(bao)含支持MySQL和PostgreSQL數據庫的快速路(lu)徑連接器(qi)。快速路(lu)徑連接器(qi)是qin) men)的連接器(qi)用(yong)來實現批次傳輸數據的高吞吐量。Sqoop也包(bao)含一般的JDBC連接器(qi)用(yong)于連接通(tong)過JDBC連接yong)氖菘/p>

    跟(gen)內置的連接不同的是,許(xu)多公司會(hui)開發他們qin)約旱牧 悠qi)插(cha)入到(dao)Sqoop中(zhong),從專門(men)的企業倉庫連接器(qi)到(dao)NoSQL數據庫。

    湖北快三官网

    在這篇文(wen)檔中(zhong)可(ke)以看到(dao)大數據集(ji)在Hadoop和外部(bu)數據倉庫例如關系型數據庫的傳輸是qian)嗝吹募虻?3酥zhi)外,Sqoop提供許(xu)多高級提醒如不同數據格式、壓(ya)縮、處理查(cha)詢等(deng)等(deng)。我們建議你多嘗試Sqoop並給我們提供反饋。
     

    更多關于Sqoop的信(xin)息可(ke)以在下面(mian)路(lu)徑找到(dao)︰
     

    Project Website: http://incubator.apache.org/sqoop

    Wiki: https://cwiki.apache.org/confluence/display/SQOOP

    Project Status:  http://incubator.apache.org/projects/sqoop.html

    Mailing Lists: https://cwiki.apache.org/confluence/display/SQOOP/Mailing+Lists

    下面(mian)是原文(wen)


    湖北快三官网

    Using Hadoop for analytics and data processing requires loading data into clusters and processing it in conjunction with other data that often resides in production databases across the enterprise. Loading bulk data into Hadoop from production systems or accessing it from map reduce applications running on large clusters can be a challenging task. Users must consider details like ensuring consistency of data, the consumption of production system resources, data preparation for provisioning downstream pipeline. Transferring data using scripts is inefficient and time consuming. Directly accessing data residing on external systems from within the map reduce applications complicates applications and exposes the production system to the risk of excessive load originating from cluster nodes.


    This is where Apache Sqoop fits in. Apache Sqoop is currently undergoing incubation at Apache Software Foundation. More information on this project can be found at http://incubator.apache.org/sqoop.

    Sqoop allows easy import and export of data from structured data stores such as relational databases, enterprise data warehouses, and NoSQL systems. Using Sqoop, you can provision the data from external system on to HDFS, and populate tables in Hive and HBase. Sqoop integrates with Oozie, allowing you to schedule and automate import and export tasks. Sqoop uses a connector based architecture which supports plugins that provide connectivity to new external systems.

    What happens underneath the covers when you run Sqoop is very straightforward. The dataset being transferred is sliced up into different partitions and a map-only job is launched with individual mappers responsible for transferring a slice of this dataset. Each record of the data is handled in a type safe manner since Sqoop uses the database metadata to infer the data types.

    In the rest of this post we will walk through an example that shows the various ways you can use Sqoop. The goal of this post is to give an overview of Sqoop operation without going into much detail or advanced functionality.

    湖北快三官网

    The following command is used to import all data from a table called ORDERS from a MySQL database:


    ---
    $ sqoop import --connect jdbc:mysql://localhost/acmedb
      --table ORDERS --username test --password ****
    ---

    In this command the various options specified are as follows:

  • import: This is the sub-command that instructs Sqoop to initiate an import. --connect <connect string>, --username <user name>, --password <password>: These are connection parameters that are used to connect with the database. This is no different from the connection parameters that you use when connecting to the database via a JDBC connection. --table <table name>: This parameter specifies the table which will be imported.


    The import is done in two steps as depicted in Figure 1 below. In the first Step Sqoop introspects the database to gather the necessary metadata for the data being imported. The second step is a map-only Hadoop job that Sqoop submits to the cluster. It is this job that does the actual data transfer using the metadata captured in the previous step.

     

    Figure 1: Sqoop Import Overview

    The imported data is saved in a directory on HDFS based on the table being imported. As is the case with most aspects of Sqoop operation, the user can specify any alternative directory where the files should be populated.

    By default these files contain comma delimited fields, with new lines separating different records. You can easily override the format in which data is copied over by explicitly specifying the field separator and record terminator characters.

    Sqoop also supports different data formats for importing data. For example, you can easily import data in Avro data format by simply specifying the option --as-avrodatafile with the import command.
     

    There are many other options that Sqoop provides which can be used to further tune the import operation to suit your specific requirements.

    Importing Data into Hive

    In most cases, importing data into Hive is the same as running the import task and then using Hive to create and load a certain table or partition. Doing this manually requires that you know the correct type mapping between the data and other details like the serialization format and delimiters. Sqoop takes care of populating the Hive metastore with the appropriate metadata for the table and also invokes the necessary commands to load the table or partition as the case may be. All of this is done by simply specifying the option --hive-import with the import command.

    ----
    $ sqoop import --connect jdbc:mysql://localhost/acmedb
      --table ORDERS --username test --password **** --hive-import
    ----

    When you run a Hive import, Sqoop converts the data from the native datatypes within the external datastore into the corresponding types within Hive. Sqoop automatically chooses the native delimiter set used by Hive. If the data being imported has new line or other Hive delimiter characters in it, Sqoop allows you to remove such characters and get the data correctly populated for consumption in Hive.
     

    Once the import is complete, you can see and operate on the table just like any other table in Hive.

    Importing Data into HBase

    You can use Sqoop to populate data in a particular column family within the HBase table. Much like the Hive import, this can be done by specifying the additional options that relate to the HBase table and column family being populated. All data imported into HBase is converted to their string representation and inserted as UTF-8 bytes.

    ----
    $ sqoop import --connect jdbc:mysql://localhost/acmedb
     --table ORDERS --username test --password ****
    --hbase-create-table --hbase-table ORDERS --column-family mysql
    ----

    In this command the various options specified are as follows:

  • --hbase-create-table: This option instructs Sqoop to create the HBase table. --hbase-table: This option specifies the table name to use. --column-family: This option specifies the column family name to use.

    The rest of the options are the same as that for regular import operation.

    湖北快三官网

    In some cases data processed by Hadoop pipelines may be needed in production systems to help run additional critical business functions. Sqoop can be used to export such data into external datastores as necessary. Continuing our example from above - if data generated by the pipeline on Hadoop corresponded to the ORDERS table in a database somewhere, you could populate it using the following command:

    ----
    $ sqoop export --connect jdbc:mysql://localhost/acmedb
     --table ORDERS --username test --password ****
    --export-dir /user/arvind/ORDERS
    ----

    In this command the various options specified are as follows:

  • export: This is the sub-command that instructs Sqoop to initiate an export. --connect <connect string>, --username <user name>, --password <password>: These are connection parameters that are used to connect with the database. This is no different from the connection parameters that you use when connecting to the database via a JDBC connection. --table <table name>: This parameter specifies the table which will be populated. --export-dir <directory path>: This is the directory from which data will be exported.


    Export is done in two steps as depicted in Figure 2. The first step is to introspect the database for metadata, followed by the second step of transferring the data. Sqoop divides the input dataset into splits and then uses individual map tasks to push the splits to the database. Each map task performs this transfer over many transactions in order to ensure optimal throughput and minimal resource utilization.

    Figure 2: Sqoop Export Overview

    Some connectors support staging tables that help isolate production tables from possible corruption in case of job failures due to any reason. Staging tables are first populated by the map tasks and then merged into the target table once all of the data has been delivered it.

    湖北快三官网

    Using specialized connectors, Sqoop can connect with external systems that have optimized import and export facilities, or do not support native JDBC. Connectors are plugin components based on Sqoop’s extension framework and can be added to any existing Sqoop installation. Once a connector is installed, Sqoop can use it to efficiently transfer data between Hadoop and the external store supported by the connector.

    By default Sqoop includes connectors for various popular databases such as MySQL, PostgreSQL, Oracle, SQL Server and DB2. It also includes fast-path connectors for MySQL and PostgreSQL databases. Fast-path connectors are specialized connectors that use database specific batch tools to transfer data with high throughput. Sqoop also includes a generic JDBC connector that can be used to connect to any database that is accessible via JDBC.

    Apart from the built-in connectors, many companies have developed their own connectors that can be plugged into Sqoop. These range from specialized connectors for enterprise data warehouse systems to NoSQL datastores.

    湖北快三官网

    In this post you saw how easy it is to transfer large datasets between Hadoop and external datastores such as relational databases. Beyond this, Sqoop offers many advance features such as different data formats, compression, working with queries instead of tables etc. We encourage you to try out Sqoop and give us your feedback.

    More information regarding Sqoop can be found at:
     

    Project Website: http://incubator.apache.org/sqoop

    Wiki: https://cwiki.apache.org/confluence/display/SQOOP

    Project Status:  http://incubator.apache.org/projects/sqoop.html

    Mailing Lists: https://cwiki.apache.org/confluence/display/SQOOP/Mailing+Lists

Tag標簽(qian)︰Apache  Sqoop  
  • nginx那些事兒
  • 本(ben)文(wen)為(wei)我學習nginx時的筆記與(yu)心得,如有錯誤(wu)或者(zhe)不當... 詳細
  • 湖北快三官网

About IT165 - 廣(guang)告服務 - 隱私聲明(ming) - 版權申明(ming) - 免(mian)責條款 - 網站地圖 - 網友投稿 - 聯系方式
本(ben)站內容來自(zi)于互聯網,僅供用(yong)于網絡技術學習,學習中(zhong)請遵循相關法律法規
湖北快三官网 | 下一页