More than 10 years ago i did struggle in vain to establish console CLI connection with Mac OS X terminal. Adapter cables at that time needed drivers not incuded in OS X and also used different chipsets. This has changed a lot as drivers are now included and it is easy to have the serial console running !
This procedure works with the micro USB console port on 700/1400 a.o. - but what about the USB type-C to USB-2.0 type-A cable for 15xx, 1600 and 1800 ? @Mikael added the link to the needed drivers, and here is the procedure:
Note - To use the miniUSB/USB Type-C console port, a driver must be installed on the console client machine (desktop/laptop). For installation instructions and download link, see the appliance home page.
Configuring checkpointing - If the stream application requires it, then a directory in the Hadoop API compatible fault-tolerant storage (e.g. HDFS, S3, etc.) must be configured as the checkpoint directory and the streaming application written in a way that checkpoint information can be used for failure recovery
Often writing data to external systems requires creating a connection object(e.g. TCP connection to a remote server) and using it to send data to a remote system.For this purpose, a developer may inadvertently try creating a connection object atthe Spark driver, and then try to use it in a Spark worker to save records in the RDDs.For example (in Scala),
This is incorrect as this requires the connection object to be serialized and sent from thedriver to the worker. Such connection objects are rarely transferable across machines. Thiserror may manifest as serialization errors (connection object not serializable), initializationerrors (connection object needs to be initialized at the workers), etc. The correct solution isto create the connection object at the worker.
You can easily use DataFrames and SQL operations on streaming data. You have to create a SparkSession using the SparkContext that the StreamingContext is using. Furthermore, this has to be done such that it can be restarted on driver failures. This is done by creating a lazily instantiated singleton instance of SparkSession. This is shown in the following example. It modifies the earlier word count example to generate word counts using DataFrames and SQL. Each RDD is converted to a DataFrame, registered as a temporary table and then queried using SQL.
A streaming application must operate 24/7 and hence must be resilient to failures unrelatedto the application logic (e.g., system failures, JVM crashes, etc.). For this to be possible,Spark Streaming needs to checkpoint enough information to a fault-tolerant storage system such that it can recover from failures. There are two types of datathat are checkpointed.
To summarize, metadata checkpointing is primarily needed for recovery from driver failures,whereas data or RDD checkpointing is necessary even for basic functioning if statefultransformations are used.
Note that simple streaming applications without the aforementioned stateful transformations can berun without enabling checkpointing. The recovery from driver failures will also be partial inthat case (some received but unprocessed data may be lost). This is often acceptable and many runSpark Streaming applications in this way. Support for non-Hadoop environments is expectedto improve in the future.
Checkpointing can be enabled by setting a directory in a fault-tolerant,reliable file system (e.g., HDFS, S3, etc.) to which the checkpoint information will be saved.This is done by using streamingContext.checkpoint(checkpointDirectory). This will allow you touse the aforementioned stateful transformations. Additionally,if you want to make the application recover from driver failures, you should rewrite yourstreaming application to have the following behavior.
If the checkpointDirectory exists, then the context will be recreated from the checkpoint data.If the directory does not exist (i.e., running for the first time),then the function functionToCreateContext will be called to create a newcontext and set up the DStreams. See the Python examplerecoverable_network_wordcount.py.This example appends the word counts of network data into a file.
If the checkpointDirectory exists, then the context will be recreated from the checkpoint data.If the directory does not exist (i.e., running for the first time),then the function functionToCreateContext will be called to create a newcontext and set up the DStreams. See the Scala exampleRecoverableNetworkWordCount.This example appends the word counts of network data into a file.
If the checkpointDirectory exists, then the context will be recreated from the checkpoint data.If the directory does not exist (i.e., running for the first time),then the function contextFactory will be called to create a newcontext and set up the DStreams. See the Java exampleJavaRecoverableNetworkWordCount.This example appends the word counts of network data into a file.
In addition to using getOrCreate one also needs to ensure that the driver process getsrestarted automatically on failure. This can only be done by the deployment infrastructure that isused to run the application. This is further discussed in theDeployment section.
Note that checkpointing of RDDs incurs the cost of saving to reliable storage.This may cause an increase in the processing time of those batches where RDDs get checkpointed.Hence, the interval ofcheckpointing needs to be set carefully. At small batch sizes (say 1 second), checkpointing everybatch may significantly reduce operation throughput. Conversely, checkpointing too infrequentlycauses the lineage and task sizes to grow, which may have detrimental effects. For statefultransformations that require RDD checkpointing, the default interval is a multiple of thebatch interval that is at least 10 seconds. It can be set by usingdstream.checkpoint(checkpointInterval). Typically, a checkpoint interval of 5 - 10 sliding intervals of a DStream is a good setting to try.
Configuring checkpointing - If the stream application requires it, then a directory in theHadoop API compatible fault-tolerant storage (e.g. HDFS, S3, etc.) must be configured as thecheckpoint directory and the streaming application written in a way that checkpointinformation can be used for failure recovery. See the checkpointing sectionfor more details.
Configuring write-ahead logs - Since Spark 1.2,we have introduced write-ahead logs for achieving strongfault-tolerance guarantees. If enabled, all the data received from a receiver gets written intoa write-ahead log in the configuration checkpoint directory. This prevents data loss on driverrecovery, thus ensuring zero data loss (discussed in detail in theFault-tolerance Semantics section). This can be enabled by settingthe configuration parameterspark.streaming.receiver.writeAheadLog.enable to true. However, these stronger semantics maycome at the cost of the receiving throughput of individual receivers. This can be corrected byrunning more receivers in parallelto increase aggregate throughput. Additionally, it is recommended that the replication of thereceived data within Spark be disabled when the write-ahead log is enabled as the log is alreadystored in a replicated storage system. This can be done by setting the storage level for theinput stream to StorageLevel.MEMORY_AND_DISK_SER. While using S3 (or any file system thatdoes not support flushing) for write-ahead logs, please remember to enablespark.streaming.driver.writeAheadLog.closeFileAfterWrite andspark.streaming.receiver.writeAheadLog.closeFileAfterWrite. SeeSpark Streaming Configuration for more details.Note that Spark will not encrypt data written to the write-ahead log when I/O encryption isenabled. If encryption of the write-ahead log data is desired, it should be stored in a filesystem that supports encryption natively.
The existing application is shutdown gracefully (seeStreamingContext.stop(...)or JavaStreamingContext.stop(...)for graceful shutdown options) which ensure data that has been received is completelyprocessed before shutdown. Then theupgraded application can be started, which will start processing from the same point where the earlierapplication left off. Note that this can be done only with input sources that support source-side buffering(like Kafka) as data needs to be buffered while the previous application was down andthe upgraded application is not yet up. And restarting from earlier checkpointinformation of pre-upgrade code cannot be done. The checkpoint information essentiallycontains serialized Scala/Java/Python objects and trying to deserialize objects with new,modified classes may lead to errors. In this case, either start the upgraded app with a differentcheckpoint directory, or delete the previous checkpoint directory.
CMS Garbage Collector: Use of the concurrent mark-and-sweep GC is strongly recommended for keeping GC-related pauses consistently low. Even though concurrent GC is known to reduce theoverall processing throughput of the system, its use is still recommended to achieve moreconsistent batch processing times. Make sure you set the CMS GC on both the driver (using --driver-java-options in spark-submit) and the executors (using Spark configuration spark.executor.extraJavaOptions).
When data is received from a stream source, the receiver creates blocks of data. A new block of data is generated every blockInterval milliseconds. N blocks of data are created during the batchInterval where N = batchInterval/blockInterval. These blocks are distributed by the BlockManager of the current executor to the block managers of other executors. After that, the Network Input Tracker running on the driver is informed about the block locations for further processing.
An RDD is created on the driver for the blocks created during the batchInterval. The blocks generated during the batchInterval are partitions of the RDD. Each partition is a task in spark. blockInterval== batchinterval would mean that a single partition is created and probably it is processed locally.
35fe9a5643