having at least 5 devices with Xiaomi miio that require token extraction, and having used in the past the android emulator memu, fricking around with xiaomi home app versions to extract the miio2.db file, copy to my laptop then run an sql expression to extract the token of my miio devices
This tool/script retrieves tokens for all devices connected to Xiaomi cloud and encryption keys for BLE devices. - GitHub - PiotrMachowski/Xiaomi-cloud-tokens-extractor: This tool/script retrieves ...
There is a much better approach. Instead of storing the watermark in the data store and then using it as filter criteria, you can convince the SAP system to manage the delta changes for you. This way, without writing any expression to compare timestamps, you can extract recently updated information.
Further steps are the same, no matter if you work with an extractor or CDS view. Click Next. The wizard automatically creates the data model and OData service, and you only have to provide the description.
Click Add Selected Services and confirm your input. You should see a popup window saying the OData service was created successfully. Verify the system alias is correctly assigned and the ICF node is active:
It tells the system that you want it to keep track of delta changes for this OData source. Then, as a result, in the response content, together with the initial full dataset, you can find an additional field __delta with the link you can use to retrieve only new and changed information.
The additional header subscribes you to the delta queue, which tracks data changes. If you follow the __delta link, which is basically the OData URL with extra query parameter !deltatoken, you will retrieve only updated information and not the full data set.
There is just one tiny inconvenience that you should know. As the field should store an authentication key, the value is protected against unauthorized access. It means that every time you edit the linked service, you have to retype the header value, exactly the same as you would do with the password. Therefore if you ever have to edit the Linked Service again, remember to provide the header value again.
The above change requires us to provide the header every time we use the linked service. Therefore we need to create a new parameter in the OData dataset to pass the value. Then we can reference it using an expression:
Do you remember that when you run delta-enabled extraction, there is an additional field __delta with a link to the next set of data? Server-side paging works in a very similar way. At the end of each response, there is an extra field __skip with the link to the next chunk of data. Both solutions use tokens passed as the query parameters. As we can see, the URL contains the token, which proves Synapse used server-side pagination to read all data.
Feature extraction is very different from Feature selection:the former consists in transforming arbitrary data, such as text orimages, into numerical features usable for machine learning. The latteris a machine learning technique applied on these features.
DictVectorizer is also a useful representation transformationfor training sequence classifiers in Natural Language Processing modelsthat typically work by extracting feature windows around a particularword of interest.
As you can imagine, if one extracts such a context around each individualword of a corpus of documents the resulting matrix will be very wide(many one-hot-features) with most of them being valued to zero mostof the time. So as to make the resulting data structure able to fit inmemory the DictVectorizer class uses a scipy.sparse matrix bydefault instead of a numpy.ndarray.
Feature hashing can be employed in document classification,but unlike CountVectorizer,FeatureHasher does not do wordsplitting or any other preprocessing except Unicode-to-UTF-8 encoding;see Vectorizing a large text corpus with the hashing trick, below, for a combined tokenizer/hasher.
Text Analysis is a major application field for machine learningalgorithms. However the raw data, a sequence of symbols cannot be feddirectly to the algorithms themselves as most of them expect numericalfeature vectors with a fixed size rather than the raw text documentswith variable length.
For instance a collection of 10,000 short text documents (such as emails)will use a vocabulary with a size in the order of 100,000 unique words intotal while each document will use 100 to 1000 unique words individually.
In order to be able to store such a matrix in memory but also to speedup algebraic operations matrix / vector, implementations will typicallyuse a sparse representation such as the implementations available in thescipy.sparse package.
Note that in the previous corpus, the first and the last documents haveexactly the same words hence are encoded in equal vectors. In particularwe lose the information that the last document is an interrogative form. Topreserve some of the local ordering information we can extract 2-gramsof words in addition to the 1-grams (individual words):
This was originally a term weighting scheme developed for information retrieval(as a ranking function for search engines results) that has also found gooduse in document classification and clustering.
Text is made of characters, but files are made of bytes. These bytes representcharacters according to some encoding. To work with text files in Python,their bytes must be decoded to a character set called Unicode.Common encodings are ASCII, Latin-1 (Western Europe), KOI8-R (Russian)and the universal encodings UTF-8 and UTF-16. Many others exist.
The text feature extractors in scikit-learn know how to decode text files,but only if you tell them what encoding the files are in.The CountVectorizer takes an encoding parameter for this purpose.For modern text files, the correct encoding is probably UTF-8,which is therefore the default (encoding="utf-8").
If the text you are loading is not actually encoded with UTF-8, however,you will get a UnicodeDecodeError.The vectorizers can be told to be silent about decoding errorsby setting the decode_error parameter to either "ignore"or "replace". See the documentation for the Python functionbytes.decode for more details(type help(bytes.decode) at the Python prompt).
In the above example, char_wb analyzer is used, which creates n-gramsonly from characters inside word boundaries (padded with space on eachside). The char analyzer, alternatively, creates n-grams thatspan across words:
The word boundaries-aware variant char_wb is especially interestingfor languages that use white-spaces for word separation as it generatessignificantly less noisy features than the raw char variant inthat case. For such languages it can increase both the predictiveaccuracy and convergence speed of classifiers trained using suchfeatures while retaining the robustness with regards to misspellings andword derivations.
While some local positioning information can be preserved by extractingn-grams instead of individual words, bag of words and bag of n-gramsdestroy most of the inner structure of the document and hence most ofthe meaning carried by that internal structure.
The above vectorization scheme is simple but the fact that it holds an in-memory mapping from the string tokens to the integer feature indices (thevocabulary_ attribute) causes several problems when dealing with largedatasets:
You can see that 16 non-zero feature tokens were extracted in the vectoroutput: this is less than the 19 non-zeros extracted previously by theCountVectorizer on the same toy corpus. The discrepancy comes fromhash function collisions because of the low value of the n_features parameter.
In a real world setting, the n_features parameter can be left to itsdefault value of 2 ** 20 (roughly one million possible features). If memoryor downstream models size is an issue selecting a lower value such as 2 **18 might help without introducing too many additional collisions on typicaltext classification tasks.
it is not possible to invert the model (no inverse_transform method),nor to access the original string representation of the features,because of the one-way nature of the hash function that performs the mapping.
preprocessor: a callable that takes an entire document as input (as asingle string), and returns a possibly transformed version of the document,still as an entire string. This can be used to remove HTML tags, lowercasethe entire document, etc.
analyzer: a callable that replaces the preprocessor and tokenizer.The default analyzers all call the preprocessor and tokenizer, but customanalyzers will skip this. N-gram extraction and stop word filtering takeplace at the analyzer level, so a custom analyzer may have to reproducethese steps.
To make the preprocessor, tokenizer and analyzers aware of the modelparameters it is possible to derive from the class and override thebuild_preprocessor, build_tokenizer and build_analyzerfactory methods instead of passing custom functions.
The extract_patches_2d function extracts patches from an image storedas a two-dimensional array, or three-dimensional with color information alongthe third axis. For rebuilding an image from all its patches, usereconstruct_from_patches_2d. For example let us generate a 4x4 pixelpicture with 3 color channels (e.g. in RGB format):
Several estimators in the scikit-learn can use connectivity information betweenfeatures or samples. For instance Ward clustering(Hierarchical clustering) can cluster together only neighboring pixelsof an image, thus forming contiguous patches:
These matrices can be used to impose connectivity in estimators that useconnectivity information, such as Ward clustering(Hierarchical clustering), but also to build precomputed kernels,or similarity matrices.
b1e95dc632