thesis/ThesTeX/content/4-implementation.tex

197 lines
12 KiB
TeX

Based on the findings in \autoref{sec:solution}, an implementation with Python was realized.
The following sections describe the structure and service composition utilized to fullfill the requirements.
\section{Code structure}
There are four packages forming the Analysis Framework project:
\begin{itemize}
\item analysis: Core analysis functionality, including log parsing, analysis, postprocessing and rendering
\item clients: Connection classes to game servers to retrieve log files and game configurations
\item selector: Web interface for non-expert users
\item tasks: Definition of asynchronous tasks
\end{itemize}
The analysis and clients packages are described in \autoref{sec:analysisframework}, while \autoref{sec:web} features selector and tasks packages.
\image{.7\textwidth}{packages}{Project package overview}{img:packages}
\subsection{Analysis Framework}\label{sec:analysisframework}
The internal structure of the analysis package is shown in \autoref{img:pack-analysis}.
Besides the subpackages for analysing work (analyzers: \autoref{sec:analysiswork}) and log parsing (loaders: \autoref{sec:loaders}), it contains helper functionalities and finally the Python module \texttt{log\_analyzer} as entrypoint for researches experimenting and outline of the intended workflow.
\image{.7\textwidth}{packages-analysis}{anaylsis package overview}{img:pack-analysis}
\subsubsection{Log parsing}\label{sec:loaders}
Outlined in \autoref{img:pack-loader}, the parsing of log files into an internal structure happens here.
\image{.7\textwidth}{packages-loader}{loader package overview}{img:pack-loader}
\paragraph{The loader module} holds the definition of the abstract base class \texttt{Loader}.
It has two unimplemented methods: \texttt{load} and \texttt{get\_entry}.
While the first is issued with an filename as argument to load a log file, the second it then called repeatedly to retrieve a single log for the analysis steps.
Processing stops when all log entries have been passed from this method.
The module also defines a showcase implementation loading a JSON file and \texttt{yield}ing it's items.
\paragraph{Biogames} is for the log files of Biodiv2go, a composite approach was used: The games' log files come as ZIP archive with an SQLite database and possibly media files.
The \texttt{SQLiteLoader} contains the logic to handle a plain SQLite file according to the definition of the \texttt{Loader} from above.
By extending this class, \texttt{ZipSQLiteLoader} focuses on unzipping the archive and creating a temporary storage location, leaving interpretation of the data to its super class.
This avoids code duplication and, with little amount of tweaking, would present a generic way to handle SQLite database files.
\paragraph{Neocart(ographer)}
was the evaluation step described in \autoref{sec:eval}.
This \texttt{Loader} deals with some seriously broken XML files.
\paragraph{Module settings} are stored in the \texttt{\_\_init\_\_} module.
This is mainly a mapping to allow references to \texttt{Loader}s in the JSON files for configuration (see \autoref{sec:settings}).
\subsubsection{Analysis Work package}\label{sec:analysiswork}
\autoref{img:pack-analyzers} shows the subpackages of \texttt{anaylsis.analyzers}.
There are subpackages for doing the actual analysis work, as well as for the postprocess and rendering step.
Additional the \texttt{settings} module defines the LogSettings class.
\image{.7\textwidth}{packages-analysis-analyzers}{anaylsis.analyzers package overview}{img:pack-analyzers}
\paragraph{LogSettings}\label{sec:settings}
This class holds the configuration for an analysis run:
\begin{itemize}
\item The type of the log parser to use
\item Information about the structure of the parsed log files, e.g.
\begin{itemize}
\item What is the key of the field to derive the type of the log entry?
\item What value does this field hold, when there is spatial information?
\item What value does indicate game actions?
\item What is the path to obtain spatial information from an spatial entry?
\end{itemize}
\item The analysis setup:
\begin{itemize}
\item Which analyzers to use,
\item and the order to apply them
\end{itemize}
\item Variable data to configure the source (see \autoref{sec:source}).
\item Rendering methods to apply to the result set
\end{itemize}
The settings are stored as JSON files, and parsed by runtime into a \texttt{LogSetting} object (see \autoref{img:oebkml} for a sample JSON settings file).
The helper functions in \texttt{analysis.util} provide a very basic implementation of an query language for Python dictionaries:
A dot-separated string defines the path to take through the dictionary, providing basically syntactic sugar to avoid lines like \texttt{entry["instance"]["config"]["@id"]}.
As this prooves quite difficult to configure using JSON, the path-string \texttt{"instance.config.@id"} is much more deserialization friendly.
\paragraph{The Analyzer package} defines the work classes to extract information from log entries.
The packages' init-module defines the Result and ResultStore classes, as well as the abstract base class for the Analyzers.
As shown in \autoref{code:anaylzer}, this base class provides the basic mechanics to access the settings.
The core feature of this project is condensed in the method stub \texttt{process}.
It is fed with an parsed entry from \autoref{sec:loaders}, processes it, possibly updates the internal state of the class, and the can decide to end the processing of the particular log entry or continue to feed down into the remainder of the analysis chain.
When all log entries of a log file are processed, the \texttt{result} method returns the findings of this analysis instance (see \autoref{par:result}).
\lstinputlisting[language=python,caption={Analyzer base class},label=code:anaylzer]{code/analyzer.py}
There are 23 classes implementing analysis functionality, splitted into modules for generic use, Biodiv2go analysis, and filtering purposes.
\paragraph{Results}\label{par:result} are stored in a \texttt{Result} object (\texttt{analysis.analyzers.analyzer.\_\_init\_\_}).
This class keeps track of the origin of the resulting data to allow filtering for results by arbitrary analzing classes.
As \autoref{code:anaylzer} shows, the \texttt{Result}s are stored in a \texttt{ResultStore}.
This store - defined next to the \texttt{Result} class - provides means to structure the results by arbitrary measures.
By passing the store's reference into the analyzers, any analyzer can introduce categorization measures.
This allows for example to distinguish several log files by name, or to combine log files and merge the results by events happening during the games' progress.
With an default of an dictionary of lists, the API supports a callable factory for arbitrary use.
\paragraph{Rendering of the Results} is done in the \texttt{render} package.
Similar to the Analyzers' package, the render package defines its common base class in the initialization module, as shown in \autoref{code:render}.
It provides implementors means to filter the result set to relevant analysis types through the \texttt{filter} methods.
Of course, the implementation of the rendering method is left open.
\lstinputlisting[language=python,caption={Render base class},label=code:render]{code/render.py}
There are 18 implementations, again splitted into generic and game-specific ones.
The most generic renderers just dump the results into JSON files or echo them to the console.
A more advanced implementation relies on the \texttt{LocationAnalyzer} and creates a KML file with a track animation (example: \autoref{img:oebge}).
Finally, e.g. \texttt{biogames.SimulationGroupRender} performs postprocessing steps on a collection of \texttt{biogames.SimulationOrderAnalyzer} results by creating a graph with matplotlib\furl{https://matplotlib.org/} to discover simulation retries (example: \autoref{img:retries}).
\subsection{Sources}\label{sec:source} of log files are clients connecting either to game servers directly or other log providers.
There is currently a bias towards HTTP clients, as REST APIs are todays go-to default.
To acknowledge this bias, the HTTP oriented base class is not defined at package level.
The \texttt{Client} originates from the \texttt{client.webclients} package instead.
It contains some convenience wrappers to add cookies, headers and URL-completion to HTTP calls as well as handling file downloads.
The two implementing classes are designed for Biodiv2go and a Geogames-Team log provider.
Using a REST API, the \texttt{Biogames} client integrates seamlessly into the authentication and authorization of the game server.
The client acts as proxy for users to avoid issues with cross-origin scripting (XSS) or resource Sharing (CORS).
The Geogames-Team's geogames like Neocartographer wirte game logs to files and only have a server running during the active game.
Therefore, an additional log providing server was created to allow access to the log files (see also: \autoref{sec:ggt-server}).
Clients can have arbitrary amounts of options, as all fields in the JSON settings file are passed through.
\subsection{Web Interface}\label{sec:web}
The selector package holds a Flask\furl{http://flask.pocoo.org/} app for an web interface for non-expert users.
It utilizes the provided clients (see \autoref{sec:source}) for authentication, and gives users the following options:
\begin{itemize}
\item Exploring available game logs
\item Configuring a new analysis run
\item View analysis run status
\item View analysis run results
\end{itemize}
The web interface offers all available clients for the user to choose from.
With user provided credentials, the server retrieves the available game logs and offers them, together with the predefined analysis options, to create an new analysis run.
When an analysis run is requested, the server issues a new task to be executed (see \autoref{sec:tasks}).
An overview page lists the status of the tasks from the given user, and provides access to the results once the task is finished.
When problems occur, the status page informs the user, too.
\subsection{Task definition}\label{sec:tasks} in the \texttt{package} provides tasks available for execution.
This package is the interface for celery\furl{http://www.celeryproject.org/} workers and issuers.
The key point is the task \texttt{analyze} to start new analysis runs.
When a new task is scheduled, the issuer puts a task in the Redis DB\furl{https://redis.io/}.
A free worker node claims the task and executes it.
During the runtime, status updates are stored in the Redis Db to inform the issuer about progress, failures and results artifacts.
\section{Service composition}
\image{\textwidth}{architecture.pdf}{archoitecure overview}{img:arch}
\subsection{Geogame Log file provider}\label{sec:ggt-server}
%TODO: end
\section{1. vortrag}
\section{Outlook: Implementation}
\subsection{Implementation}
Analysis
\begin{itemize}
\item Python (3.6)
\item Standalone library/CLI tool
\item Web based configuration/Runner/API (Flask)
\end{itemize}
Rendering
\begin{itemize}
\item Matplotlib, Numpy
\begin{itemize}
\item Graphs
\end{itemize}
\item Javascript
\begin{itemize}
\item Leaflet
\item Web visualization: Maps, Tracks, …
\end{itemize}
\end{itemize}
\pic{.5\textwidth}{../../PresTeX/images/matplotlib}
\pic{.5\textwidth}{../../PresTeX/images/python}
\pic{.4\textwidth}{../../PresTeX/images/flask}
\pic{.4\textwidth}{../../PresTeX/images/leaflet}
\subsection{Examples}
Configuration \& results
%\twofigures{0.5}{../../PresTeX/images/oeb-kml}{Analyzer configuration}{img:oebkml}{../../PresTeX/images/oeb-ge}{Result visualized}{img:oebge}{Example: Generate KML tracks (BioDiv2Go; Oberelsbach2016)}{fig:oeb2016}
ActivityMapper
\image{.7\textwidth}{../../PresTeX/images/track-fi}{Combined screen activity and spatial progress}{img:trackfi}
Graphs
\image{\textwidth}{../../PresTeX/images/speed}{Speed distribution}{img:speed}
\image{.9\textwidth}{../../PresTeX/images/time-rel}{Time distribution}{img:time}